Meta is working on a new tool that leverages the power of generative AI, the underlying technology of viral chatbot ChatGPT. Dubbed Voicebox, the tool can be used to create speech with voice samples and simple text input. Meta also claims that Voicebox can filter out unwanted background noise from audio samples. However, unlike other generative AI tools like ChatGPT and Bard, or AI image generators like Dull-E or Midjourney, Voicebox remains unavailable to testers and may remain restricted for the time being. This is because Meta says that Voicebox can be misused and has a lot of potential risks.
Meta Voicebox, and how does it work?
In simple words, Voicebox is a speech-to-text generator along with some audio editing tools. However, Meta says that its AI tool is far more effective than its competitors because Voicebox can replicate intonation and pronunciation. Voicebox’s existing competition Val-e also lets users create text-to-speech samples with up to 3 seconds of recording. However, Meta claims that Voicebox output is up to 20 times faster with fewer errors.
Since Voicebox is not available to the public, the company explains its workings in a research paper and blog post. Meta says that Voicebox is built on a method called “flow matching” for converting text to speech. The model is said to handle complex and unpredictable relationships between text and speech. It also allows Voicebox to be trained on a larger and more diverse set of data, making it more powerful and flexible.
Currently, Voicebox can generate speech in English, French, German, Spanish, Polish and Portuguese. Meta states that the technology is “exciting” because it could help people communicate in a natural and authentic way “even if they don’t speak the same languages.”
As mentioned, Voicebox can also be used for audio editing. In a demo, Meta shows that the tool effectively filtered out the background noise of a dog barking from a sample. Similar audio filtering features are already present in Google Meet and Zoom.
Why is Meta Voicebox unavailable?
Meta says the company is “not making the Voicebox model or code publicly available at this time” due to “potential risks of misuse”. It further added, “While we believe it is important to be open with the AI community and share our research to advance the state of the art in AI, it is also necessary to strike the right balance between openness with responsibility. With these thoughts in mind, today we’re sharing audio samples and a research paper detailing our approach and the results we achieved.”
While this could also mean that Meta is still working on Voicebox and the AI tools are incomplete, the decision may reflect well with some critics. Earlier this year, we used a text-to-image AI generator to create fake images of Elon Musk, Barack Obama, and Donald Trump in various locations and clothing. AI-generated voice samples could turn out to be a nightmare for politicians in India, where the battle against fake news continues on WhatsApp. Altered or AI-generated audio samples can also help hackers with extortion.
Meta adds that it plans to “investigate proactive methods to train generative models such that synthetic speech can be more easily detected, such as embedding artificial fingerprints.”