Meta Platforms, the company behind Facebook and Instagram, has unveiled a generative AI text to speech tool called Voicebox AI.
Meta Platforms, the company behind Facebook and Instagram, have previewed a text to speech generative artificial intelligence tool, named Voicebox AI, which it claims outperforms all existing models.
Similar to ChatGPT and DALL-E, the generative model is text based. Instead of it generating text or images however, Voicebox recreates the words in a variety of voices, alongside cutting out unwanted audio, pauses, and other audio issues.
SEE ALSO: Nearly Every Job Will be Touched by Generative AI
According to Meta, Voicebox can match an audio style from only two seconds of sample audio. It can also recreate the person’s voice in several languages, with English, French, German, Spanish, Polish, and Portuguese the first few languages Meta has added.
In the preview video, Meta CEO Mark Zuckerberg appears to be revealing the capabilities of Voicebox. Meta did not say in the press release if it is actually Zuckerberg speaking or if the team used an audio sample, we assume it’s the latter.
Meta has not said when it plans to make Voicebox available to the wider public. As with a lot of generative AI research projects, there is definitely a lot of ways bad actors could use this tool to commit fraud and spread misinformation. Meta has been rather adverse to launching AI tools to the general public, although this may be changing as the company has switched its focus from virtual reality and the metaverse to AI.
“Prior to Voicebox, generative AI for speech required specific training for each task using carefully prepared training data,” said research engineer at Meta AI, Matt Le. “Voicebox uses a new approach to learn just from raw audio and an accompanying transcription. Unlike autoregressive models for audio generation, Voicebox can modify any part of a given sample, not just the end of an audio clip it is given.”
In a research paper published at a similar time the press release, Meta AI said that Voicebox is able to generate a diverse set of audio samples twenty times faster than VALL-E, Microsoft’s own text to speech generative tool. It should be noted that both tools are not widely available, and claims made by either research team cannot be fully verified due to a lack of access.
Outside of pranking friends, the audio editing and noise reduction tools should be valuable to audio and sound engineers, who would have previously spent hours removing noise on videos or clearing up portions of dialogue. It’s not clear how Meta would market this service to engineers however, as it is not a competitor in the media editing space.
Text to speech does appear to be the next generative system to be taken up by the masses. Image generation and editing tools, in the form of DALL-E, Midjourney, and Stable Diffusion, were the first to break through. OpenAI’s ChatGPT was the first generative text tool to hit the public web, which has been a huge success for the AI research lab, gaining over 100 million users.
Whether these tools will have the same broad appeal of ChatGPT remains to be seen. It looks to be more in the realm of DALL-E and other image generation tools, which could be assets for digital artists and people in media but aren’t as valuable to the wider public.