Millions of households have voice-controlled devices, but when was the last time you heard synthesized speech for more than a handful of seconds? WellSaid Labs advanced the field with a voice engine that can easily and quickly generate hours of voice content that sounds as good or better than the snippets we hear from Siri and Alexa every day.
The company has been working to evolve its technology from a formidable demo to a commercial product since its public debut last year, and in the process has found a lucrative niche to build from.
CTO Michael Petrochuk said the company was early on based its technology on previous research – Google's Tacotron project, which established a new standard for realism in artificial speech.
“Despite being released two years ago, Tacotron 2 is still state-of-the-art. But there are a few problems, ”said Petrochuk. "First, it's not fast – it takes three minutes to produce a second of audio. And it's designed to model 15 seconds of audio. Imagine that in a workflow where you generate 10 minutes of content, that Orders of magnitude differ from where we want to be. "
WellSaid has completely rebuilt its model with a focus on speed, quality and length. It sounds like you're focusing on everything at the same time, but there are always many more parameters to optimize for. The result is a model that can produce extremely high quality speech with about 15 voices (and multiple languages) in about half the real time. So creating a minute-long clip would take about 36 seconds instead of a few hours.
This seemingly basic skill has many advantages. Not only is it faster, it also simplifies and makes working with the results easier. As an audio content producer, you can simply insert a script hundreds of words in length, listen to the information it contains, and then adjust the pronunciation or cadence with a few keystrokes. Tacotron changed synthetic language, but it was never really a product. WellSaid builds on its own advances to create both usable software and arguably a better language system overall.
As evidence, the clips generated by the model – 15-second clips so they can compete with Tacotron and others – reached a milestone where they were rated as well as human voices in tests organized by WellSaid. There is no objective measure for this type of thing, but asking many people to weigh up how human something sounds is a good place to start.
As part of the team's work to achieve “human parity” in these conditions, they also released a series of audio clips showing how the model can produce much more sophisticated content.
It produced plausible-sounding language in Spanish, French, and German (I'm not a native speaker of theirs, so I can't say more) and demonstrated its ability with complex and linguistically difficult words (like stoichiometry and halogenation), words that differ depending on the context differentiate (buffet, desert) and so on. The grand finale must be a continuous 8-hour reading of Mary Shelley's “Frankenstein”.
However, audiobooks are not the industry WellSaid is using as a stepladder to further advancement. Instead, they form a bundle that works in the hugely boring but necessary area of corporate training. They know the types of videos that explain guidelines, document the use of internal tools, and explain best practices for sales, management, development tools, and more.
Business learning content is generally unique, or at least tailored to each business, and can include hours of audio – an alternative to the saying “here, read this pack” or gathering everyone in a room to watch a decade-old DVD about office behavior . Not the most exciting place to see such powerful technology out there, but the truth is that no matter how transformative you find your technology to be, if you don't make money, startups will be bogged down.
"We found a sweet spot in corporate training, but for product development we have developed these basic elements for an ever larger space," explained growth manager Martín Ramírez. “Voice is everywhere, but we have to be pragmatic about who we are building for today. Finally, we will provide the infrastructure in which each vote can be created and distributed. "
At first glance, it looks like the company's offering is slowly expanding in directions like other languages. In WellSaid's system, English is not “burned in”, and given training data in other languages should perform equally well in these languages. So this is an easy way forward. But other industries could also use improved voice functions: podcasting, games, radio broadcasts, advertising, governance.
A major limitation of the enterprise approach is that the system is intended to be operated by one person and essentially used to record a virtual voice actor. This means that it is not useful for groups for whom enhanced synthetic voice is desirable – many people with disabilities who affect their own voice, blind people who use voice-based interfaces throughout the day, or even people who who travel to a foreign country and use real time translation tools.
"I see WellSaid waiting for this use case in the near future," said Ramírez, although he and the others were careful not to make any promises. "But today, the way it's built, we really believe that a human producer should interact with the engine in order to reproduce it at a natural level of human parity." The dynamic rendering scenario is approaching quite quickly and we want to be prepared for it, but are not ready for it today. "
The company has "many runways and customers" and is growing rapidly – so no funding needs are required right now, thank you, venture capitalists.