Google has launched a new technology that makes it easier for businesses to add natural sounding speech capabilities to their applications and services.
Cloud Text-to-Speech is available—currently in beta form—as an API that developers can use to enable voice interaction in a wide range of use cases.
Examples include powering interactive voice response systems in call centers, adding voice response capabilities to TVs, cars and internet of things devices and automatically converting news articles, books and other text-based media to audiobooks and podcasts.
Developers can choose from 32 different voices in 12 languages when adding voice capabilities to an application, service or device using Cloud Text-To-Speech.
Cloud-Text-To-Speech allows developers to customize attributes like speaking rate, pitch and volume gain, according to Dan Aharon, product manager of Cloud AI at Google.
The technology is designed to pronounce complex text such as names, dates and addresses correctly and authentically without any tweaking or customization, Aharon wrote in a blog announcing Cloud Text-To-Speech March 27.
Some of the high fidelity voices available with the new technology use WaveNet from DeepMind, a UK based artificial intelligence firm that Google acquired in 2014 and is now an Alphabet subsidiary.
WaveNet is a deep neural network for generating speech that mimics human voices. The speech generated with WaveNet is far more natural sounding than even the best Text-to-Speech systems, according to Google.
The technology is different from the most common current approach to generating speech with computers, which is by selecting and concatenating short speech fragments to make them whole utterances.
With concatenative text-to-speech technologies, a large database of speech fragments from a single speaker is first recorded and those fragments are then recombined as needed to make complete sentences, Google note. This approach makes it hard to modify the voice or alter the emotion or emphasis of the computer generated speech, according to Google.
WaveNet on the other hand is designed to produce raw audio waveforms by learning from large volumes of speech samples. “During training, the network extracts the underlying structure of the speech, for example which tones follow one another and what shape a realistic speech waveform should have,” Aharon said.
So when it is provided with a text input, a fully trained WaveNet model will be able to generate the corresponding speech waveform, much more accurately than other approaches to speech synthesis, he said. Current WaveNet models can generate up to 20 seconds of relatively high-quality audio in just 1 second.
Pricing for the Cloud Text-To-Speech API is based on the amount of text characters that are synthesized into audio. For speech that is synthesized without using WaveNet, Google won’t charge anything for the first 4 million characters each month and then $4 per 1 million characters after that. Enterprises that want WaveNet voices will get the first 1 million characters for free each month and then will have to pay $16 for each additional million characters.