Speech Recognition Finally Finding Its Voice in Mobile Technology

Voice-recognition technology has made significant progress of late, becoming a popular feature of smartphones, automotive navigation and entertainment systems. While a panel of Silicon Valley tech experts says it still has its glitches, it can eventually improve to where talking to a machine is like talking to a person.

PALO ALTO, Calif. - If speech-recognition technology were a human, it would be like a 5- or 6-year-old child. At the age of 1, you can speak to a child, but you have to speak slowly and simply using small words. By 5 or 6, it starts to better understand your words and, more importantly, your meaning.

The comparison of computer speech development to human speech development came up during a panel discussion Aug. 20 at a forum hosted by the Churchill Club of Silicon Valley in Palo Alto, Calif. Representatives of a speech-recognition software company, an automaker and Apple co-founder Steve Wozniak discussed where speech recognition has been and where it's going.

Speech is becoming the new computer user interface, said Quentin Hardy, deputy technology editor of The New York Times and moderator of the panel, continuing a long line of UI evolution from the punch card and the command line interface to the mouse and the touch-screen.

With each advance, the interaction shifts became less machine and more human. When we want to get someone's attention, we tap them on the shoulder like we tap on a screen, said Wozniak, and when we want to talk to someone, we speak.

"We love our computers; we love our phones. We are getting that feeling we get from another person," he said.

Speech-recognition technology has evolved from the machine understanding voice commands to understanding meaning and context, said Ron Kaplan, senior director and distinguished scientist at Nuance, whose voice-recognition technology has been licensed to Apple for use in its Siri personal assistant feature on the iPhone 4S and to the Ford Motor Co., for its MyFordTouch system that is also based on Microsoft Sync.

"One of the enabling technological advances that makes more accurate speech recognition possible and makes more accurate understanding of intent possible, is the ability to accumulate large amounts of data from lots of user experiences and to sift and organize and build models from it," Kaplan said.

In other words, like a child, its vocabulary and understanding grows the more it hears what people say to it.

Ford opened a lab in Silicon Valley at the beginning of the year and the unit is organized as a startup far away from the bureaucracy at Ford headquarters in Dearborn, Mich. The lab is continually working to improve the accuracy of MyFordTouch, which was introduced in 2007. Drivers use voice commands to get directions, adjust the heating and air conditioning or change radio stations. Ford parked a 2012 Focus Electric sedan equipped with MyFordTouch in the hotel ballroom where the event was held.

While constantly improving, speech recognition is complicated, said Sheryl Connelly, a futurist at Ford. Because of that complexity, drivers don't yet talk to their cars like they're talking to a person.

"We talk to it as if we're talking to a foreigner. We talk very slowly and stilted and we have unnatural pauses," Connelly said. "That's why we still see hiccups."

There are expected to be about 4 million Fords and Lincolns on the road equipped with MyFordTouch by the end of this year, a number expected to reach 13 million by 2015, she said. But as such vehicles roll out across Europe and elsewhere, the system has to improve to understand different languages and dialects, which adds to the complexity.

Still, Silicon Valley has emerged as a center for development of speech recognition, as it is for other technology, noted Dan Miller, senior analyst and founder of Opus Research. He knows of several startups in the last eight or nine years that have pursued speech recognition as a business plan. With every iteration, the goal is for the computer to understand "natural language" like the actress Zooey Deschanel in a Siri TV ad who simply says "Let's get tomato soup delivered," and the application understands what she means.

"We've moved along a maturity path where ... people's expectations about what they can do with today's technologies … create its own demand," for improved speech-recognition technology, said Miller. "And then these energetic and imaginative people can come and try to fulfill that."