Can We Talk

Muddle over multimodal standard clouds future of speech recognition apps.

Call it the shock of recognition—again. Just as VoiceXML-based speech recognition applications are hitting their stride, along comes a potentially competing standard backed by—who else?—Microsoft Corp.

Since the VoiceXML 1.0 standard was put in place at the World Wide Web Consortium nearly two years ago, developers have been using the standard to incorporate speech recognition into a wide range of applications, many in customer relationship management.

But interest in applications that combine speech recognition with other forms of input such as a keyboard, stylus or keypad, called multimodal applications, has spurred several companies, including Microsoft, to back the SALT (Speech Application Language Tags) Forum. Now, two camps are emerging: SALT and the W3Cs VoiceXML initiative.

In one corner stands a group led by IBM, Motorola Inc. and Opera Software ASA, which has submitted a proposal to combine VoiceXML with XHTML (Extensible HTML) to the W3C standards body. That proposal calls for developers to create multimodal applications for the same markup page, using VoiceXML for speech and XHTML for text and graphics. The proposal lays the groundwork for uniting the protocols effectively, and the W3C is expected to soon form a working group to discuss the submission.

Meanwhile, in the SALT corner are Microsoft, Cisco Systems Inc., Comverse Inc., Intel Corp., Philips Electronics NV and SpeechWorks International Inc. The SALT Forum plans to make a submission to an international standards body soon, although it hasnt yet decided whether it will be the W3C or the Internet Engineering Task Force.

Enterprises that are using speech recognition technology, perhaps with an eye to going multimodal in the future, want to see vendors work out their differences and come to a unified standard.

"As a user of the technology, our position is that we really would like to see open standards," said Joan Madden, project manager at United Parcel Service of America Inc., in Mahwah, N.J. "[VoiceXML] and XHTML seem to be moving more toward open standards."

UPS uses speech recognition technology from Nuance Communications Inc., a speech software developer. The technology enables the company to handle up to 80 percent of its customer inquiries without having to turn customers over to human representatives.

UPS is not alone.

"It can benefit us directly if they nail down a standard," said Roy Probus, reporting analyst for WebMD Corp., in Nashville, Tenn. "We can move between vendors, knowing that our system wont change."

WebMD uses speech recognition technology from Edify Corp. to automate its customer service call center. Today, customers give voice commands, and the system forwards them to the appropriate representative. In the future, WebMD plans to build an automated system that uses a knowledge base to answer questions.

Edify was among 18 speech recognition application developers to recently voice support for the SALT Forum. Yet, like many, the company is also keeping a close eye on the VoiceXML camp.

"Its not an either/or situation," said Ken Waln, chief technology officer for Edify, in Santa Clara, Calif. "Were further along in our support for SALT than we are for [VoiceXML], but we intend to support both standards as they become available."

Waln said the standards are trying to solve the same problem: adding voice capabilities as seamlessly as possible to minimize the learning curve for developers. The standard supported by IBM, Motorola and Opera combines VoiceXML and XHTML on the same markup page. SALT adds voice tags to existing Web markup languages, such as HTML and XML.

"[VoiceXML]s further along in the voice world; SALTs further along in the multimodal world," Waln said. "But theyre both trying to integrate Web and voice, and eventually, theyll converge and become one standard."

Even SpeechWorks, a speech recognition software developer and founding member of the SALT Forum, supports VoiceXML for speech recognition.

"Philosophically, I think both standards are headed in the same direction," said Rob Kassel, product manager for emerging technologies at SpeechWorks, in Boston. Kassel said SpeechWorks threw its support behind SALT after it was unimpressed by earlier efforts to combine VoiceXML with HTML for multimodal applications. He said he hasnt fully evaluated the current submission to the W3C to combine VoiceXML with XHTML.

"The SALT Forum plans to submit its standard to an international standards body. Possibly, itll be the W3C, and, maybe then, some of these issues do get worked out," Kassel said.

Dave Raggett, a W3C fellow and the consortiums lead for voice and multi-modal applications, said hes confident any differences between SALT and the VoiceXML/XHTML specification can be worked out if the SALT Forum submits its specifications to the W3C.

"Theres a lot of noise out there now, but its really just positioning," said Raggett, who helped referee the standards debate between Microsoft and Netscape Communications Corp. in 1995 that led to the adoption of universal HTML standards.

"These companies are going to have to ask themselves, Do we want to have strong standards or dont we?" said Raggett, whose day job is senior architect at Openwave Systems Inc., in Herts, England. "Theyre going to choose to have strong standards, and the W3C is the place to do that."

That optimism is not universally shared throughout the industry.

"[The W3C] could take pieces of each, and we could move forward with a unified standard," said Matt Callan, director of corporate marketing at Nuance, in Menlo Park, Calif. Nuance has thrown its support behind the VoiceXML/ XHTML group. Callan said SALT may have merits as a technology, but the IBM-led group is taking a better path by submitting to the W3C.

"Theres a right way for this to be decided, and the SALT Forum isnt the right way," Callan said.

Fran Rabuck, practice leader for mobile computing at Alliance Consulting, in Philadelphia, and an eWeek Corporate Partner, agreed.

"I would prefer something that comes under W3C because, historically, its a neutral territory," Rabuck said. "The last thing we need is more dueling standards."

That said, Rabuck said multimodal application technology is still too immature for most organizations to get too worked up about, whereas speech recognition technology is coming of age.

"The worst thing is that [the multimodal standards debate] creates confusion in the [VoiceXML] market," Rabuck said. "Theres real power in voice systems that will allow you to do things now that dont require the multimodal piece."

Bern Elliot, an analyst at Gartner Inc., in Stamford, Conn., said that multimodal applications are still two years away from becoming mainstream and that the SALT initiative could become a distraction to organizations looking to deploy speech applications today.

"The SALT initiative is not going to help end users at enterprises achieve their speech application goals at this point," Elliot said. "It could slow down the progress." He called the SALT Forums approach to this point "counterproductive."

"Usually," Elliot said, "the early stages of a standard are done quietly, without a lot of publicity. This politicizes what should be a technical investigation. What would be constructive is that the SALT and XML groups work productively in the W3C together."

To hear the respective speech gurus at IBM and Microsoft talk, though, the debate may only get messier.

"Theres room in the market for multiple standards," said James Maston, group product manager for .Net speech technologies at Microsoft, in Redmond, Wash. "[VoiceXML] did a great job for its intended use in the telephony space, but our ultimate goal is to benefit the customers."

SALT, Maston said, can turn 6 million Web developers into speech- enabled application developers since it builds on Web development technologies they already know. And he said it would do it more easily than VoiceXML and XHTML.

Maston said SALT is one part of Microsofts strategy to spur adoption of technologies that will voice-enable the Web. Another is creating a platform to deploy such technologies built on .Net.

That raises cries of foul from the IBM camp, that Microsoft is just seeking another platform monopoly. "I think its fairly obvious that they have a history of doing this," said William "Ozzie" Osborne, general manager of IBMs Voice Systems group, in Boca Raton, Fla.

Osborne said IBM supports a combination of existing standards—VoiceXML and XHTML—not a new standard. The W3C submission lays out how best to combine the two.

"As both standards [VoiceXML and XHTML] continue to improve, people can continue to use them," Osborne said. "We dont need a new standard. If youre adding [voice] tags, then youre adding syntax, which means youre changing syntax, and theres already a standard to do that. [The SALT Forum] wants to throw that all away and start all over again."

Osborne said there might be a place for SALT at the multimodal standards table, provided it works with the W3C. "We dont need to get into a whos better? discussion; we dont need two standards and people developing programs for both standards," Osborne said. "Lets put everything into a multimodal working group [at the W3C] and get on with getting to one standard."

Additional reporting by Stan Gibson