Companies that want to broaden the range of features in telephony-based applications now have more options through speech recognition, but the new technology will require the development of greater levels of expertise.
With Version 2.1 of VXML (Voice XML), the World Wide Web Consortium is about to standardize a number of improvements that consolidate several common vendor implementations of features that previously werent inside the scope of VXML. This will allow developers to build more robust applications and handle exceptions more effectively. Version 2.1 of VXML is currently under final review by the W3C (www.w3c.org/voice).
When organizations evaluate IVR (Interactive Voice Response) systems, one of the biggest choices they have is whether to build their speech platform on products based on VXML or those based on Microsoft Corp.s competing specification, SALT (Speech Application Language Tag).
VXML, Version 2.0 of which was ratified in October 2001, is the better choice for building a richer application with deeper hooks into underlying corporate applications and data. SALT, established in August 2002, gives companies a way to leverage more-common programming skills and eventually to build IVR applications that extend beyond the telephone.
VXML and SALT are poised to significantly change the way telephony-based applications work by combining speech recognition, DTMF (dual-tone multifrequency) and TTS (text-to-speech) technologies in a single application design-and-development paradigm. Whereas early-generation IVR and touch-tone-based applications required developing to specific telephony hardware, VXML and SALT abstract development using high-level Web application markup languages.
VXML and SALT deliver the same capability at their cores: They separate the speech interface from business logic and data. Both define how an application will manage interaction between the user and application by determining voice interaction through the use of grammars and telephone keypad input through DTMF.
Both use the Web application development techniques of building pages using XML and tags, respectively, to define how the application will flow and to validate speech or touch-tone input from the user.
From an architectural perspective, SALT differs from VXML in that it defines a way to build multimodal applications so that a voice application can exist in another form through another interface, such as a Web browser.
A more practical difference is the tight integration between SALT and Microsoft developer tools—namely, Visual Studio .Net. This integration will allow anyone familiar with Visual Basic or other Visual Studio development languages to write a speech application. (This can be seen in applications such as Pronexus Inc.s VBSALT, reviewed here, which can be used to accelerate the development of speech applications in Visual Basic.)
VXML, on the other hand, is supported by most of the traditional IVR and telephony application vendors, giving companies broader platform and development language choices.
VXML organizes an application into a set of documents that define how the application works through dialog states, typically consisting of menus and forms. Grammars then function as the input an application expects for each dialog state, using either speech or DTMF input and organized as a list of valid responses in a file.
These documents are written in XML, so developers can define how a user will interact with an application. For example, in a travel-booking application, a developer might build a set of documents that will step the user through the processes and subdialogs necessary to capture information about travel times, airlines, frequent-flier data and payment. VXML lets the developer choose speech or DTMF input as appropriate while providing a way for the user to make choices within a grammar the application understands.
SALT uses a variety of elements to organize applications, with Prompt, Listen and DTMF defining the flow of the applications.
Prompt defines how the application queries the user for input and can call inputs from text files and variables that will be converted to speech to actual audio files.
Listen controls the speech-based input for the application or grammars, which can be referenced in line in the application or in a separate file. Listen supports a number of tags for controlling the interaction of the speech input with the underlying application logic. It can also be used to capture speech input to help diagnose faults in the application or speech recognition engine for developers tuning the application.
The DTMF element functions similarly to Listen in that it captures keypad input for the application.
Microsoft has designed SALT to work in conjunction with other interfaces. This lets developers reuse elements of an application on a variety of interfaces, from a Web browser to a PDA.
In one of these multimodal applications, SALT tags would be embedded directly in a Web page. The user accessing that page. The user acessing that page from a PC would be able to interact with the page using traditional input or speech that is recognized locally. On a less powerful device, such as a PDA, users could interact via speech that is recognized on the server using a server-based speech recognition engine in much the same way telephone-based access to that application handles speech input.
There is an effort under way to bring SALT into the VXML sphere. Microsoft is part of the W3Cs Voice Browser Working Group, as are vendors such as Intel Corp. that support both VXML and SALT platforms. Version 3 of VXML is expected to include elements of SALT. The first working draft of VXML 3 is expected at the end of next year; a final version of the standard is slated for 2007.
Technical Analyst Michael Caton can be reached at [email protected]