Free Print Subscription Printer-friendly version Email to a Friend

VOICEXML DISTRIBUTES VOICE TO THE MASSES

( 01 Apr 2002 )
Nicholas Cravotta, Technical Editor



Today's cell phones, PDAs, and laptops are more versatile than ever. You can use them to surf the Web, reading the limited content available on tiny, monochrome screens. Some support limited voice recognition, but verbally navigating visual interfaces when you can't see the screen is awkward. Finally, you must have a special device—with a charged battery—to access such Web services as information retrieval, including stock quotes, scores, news, and e-mail; e-transactions; or telephony services. But, with VoiceXML (Voice Extensible Markup Language), vendors are now offering these services to users from any phone.

The crafters of VoiceXML wanted to develop the spec to use the Web infrastructure. Hence, it has the VoiceXML scripting language for writing pages, or documents, for you to "read"—that is, listen to—through a voice browser. VoiceXML's distributed-processing model is the key to its power. Web browsers on PCs, for example, run on the client, requiring the client to have enough processing resources to handle HTML (Hypertext Markup Language), JavaScript, text, images, formatting, I/O, and the like. Processing for VoiceXML applications, on the other hand, takes place on the server side. The voice browser, or gateway, converts either human-voice or character input through DTMF (dual-tone, multifrequency) touch-tone capability to a digital format and converts digital data into synthesized or prerecorded human speech. In other words, whereas standard Web access requires a mini PC, voice Web access requires no more than a standard "dumb" phone, even one with a rotary dial. Such an architecture can have a tremendous impact on defining how much technology you need to voice-enable handheld devices.

People often refer to current voice systems, such as those that airlines deploy to provide up-to-date flight-status information, as IVR (interactive-voice-response) systems, which manufacturers often base on proprietary architectures and formats. In contrast, VoiceXML's developers wanted to create a standard for developing a voice application that enables reuse of hardware and application software across varied applications. Toward that end, the technology separates application development from resource deployment; that is, a host processor or a DSP farm transparent to the application can perform the actual voice recognition. Hence, any VoiceXML document can run on any voice browser on any server using any voice-recognition engine.

Thus, distributed processing has another ramification: In the past, service creation (the application) and service hosting/control (the equipment and middleware) depended on each other and required a single company or group of cooperating companies to produce a turnkey product. (Turnkey products encounter interesting bottlenecks when users need to add or modify services or when departments within a company want to use the same system but offer different services.) Using VoiceXML, a voice browser/gateway provides the link between various resources, allowing independent creation and management of each resource (Figure 1). Taken a step further, the various components of an interactive-voice-response system can actually reside on physically different networks. Thus, access and service can become commodities. Creating a system becomes a matter of integrating access to—rather than managing—these resources. Thus, resources need not reside within one organization or even location, and organizations can independently scale them. An interactive-voice-response-service provider could conceivably own no equipment.

VOICE BROWSING

Figure 1
You can independently manage, scale, and locate the components of a VoiceXML architecture throughout a network or even across the Internet.

A voice browser serves a similar role to that of a visual browser: It accepts and "displays" documents (Figure 2). VoiceXML manages the dialogue between a human and a machine (see sidebar "Moment of silence"). With VoiceXML, an application server sends a document to a server hosting the voice browser. The documents can be either statically or dynamically generated; dynamically generated documents contain real-time information, such as current stock prices or e-mail from a database server. Each document has "prompts" that secure information from a user. After the user "fills in" the document with the required information, the voice browser sends the information it has collected back to the application server. The application server processes the responses and then provides an appropriate follow-up document.

Like HTML documents, VoiceXML documents are all-text and contain references to nontextual data, such as links to prerecorded audio; previously compiled "grammars," which define the structure and, therefore, the style of interaction between human and computer; or other Web-based resources. VoiceXML supports event handling, such as noinput, nomatch, or help; properties, such as voice-recognition thresholds; telephony features, such as automatic number identification; and connection control, such as disconnect, transfer to a live operator if a user has a problem the VoiceXML application can't handle; or "bridging," which adds a third party to the conversation. You can also extend VoiceXML to support special features. To accomplish this extension, you use objects, which allow the browser to access platform-specific functions, such as speaker verification, which recognizes a user by his or her voice; running a voice browser on a handheld device; or acquiring the location of a user using a device's GPS (Global Positioning System) capabilities. Using objects, however, affects the portability of an application.

Note that documents are "stateless," meaning the browser has limits on how it can act on information it collects. Additionally, the typical input device, a telephone, is stateless as well, meaning that you can't store "cookies" on it. A user may access services that require multiple sequential or related interactions. For example, a user may wish to suspend a session; transfer it to another person to finish; make corrections to a previous session; or access the session using another format, such as a visual browser. If you want to track individual users, you have to manage your own database at the application level. You may also want to provide a means for tracking general usage patterns for how users access the site. Such information can be extremely useful in determining where a site is difficult to use. For example, you may find it necessary to adjust menus to make frequently used features more accessible on an applicationwide or individual-user basis.


WHAT YOU SAY AND HOW YOU SAY IT
The key to a document's flexibility is its grammar. Grammars define the structure and, therefore, the style of interaction between human and computer. You can use several approaches, or styles, to define grammar (see sidebar "Dialogue styles"). In "directed" dialogues, prompts follow a defined pattern for gathering information, and vocabularies can be responses that are valid only in the context of the prompt. Directed dialogues limit a user's options; however, they reduce the complexity of creating a grammar because the user can choose only an option the dialogue currently offers. Furthermore, directed dialogues also reduce the pool of valid responses, thus easing the recognition task.

Mixed-initiative dialogues allow a user to direct the

Figure 2
The application server accesses the database to create dynamic VoiceXML pages (a). VoiceXML pages are sent to the voice browser (b). The voice browser executes the VoiceXML document (c), prompting the user either with prerecorded audio files (d) or synthesis speech from text (e). The user hears prompts and responds (f). Touch tones are translated (g) and verbal responses sent to a speech-recognition engine (h) with an appropriate grammar that is either part of a VoiceXML document or externally referenced using a URL (i). The browser parses the response and stores it (j) and then possibly prompts the user for additional information. When all VoiceXML is completely filled out (k), stored responses are sent back to the application server (l) for processing. The application server may store some information in a database, serve up another page to the browser, or take action on data, such as process a transaction.
conversation.They approach natural speech, allowing the user to choose almost any command, and they also provide parametric information as well. This approach allows a browser to more quickly collect information than do directed or basic dialogues, which list menu choices. Mixed-initiative dialogues reduce the number of interactions for collecting information and let the user talk more naturally. As a consequence, the dialogue must account for more possibilities and more variations for each possibility. Additionally, a dialogue should have variations of prompts to give users a different way of hearing the question in case the primary prompt fails to result in an appropriate response. Dialogues allow specification of multiple stages of an interaction on a single document, thus improving responsiveness by reducing the number of client/server exchanges and the delays associated with those exchanges. VoiceXML also includes mechanisms for handling common issues that arise, such as unexpected user responses, such as an inappropriate "help" request; recognition errors, such as "I didn't understand your response"; and platform errors when, for example, a grammar link breaks.

Grammars define what words and concepts the browser can recognize, and VoiceXML defines what action to take based on the recognized speech. The drafters of VoiceXML 1.0 required no grammar format, following the reasoning that HTML specifies no image formats. However, for applications to be portable—that is, interoperate with each other—the grammars themselves must also be portable. VoiceXML 2.0, currently a working draft that its drafters made public last October, corrects this drawback by requiring support for the W3C (World Wide Web Consortium) speech-grammar and -synthesis languages. Thus, applications written for VoiceXML 1.0 using proprietary grammars may face compatibility issues with the arrival of VoiceXML 2.0. Prudent engineers will adopt the W3C formats to prepare for this changeover.

VoiceXML 2.0 also supplies built-in grammars for collecting fairly standard information formats, such as dates or numbers. Developers can also design subdialogues, functioning somewhat like subroutines, that users can invoke from a document. Subdialogues are useful for collecting multiple pieces of information that a user may not say in one utterance and for which users may want to confirm or validate responses. An example of a subdialogue is: "Are you sure you want to purchase 100,000 shares of bankruptcy.com?" Thus, applications and contexts within an application can share an application.

Grammars also play an important role in that they define the vocabulary a speech-recognition engine recognizes (see sidebar "Recognize this?"). You can dynamically compile grammars to include responses appropriate to a user and session. You should precompile some grammars, such as large name directories, to save processing and loading time. Note that a document can include a grammar or a URL pointing to one.

Grammars perform only a user-interface role. They access only that data and process only the user input that the dialogue specifies. For example, a browser does not understand whether a user is asking for a flight number or whether a flight is arriving or departing. Rather, it collects this information and sends it to the application server for processing. If a flight number does not exist, the application server determines this fact and might send a document stating this fact and asking for a new flight number. If the flight number does exist, the application server dynamically creates a document with the desired user response, such as "Flight 1346 is scheduled to land at 4:31 pm." One point easy to overlook is that voice plays a substantial role in extending the currently flat phone and online personalities of companies.

VoiceXML promises to substantially increase the number of viable connections to the Internet. By distributing the processing load, even dumb nodes can take advantage of voice browsing. The burden on handheld devices to support speech becomes less critical as speech recognition and synthesis become network commodities.

MOMENT OF SILENCE
Voice is not always the best interface. Sometimes, you need to give a password or an account number that you would rather type because of privacy concerns. Additionally, you can more easily compare data, such as prices and features, side-by-side than hearing them read from a list. Thus, the VoiceXML Forum is already looking at multimodal interfaces, in which developers can pick the most appropriate interface for a task.

Segmenting individual components will speed the adoption of multimodal interfaces across applications. Several companies are developing speech-recognition modules that users can then place in diverse applications, including cars, PDAs, and cell phones. The SALT (Speech Applications Language Tags) Forum promotes such multimodal and telephony-enabled access to information, applications, and Web services from PCs, telephones, tablet PCs, and wireless PDAs. SALT will extend existing XML variants, enabling users to input data using speech, a keyboard, a keypad, a mouse, or a stylus and to output data as synthesized speech, audio, plain text, motion video, or graphics. SALT founders hope to make the specification public in the first quarter of this year and submit it to a standards body by midyear.

DIALOGUE STYLES
The three dialogue styles for VoiceXML are: basic, or menu style; directed-dialogue style; and mixed-initiative-dialogue style. Basic style lists menu choices, and users press a number or speak a choice from a limited vocabulary. For example, the browser would prompt the user, saying, "For arrival times, press or say one. For departure status, press or say two," and the user would respond, "one." The browser would then say, "Press or say the number of the flight, followed by the pound sign," and the user would respond in kind.

In directed-dialogue style, prompts follow a defined pattern for gathering information. Vocabularies are limited to valid responses in the context of the prompt. For example, a browser might ask, "Do you want to check on an arriving flight or a departing flight?" The user would respond, for example, "arriving." The browser would then ask, "What flight do you want to check?" And the user would respond appropriately.

Mixed-initiative dialogue approaches natural speech, allowing users to choose almost any command and also to provide parametric information. For example, the browser might say, "How may I help you?" The user would respond, "I want to know when flight 1346 is scheduled to land."

RECOGNIZE THIS?
VoiceXML offers the advantage of insulating application developers from having to understand the underlying hardware. For example, a developer need not know how a system will recognize voice—just that there is a resource available to handle the task. To some degree, speech recognition has reached the point at which increasing recognition accuracy from the current rate of 95 to 98% will require a quantum leap in recognition technology. However, even recognition engines that offer 95 to 98% accuracy will misrecognize one in 20 to one in 50 words, or at least one word per session. Fortunately, you can increase this accuracy at the system level.

Getting clearer input from the user is a key area for improvement. Many phones are used in noisy environments. Proper noise and echo cancellation at the node improves the captured signal, giving the recognition engine a better signal to work with. Unfortunately, the model for VoiceXML puts the input device itself outside the direct management of the voice system; you have to work with what you get.

Another area of distortion arises from connection quality. Speech-recognition technologies, for example, compress the voice input from cell phones with loss over the wireless link. Bringing speech recognition closer to the edge—that is, before the compression—eliminates this problem but requires a more expensive and specialized node/phone. Distributed-speech recognition, such as the Aurora initiative that the European Telecommunications Standards Institute sponsors, hopes to address this problem by using a front-end recognition engine to extract voice features that would degrade over mobile links, encode such features, and send them over the link for processing. Distributed-speech recognition also reduces the processing load on the recognition server, enabling overall processing of more voice channels. Voice recognition at the node also enables local applications, such as name dialing.

You can also increase recognition accuracy by using more intelligent tolerances. For example, reducing the acceptable vocabulary reduces the number of similar-sounding phrases and increases the chance of selecting the correct phrase. Context, such as comparing a word that a system recognizes with low confidence with words in the same sentence that it recognizes with high confidence, also improves accuracy. Additionally, the further into a dialogue a transaction proceeds, the more limited the options are and the higher the accuracy of discerning among them is.

Minimizing "dead air," or the time that a user has to wait, is also critical to successful user interaction. Although caching commonly used pages, such as a home page, and prefetching pages likely to be accessed next can reduce delay, intelligent allocation of the speech-recognition resource can also increase performance. Caching grammars before you need them and storing commonly used files, such as particular grammars, local to the gateway, also reduce latency. For recognition resources physically removed from the voice gateway, trusting Internet connections may result in unpredictable and unpreventable delays.

Another issue is how you tie up a recognition resource while it is in use. A DSP board can handle a certain number of voice channels. Reserving one of these channels for use while a message is playing back to a user means that the resource sits idle during that time. If you want to support "barge in," in which a user can interrupt the voice gateway with a command at any time, you may need to leave a channel open or have a way to quickly allocate one. Freeing channels when they are not in use presents potential conflicts. Buffering voice versus using a real-time connection addresses some of these problems but could increase latency to a user-noticeable degree. Buffering, however, enables a preprocessor to remove dead space and hesitations from an input signal, thereby reducing the load on the recognition resource and increasing its effective channel capacity.

Determining the capacity of a particular resource for an interactive-voice-response system is a complex measurement. For example, some calls are more complex than others. Recognizing a name is more difficult than differentiating between "yes" and "no." Supporting a long list of phrases that mean the same thing also increases the time it takes to complete recognition.

Support for self-monitoring the recognition resource can profile typical system usage that is useful for optimization at all levels. Developers will be able to see how people use the system and which parts of the dialogue are more troublesome than others. IT management could more confidently and accurately measure the load on each resource and provision. Finally, engineers could track traffic through the resource, minimizing overhead and delays.


ACKNOWLEDGMENTS

Special thanks to Peter Danielsen, one of the original authors of the 1.0 spec, an editor of the 2.0 spec, and a distinguished member of the technical staff at Lucent Technologies; and to Dave Spencer, Speech Marketing Manager at Intel.


You can contact Technical Editor Nicholas Cravotta at
(1) 510-558-8906, Fax (1) 510-558-8914
E-mail ednnick@pacbell.net.

 
Free Print Subscription Printer-friendly version Email to a Friend
Article Rating 
Average Rate: No rating yet
 
Poor Quite Good Good Very Good Excellent
 
 
Related Content 
 
 
ADVERTISEMENT
 
 
WEBCASTS
 
RESOURCE CENTER
 
Highest Rated  
 
 
 
 
 
ADVERTISEMENT
 
 
 
 
 
 


RSS
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   

POLL
What type of environmental regulation do you think will be most beneficial for the tech industry?
Proper recycling and disposal
Push for power efficiency and energy conservation
Chemical/lead regulation
View results

 
 
 


 
 
Power Technology E-newsletter 
EMC-compliant DC/DC µModule Regulator Rated at 1A, 36V EDNA, November 09
Digital Power IC Market to Surge at 19.8% CAGR EDNA, September 09
Dual-input, Single-output Power Supply Selector Switch Reduces System Size while Improving Integrity EDNA, February 08
 
Analog E-newsletter 
Ambient-light Sensors Pack in Features to Help Applications Get Smarter, Greener
EDNA, November 09
 

 
KNOWLEDGE CENTER
 
Texas Instruments: DaVinci™ Technology
 
Texas Instruments: Safe Bet Series
 
 
INDUSTRY LINKS
 
Photonics Association (Singapore)
Singapore Industrial Automation Association (SIAA)
Taiwan Semiconductor Industry Association (TSIA)
 
 


 
 
OUR SPONSORS
 





Keithley Instruments
With more than 60 years of measurement expertise, Keithley Instruments has become a world leader in advanced electrical test instruments and systems from DC to RF (radio frequency). Our products solve emerging measurement needs in production testing, process monitoring, product development, and research...
 
 
     
 



Canon Communications Asia
EDN India | EDN Taiwan | EDN Korea | EDN Japan | EDN China | EDN | EDN Europe

 
ABOUT EDN Asia | FREE SUBSCRIPTION | CONTACT US
   
© 2010 Canon Communications
All rights reserved. Use of this web site is subject to its Terms and Conditions of Use. View our Privacy Policy.