"Voice Browser" Activity — Voice enabling the Web!

NEWS

Introduction

W3C is working to expand access to the Web to allow people to interact via key pads, spoken commands, listening to prerecorded speech, synthetic speech and music. This will allow any telephone to be used to access appropriately designed Web-based services, and will be a boon to people with visual impairments or needing Web access while keeping theirs hands & eyes free for other things. It will also allow effective interaction with display-based Web content in the cases where the mouse and keyboard may be missing or inconvenient.

To fulfil this goal, the W3C Voice Browser working group (members only) is defining a suite of markup languages covering dialog, speech synthesis, speech recognition, call control and other aspects of interactive voice response applications. VoiceXML is a dialog markup language designed for telephony applications, where users are restricted to voice and DTMF (touch tone) input. The other specifications are being designed for use in a variety of contexts, and not just with VoiceXML. Further work is anticipated on enabling their use with other W3C markup languages such as XHTML, XForms and SMIL. This will be done in conjunction with other W3C working groups, including the proposed new Multimodal working group.

Some possible applications include:

We have set up a public mailing list for discussion of voice browsers and our work in this area. To subscribe send an email to www-voice-request@w3.org with the word subscribe in the subject line (include the word unsubscribe if you want to unsubscribe). The archive for the list is acccessible online. Note: to post a message to the list, you first need to subscribe. This is an anti-spam measure. The W3C Activity lead for voice and multimodal is Dave Raggett <dsr@w3.org> Phone: +44 1225 866 240.

This page will give you an introduction to each of the areas the working group is addressing, the plans for future work, a list of known implementations, and frequently asked questions.

Current status and plans

W3C's work on voice browsers originally started in the context of making the Web accessible to more of us, more of the time. In October 1998, W3C organized a workshop on "Voice Browsers". The workshop brought together people involved in developing voice browsers for accessing Web based services. The workshop concluded that the time was ripe for W3C to bring together interested parties to collaborate on the development of joint specifications for voice browsers. As a response, an activity proposal and charter were written to establish a W3C "Voice Browser" Activity and Working Group (members only).

Following review by W3C members, this activity was established on 26 March 1999. The W3C staff contact, and activity lead is Dave Raggett (W3C/Openwave). The chair of the Voice Browser working group is Jim Larson (Intel).

Rechartering

The charter expired early this year. An extension was granted to continue work while a Voice Browser Patent Advisory Group (members only) was convened to review IPR associated with the working group's specifications and to make recommendations for the IPR policy for rechartering the group. Voice Browser working group participants were asked to disclose all patents that may be essential to the specifications under development by the working group. See the patent disclosure page for details.

W3C now plans to move rapidly ahead with rechartering, in conjunction with chartering a new working group for multimodal. This follows a multimodal workshop held in September 2000, which addressed the emerging importance of speech recognition and synthesis for the Mobile Web. Future work is anticipated to focus on the maintenance of VoiceXML, call control, voice browser interoperation, speech synthesis, speech grammars, and the object model for enabling the use of these specifications in combination with other W3C specifications such as XHTML, XForms and SMIL.

Work under development

This is intended to give you a brief summary of each of the major work items under development by the Voice Browser working group. The suite of specifications is known as the W3C Speech Interface Framework.

VoiceXML

The VoiceXML 2.0 specification defines an XML language for telephony applications. It is based upon extensive industry experience with VoiceXML 1.0. The principle differences are: a) VoiceXML 2.0 requires support for the W3C speech grammar and speech synthesis languages - this will ensure greater interoperability, b) the specification has been clarified in many areas and reorganized for ease of understanding. For an introduction, here is a VoiceXML 2.0 tutorial. Further tutorials and other resources can be found on the VoiceXML Forum website. Both organizations have signed a memorandum of understanding setting out the goals of both parties. A list of VoiceXML implementations is given below.

Speech Synthesis

The Speech Synthesis specification defines a markup language for prompting users via a combination of prerecorded speech, synthetic speech and music. You can select voice characteristics (name, gender and age) and the speed, volume, pitch, and emphasis. There is also provision for overriding the synthesis engine's default pronunciation. The specification is expected to re-enter last call, following the incorporation of feedback on the previous draft.

Speech Recognition

The working group is developing specifications for speech and DTMF grammars, together with the means to extract semantic results from the recognition process:

DTMF Grammars

DTMF (touch tone) input is often used as an alternative to speech recognition. It is especially useful in noisy conditions or when the social context makes it awkward to speak. The W3C DTMF grammar format allows authors to specify the expected sequence of digits, and to bind them to the appropriate results.

Speech Grammars

These allow authors to specify rules covering the sequences of words that users are expected to say in particular contexts. The Speech Recognition Grammar specification defines an XML language for context free speech grammars. There is also a directly equivalent augmented BNF (ABNF) syntax, which some authors may find easier to deal with. Some speech engines may be able to strip out ums and arhs, and to perform partial matches. Recognizers may report confidence values. If the utterance has several possible parses, the recognizer may be able to report the most likely alternatives (n-best results).

Stochastic (N-Gram) Language Models

In most cases, the prompts are very carefully designed to encourage the user to answer in a form that matches context free grammar rules. In some applications it is appropriate to use open ended prompts (how can I help). In these cases, context free grammars are unwieldy. The solution is to use a stochastic language model. Such models specify the probability that one word occurs following certain others. The probabilities are computed from a corpus of utterances collected from many users. W3C's work in this area has been given a lower priority compared to other work.

Semantic Interpretation

The recognition process matches an utterance to a speech grammar, building a parse tree as a byproduct. W3C has been working on two approaches to harvesting semantic results from the parse tree. The first approach involves annotating grammar rules with semantic interpretation tags. These are expressed in a syntax based upon a subset of ECMAScript, and when evaluated, yield a value that can be held in an ECMAScript variable. For example, the user utterance:

"I would like a medium coca cola and a large pizza with pepperoni and mushrooms."

could be converted to the following semantic result:

{
  drink: {
    beverage: "coke"
    drinksize: "medium"}
  pizza: {
    number: "3"
    pizzasize: "large"
    topping: [ "pepperoni", "mushrooms" ]
  }
}

The second approach represents the result in XML, and is intended to be directly compatible with W3C's work on XForms. This work has been proceeding at a lower priority as compared with the first approach.

Pronunciation Lexicon

Application developers sometimes need to ability to tune speech engines, whether for synthesis or recognition. W3C is developing a markup language for an open portable specification of pronunciation information using a standard phonetic alphabet. The most commonly needed pronunciations are for proper nouns such as surnames or business names. This work has been given a lower priority compared to other work items.

Call Control

W3C is working on markup to enable fine-grained control of speech (signal processing) resources and telephony resources in a VoiceXML telephony platform. The scope of these language features is for controlling resources in a platform on the network edge, not for building network-based call processing applications in a telephone switching system, or for controlling an entire telecom network. These components are designed to integrate naturally with existing language elements for defining applications which run in a voice browser framework. This will enable application developers to use markup to perform call screening, whisper call waiting, call transfer, and more. Users can be offered the ability to place outbound calls, conditionally answer calls, and to initiate or receive outbound communications such as another call.

Voice Browser Interoperation

Call control can be used to transfer a user from one voice browser to another on a competely different machine, perhaps in another continent. This work item is focussing on mechanisms to transfer application state such as a session identifier along with the user's audio connections. In a related scenario, the user could start with a visual interaction on a cell phone and follow a link to switch to a VoiceXML application. The ability to transfer a session identifier makes it possible for the Voice Browser application to pick up user preferences and other data entered into the visual application. Finally, the user could transfer from a VoiceXML application to a customer service agent. The agent needs the ability to use their console to view information about the customer, as collected during the preceding VoiceXML application. The ability to transfer a session identifier can be used to retrieve this information from the customer database.

Work no longer under development

Further work on multimodal interaction has been suspended and will be resumed in a new multimodal working group, which W3C plans to set up in the near future. The Voice Browser working group was unable to reach a consensus on developing a specification for reusable dialog components. However, it is very much hoped that the development community will provide libraries written in VoiceXML for such common tasks as entering credit card details, postal addresses, telephone numbers and so forth.

Frequently asked questions

Far more people today have access to a telephone than have access to a computer with an Internet connection. In addition, sales of cellphones are booming, so that many of us have already or soon will have a phone within reach wherever we go. Voice Browsers offer the promise of allowing everyone to access Web based services from any phone, making it practical to access the Web any time and any where, whether at home, on the move, or at work.

It is common for companies to offer services over the phone via menus traversed using the phone's keypad. Voice Browsers offer a great fit for the next generation of call centers, which will become Voice Web portals to the company's services and related websites, whether accessed via the telephone network or via the Internet. Users will able to choose whether to respond by a key press or a spoken command. Voice interaction holds the promise of naturalistic dialogs with Web-based services.

Voice browsers allow people to access the Web using speech synthesis, pre-recorded audio, and speech recognition. This can be supplemented by keypads and small displays. Voice may also be offered as an adjunct to conventional desktop browsers with high resolution graphical displays, providing an accessible alternative to using the keyboard or screen, for instance in automobiles where hands/eyes free operation is essential. Voice interaction can escape the physical limitations on keypads and displays as mobile devices become ever smaller.

Hitherto, speech recognition and spoken language technologies have had for the most part to be handcrafted into applications. The Web offers the potential to vastly expand the opportunities for voice-based applications. The Web page provides the means to scope the dialog with the user, limiting interaction to navigating the page, traversing links and filling in forms. In some cases, this may involve the transformation of Web content into formats better suited to the needs of voice browsing. In others, it may prove effective to author content directly for voice browsers.

Information supplied by authors can increase the robustness of speech recognition and the quality of speech synthesis. Text to speech can be combined with pre-recorded audio material in an analogous manner to the use of images in visual media, drawing upon experience with radio broadcasting. The lessons learned in designing for accessibility can be applied to the broader voice browsing marketplace, making it practical to deliver services to a wide range of platforms.

Q1. Why not just use HTML instead of inventing a new language for voice-enabled web applications?

A1. HTML was designed as a visual language with emphasis on visual layout and appearance. Voice interfaces are much more dialog oriented, with emphasis on verbal presentation and response. Rather than bloating HTML with additional features and elements, new markup languages were especially designed for speech dialogs.

Q2. How does the W3C Voice Browser Working Group relate to the VoiceXML Forum?

A2. The VoiceXML Forum developed the dialog language VoiceXML 1.0, which it submitted to the W3C Voice Browser Working Group. The Voice Browser working group used those specifications as a model for VoiceXML 2.0. In addition, the Voice Browser Working Group has augmented the VoiceXML 2.0 with Speech Recognition Grammar Markup Language and the Speech Synthesis Markup Language. The VoiceXML Forum provides educational, marketing, and conformance testing services. The two groups have a good working relationship, and work closely together to enhance the ability of developers to create web-based voice applications. Both organizations have signed a memorandum of understanding setting out the goals of both parties.

Q3. What are the differences between VoiceXML, VXML, VoXML, and all the other voice mark up languages?

A3. Historically, different speech companies created their own voice markup languages with different names. As companies integrated languages together, new names were given to the integrated languages. The IBM original language was SpeechML. AT&T and Lucent both had a language called PML (Phone Markup Language, but each had different syntax. Motorola's original language was VoxML. IBM, AT&T, Lucent, and Motorola formed the VoiceXML Forum and created VoiceXML (briefly known as VXML). HP Research Labs created TalkML. The World Wide Web Consortium Voice Browser Working Group has specified VoiceXML 2.0, based upon extensive industry experience with VoiceXML 1.0.

Q4. Will WAP and VoiceXML ever be integrated into a single language for specifying a combined verbal/visual interface?

A4. Different standards bodies defined the Wireless Markup Language (WML) and VoiceXML 2.0. A joint W3C/WAP workshop was held in September 2000 to address this question. Some difficult problems to integration were identified, including differences in architecture (WAP is a client-based browser, VoiceXML is a server-based browser), as well as differences in language philosophy and style. The workshop adopted the "Hong Kong Manifesto" which basically states that a new W3C working group should be created to address this problem and coordinate activities to specify a multimodal dialog markup language supporting both visual and verbal user interfaces. The W3C Voice Browser Working Group has also approved the "Hong Kong Manifesto." We anticipate that a new working group will be organized in the next few months.

Q5. What is the difference between VoiceXML 2.0 and SMIL?

A5. Synchronized Multimedia Integration Language (SMIL, pronounced "smile") is a presentation language that coordinates the presentation of multiple visual and audio outputs to the user. VoiceXML 2.0 coordinates input from the user and output to the user. Eventually the presentation capabilities of SMIL should be integrated with the output capabilities of VoiceXML 2.0.

Q6. Where can I find specifications of the W3C Voice Browser Activity and how do I provide feedback to the W3C Voice Browser Working Group?

A6. The page to look at is http://www.w3.org/Voice/. You can find links to all of the published drafts and additional background material. Comments and feedback may be e-mailed to www-voice@w3.org, but you have to subscribe first (an anti-spam measure).

Q7. What speech applications cannot currently be supported by the W3C Speech Interface Framework?

A7. While the W3C Speech Interface Framework and its associated languages support a wide range of speech applications in which the user and computer speak with each other, there are several specialized classes of applications requiring greater control of the speech synthesizer and speech recognizer than supported in the current languages. The Speech Grammar Markup Language does not currently support the fine granularity necessary for detecting speech disfluencies in disabled speakers or foreign language speakers that may be required for "learn to speak" applications. There are currently no mechanisms to synchronize a talking head with synthesized speech. The Speech Synthesis Markup Language is not able to specify melodies for applications in which the computer sings. We consider the Natural Language Semantics a first step towards specifying semantics of dialogs. Because there are no context or dialog history databases defined, extra mechanisms must be supplied to do advanced natural language processing. Speaker identification and verification and advanced telephony commands are not yet supported in the W3C Speech Interface Framework. Developers are encouraged to define objects that support these features.

Q8. When developing an application, what functions and features belong in the application and what functions and features belong in the browser?

A8. A typical browser implements a specific set of features. We discourage developers from reimplementing these features within the application. New features should be implemented in the application. If and when several applications implement a new feature, the Working Group will consider placing the features in a markup language specification, and encouraging updates to browsers that incorporate the new feature. We discourage developers from creating downloadable browser enhancements because some browsers may not be able to accept downloads, especially browsers embedded into small devices and appliances.

Q9. What is the relationship between the VoiceXLM 2.0 and programming languages such as Java and C++?

A9. Objects may be implemented using any programming language.

Q10. How has the voice browser group addressed accessibility?

A10. The voice browser group's work on speech synthesis markup language brings the same level of richness to synthesized aural presentations that users have come to expect with visual presentations driven by HTML. In this respect, our work picks up from the prior W3C work on Aural CSS. Next, our work on making speech interfaces pervasive on the WWW has an enormous accessibility benefit; speech interaction enables information access to a significant percentage of the population that is currently disenfranchised.

As the voice browser group, our focus has naturally been on auditory interfaces, and hence all of our work has a positive impact on the user group facing the most access challenges on the visual WWW today --namely blind and low vision users. At the same time we are keenly aware of the fact that the move to information access via the auditory channel raises access challenges for users with hearing or speaking impairments. For a hearing-impaired user, synthesized text should be displayed visually. For a speaking-impaired user, verbal responses may instead be entered via a keyboard.

Finally, we realize that every individual is unique in terms of his or her abilities; this is likely to become key as we move towards multimodal interfaces which will need to adjust themselves to the users current environment and functional abilities. Work on multimodal browsing will address this in the context of user and device profiles.

Q11. Are their IP issues associated with VoiceXML 2.0?

A11.  Yes. Some members of the Voice Browser Working Group may have IP claims.  Every member of the Voice Browser Working Group is required to make a disclosure statement regarding its IP claims relevant to essential technology for Voice Browsers.  You can review these statements at http://www.w3.org/2001/09/voice-disclosures.html

Q12. How will patent policy issues effect future work on VoiceXML?

A12. W3C is currently working on revising its patent policy. If the W3C adopts a patent policy that precludes work on specifications encumbered by patents requiring royalty fees, and if patent holders aren't prepared to waive royalty fees for patents relating to the VoiceXML specification, then W3C would be forced to stop further work on VoiceXML.

Note: W3C does not take a position regarding the validity or scope of any intellectual property right or other rights that might be claimed to pertain to the implementation or use of the technology, nor the extent to which any license under such rights might or might not be available. Copyright of WG deliverables is vested in the W3C.

Q13. Who has implemented VoiceXML interpreters?

A13. Several venders have implemented VoiceXML 1.0 and are extending their implementations to conform with the markup languages in the W3C Speech Interface Framework. To be listed here, the implementation must be working and available for use by developers. Vendors are listed in alphabetical order:

The BeVocal Café is a voice-developer site providing a VoiceXML 1.0 interpreter, written entirely in Java, and with many pre-tuned grammars and professional audio — including nationwide U.S. street addresses.

General Magic, http://www.generalmagic.com, has also implemented a version of VoiceXML 1.0.

HeyAnita's implementation of VoiceXML 1.0 is now available for use by developers, and offers full interactive debugging support. For more detail see HeyAnita's FreeSpeech Developer Network.

IBM Voice Server SDK Beta Program is based on VoiceXML Version 1.0 available at http://www.alphaworks.ibm.com/tech/voiceserversdk.

Motorola has the Mobile Application Development Toolkit (MADK), a freely downloadable software development kit that supports VoiceXML 1.0 (as well as WML and VoxML). See http://www.motorola.com/MIMS/ISG/spin/mix/.

Nuance offers graphical VoiceXML development tools, a Voice Site Staging Center for rapid prototyping and testing, and a VoiceXML-based voice browser to developers at no cost. See the Nuance Developer Network at http://extranet.nuance.com/developer/ to get started.

The Open VXI VoiceXML interpreter is a portable open source library that interprets the VoiceXML dialog markup language. It is designed to serve as a reference for parties interested in understanding how VoiceXML markup might be executed. Open VXI is now a SourceForge project.

PIPEBEACH offers speechWeb, a carrier-class VoiceXML platform, and a speechWeb Application Partner Program including developer's site, tutorials and access to speechWeb systems for application verification. For more information visit http://www.pipebeach.com.

Telera (http://www.telera.com) offers a VoiceXML 1.0 browser with call control extensions as part of their Voice Web Central Office (VWCO) product offering. Telera has applied the experience gained in running a Telera XML (TXML) based service network for over two years, to provide a reliable and scalable Voice Web Infrastructure that allows the business application developers to write their applications now in VoiceXML in addition to TXML.

Tellme Studio allows anyone to develop their own voice applications and access them over the phone just by providing a URL to your content. Visit http://studio.tellme.com to begin. The Tellme Networks voice service is built entirely with VoiceXML. Call 1-800-555-TELL to try this service.

VoiceGenie has sponsored a developer challenge in association with VoiceXMLCentral, a VoiceXML virtual community and search engine. For details on Voice Genie's developer support, see: http://developer.voicegenie.com.

W3C Staff Contact

Dave Raggett <dsr@w3.org>, W3C, (on assignment from Openwave)

Copyright  ©  1998-2001 W3C (MIT, INRIA, Keio ), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply. Your interactions with this site are in accordance with our public and Member privacy statements. This page was last updated on $Date: 2001/09/21 16:08:43 $