Copyright © 2011 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
This document is the Final Report of the HTML Speech Incubator Group and presents requirements and other deliverables of the group.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document is the 29 June 2011 draft of the Final Report for the HTML Speech Incubator Group. Comments for this document are welcomed to public-xg-htmlspeech@w3.org (archives).
This document was produced according to the HTML Speech Incubator Group's charter. Please consult the charter for participation and intellectual property disclosure requirements.
Publication as a W3C Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
1 Terminology
2 Overview
3 Deliverables
3.1 Prioritized
Requirements
3.2 New
Requirements
3.3 Use
Cases
3.4 Individual
Proposals
3.5 Solution Design
Agreements and Alternatives
3.6 Proposed
Solutions
The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this specification are to be interpreted as described in [IETF RFC 2119].
This document presents the deliverables of the HTML Speech Incubator Group. First, it presents the requirements developed by the group, ordered by priority of interest of the group members. Next, it covers the use cases developed by the group. Next, it briefly describes and points to the major individual proposals sent in to the group as proof-of-concept examples to help the group be aware of both possibilities and tradeoffs. It then presents design possibilities on important topics, providing decisions where the group had consensus and alternatives where multiple strongly differing opinions existed, with a focus on satisfying the high-interest requirements. Finally, the document contains (all or some of) a proposed solution that addresses the high-interest requirements and the design decisions.
The major steps the group took in working towards API recommendations, rather than just the final decisions, are recorded to act as an aid to any future standards-track efforts in understanding the motivations that drove the recommendations. Thus, even if a final standards-track document differs from any API recommendations in this document, the final standard should address the requirements, use cases, and design decisions laid out by this Incubator Group.
According to the charter, the group is to produce one deliverable, this document. It goes on to state that the document may include
The group has developed requirements, some with use cases, and has made progress towards one or more API proposals that are effectively change requests to other existing standard specifications. These subdeliverables follow.
The HTML Speech Incubator Group developed and prioritized requirements as described in the Requirements and use cases document. A summary of the results is presented below with requirements listed in priority order, and segmented into those with strong interest, those with moderate interest, and those with mild interest. Each requirement is linked to its description in the requirements document.
A requirement was classified as having "strong interest" if at least 80% of the group believed it needs to be addressed by any specification developed based on the work of this group. These requirements are:
A requirement was classified as having "moderate interest" if less than 80% but at least 50% of the group believed it needs to be addressed by any specification developed based on the work of this group. These requirements are:
A requirement was classified as having "mild interest" if less than 50% of the group believed it needs to be addressed by any specification developed based on the work of this group. These requirements are:
While disucssing some of the use cases, proposals, design decisions, and possible solution a few more requirements were discovered and agreed to. These requriements are:
Through out this process the group has developed a lot of different use cases covering a varitey of scenarios. These use cases were developed as part of the requirements process, through the proposals that were submitted, and in our group discussion. It is important that it be possible to support as many of these use cases as easily as possible by our proposed solutions.
A Speech Command and Control Shell that allows multiple commands, many of which may take arguments, such as "call [number]", "call [person]", "calculate [math expression]", "play [song]", or "search for [query]".
A use case exists around collecting multiple domain specific inputs sequentially where the later inputs depend on the results of the earlier inputs. For instance, changing which cities are in a grammar of cities in response to the user saying in which state they are located.
This use case is to collect free form spoken input from the user. This might be particularly relevant to an email system, for instance. When dictating an email, the user will continue to utter sentences until they're done composing their email. The application will provide continuous feedback to the user by displaying words within a brief period of the user uttering them. The application continues listening and updating the screen until the user is done. Sophisticated applications will also listen for command words used to add formatting, perform edits, or correct errors.
Many web applications incorporate a collection of input fields, generally expressed as forms, with some text boxes to type into and lists to select from, with a "submit" button at the bottom. For example, "find a flight from New York to San Francisco on Monday morning returning Friday afternoon" might fill in a web form with two input elements for origin (place & date), two for destination (place & time), one for mode of transport (flight/bus/train), and a command (find) for the "submit" button. The results of the recognition would end up filling all of these multiple input elements with just one user utterance. This application is valuable because the user just has to initiate speech recognition once to complete the entire screen.
Some speech applications are oriented around determining the user's intent before gathering any specific input, and hence their first interaction may have no visible input fields whatsoever, or may accept speech input that is far less constrained than the fields on the screen. For example, the user may simply be presented with the text "how may I help you?" (maybe with some speech synthesis or an earcon), and then utter their request, which the application analyzes in order to route the user to an appropriate part of the application. This isn't simply selection from a menu, because the list of options may be huge, and the number of ways each option could be expressed by the user is also huge. In any case, the speech UI (grammar) is very different from whatever input elements may or may not be displayed on the screen. In fact, there may not even be any visible non-speech input elements displayed on the page.
Some sophisticated applications will re-use the same utterance in two or more recognitions turns in what appears to the user as one turn. For example, an application may ask "how may I help you?", to which the user responds "find me a round trip from New York to San Francisco on Monday morning, returning Friday afternoon". An initial recognition against a broad language model may be sufficient to understand that the user wants the "flight search" portion of the app. Rather than get the user to repeat themselves, the application will just re-use the existing utterance for the recognition on the flight search recognition.
Automatic detection of speech/non-speech boundaries is needed for a number of valuable user experiences such as "Push once to talk" or "hands-free dialog". In press-once to talk the user manually interacts with the app to indicate that the app should start listening. For example, they raise the device to their ear, press a button on the keypad, or touch a part of the screen. When they're done talking, the app automatically performs the speech recognition without the user needing to touch the device again. In hands-free dialog, where the user can start and stop talking without any manual input to indicate when the application should be listening. The application and/or browser needs to automatically detect when the user has started talking, so it can initiate speech recognition. This is particularly useful for in-car, or 10-foot usage (e.g. living room), or for people with disabilities.
The application may wish to visually highlight the word or phrase that the application is synthesizing. Or, alternatively, the visual application may wish to coordinate the synthesis with animations of an avatar speaking or with appropriately timed slide transitions and thus need to know where in the reading of the synthesized text the application currently is. In addition, the application may wish to know where in a piece of synthesized text an interruption occurred and use the temporal feedback to tell.
The web page when loaded may wish to say a simple phrase of synthesized text such as "hello world".
The application can act as a translator between two individuals fluent in different languages. The application can listen to one speaker and understand the utterances in one language, can translated the spoken phrases to a different language, and then can speak the translation to the other individual.
The application reads out subjects and contents of email and also listens for commands, for instance, "archive", "reply: ok, let's meet at 2 pm", "forward to bob", "read message". Some commands may relate to VCR like controls of the message being read back, for instance, "pause", "skip forwards", "skip back", or "faster". Some of those controls may include controls related to parts of speech, such as, "repeat last sentence" or "next paragraph".
One other important email scenario is that when an email message is received, a summary notification may be raised that displays a small amount of content (for instance the person the email is from and a couple of words of the subject). It is desirable that a speech API be present and listening for the duration of this notification, allowing a user experience of being able to say "Reply to that" or "Read that email message". Note that this recognition UI could not be contingent on the user clicking a button, as that would defeat much of the benefit of this scenario (being able to reply and control the email without using the keyboard or mouse).
The type of dialogs that allow for collecting multiple pieces of information in either one turn or sequential turns in response to frequently synthesized prompts. Types of dialogs might be around ordering a pizza or booking a flight route complete with the system repeating back the choices the user said. This dialog system may well be represented by a VXML form or application that allows for control of the dialog. The VXML dialog may be fetched using XMLHttpRequest.
The ability to mix and integrate input from multiple modalities such as by saying "I want to go from here to there" while tapping two points on a touch screen map.
A direction service that speaks turn-by-turn directions. Accepts hands-free spoken instructions like "navigate to [address]" or "navigate to [business listing]" or "reroute using [road name]". Input from the location of the user may help the service know when to play the next direction. It is possible that user is not able to see any output so the service needs to regularly synthesize phrases like "turn left on [road] in [distance]".
The user combines speech input and output with tactile input and visual output to enable scenarios such as tapping a location on the screen while issuing an in game command like "Freeze Spell". Speech could be used either to initiate the action or as an inventory changing system, all while the normal action of the video game is continuing.
The following individual proposals were sent in to the group to help drive discussion.
This section attempts to capture the major design decisions the group made. In cases where substantial disagreements existed, the relevant alternatives are presented rather than a decision. Note that text only went into this section if it either represented group consensus or an accurate description of the specific alternative, as appropriate.
This is where design decisions regarding control of and communication with remote speech services, including media negotiation and control, will be recorded.
This is where design decisions regarding the script API capabilities and realization will be recorded.
The following glossary provides brief definitions of terms that may not be familiar to readers new to the technology domain of speech processing.
This section holds a non-exhaustive list of topics the group has yet to discuss. It is for working purposes only and will likely be removed when the report is complete.