Copyright © 2007 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
In 2005, 2006, and 2007 the W3C held workshops to understand the ways, if any, in which the design of SSML 1.0 limited its usefulness for authors of applications in Asian, Eastern European, and Middle Eastern languages. In 2006 an SSML subgroup of the W3C Voice Browser Working Group was formed to review this input and develop requirements for changes necessary to support those languages. This document contains those requirements.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This is the 11 June 2007 W3C Working Draft of "Speech Synthesis Markup Language Version 1.1 Requirements".
This document describes the requirements for changes to the SSML 1.0 specification required to fulfill the charter given in [Section 1.2]. This is the second Working Draft. The group does not expect this document to become a W3C Recommendation. Changes since the previous version are listed in Appendix A.
This document has been produced as part of the W3C Voice Browser Activity, following the procedures set out for the W3C Process. The authors of this document are members of the Voice Browser Working Group. You are encouraged to subscribe to the public discussion list <www-voice@w3.org> and to mail us your comments. To subscribe, send an email to <www-voice-request@w3. org> with the word subscribe in the subject line (include the word unsubscribe if you want to unsubscribe). A public archive is available online.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document establishes a prioritized list of requirements for speech synthesis markup which any proposed markup language should address. This document addresses both procedure and requirements for the specification development. In addition to general requirements, the requirements are addressed in separate sections on Speech Interface Framework Consistency, Token/Word Boundary, Phonetic Alphabet and Pronunciation Script, Language Category, and Name/Proper Noun Identification Requirements, followed by Future Study and Acknowledgements sections.
As a W3C standard, one of the aims of SSML (see [SSML] for description) is to be suitable and convenient for use by application authors and vendors worldwide. A brief review of the most broadly-spoken world languages [LANGUAGES] shows a number of languages that are in large commercial or emerging markets for speech synthesis technologies but for which there was limited or no participation by either native speakers or experts during the development of SSML 1.0. To determine in what ways, if any, SSML is limited by its design with respect to supporting these languages, the W3C held three workshops on the Internationalization of SSML. The first workshop [WS], in Beijing, PRC, in October 2005, focused primarily on Chinese, Korean, and Japanese languages, and the second [WS2], in Crete, Greece, in May 2006, focused primarily on Arabic, Indian, and Eastern European languages. The third workshop [WS3], in Hyderabad, India, in January 2007, focused heavily on Indian and Middle Eastern languages.
These three workshops resulted in excellent suggestions for changes to SSML, describing the ways in which SSML 1.0 has been extended and enhanced around the world. An encouraging result from the workshops was that many of the problems might be solvable using similar, if not identical, solutions. In fact, it may be possible to increase dramatically the usefulness of SSML for many application authors around the world by making a limited number of carefully-planned changes to SSML 1.0. That is the goal of this effort.
The scope for a W3C recommendation for SSML 1.1 is modifications to SSML 1.0 to
VCR-like controls are out of scope for SSML 1.1. We may discuss <say-as> (see [SAYAS]) issues that are related to the SSML 1.1 work above and collect requirements for the next document that addresses <say-as> values. We will not create specifications for additional <say-as> values but may publish a separate Note containing the <say-as> requirements specifically related to the SSML 1.1 work. We will follow standard W3C procedures.
* provided there is sufficient group expertise and contribution for these languages
The General Requirements in section 2 arose out of SSML-specific and general Voice Browser Working Group discussions. The Speech Interface Framework Consistency Requirements in section 3 were generated by the Voice Browser Working Group. The SSML subgroup developed the charter. The remaining requirements were then developed as follows:
First, the SSML subgroup grouped topics presented and discussed at the workshops (see Section 1.1) into the following categories:
The following table shows how the topics were categorized. There is no implied ordering within each column.
Short-term (group agrees to work on this) | Long-term (after short-term work will revisit to determine if belongs in group) | Experts needed (in order to make decision to work on this in this subgroup) | Other SSML work (SSML 2.0 or later, <say-as> Note, etc. |
---|---|---|---|
Token/word boundaries | Tones | Providing number, case, gender agreement info | Special words |
Phonetic alphabets | Expand Part-Of-Speech support | Syllable markup | Tone sandhi |
Verify that RFC3066 language categories are complete enough that we do not need anything new beyond xml:lang to identify languages and dialects | Text with multiple languages (changing xml:lang without changing voice; separately specifying language of content and language to speak) | Diacritics, SMS text, simplified/alternate text | Enhance prosody rate to include "speech units per time unit" where speech units would be syllable, mora, phoneme, foot, etc. and time unit would be seconds, ms, minutes, etc.(would address mora/sec request) |
Chinese names (say-as requirements) | Sub-word unit demarcation and annotation | Background sound (may be handled best by VoiceXML3 work) | |
Ruby | Transliteration | Expressive elements | |
Sentence structure |
Next, for each topic in the Short-term list, we developed one or more problem statements. Where applicable, the problem statements have been included in this document.
We then generated requirements to address the problem statements.
It is interesting to note that the three Long-term topics have been addressed by the requirements developed while working on the Short-term topics: Tones are addressed via the pronunciation alphabets, Part-Of-Speech support may be at least partially addressed via requirement 4.2.3, and Text with multiple languages is being addressed as part of the language category requirements.
The topics in the remaining two categories (Experts needed and Other SSML work) are listed and briefly described in the Future Study section.
SSML 1.1 should be backwards compatible with SSML 1.0 except where modification is necessary to satisfy other requirements in this document.
SSML 1.1 may use Internationalized Resource Identifiers [RFC3987] instead of URIs.
This section must include requirements that make SSML consistent with the other Speech Interface Framework specifications, including VoiceXML 2.0/2.1, PLS, SRGS, and SISR in both behavior and syntax, where possible.
SSML must support the maxage and maxstale attributes for the <audio> element as supported in VoiceXML 2.1.
SSML lacks these attributes, so it is not clear how SSML enforces
(or even has) a caching model for audio resources.
SSML must support the maxage and maxstale attributes for the <lexicon> element.
SSML should provide a mechanism for an author to set default values for the maxage and maxstale attributes.
SSML should provide error messages and include detail.
SSML 1.0 defines error [SSML §1.5] as "Error Results are undefined. A conforming synthesis processor may detect and report an error and may recover from it." Note that in the case of an <audio> where there is a protocol error fetching the URI resource, or whether the resource cannot be played, VoiceXML might log this information in its session variables. The error information likely to be required: URI itself, protocol response code and a reason (textual description). It is expected that the SSML processor would recover from this error (play fallback content if specified, or ignore the element).
The <audio> element should be extended with a type attribute to indicate the media type of the URI. It may be used
The handling of the requested type versus an authoritative type returned by a protocol would follow the same approach described for the type in <lexicon> [SSML Section 3.1.4]. On a type mismatch, the processor should play the audio if it can.
SSML should be modified as necessary to operate effectively with VCR controls VoiceXML is looking to introduce.
3.4.1 SSML 1.1 should provide a mechanism to indicate that only a subset of the entire <speak> content is to be rendered. This mechanism should allow designation of the start and end of the subset based on time offsets from the beginning of the <speak> content, the end of the <speak> content, and marks within the content.
3.4.2 It would be nice if SSML 1.1 provided a mechanism to indicate that only a subset of the content of an <audio> element is to be rendered. This mechanism, if provided, should allow designation of the start and end of the subset based on time offsets from the beginning of the <audio> content, the end of the <audio> content, and marks within the content.
3.4.3 SSML 1.1 should provide a mechanism to adjust the speed of the rendered <speak> content.
3.4.4 It would be nice if SSML 1.1 provided a mechanism to either adjust or set the average pitch of the rendered <speak> content.
3.4.5 SSML 1.1 should provide a mechanism to either adjust or set the volume of the rendered <speak> content.
Authors must be given explicit control over which <lexicon>-specified lexicons are active for which portions of the document. This will allow explicit activation/deactivation of lexicons.
It would be nice if SSML were modified to support prefetching of audio as defined by the "fetchhint" attribute of the <audio> tag in VoiceXML 2.0 [VXML2]. The exact mechanism used by the VoiceXML interpreter to instruct the SSML processor to prefetch audio may be out of scope. However, SSML should at a minimum recommend behavior for asserting audio resource freshness at the point of playback. This clarifies how audio resource prefetching and caching behaviors interact.
SSML 1.1 must provide a way to uniquely reference <p>, <s>, and the new word-level element (see Section 4) for cross-referencing by external documents.
This section must include requirements that address the following problem statement:
All TTS systems make use of word boundaries to do synthesis. All Chinese/Thai/Japanese systems today must do additional processing to identify word boundaries because white-space is not normally used as a boundary identifier in written language. In this processing, errors that occur can cause poorer output quality and even misunderstandings. Overall TTS performance for these systems can be improved if document authors can hand-label the word boundaries where errors are expected or found to occur.
SSML 1.1 must provide a mechanism to eliminate word segmentation ambiguities. This is necessary in order to render languages
Resulting benefits can include improved cues for prosodic control (e.g., pause) and may assist the synthesis processor in selection of the correct pronunciation for homographs.
4.2.1 SSML 1.1 must provide a mechanism for annotating words.
4.2.2 SSML 1.1 must standardize an annotation of the language using mechanisms similar to those used elsewhere in the specification to identify language.
4.2.3 SSML 1.1 must standardize a mechanism to refer to the correct pronunciation in the Pronunciation Lexicon Specification, in particular when there are multiple pronunciations for the same orthography. This will enhance the existing implied correspondence between words and pronunciation lexicons.
This section must include requirements that address the following problem statement:
Although IPA (and its textual equivalents) provides a way to write every pronunciation for every language, for some languages there are alternative pronunciation scripts (not necessarily phonetic/phonemic) that are already widely known and used; these scripts may still require some modifications to be useful within SSML. SSML requires support for IPA and permits any string to be used as the value of the "alphabet" attribute in the <phoneme> element. However, TTS vendors for these languages want a standard reference for their pronunciation scripts. This might require extra work to define a standard reference.
5.1.1 SSML 1.1 must enable the use of values for the "alphabet" attribute of the <phoneme> element that are defined in a registry that can be updated independent of SSML. This registry and its registration policy must be defined by the SSML subgroup.
The intent of this change is to encourage the standardization of alternative pronunciation scripts, for example Pinyin for Mandarin, Jyutping for Cantonese, and Ruby for Japanese.
As part of the discussion on the registration policy, the SSML subgroup should consider the following:
5.1.2 The registry named in 4.1.1 should be maintained through IANA.
This section must include requirements that address the following problem statement:
The xml:lang attribute in SSML is the only way to identify the language. It represents both the natural (human) language of the text content and the natural (human) language the synthesis processor is to produce. For languages whose scripts are ideographs rather than pronunciation-related, we are not sure that the permitted values for xml:lang, as specified by RFC3066, are detailed enough to distinguish among languages (and their dialects) that use the same ideographs.
SSML 1.1 must ensure the use of a version of xml:lang that uses the successor specification to RFC3066 [RFC3066] (for example, BCP47 [BCP47]).
This will provide sufficient flexibility to indicate all of the needed languages, scripts, dialects, and their variants.
6.2.1 SSML 1.1 must clearly state that the 'xml:lang' attribute identifies the language of the content.
6.2.2 SSML 1.1 must clearly state that processors are expected to determine how to render the content based on the value of the 'xml:lang' attribute and must document expected rendering behavior for the xml:lang values they support.
6.2.3 SSML 1.1 must specify that selection of xml:lang and voice are independent. It is the responsibility of the TTS vendor to decide and document which languages are supported by which voices and in what way.
This section must include requirements on a future version of <say-as> to support better interpretation of Chinese names and Korean proper nouns.
In some languages, it is necessary to do some special handing to identify names/proper nouns. For example, in some Asian languages, the pronunciation of characters used in Chinese surnames and Korean proper nouns will change. If the name/proper noun is properly marked, there is a predictable pronunciation for it. Such a requirement is crucial and must be satisfied because, in languages such as Chinese and Korean, there is no obvious tag to identify names/proper nouns from other contents (e.g. there is no capitalization as used in English) and it is often difficult for the speech synthesis processor to automatically identify all the names/proper nouns properly.
It is also important to identify which part of a name is the surname and which part(s) is/are the given name(s) since there might be several patterns of different surname/given name combinations. For example,
A future version of SSML must provide a mechanism to identify content as a proper noun.
A future version of SSML must provide a mechanism to identify content as a name. This might be done by creating a new "name" value for the interpret-as attribute of the <say-as> element, along with appropriate values for the format and detail attributes.
A future version of SSML must provide a mechanism to identify a portion of a name as the surname.
This section contains issues that were identified during requirements capture but which have not been directly incorporated into the current set of requirements. The descriptions are not intended to be exhaustive but rather to give a brief explanation of the core idea(s) of the topics.
Japanese, Hungarian, and Arabic words all vary by number, gender, case, and/or category. An example difficulty occurs in reading numeric values from news feeds, since the actual spoken numbers may change based on implied context. By providing this context the synthesizer can generate the proper word.
The two main use cases/motivations for this capability are
The current belief is that this markup is not needed in order to accomplish the stated objectives of SSML 1.1. Since markup of syllables and particularly the use of prosodic markup at a syllable level challenges the implicit word-level foundation of SSML 1.0, changes of this nature are likely to be far-reaching in consequence for the language. Unless this is later discovered to be necessary, this work should wait for a fuller rewrite of SSML than is anticipated for SSML 1.1.
There are a number of cases where SSML is used to render other-than-traditional forms of text. The most common of these appears to be mobile text messages. It is fairly common to see significantly abbreviated text (such as "cul8r" for "see you later" in English) and, for non-English languages, text that does not properly use native character sets. Examples include dropped diacritics in Polish (eg., the word pączek written as the word paczek) or the use of the three-symbol string '}|{' to represent the Russian letter 'Ж'.
In Chinese, the foundational writing unit is the character, and although there may be many different pronunciations for a given character, each pronunciation is only a single syllable. It is thus common in Chinese synthesis processors to be able to control prosodic information such as contrastive stress at the syllable level.
Hungarian is a highly agglutinative language whose significant morphological variations are represented in the orthography. Thus, contrastive stress may need to be marked at a sub-word level. For example, “Nem a dobozon, hanem a dobozban van a könyv” means “The book is not in the box, but on the box.”
Note that the approaches currently being considered to address the requirements in Section 4 may provide a limited ability to do sub-word prosodic annotation.
Many of the languages on the Indian subcontinent are based on a common set of underlying phonemic units and have writing systems (scripts) that are based on these underlying units. The scripts for these languages may differ substantially from one another, however, and from the historic Indian script specifically designed for writing pronunciations. Additionally, because of the spread of communication systems in which it is easier to write in Latin scripts (or ASCII, in particular) than in native scripts, India has seen a proliferation of ASCII-based writing systems that are also based on the same underlying phonemic units. Unfortunately, these ASCII-based writing systems are not standardized.
The challenge for speech synthesis systems today is that the system will often use several lexicons, each of which uses a different pronunciation writing system. Pronunciations given inline by an author may also be in a different (and potentially non-standard) writing system. This challenge is currently addressed for Indian speech synthesis systems by using transliteration among code pages. Each code page describes how a particular writing system maps into a canonical writing system. It is thus possible for a synthesis processor to know how to convert any text into a representation of pronunciation that can be looked up in a lexicon.
Although the need to use different pronunciation alphabets will be addressed for standard alphabets, i.e., those for the different Indian languages, to address the user-specific ASCII representations a more generic mapping facility might be needed. Such a capability might also address the common issue of how to map mobile phone short message text into the standard grapheme representations used in a lexicon.
Many new values for the "interpret-as" attribute of the <say-as> element have been suggested. Common ones include URI, email address, postal address, and email. Although clearly useful, these values are similar, if not identical, to ones considered during the development of the Say-as Note [SAYAS]. It is not clear which, if any, of the values suggested are critically, or at least more, necessary for languages other than those for which SSML 1.0 works well today. These suggestions from the workshops may be incorporated into future work on the <say-as> element, which is outside the scope of the SSML 1.1 effort.
When the nominal tones of sequences of syllables in Chinese match certain patterns, the actual spoken tones change in predictable ways. For example, in Mandarin if two tone 3 syllables occur together, the first will actually be pronounced as tone 2 instead of tone 3. Similar, but different, rules apply for Cantonese and for the many other spoken languages that use the written Han characters. This need may be addressed sufficiently by other requirements in this document.
The rate attribute of the <prosody> element in SSML 1.0 only allows for relative changes to the speech rate, not absolute settings. A primary reason for this was lack of agreement on what units would be used to set the rate -- phonemes, syllables, words, etc. With the feedback received so far, it would be possible to enhance the prosody rate to permit absolute values of the form " X speech units per time unit" where speech units could be selected by the author to be syllable, mora, phoneme, foot, etc. and time units could be selected by the author to be seconds, ms, minutes, etc. This is a good example of a feature that should be considered if and when an SSML 2.0 is developed.
There are many requests to permit a separate audio track to be established to provide background speech, music, or other audio. This feature is about audio mixing rather speech synthesis, so either it should be handled outside of SSML (via SMIL [SMIL2] or via a future version of VoiceXML) or a more thorough analysis of what audio mixing capabilities are desired should be done as part of a future version of SSML.
There are requests for speaking style ("news", "sports", etc.) and emotion portrayal ("angry", "joyful", "sad") that represent high-level requests that result in rather sophisticated speech production changes, and historically there has been insufficient agreement on how these styles would be rendered. However, this is slowly changing -- see, for example, the W3C Emotion Incubator Group [EMOTION]. This category of request most definitely should be considered when developing a future version of SSML.
SSML 1.0 has only two explicit logical structure elements: <paragraph> and <sentence>. In addition, whitespace is used as an implicit word boundary. There have been requests to provide other sub-sentence structure such as phrase markers (and explicit word marking, one of the requirements earlier in this document). The motivations for such features vary slightly but usually center around providing improved prosodic control. This is a good topic to reconsider in a future, possibly completely rewritten, version of SSML.
The editors wish to thank the members of the Voice Browser Working Group involved in this activity (listed in family name alphabetical order):