Speech Synthesis Markup Language Version 1.1 Requirements

W3C Working Draft 11 June 2007

This version:: http://www.w3.org/TR/2007/WD-ssml11reqs-20070611/
Latest version:: http://www.w3.org/TR/ssml11reqs/
Previous version:: http://www.w3.org/TR/2006/WD-ssml11reqs-20061219/
Editors:: Daniel C. Burnett, Nuance; 双志伟 (Zhi Wei Shuang), IBM
Authors:: Scott McGlashan, HP; Andrew Wahbe, Genesys; 夏海荣 (Hairong Xia), Panasonic; 严峻 (Yan Jun), iFLYTEK; 吴志勇 (Zhiyong Wu), Chinese University of Hong Kong

In 2005, 2006, and 2007 the W3C held workshops to understand the ways, if any, in which the design of SSML 1.0 limited its usefulness for authors of applications in Asian, Eastern European, and Middle Eastern languages. In 2006 an SSML subgroup of the W3C Voice Browser Working Group was formed to review this input and develop requirements for changes necessary to support those languages. This document contains those requirements.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is the 11 June 2007 W3C Working Draft of "Speech Synthesis Markup Language Version 1.1 Requirements".

This document describes the requirements for changes to the SSML 1.0 specification required to fulfill the charter given in [Section 1.2]. This is the second Working Draft. The group does not expect this document to become a W3C Recommendation. Changes since the previous version are listed in Appendix A.

This document has been produced as part of the W3C Voice Browser Activity, following the procedures set out for the W3C Process. The authors of this document are members of the Voice Browser Working Group. You are encouraged to subscribe to the public discussion list <www-voice@w3.org> and to mail us your comments. To subscribe, send an email to <www-voice-request@w3. org> with the word subscribe in the subject line (include the word unsubscribe if you want to unsubscribe). A public archive is available online.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

1. Introduction

This document establishes a prioritized list of requirements for speech synthesis markup which any proposed markup language should address. This document addresses both procedure and requirements for the specification development. In addition to general requirements, the requirements are addressed in separate sections on Speech Interface Framework Consistency, Token/Word Boundary, Phonetic Alphabet and Pronunciation Script, Language Category, and Name/Proper Noun Identification Requirements, followed by Future Study and Acknowledgements sections.

1.1 Background and motivation

As a W3C standard, one of the aims of SSML (see [SSML] for description) is to be suitable and convenient for use by application authors and vendors worldwide. A brief review of the most broadly-spoken world languages [LANGUAGES] shows a number of languages that are in large commercial or emerging markets for speech synthesis technologies but for which there was limited or no participation by either native speakers or experts during the development of SSML 1.0. To determine in what ways, if any, SSML is limited by its design with respect to supporting these languages, the W3C held three workshops on the Internationalization of SSML. The first workshop [WS], in Beijing, PRC, in October 2005, focused primarily on Chinese, Korean, and Japanese languages, and the second [WS2], in Crete, Greece, in May 2006, focused primarily on Arabic, Indian, and Eastern European languages. The third workshop [WS3], in Hyderabad, India, in January 2007, focused heavily on Indian and Middle Eastern languages.

These three workshops resulted in excellent suggestions for changes to SSML, describing the ways in which SSML 1.0 has been extended and enhanced around the world. An encouraging result from the workshops was that many of the problems might be solvable using similar, if not identical, solutions. In fact, it may be possible to increase dramatically the usefulness of SSML for many application authors around the world by making a limited number of carefully-planned changes to SSML 1.0. That is the goal of this effort.

1.2 SSML 1.1 subgroup charter

VCR-like controls are out of scope for SSML 1.1. We may discuss <say-as> (see [SAYAS]) issues that are related to the SSML 1.1 work above and collect requirements for the next document that addresses <say-as> values. We will not create specifications for additional <say-as> values but may publish a separate Note containing the <say-as> requirements specifically related to the SSML 1.1 work. We will follow standard W3C procedures.

* provided there is sufficient group expertise and contribution for these languages

1.3 Requirements development process

The General Requirements in section 2 arose out of SSML-specific and general Voice Browser Working Group discussions. The Speech Interface Framework Consistency Requirements in section 3 were generated by the Voice Browser Working Group. The SSML subgroup developed the charter. The remaining requirements were then developed as follows:

First, the SSML subgroup grouped topics presented and discussed at the workshops (see Section 1.1) into the following categories:

The following table shows how the topics were categorized. There is no implied ordering within each column.

Short-term (group agrees to work on this)	Long-term (after short-term work will revisit to determine if belongs in group)	Experts needed (in order to make decision to work on this in this subgroup)	Other SSML work (SSML 2.0 or later, <say-as> Note, etc.
Token/word boundaries	Tones	Providing number, case, gender agreement info	Special words
Phonetic alphabets	Expand Part-Of-Speech support	Syllable markup	Tone sandhi
Verify that RFC3066 language categories are complete enough that we do not need anything new beyond xml:lang to identify languages and dialects	Text with multiple languages (changing xml:lang without changing voice; separately specifying language of content and language to speak)	Diacritics, SMS text, simplified/alternate text	Enhance prosody rate to include "speech units per time unit" where speech units would be syllable, mora, phoneme, foot, etc. and time unit would be seconds, ms, minutes, etc.(would address mora/sec request)
Chinese names (say-as requirements)		Sub-word unit demarcation and annotation	Background sound (may be handled best by VoiceXML3 work)
Ruby		Transliteration	Expressive elements
			Sentence structure

Next, for each topic in the Short-term list, we developed one or more problem statements. Where applicable, the problem statements have been included in this document.
We then generated requirements to address the problem statements.

It is interesting to note that the three Long-term topics have been addressed by the requirements developed while working on the Short-term topics: Tones are addressed via the pronunciation alphabets, Part-Of-Speech support may be at least partially addressed via requirement 4.2.3, and Text with multiple languages is being addressed as part of the language category requirements.

The topics in the remaining two categories (Experts needed and Other SSML work) are listed and briefly described in the Future Study section.

2. General Requirements

2.1 Backwards compatibility

SSML 1.1 should be backwards compatible with SSML 1.0 except where modification is necessary to satisfy other requirements in this document.

2.2 Use of IRIs instead of URIs

SSML 1.1 may use Internationalized Resource Identifiers [RFC3987] instead of URIs.

3. Speech Interface Framework Consistency Requirements

This section must include requirements that make SSML consistent with the other Speech Interface Framework specifications, including VoiceXML 2.0/2.1, PLS, SRGS, and SISR in both behavior and syntax, where possible.

3.1 Caching attributes

3.1.1 <audio> caching attributes

SSML must support the maxage and maxstale attributes for the <audio> element as supported in VoiceXML 2.1.
SSML lacks these attributes, so it is not clear how SSML enforces (or even has) a caching model for audio resources.

3.1.2 <lexicon> caching attributes

3.1.3 Caching defaults

SSML should provide a mechanism for an author to set default values for the maxage and maxstale attributes.

3.2 Error messages in VoiceXML 3.0

SSML 1.0 defines error [SSML §1.5] as "Error Results are undefined. A conforming synthesis processor may detect and report an error and may recover from it." Note that in the case of an <audio> where there is a protocol error fetching the URI resource, or whether the resource cannot be played, VoiceXML might log this information in its session variables. The error information likely to be required: URI itself, protocol response code and a reason (textual description). It is expected that the SSML processor would recover from this error (play fallback content if specified, or ignore the element).

3.3 "type" attribute

The <audio> element should be extended with a type attribute to indicate the media type of the URI. It may be used

The handling of the requested type versus an authoritative type returned by a protocol would follow the same approach described for the type in <lexicon> [SSML Section 3.1.4]. On a type mismatch, the processor should play the audio if it can.

3.4 VCR controls in VoiceXML

SSML should be modified as necessary to operate effectively with VCR controls VoiceXML is looking to introduce.

3.4.1 SSML 1.1 should provide a mechanism to indicate that only a subset of the entire <speak> content is to be rendered. This mechanism should allow designation of the start and end of the subset based on time offsets from the beginning of the <speak> content, the end of the <speak> content, and marks within the content.

3.4.2 It would be nice if SSML 1.1 provided a mechanism to indicate that only a subset of the content of an <audio> element is to be rendered. This mechanism, if provided, should allow designation of the start and end of the subset based on time offsets from the beginning of the <audio> content, the end of the <audio> content, and marks within the content.

3.4.3 SSML 1.1 should provide a mechanism to adjust the speed of the rendered <speak> content.

3.4.4 It would be nice if SSML 1.1 provided a mechanism to either adjust or set the average pitch of the rendered <speak> content.

3.4.5 SSML 1.1 should provide a mechanism to either adjust or set the volume of the rendered <speak> content.

3.5 Lexicon synchronization

Authors must be given explicit control over which <lexicon>-specified lexicons are active for which portions of the document. This will allow explicit activation/deactivation of lexicons.

3.6 Prefetching support

It would be nice if SSML were modified to support prefetching of audio as defined by the "fetchhint" attribute of the <audio> tag in VoiceXML 2.0 [VXML2]. The exact mechanism used by the VoiceXML interpreter to instruct the SSML processor to prefetch audio may be out of scope. However, SSML should at a minimum recommend behavior for asserting audio resource freshness at the point of playback. This clarifies how audio resource prefetching and caching behaviors interact.

3.7 External reference to text structure

SSML 1.1 must provide a way to uniquely reference <p>, <s>, and the new word-level element (see Section 4) for cross-referencing by external documents.

4. Token/Word Boundary Requirements

This section must include requirements that address the following problem statement:

4.1 Word boundary disambiguation

SSML 1.1 must provide a mechanism to eliminate word segmentation ambiguities. This is necessary in order to render languages

Resulting benefits can include improved cues for prosodic control (e.g., pause) and may assist the synthesis processor in selection of the correct pronunciation for homographs.

4.2 Annotation of words

4.2.2 SSML 1.1 must standardize an annotation of the language using mechanisms similar to those used elsewhere in the specification to identify language.

4.2.3 SSML 1.1 must standardize a mechanism to refer to the correct pronunciation in the Pronunciation Lexicon Specification, in particular when there are multiple pronunciations for the same orthography. This will enhance the existing implied correspondence between words and pronunciation lexicons.

5. Phonetic Alphabet and Pronunciation Script Requirements

This section must include requirements that address the following problem statement:

5.1 Registry for alternative pronunciation scripts

5.1.1 SSML 1.1 must enable the use of values for the "alphabet" attribute of the <phoneme> element that are defined in a registry that can be updated independent of SSML. This registry and its registration policy must be defined by the SSML subgroup.

The intent of this change is to encourage the standardization of alternative pronunciation scripts, for example Pinyin for Mandarin, Jyutping for Cantonese, and Ruby for Japanese.

As part of the discussion on the registration policy, the SSML subgroup should consider the following:

6. Language Category Requirements

This section must include requirements that address the following problem statement:

6.1 Successor to RFC3066 support

SSML 1.1 must ensure the use of a version of xml:lang that uses the successor specification to RFC3066 [RFC3066] (for example, BCP47 [BCP47]).

This will provide sufficient flexibility to indicate all of the needed languages, scripts, dialects, and their variants.

6.2 xml:lang requirements

6.2.1 SSML 1.1 must clearly state that the 'xml:lang' attribute identifies the language of the content.

6.2.2 SSML 1.1 must clearly state that processors are expected to determine how to render the content based on the value of the 'xml:lang' attribute and must document expected rendering behavior for the xml:lang values they support.

6.2.3 SSML 1.1 must specify that selection of xml:lang and voice are independent. It is the responsibility of the TTS vendor to decide and document which languages are supported by which voices and in what way.

7. Name/Proper Noun Identification Requirements

This section must include requirements on a future version of <say-as> to support better interpretation of Chinese names and Korean proper nouns.

In some languages, it is necessary to do some special handing to identify names/proper nouns. For example, in some Asian languages, the pronunciation of characters used in Chinese surnames and Korean proper nouns will change. If the name/proper noun is properly marked, there is a predictable pronunciation for it. Such a requirement is crucial and must be satisfied because, in languages such as Chinese and Korean, there is no obvious tag to identify names/proper nouns from other contents (e.g. there is no capitalization as used in English) and it is often difficult for the speech synthesis processor to automatically identify all the names/proper nouns properly.

It is also important to identify which part of a name is the surname and which part(s) is/are the given name(s) since there might be several patterns of different surname/given name combinations. For example,

7.1 Identify content as proper noun

A future version of SSML must provide a mechanism to identify content as a proper noun.

7.2 Identify content as name

A future version of SSML must provide a mechanism to identify content as a name. This might be done by creating a new "name" value for the interpret-as attribute of the <say-as> element, along with appropriate values for the format and detail attributes.

7.3 Identify name sub-content as surname

A future version of SSML must provide a mechanism to identify a portion of a name as the surname.

8. Future Study

This section contains issues that were identified during requirements capture but which have not been directly incorporated into the current set of requirements. The descriptions are not intended to be exhaustive but rather to give a brief explanation of the core idea(s) of the topics.

8.1 Number, gender, case agreement

Japanese, Hungarian, and Arabic words all vary by number, gender, case, and/or category. An example difficulty occurs in reading numeric values from news feeds, since the actual spoken numbers may change based on implied context. By providing this context the synthesizer can generate the proper word.

8.2 Syllable markup

The current belief is that this markup is not needed in order to accomplish the stated objectives of SSML 1.1. Since markup of syllables and particularly the use of prosodic markup at a syllable level challenges the implicit word-level foundation of SSML 1.0, changes of this nature are likely to be far-reaching in consequence for the language. Unless this is later discovered to be necessary, this work should wait for a fuller rewrite of SSML than is anticipated for SSML 1.1.

8.3 Diacritics, SMS text, simplified/alternate text

There are a number of cases where SSML is used to render other-than-traditional forms of text. The most common of these appears to be mobile text messages. It is fairly common to see significantly abbreviated text (such as "cul8r" for "see you later" in English) and, for non-English languages, text that does not properly use native character sets. Examples include dropped diacritics in Polish (eg., the word pączek written as the word paczek) or the use of the three-symbol string '}|{' to represent the Russian letter 'Ж'.

8.4 Sub-word unit demarcation and annotation

In Chinese, the foundational writing unit is the character, and although there may be many different pronunciations for a given character, each pronunciation is only a single syllable. It is thus common in Chinese synthesis processors to be able to control prosodic information such as contrastive stress at the syllable level.

Hungarian is a highly agglutinative language whose significant morphological variations are represented in the orthography. Thus, contrastive stress may need to be marked at a sub-word level. For example, “Nem a dobozon, hanem a dobozban van a könyv” means “The book is not in the box, but on the box.”

Note that the approaches currently being considered to address the requirements in Section 4 may provide a limited ability to do sub-word prosodic annotation.

8.5 Transliteration

Many of the languages on the Indian subcontinent are based on a common set of underlying phonemic units and have writing systems (scripts) that are based on these underlying units. The scripts for these languages may differ substantially from one another, however, and from the historic Indian script specifically designed for writing pronunciations. Additionally, because of the spread of communication systems in which it is easier to write in Latin scripts (or ASCII, in particular) than in native scripts, India has seen a proliferation of ASCII-based writing systems that are also based on the same underlying phonemic units. Unfortunately, these ASCII-based writing systems are not standardized.

The challenge for speech synthesis systems today is that the system will often use several lexicons, each of which uses a different pronunciation writing system. Pronunciations given inline by an author may also be in a different (and potentially non-standard) writing system. This challenge is currently addressed for Indian speech synthesis systems by using transliteration among code pages. Each code page describes how a particular writing system maps into a canonical writing system. It is thus possible for a synthesis processor to know how to convert any text into a representation of pronunciation that can be looked up in a lexicon.

Although the need to use different pronunciation alphabets will be addressed for standard alphabets, i.e., those for the different Indian languages, to address the user-specific ASCII representations a more generic mapping facility might be needed. Such a capability might also address the common issue of how to map mobile phone short message text into the standard grapheme representations used in a lexicon.

8.6 Special words

Many new values for the "interpret-as" attribute of the <say-as> element have been suggested. Common ones include URI, email address, postal address, and email. Although clearly useful, these values are similar, if not identical, to ones considered during the development of the Say-as Note [SAYAS]. It is not clear which, if any, of the values suggested are critically, or at least more, necessary for languages other than those for which SSML 1.0 works well today. These suggestions from the workshops may be incorporated into future work on the <say-as> element, which is outside the scope of the SSML 1.1 effort.

8.7 Tone Sandhi

When the nominal tones of sequences of syllables in Chinese match certain patterns, the actual spoken tones change in predictable ways. For example, in Mandarin if two tone 3 syllables occur together, the first will actually be pronounced as tone 2 instead of tone 3. Similar, but different, rules apply for Cantonese and for the many other spoken languages that use the written Han characters. This need may be addressed sufficiently by other requirements in this document.

8.8 More flexible prosody rate

The rate attribute of the <prosody> element in SSML 1.0 only allows for relative changes to the speech rate, not absolute settings. A primary reason for this was lack of agreement on what units would be used to set the rate -- phonemes, syllables, words, etc. With the feedback received so far, it would be possible to enhance the prosody rate to permit absolute values of the form " X speech units per time unit" where speech units could be selected by the author to be syllable, mora, phoneme, foot, etc. and time units could be selected by the author to be seconds, ms, minutes, etc. This is a good example of a feature that should be considered if and when an SSML 2.0 is developed.

8.9 Background sound

There are many requests to permit a separate audio track to be established to provide background speech, music, or other audio. This feature is about audio mixing rather speech synthesis, so either it should be handled outside of SSML (via SMIL [SMIL2] or via a future version of VoiceXML) or a more thorough analysis of what audio mixing capabilities are desired should be done as part of a future version of SSML.

8.10 Expressive elements

There are requests for speaking style ("news", "sports", etc.) and emotion portrayal ("angry", "joyful", "sad") that represent high-level requests that result in rather sophisticated speech production changes, and historically there has been insufficient agreement on how these styles would be rendered. However, this is slowly changing -- see, for example, the W3C Emotion Incubator Group [EMOTION]. This category of request most definitely should be considered when developing a future version of SSML.

8.11 Sentence structure

SSML 1.0 has only two explicit logical structure elements: <paragraph> and <sentence>. In addition, whitespace is used as an implicit word boundary. There have been requests to provide other sub-sentence structure such as phrase markers (and explicit word marking, one of the requirements earlier in this document). The motivations for such features vary slightly but usually center around providing improved prosodic control. This is a good topic to reconsider in a future, possibly completely rewritten, version of SSML.

9. References

10. Acknowledgements

The editors wish to thank the members of the Voice Browser Working Group involved in this activity (listed in family name alphabetical order):