Speech Synthesis Markup Language Specification for the Speech Interface Framework

W3C Working Draft 3 January 2001

This version:: http://www.w3.org/TR/2001/WD-speech-synthesis-20010103
Latest version:: http://www.w3.org/TR/speech-synthesis
Previous version:: http://www.w3.org/TR/2000/WD-speech-synthesis-20000808
Editors:: Mark R. Walker, Intel; Andrew Hunt, SpeechWorks International

Abstract

The Speech Interface Framework working group has sought to develop standards to enable access to the web using spoken interaction. The Speech Synthesis Markup Language Specification is part of this set of new markup specifications for voice browsers, and is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in web and other applications. The essential role of the markup language is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate and etc. across different synthesis-capable platforms.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. The latest status of this document series is maintained at the W3C.

This is the 3 January 2001 last call Working Draft of the "Speech Synthesis Markup Language Specification". This last call review period ends 31 January 2001. You are encouraged to subscribe to the public discussion list <www-voice@w3.org> and to mail in your comments before the review period ends. To subscribe, send an email to <www-voice-request@w3. org> with the word subscribe in the subject line (include the word unsubscribe if you want to unsubscribe). A public archive is available online.

This specification describes markup for generating synthetic speech via a speech synthesizer, and forms part of the proposals for the W3C Speech Interface Framework. This document has been produced as part of the W3C Voice Browser Activity, following the procedures set out for the W3C Process. The authors of this document are members of the Voice Browser Working Group (W3C Members only).

To help the Voice Browser working group build an implementation report, (as part of advancing the document on the W3C Recommendation Track), you are encouraged to implement this specification and to indicate to W3C which features have been implemented, and any problems that arose.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite W3C Working Drafts as other than "work in progress". A list of current public W3C Working Drafts can be found at http://www.w3.org/TR/.

1. Introduction
2. Elements and Attributes
- Document Structure, Text Processing and Pronunciation
- 2.1 "speak" Root Element
- 2.2 "xml:lang" Language Attribute
- 2.3 "paragraph" and "sentence"
- 2.4 "sayas" Element
- 2.5 "phoneme" Element
- Prosody and Style
- 2.6 "voice" Element
- 2.7 "emphasis" Element
- 2.8 "break" Element
- 2.9 "prosody" Element
- Other Elements
- 2.10 "audio" Element
- 2.11 "mark" Element
- 2.12 Miscellaneous relevant XML features
3. Future Study
4. Examples
5. Conformance
6. DTD for the Speech Synthesis Markup Language
7. References
8. Acknowledgements

1. Introduction

This W3C Standard is known as the Speech Synthesis Markup Language Specification and is based upon the JSML specification, which is owned by Sun Microsystems, Inc., California, U.S.A.

The Speech Synthesis Markup Language specification is part of a larger set of markup specifications for voice browsers developed through the open processes of the W3C. It is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in web and other applications. The essential role of the markup language is to give authors of synthesizable content a standard way to control aspects of speech output such as pronunciation, volume, pitch, rate and etc. across different synthesis-capable platforms.

1.1 Terminology and Design Concepts

There is some variance in the use of terminology in the speech synthesis community. The following definitions establish a common understanding for this document.

Voice Browser	A device which interprets a (voice) markup language and is capable of generating voice output and/or interpreting voice input, and possibly other input/output modalities.
Speech Synthesis	The process of automatic generation of speech output from data input which may include plain text, formatted text or binary objects.
Text-To-Speech	The process of automatic generation of speech output from text or annotated text input.

The design and standardization process has followed from the Speech Synthesis Markup Requirements for Voice Markup Languages published December 23, 1999 by the W3C Voice Browser Working Group.

The following items were the key design criteria.

Consistency: provide predictable control of voice output across platforms and across speech synthesis implementations.
Interoperability: support use along with other W3C specifications including (but not limited to) the Dialog Markup Language, Audio Cascading Style Sheets and SMIL.
Generality: support speech output for a wide range of applications with varied speech content.
Internationalization: Enable speech output in a large number of languages within or across documents.
Generation and Readability: Support automatic generation and hand authoring of documents. The documents should be human-readable.
Implementable: The specification should be implementable with existing, generally available technology and the number of optional features should be minimal.

1.2 Speech Synthesis Processes

A Text-To-Speech (TTS) system that supports the Speech Synthesis Markup Language will be responsible for rendering a document as spoken output and for using the information contained in the markup to render the document as intended by the author.

Document creation: A text document provided as input to the TTS system may be produced automatically, by human authoring, or through a combination of these forms. The Speech Synthesis markup language defines the form of the document.

Document processing: The following are the six major processing steps undertaken by a TTS system to convert marked-up text input into automatically generated voice output. The markup language is designed to be sufficiently rich so as to allow control over each of the steps described below so that the document author (human or machine) can control the final voice output.

XML Parse: An XML parser is used to extract the document tree and content from the incoming text document. The structure, tags and attributes obtained in this step influence each of the following steps.
Structure analysis: The structure of a document influences the way in which a document should be read. For example, there are common speaking patterns associated with paragraphs and sentences.

- Markup support: The "paragraph" and "sentence" elements defined in the TTS markup language explicitly indicate document structures that affect the speech output.

- Non-markup behavior: In documents and parts of documents where these elements are not used, the TTS system is responsible for inferring the structure by automated analysis of the text, often using punctuation and other language-specific data.
Text normalization: All written languages have special constructs that require a conversion of the written form (orthographic form) into the spoken form. Text normalization is an automated process of the TTS system that performs this conversion. For example, for English, when "$200" appears in a document it may be spoken as "two hundred dollars". Similarly, "1/2" may be spoken as "half", "January second", "February first", "one of two" and so on.

- Markup support: The "say-as" element can be used in the input document to explicitly indicate the presence and type of these constructs and to resolve ambiguities. The set of constructs that can be marked includes dates, times, numbers, acronyms, current amounts and more. The set covers many of the common constructs that require special treatment across a wide number of languages but is not and cannot be a complete set.

- Non-markup behavior: For text content that is not marked with the "say-as" element the TTS system is expected to make a reasonable effort to automatically locate and convert these constructs to a speakable form. Because of inherent ambiguities (such as the "1/2" example above) and because of the wide range of possible constructs in any language, this process may introduce errors in the speech output and may cause different systems to render the same document differently.
Text-to-phoneme conversion: Once the system has determined the set of words to be spoken it must convert those words to a string of phonemes. A phoneme is the basic unit of sound in a language. Each language (and sometimes each national or dialect variant of a language) has a specific phoneme set: e.g. most US English dialects have around 45 phonemes. In many languages this conversion is ambiguous since the same written word may have many spoken forms. For example, in English, "read" may be spoken as "reed" (I will read the book) or "red" (I have read the book). Another issue is the handling of words with non-standard spellings or pronunciations. For example, an English TTS system will often have trouble determining how to speak some non-English-origin names; e.g. "Tlalpachicatl" which has a Mexican/Aztec origin.

- Markup support: The "phoneme" element allows a phonemic sequence to be provided for any word or word sequence. This provides the content creator with explicit control over pronunciations. The "say-as" element may also be used to indicate that text is a proper name that may allow a TTS system to apply special rules to determine a pronunciation.

- Non-markup behavior: In the absence of a "phoneme" element the TTS system must apply automated capabilities to determine pronunciations. This is typically achieved by looking up words in a pronunciation dictionary and applying rules to determine other pronunciations. Most TTS systems are expert at performing text-to-phoneme conversions so most words of most documents can be handled automatically.
Prosody analysis: Prosody is the set of features of speech output that includes the pitch (also called intonation or melody), the timing (or rhythm), the pausing, the speaking rate, the emphasis on words and many other features. Producing human-like prosody is important for making speech sound natural and for correctly conveying the meaning of spoken language.
- Markup support: The "emphasis" element, "break" element and "prosody" element may all be used by document creators to guide the TTS system is generating appropriate prosodic features in the speech output. The "lowlevel" element (under Future Study) could provide particularly precise control of the prosodic analysis.

- Non-markup behavior: In the absence of these elements, TTS systems are expert (but not perfect) in automatically generating suitable prosody. This is achieved through analysis of the document structure, sentence syntax, and other information that can be inferred from the text input.
Waveform production: The phonemes and prosodic information are used by the TTS system in the production of the audio waveform. There are many approaches to this processing step so there may be considerable platform-specific variation.

- Markup support: The TTS markup does not provide explicit controls over the generation of waveforms. The "voice" element allows the document creator to request a particular voice or specific voice qualities (e.g. a young male voice). The "audio" element allows for insertion of recorded audio data into the output stream.

1.3 Document Generation, Applications and Contexts

There are many classes of document creator that will produce marked-up documents to be spoken by a TTS system. Not all document creators (including human and machine) have access to information that can be used in all of the elements or in each of the processing steps described in the previous section. The following are some of the common cases.

The document creator has no access to information to mark up the text. All processing steps in the TTS system must be performed fully automatically on raw text. The document requires only the containing "speak" element to indicate the content is to be spoken.
When marked text is generated programmatically the creator may have specific knowledge of the structure and/or special text constructs in some or all of the document. For example, an email reader can mark the location of the time and date of receipt of email. Such applications may use elements that affect structure, text normalization, prosody and possibly text-to-phoneme conversion.
Some document creators make considerable effort to mark as many details of the document to ensure consistent speech quality across platforms and to more precisely specify output qualities. In these cases, the markup may use any or all of the available elements to tightly control the speech output. For example, prompts generated in telephony and voice browser applications may be fine-tuned to maximize the effectiveness of the overall system.
The most advanced document creators may skip the higher-level markup (structure, text normalization, text-to-phoneme conversion, and prosody analysis) and produce low-level TTS markup for segments of documents or for entire documents. This typically requires tools to generate sequences of phonemes, plus pitch and timing information. For instance, tools that do "copy synthesis" or "prosody transplant" try to emulate human speech by copying properties from recordings.

The following are important instances of architectures or designs from which marked-up TTS documents will be generated. The language design is intended to facilitate each of these approaches.

Dialog language: It is a requirement that it should be possible to include documents marked with the speech synthesis markup language into the dialog description document to be produced by the Voice Browser Working Group.
Interoperability with Aural CSS: Any HTML processor that is Aural CSS-enabled can produce Speech Synthesis Markup Language. ACSS is covered in Section 19 of the Cascading Style Sheets, level 2 (CSS2) Specification (12-May-1998). This usage of speech synthesis facilitates improved accessibility to existing HTML and XHTML content.
Application-specific style-sheet processing: As mentioned above, there are classes of application that have knowledge of text content to be spoken and this can be incorporated into the speech synthesis markup to enhance rendering of the document. In many cases, it is expected that the application will use style-sheets to perform transformations of existing XML documents to speech synthesis markup. This is equivalent to the use of ACSS with HTML and once again the speech synthesis markup language is the "final form" representation to be passed to the speech synthesis engine.

2. Elements and Attributes

The following elements are defined in this draft specification.

2.1 "speak" Root Element

The Speech Synthesis Markup Language is an XML application. The root element is "speak".

<?xml version="1.0"?>
<speak>
  ... the body ...
</speak>

2.2 "xml:lang" Attribute: Language

Following the XML convention, languages are indicated by an "xml:lang"attribute on the enclosing element with the value following RFC 1766 to define language codes. Language information is inherited down the document hierarchy, i.e. it has to be given only once if the whole document is in one language, and language information nests, i.e. inner attributes overwrite outer attributes.

<speak xml:lang="en-US">
  <paragraph>I don't speak Japanese.</paragraph>
  <paragraph xml:lang="ja">Nihongo-ga wakarimasen.</paragraph>
</speak>

Usage note 1: The speech output platform determines behavior in the case that a document requires speech output in a language not supported by the speech output platform. This is currently only one of two allowed exceptions to the conformance criteria.

Usage note 2: There may be variation across conformant platforms in the implementation of "xml:lang" for different markup elements. A document author should beware that intra-sentential language changes may not be supported on all platforms.

Usage note 3: A language change often necessitates a change in the voice. Where the platform does not have the same voice in both the enclosing and enclosed languages it should select a new voice with the inherited voice attributes. Any change in voice will reset the prosodic attributes to the default values for the new voice of the enclosed text. Where the "xml:lang" value is the same as the inherited value there is no need for any changes in the voice or prosody.

Usage note 4: All elements should process their contents specific to the enclosing language. For instance, the phoneme, emphasis and break element should each be rendered in a manner that is appropriate to the current language.

Usage note 5: Unsupported languages on a conforming platform could be handled by specifying nothing and relying on platform behavior, issuing an event to the host environment, or by providing substitute text in the ML.

2.3 "paragraph" and "sentence": Text Structure Elements

A "paragraph" element represents the paragraph structure in text. A "sentence" element represents the sentence structure in text. A paragraph contains zero or more sentences.

<paragraph>
  <sentence>This is the first sentence of the paragraph.</sentence>
  <sentence>Here's another sentence.</sentence>
</paragraph>

Usage note 1: For brevity, the markup also supports <p> and <s> as exact equivalents of <paragraph> and <sentence>. (Note: XML requires that the opening and closing elements be identical so <p> text </paragraph> is not legal.). Also note that <s> means "strike-out" in HTML 4.0 and earlier, and in XHTML-1.0-Transitional but not in XHTML-1.0-Strict.

Usage note 2: The use of paragraph and sentence elements is optional. Where text occurs without an enclosing paragraph or sentence elements the speech output system should attempt to determine the structure using language-specific knowledge of the format of plain text.

2.4 "say-as" Element

The "say-as" element indicates the type of text construct contained within the element. This information is used to help specify the pronunciation of the contained text. Defining a comprehensive set of text format types is difficult because of the variety of languages that must be considered and because of the innate flexibility of written languages. The "say-as" element has been specified with a reasonable set of format types. Text substitution may be utilized for unsupported constructs.

The "type" attribute is a required attribute that indicates the contained text construct. The format is a text type optionally followed by a colon and a format. The base set of type values, divided according to broad functionality, is as follows:

Pronunciation Types

acronym: contained text is an acronym. The characters in the contained text string are pronounced as individual characters.

<say-as type="acronym"> USA </say-as>
<!-- U. S. A. -->

Numerical Types

number: contained text contains integers, fractions, floating points, Roman numerals or some other textual format that can be interpreted and spoken as a number in the current language. Format values for numbers are: "ordinal", where the contained text should be interpreted as an ordinal. The content may be a digit sequence or some other textual format that can be interpreted and spoken as an ordinal in the current language; and "digits", where the contained text is to be read as a digit sequence, rather than as a number.

Rocky <say-as type="number"> XIII </say-as>
<!-- Rocky thirteen -->
Pope John the <say-as type="number:ordinal"> VI </say-as>
<!-- Pope John the sixth -->
Deliver to <say-as type="number:digits"> 123 </say-as> Brookwood.
<!-- Deliver to one two three Brookwood-->

Time, Date and Measure Types

date: contained text is a date. Format values for dates are:

"dmy", "mdy", "ymd" (day, month , year), (month, day, year), (year, month, day)

"ym", "my", "md" (year, month), (month, year), (month, day)

"y", "m", "d" (year), (month), (day).
time: contained text is a time of day. Format values for times are:

"hms", "hm", "h" (hours, minutes, seconds), (hours, minutes), (hours).
duration: contained text is a temporal duration. Format values for durations are:

"hms", "hm", "ms", "h", "m", "s" (hours, minutes, seconds), (hours, minutes), (minutes, seconds), (hours), (minutes), (seconds).
currency: contained text is a currency amount.
measure: contained text is a measurement.
telephone: contained text is a telephone number.

<say-as type="date:ymd"> 2000/1/20 </say-as>
<!-- January 20th two thousand -->
Proposals are due in <say-as type="date:my"> 5/2001 </say-as>
<!-- Proposals are due in May two thousand and one -->
The total is <say-as type="currency">$20.45</say-as>
<!-- The total is twenty dollars and forty-five cents -->

When multi-field quantities are specified ("dmy", "my", etc.), it is assumed that the fields are separated by single, non-alphanumeric character.

Address, Name, Net Types

name: contained text is a proper name of a person, company etc.
net: contained text is an internet identifier. Format values for net are: "email", "uri".
address: contained text is a postal address.

<say-as type="net:email"> road.runner@acme.com </say-as>

Substitution

The "sub" attribute is employed to indicate that the specified text replaces the contained text for pronunciation. This allows a document to contain both a spoken and written form.

<say-as sub="World Wide Web Consortium"> W3C </say-as>
<!-- World Wide Web Consortium -->

Usage note 1: The conversion of the various types of text and text markup to spoken forms is language and platform-dependent. For example, <say-as type="date:ymd"> 2000/1/20 </say-as> may be read as "January twentieth two thousand" or as "the twentieth of January two thousand" and so on. The markup examples above are provided for usage illustration purposes only.

Usage note 2: It is assumed that pronunciations generated by the use of explicit text markup always take precedence over pronunciations produced by a lexicon.

2.5 "phoneme" Element

The "phoneme" element provides a phonetic pronunciation for the contained text. The "phoneme" element may be empty. However, it is recommended that the element contain human-readable text that can be used for non-spoken rendering of the document. For example, the content may be displayed visually for users with hearing impairments.

The "alphabet" attribute is an optional attribute that specifies the phonetic alphabet. The "ph" attribute is a required attribute that specifies the phoneme string:

ipa: The specified phonetic string is composed of symbols from the International Phonetic Alphabet (IPA).
worldbet: The specified phonetic string is composed of symbols from the Worldbet (Postscript) phonetic alphabet.
xsampa: The specified phonetic string is composed of symbols from the X-SAMPA phonetic alphabet.

<phoneme alphabet="ipa" ph="t&#x252;m&#x251;to&#x28A;"> tomato </phoneme>
<!-- This is an example of IPA using character entities -->

Usage note 1: Characters composing many of the IPA phonemes are known to display improperly on most platforms. Additional IPA limitations include the fact that IPA is difficult to understand even when using ASCII equivalents, IPA is missing symbols required for many of the world's languages, and IPA editors and fonts containing IPA characters are not widely available.

Usage note 2: Entity definitions may be used for repeated pronunciations. For example:

<!ENTITY uk_tomato "t&#x252;m&#x251;to&#x28A;">
... you say <phoneme ph="&uk_tomato;"> tomato </phoneme> I say...

Usage note 3: In addition to an exhaustive set of vowel and consonant symbols, IPA supports a syllable delimiter, numerous diacritics, stress symbols, lexical tone symbols, intonational markers and more.

2.6 "voice" Element

The "voice" element is a production element that requests a change in speaking voice. Attributes are:

gender: optional attribute indicating the preferred gender of the voice to speak the contained text. Enumerated values are: "male", "female", "neutral".
age: optional attribute indicating the preferred age of the voice to speak the contained text. Acceptable values are of type (integer)
category: optional attribute indicating the preferred age category of the voice to speak the contained text. Enumerated values are: "child" , "teenager" , "adult", "elder".
variant: optional attribute indicating a preferred variant of the other voice characteristics to speak the contained text. (e.g. the second or next male child voice). Acceptable values are of type (integer).
name: optional attribute indicating a platform-specific voice name to speak the contained text. The value may be a space-separated list of names ordered from top preference down. Acceptable values are of the form (voice-name-list).

<voice gender="female" category="child">Mary had a
little lamb,</voice><!-- now request a different female child's voice -->
<voice gender="female" category="child" variant="2">It's fleece
was white as snow.</voice>
<!-- platform-specific voice selection -->
<voice name="Mike">I want to be like Mike.</voice>

Usage note 1: When there is not a voice available that exactly matches the attributes specified in the document, the voice selection algorithm may be platform-specific.

Usage note 2: Voice attributes are inherited down the tree including to within elements that change the language.

<voice gender="female"> 
  Any female voice here.
  <voice category="child"> 
    A female child voice here.
    <paragraph xml:lang="ja"> 
      <!-- A female child voice in Japanese. -->
    </paragraph>
  </voice>
</voice>

Usage note 3: A change in voice resets the prosodic parameters since different voices have different natural pitch and speaking rates. Volume is the only exception. It may be possible to preserve prosodic parameters across a voice change by employing a style sheet. Characteristics specified as "+" or "-" voice attributes with respect to absolute voice attributes would not be preserved.

Usage note 4: The "xml:lang" attribute may be used specially to request usage of a voice with a specific dialect or other variant of the enclosing language.

<voice xml:lang="en-cockney">Try a Cockney voice
(London area).</voice>
<voice xml:lang="en-brooklyn">Try one New York
accent.</voice>

2.7 "emphasis" Element

The "emphasis" element requests that the contained text be spoken with emphasis (also referred to as prominence or stress). The synthesizer determines how to render emphasis since the nature of emphasis differs between languages, dialects or even voices. The attributes are:

level: the "level" attribute indicates the strength of emphasis to be applied. Defined values are "strong", "moderate", "none" and "reduced". The default level is "moderate". The meaning of "strong" and "moderate" emphasis is interpreted according to the language being spoken (languages indicate emphasis using a possible combination of pitch change, timing changes, loudness and other acoustic differences). The "reduced" level is effectively the opposite of emphasizing a word. For example, when the phrase "going to" is reduced it may be spoken as "gonna". The "none" level is used to prevent the speech synthesizer from emphasizing words that it might typically emphasize.

That is a <emphasis> big </emphasis> car!
That is a <emphasis level="strong"> huge </emphasis>
bank account!

2.8 "break" Element

The "break" element is an empty element that controls the pausing or other prosodic boundaries between words. The use of the break element between any pair of words is optional. If the element is not defined, the speech synthesizer is expected to automatically determine a break based on the linguistic context. In practice, the "break" element is most often used to override the typical automatic behavior of a speech synthesizer. The attributes are:

size: the "size" attribute is an optional attribute having one of the following relative values: "none", "small", "medium" (default value), or "large". The value "none" indicates that a normal break boundary should be used. The other three values indicate increasingly large break boundaries between words. The larger boundaries are typically accompanied by pauses.
time: the "time" attribute is an optional attribute indicating the duration of a pause in seconds or milliseconds. It follows the "Times" attribute format from the Cascading Style Sheet Specification. e.g. "250ms", "3s".

Take a deep breath <break/>
then continue. 

Press 1 or wait for the tone. <break time="3s"/>
I didn't hear you!

Usage note 1: Using the "size" attribute is generally preferable to the "time" attribute within normal speech. This is because the speech synthesizer will modify the properties of the break according to the speaking rate, voice and possibly other factors. As an example, a fixed 250ms pause (placed with the "time" attribute) sounds much longer in fast speech than in slow speech.

2.9 "prosody" Element

The "prosody" element permits control of the pitch, speaking rate and volume of the speech output. The attributes are:

pitch: the baseline pitch for the contained text in Hertz, a relative change or values "high", "medium", "low", "default".
contour: sets the actual pitch contour for the contained text. The format is outlined below.
range: the pitch range (variability) for the contained text in Hertz, a relative change or values "high", "medium", "low", "default".
rate: the speaking rate for the contained text, a relative change or values "fast", "medium", "slow", "default".
duration: a value in seconds or milliseconds for the desired time to take to read the element contents. Follows the Times attribute format from the Cascading Style Sheet Specification. e.g. "250ms", "3s".
volume: the volume for the contained text in the range 0.0 to 100.0, a relative change or values "silent", "soft", "medium", "loud" or "default".

Relative values

Relative changes for any of the attributes above are specified as floating-point values: "+10", "-5.5", "+15.2%", "-8.0%". For the pitch and range attributes, relative changes in semitones are permitted: "+5st", "-2st". Since speech synthesizers are not able to apply arbitrary prosodic values, conforming speech synthesis processors may set platform-specific limits on the values. This is the second of only two exceptions allowed in the conformance criteria for an SSML processor.

The price of XYZ is <prosody rate="-10%">
<say-as type="currency">$45</say-as></prosody>

Pitch contour

The pitch contour is defined as a set of targets at specified intervals in the speech output. The algorithm for interpolating between the targets is platform-specific. In each pair of the form (interval,target), the first value is a percentage of the period of the contained text and the second value is the value of the "pitch" attribute (absolute, relative, relative semitone, or descriptive values are all permitted). Interval values outside 0% to 100% are ignored. If a value is not defined for 0% or 100% then the nearest pitch target is copied.

<prosody contour="(0%,+20)(10%,+30%)(40%,+10)">
 good morning
</prosody>

Usage note 1: The descriptive values ("high", "medium" etc.) may be specific to the platform, to user preferences or to the current language and voice. As such, it is generally preferable to use the descriptive values or the relative changes over absolute values.

Usage note 2: The default value of all prosodic attributes is no change. For example, omitting the rate attribute means that the rate is the same within the element as outside.

Usage note 3: The "duration" attribute takes precedence over the "rate" attribute. The "contour" attribute takes precedence over the "pitch" and "range" attributes.

Usage note 4: All prosodic attribute values are indicative: if a speech synthesizer is unable to accurately render a document as specified it will make a best effort (e.g. trying to set the pitch to 1Mhz, or the speaking rate to 1,000,000 words per minute.)

2.10 "audio" Element

The "audio" element supports the insertion of recorded audio files and the insertion of other audio formats in conjunction with synthesized speech output. The audio element may be empty. If the audio element is not empty then the contents should be the marked-up text to be spoken if the audio document is not available. The contents may also be used when rendering the document to non-audible output and for accessibility. The required attribute is "src", which is the URI of a document with an appropriate mime-type.

<!-- Empty element -->
Please say your name after the tone.  <audio src="beep.wav"/>
<!-- Container element with alternative text -->
<audio src="prompt.au">What city do you want to fly from?</audio>

Usage note 1: The "audio" element is not intended to be a complete mechanism for synchronizing synthetic speech output with other audio output or other output media (video etc.). Instead the "audio" element is intended to support the common case of embedding audio files in voice output.

Usage note 2: The alternative text may contain markup. The alternative text may be used when the audio file is not available, when rendering the document as non-audio output, or when the speech synthesizer does not support inclusion of audio files.

2.11 "mark" Element

A "mark" element is an empty element that places a marker into the output stream for asynchronous notification. When audio output of the TTS document reaches the mark, the speech synthesizer issues an event that includes the required "name" attribute of the element. The platform defines the destination of the event. The "mark" element does not affect the speech output process.

Go from <mark name="here"/> here, to <mark name="there"/> there!

Usage note 1: When supported by the implementation, requests can be made to pause and resume at document locations specified by the mark values.

Usage note 2: The mark names are not required to be unique within a document.

2.12 Miscellaneous relevant XML features

If a non-validating XML parser is used, an arbitrary XML element can be included in documents to expose platform-specific capabilities. If a validating XML parser is used, then engine-specific elements can be included if they are defined in an extended schema within the document. These extension elements are processed by engines that understand them and ignored by other engines.

Usage note 1: When engines support non-standard elements and attributes it is good practice for the name to identify the feature as non-standard, for example, by using a "x" prefix or a company name prefix.

3. Future Study

This section is Informative.

3.1 Future Study - Other Phoneme Alphabets

The Voice Browser Working Group is considering the additional support of the UNIPA phonetic alphabet developed by Lernout and Hauspie Speech Products. UNIPA was designed to reflect a one-to-one ASCII representation of existing IPA symbols, greater ease of use and readability, and ease of portability across platforms. Issues with UNIPA surround the fact that the symbols were not specifically designed for use in XML attribute statements. The use of double quotes, ampersand, and less-than characters is currently incompatible with SSML DTD usage.

All of the phoneme alphabets currently supported by SSML suffer from the same defect in that they contain phonemic symbols not specifically designed for expression within XML documents. The design of a new, XML-optimal phoneme alphabet is currently under study.

3.2 Future Study - Audio Element

A future incarnation of the "audio" element could include a "mode" attribute. If equal to "insertion"(the default), the speech output is temporarily paused, the audio is played then speech is resumed. If equal to "background", the audio is played along with speech output. Currently unresolved are the mechanics of how to specify audio playback behaviors like playback termination, etc.

3.3 Future Study - Mark Element

There has been discussion that the "mark" element should be an XML identifier ("id" attribute) with values being unique within the scope of the document. In addition, future study needs to ensure that events generated by a mark element are consistent with existing event models in other specifications (e.g. DOM, SMIL and the dialog markup language).

3.4 Future Study - "lowlevel" Elements: Fine-Grained Acoustic-Prosodic Control

The "lowlevel" element is a container for a sequence of phoneme and pitch controls: "ph" and "f0" elements respectively. The attributes of the "lowlevel" container element are:

Optional "alt" attribute that provides a human-readable string that is equivalent to the contained phonemic sequence.
Optional "pitch" attribute with values of "absolute", "relative" and "percentage" that indicate how to interpret the values on the contained pitch elements.
Tentative: Optional "alphabet" attribute with a default value of "ipa" and alternative values of "sampa" and "worldbet". This indicates which phonetic alphabet is used for the phonetic string values.

The "ph" and "f0" elements may be interleaved or placed in separate sequences (as in the example below).

"ph" Element: Phoneme with Duration

A "lowlevel" element may contain a sequence of zero or more "ph" elements. The "ph" element is empty. The "p" attribute is required and has a value that is a phoneme symbol. The optional "d" attribute is the duration in seconds or milliseconds (seconds as default) for the phoneme. If the "d" attribute is omitted a platform-specific default is used.

<lowlevel alt="hello">
  <ph p="pau" d=".21"/><ph p="h" d=".0949"/><ph p="&" d=".0581"/>
  <ph p="l" d=".0693"/><ph p="ou" d=".2181"/>
</lowlevel>
<!-- This example uses WorldBet phonemes -->

"f0" Element: Timed Pitch Targets

A "lowlevel" element may contain a sequence of zero or more "f0" elements. The "f0" element is empty. The "v" (value) attribute is required and should be in the form of an integer or simple floating point number (no exponentials). The value attribute is interpreted according to the value of the "pitch" attribute of the enclosing "lowlevel" element. The optional "t" attribute indicates the time offset from the preceding "f0" element and has a value of seconds or milliseconds (seconds as default). If the "t" attribute is omitted on the first "f0" element in a "lowlevel" container, the specified "f0" target value is aligned with the start of the first non-silent phoneme.

<lowlevel alt="hello" pitch="absolute">
  <ph p="pau" d=".21"/><ph p="h" d=".0949"/><ph p="&" d=".0581"/>
  <ph p="l" d=".0693"/><ph p="ou" d=".2181"/>
  <!-- This example uses WorldBet phonemes -->

  <f0 v="103.5"/> <f0 v="112.5" t=".075"/>
  <f0 v="113.2" t=".175"/> <f0 v="128.1" t=".28"/>
</lowlevel>

Usage note 1: It is anticipated that low-level markup will be generated by automated tools, so compactness is given priority over readability.

Issues:

There is an unresolved request to require that the "fo" and "ph" elements be interleaved within the "lowlevel" element so that they are in exact temporal order. This change is simple to require but requires that the duration attributes be interpreted consistently. It has been proposed that for the "ph" element the "d" attribute be an offset from the prior "ph" element but that for the "f0" element it should be an offset from the previous "ph" or "f0" element. A diagram would help here.
The attribute names for this element set need to be similar, identical, or somehow consistent with those of the "prosody" element.
Would "pi" or "fr" be preferable to "f0": i.e. pitch or frequency vs. the technical abbreviation for fundamental frequency.
The "phoneme" element and "lowlevel" are inconsistent in that the phone string is an attribute in "phoneme" and part of the content for "lowlevel". Also, the alternative text is the contents of the "phoneme" element but an attribute of "lowlevel". Perhaps these inconsistencies are unavoidable?
This element should track changes in the "phoneme" element. e.g. if "phoneme" adds an "alphabet" attribute that allows the specification of IPA, WorldBet or possibly other phonemic alphabets, then a similar attribute should be added to the "lowlevel" element.

3.5 Future Study - Intonational Controls

The existing specification supports many ways by which a document author can affect the intonational rendering of speech output. In part, this reflects the broad communicate role of intonation in spoken language: it reflects document structure (see the paragraph and sentence elements), prominence (see the emphasis element), and prosodic boundaries (see the break element). Intonation also reflects emotion and many less definable characteristics that are not planned for inclusion in this specification.

The specification could be enhanced to provide specific intonational controls at boundaries and at points of emphasis. In both cases there are existing elements to which intonational attributes could be added. The issues that need to be addressed are:

Determining the form that the attributes should take,
Ensuring that the attributes are applicable to a wide set of languages,
Ensuring that use of the attributes does not require specialized knowledge of intonation theory.

Intonational boundaries: The existing specification allows a document to mark major boundaries and structures using the paragraph and sentence elements and the break element. The break element explicitly marks a boundary whereas boundaries implicitly occur at both the start and end of paragraphs and sentences. For each of these boundary locations we could specify intonational patterns such as a rise, fall, flat, low-rising, high-falling and some more complex patterns. Proposals received to date include use of labeling systems from intonational theory or use of punctuation symbols such as '?', '!' and '.'.

Emphasis tones: The emphasis element can be used to explicitly mark any word or word sequence as emphasized. Each spoken language has patterns by which emphasis is marked intonationally. For example, for English, the more common emphasis tones are high, low, low-rising, and high-downstep. Our challenge is to determine a set of tones that has sufficient coverage of the tones of many spoken languages to be useful, but which does not require extensive theoretical knowledge.

3.6 Future Study - "value" Element

A "value" element has been proposed that permits substitution of a variable into the text stream. The variable's value must be defined separately, either by a "set" element (not yet defined) earlier in the document or in the host environment (e.g. in a voice browser). The value is a plain text string (markup may be ignored).

name: the name of the variable to be inserted in the text stream.
type: same format as the "type" attribute of the "say-as" element allowing the text to be marked as a phone number, date, time etc.

The time is <value name="currentTime"/>.

Issues:

The "value" element is equivalent to the "value" element of the VoiceXML specification. Unlike the Voice Browser which interprets VoiceXML, a speech synthesizer does not typically have persistent variables and would not normally have access to the internal variable of a Voice Browser. One proposal is for the Dialog ML to define a "value" element in its namespace and to convert that element to normal speech synthesis markup before passing the document to a speech synthesis. This is consistent with the spirit of the speech synthesis markup language as a "final form" representation. A downside of this approach is that the DTD for speech synthesis in the Dialog ML would be inconsistent with this specification.
VoiceXML permits values to be stored audio files.

4. Examples

This section is Informative.

The following is an example of reading headers of email messages. The paragraph and sentence elements are used to mark the text structure. The say-as element is used to indicate text constructs such as the time and proper name. The break element is placed before time and has the effect of marking the time as important information for the listener to pay attention to. The prosody element is used to slow the speaking rate of the email subject so that the user has extra time to listen and write down the details.

<?xml version="1.0"?>
<speak>
<paragraph>

<sentence>You have 4 new messages.</sentence>

<sentence>The first is from <say-as 
type="name">Stephanie Williams</say-as>
and arrived at <break/>
<say-as type="time">3:45pm</say-as>.</sentence>

<sentence>The subject is <prosody
rate="-20%">ski trip</prosody></sentence>

</paragraph>
</speak>

The following example combines audio files and different spoken voices to provide information on a collection of music.

<?xml version="1.0"?>
<speak>

<paragraph><voice gender="male">

<sentence>Today we preview the latest romantic music
 from the W3C.</sentence>

<sentence>Hear what the Software Reviews said about
 Tim Lee's newest hit.</sentence>

</voice></paragraph>

<paragraph><voice gender="female">
He sings about issues that touch us all.
</voice></paragraph>

<paragraph><voice gender="male">
Here's a sample.  <audio src="http://www.w3c.org/music.wav">
Would you like to buy it?</voice></paragraph>

</speak>

5. Conformance

This section is Normative.

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this conformance section are to be interpreted as described in RFC 2119

5.1: Conforming Speech Synthesis Markup Document Fragments

A speech synthesis markup document fragment is a Conforming XML Document Fragment if it adheres to the specification described in this document including the DTD (see Document Type Definition) and also:

(relative to XML) is well-formed.
if all non-speech synthesis namespace elements and attributes and all xmlns attributes which refer to non-speech synthesis namespace elements are removed from the given document, and if an appropriate XML declaration (i.e., <?xml...?>) is included at the top of the document, and if an appropriate document type declaration which points to the speech synthesis DTD is included immediately thereafter, the result is a valid XML document.
conforms to the following W3C Recommendations:
- the XML 1.0 specification (Extensible Markup Language (XML) 1.0).
- (if any namespaces other than speech synthesis markup are used in the document) Namespaces in XML.

The Speech Synthesis Markup Language or these conformance criteria provide no designated size limits on any aspect of speech synthesis markup documents. There are no maximum values on the number of elements, the amount of character data, or the number of characters in attribute values.

5.2: Conforming Stand-Alone Speech Synthesis Markup Language Documents

A file is a Conforming Stand-Alone Speech Synthesis Markup Language Document if:

it is an XML document.
its root element is a 'speak' element.
it conforms to the criteria for Conforming Speech Synthesis Markup Language Fragments.

5.3: Conforming Speech Synthesis Markup Language Processors

A Speech Synthesis Markup Language processor is a program that can parse and process Speech Synthesis Markup Language documents.

In a Conforming Speech Synthesis Markup Language Processor, the XML parser must be able to parse and process all XML constructs defined within XML 1.0 and XML Namespaces.

A Conforming Speech Synthesis Markup Language Processor must correctly understand and apply the command logic defined for each markup element as described by this document. Exceptions to this requirement are allowed when an xml:lang attribute is utilized to specify a language not present on a given platform, and when a non-enumerated attribute value is specified that is out-of-range for the platform. The response of the Conforming Speech Synthesis Markup Language Processor in both cases would be platform-dependent.

A Conforming Speech Synthesis Markup Language Processor should inform its hosting environment if it encounters an element, element attribute, or syntactic combination of elements or attributes that it is unable to support. A Conforming Speech Synthesis Markup Language Processor should also inform its hosting environment if it encounters an illegal speech synthesis document or unknown XML entity reference.

6. DTD for the Speech Synthesis Markup Language

This section is Normative.

<?xml version="1.0" encoding="ISO-8859-1"?>

<!-- Speech Synthesis Markup Language  v0.6 20001114 -->

<!ENTITY % allowed-within-sentence " #PCDATA | say-as | phoneme |
     voice | emphasis | break | prosody | audio | value | mark " >

<!ENTITY % structure "paragraph | p | sentence | s">

<!ENTITY % duration "CDATA">

<!ENTITY % integer "CDATA" >

<!ENTITY % uri "CDATA" >

<!ENTITY % phoneme-string "CDATA" >

<!ENTITY % phoneme-alphabet "CDATA" >

<!-- Definitions of the structural elements. -->
<!-- Currently, these elements support only the xml:lang attribute -->

<!ELEMENT speak (%allowed-within-sentence; | %structure;)*>
<!ATTLIST speak
     xml:lang NMTOKEN #IMPLIED>
         
<!ELEMENT paragraph (%allowed-within-sentence; | sentence | s)*>
<!ATTLIST paragraph
     xml:lang NMTOKEN #IMPLIED>

<!ELEMENT sentence (%allowed-within-sentence;)*>
<!ATTLIST sentence
     xml:lang NMTOKEN #IMPLIED>
    
<!--
 'p' and 's' are exact equivalent forms of 'paragraph' and 'sentence'
-->
     
<!ELEMENT p (%allowed-within-sentence; | sentence | s)*>
<!ATTLIST p
        xml:lang NMTOKEN #IMPLIED >

<!ELEMENT s (%allowed-within-sentence;)*>
<!ATTLIST s
        xml:lang NMTOKEN #IMPLIED >

<!--
 The flexible container elements can occur within paragraph
 and sentence but may also contain these structural elements.
-->

<!ENTITY % voice-name "CDATA">

<!ELEMENT voice (%allowed-within-sentence; | %structure;)*>
<!ATTLIST voice
     gender   (male|female|neutral)                  #IMPLIED
     age      %integer                               #IMPLIED
         category (child|teenager|adult|elder)           #IMPLIED
     variant  %integer                               #IMPLIED
     name     CDATA                                  #IMPLIED >

<!ELEMENT prosody (%allowed-within-sentence; | %structure;)*>
<!ATTLIST prosody
     pitch      CDATA  #IMPLIED
     contour    CDATA  #IMPLIED
     range      CDATA  #IMPLIED
     rate       CDATA  #IMPLIED
     duration   CDATA  #IMPLIED
     volume     CDATA  #IMPLIED >

<!ELEMENT audio (%allowed-within-sentence; | %structure;)*>
<!ATTLIST audio
     src        %uri;                  #IMPLIED >

<!-- These basic container elements can contain any of the -->
<!-- within-sentence elements, but neither sentence or paragraph. -->

<!ELEMENT emphasis (%allowed-within-sentence;)*>
<!ATTLIST emphasis
     level      (strong|moderate|none|reduced)  'moderate' >

<!-- These basic container elements can contain only data -->

<!ENTITY % say-as-types
    "(acronym|number:ordinal|number:digits|
        telephone|
        date:dmy|date:mdy|date:ymd|date:ym|
          date:my|date:md|date:y|date:m|date:d|
        time:hms|time:hm|time:h|
        duration:hms|duration:hm|duration:ms|
          duration:h|duration:m|duration:s|
        currency|measure|name|net|address)">

<!ELEMENT say-as (#PCDATA)>
<!ATTLIST say-as
     type   %say-as-types;   #REQUIRED
         sub    CDATA            #IMPLIED >
         
<!ELEMENT phoneme (#PCDATA)>
<!ATTLIST phoneme
     ph        %phoneme-string;   #REQUIRED
     alphabet  %phoneme-alphabet; #IMPLIED >

<!-- Definitions of the basic empty elements -->

<!ELEMENT break EMPTY>
<!ATTLIST break
     size      (large|medium|small|none)  'medium'
     time      %duration;                 #IMPLIED >

<!ELEMENT mark EMPTY>
<!ATTLIST mark
     name      CDATA   #REQUIRED >

The following is a fragment of a DTD that represents the elements described for Future Study

<!-- Value element -->

<!ELEMENT value EMPTY>
<!ATTLIST value
     name      CDATA   #REQUIRED >

<!-- Low-level elements -->

<!ENTITY % lowlevel-content " ph | f0 " >

<!ENTITY % pitch-types " (absolute|relative|percent) 'absolute' ">

<!ELEMENT lowlevel ( %lowlevel-content; )*>
<!ATTLIST lowlevel
     alt      CDATA              #IMPLIED
     pitch    %pitch-types;      #IMPLIED
     alphabet %phoneme-alphabet; #IMPLIED >

<!ELEMENT ph EMPTY>
<!ATTLIST ph
     p        %phoneme-alphabet;  #REQUIRED
     d        CDATA               #IMPLIED >

<!ELEMENT f0 EMPTY>
<!ATTLIST f0
     v        CDATA               #REQUIRED 
     t        CDATA               #IMPLIED >

7. References

Normative.

Java Speech API Markup Language

(http://java.sun.com/products/java-media/speech/forDevelopers/JSML/index.html)
JSML is an XML specification for controlling text-to-speech engines. Implementations are available from IBM, Lernout & Hauspie and in the Festival speech synthesis platform and in other implementations of the Java Speech API.

S. Bradner, "Key words for use in RFCs to Indicate Requirement Levels", RFC 2119, Harvard University, March 1997

Informative.

SABLE

(http://www.research.att.com/~rws/Sable.v1_0.htm)
SABLE is a markup language for controlling text to speech engines. It has evolved out of work on combining three existing text to speech languages: SSML, STML and JSML. Implementations are available for the Bell Labs synthesizer and in the Festvial speech synthesizer. The following are two of the papers written about SABLE and its applications:

SABLE: A Standard for TTS Markup, Sproat et. al. (http://www.research.att.com/~rws/SABPAP/sabpap.htm)
SABLE: an XML-based Aural Display List For The WWW, Sproat and Raman. (http://www.bell-labs.com/project/tts/csssable.html)

Spoken Text Markup Language

(http://www.cstr.ed.ac.uk/publications/1997/Sproat_1997_a.ps)
STML is an SGML language for controlling text to speech engines developed jointly by Bell Laboratories and by the Centre for Speech Technology Research, Edinburgh University.

Microsoft Speech API Control Codes

(http://www.microsoft.com/iit/)
SAPI defines a set of inline control codes for manipulating speech output by SAPI speech synthesizers.

VoiceXML Prompts

(http://www.voicexml.com/)
The Voice XML specification for dialog systems development includes a set of prompt elements for generating speech synthesis and other audio output that are very similar to elements of JSML and SABLE.

8. Acknowledgements

This document was written with the participation of the members of the W3C Voice Browser Working Group (listed in alphabetical order):

Brian Eberman, SpeechWorks
Jim Larson, Intel
Bruce Lucas, IBM
T.V. Raman, IBM
Dave Raggett, W3C/Openwave
Richard Sproat, AT&T
Kuansan Wang, Microsoft