Copyright © 2008 © 2009 W3C ® ® ( MIT , ERCIM , Keio ), All Rights Reserved. W3C liability , trademark and document use rules apply.
The Voice Browser Working Group has sought to develop standards to enable access to the Web using spoken interaction. The Speech Synthesis Markup Language Specification is one of these standards and is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This is the 7 November 2008 27 August 2009 Candidate Recommendation of "Speech Synthesis Markup Language (SSML) Version 1.1". The Last Call period ends on 20 July 2008. Changes from the Last Call previous Candidate Recommendation Working Draft can be found in Appendix G .
This document enhances SSML 1.0 [ SSML ] to provide better support for a broader set of natural (human) languages. To determine in what ways, if any, SSML is limited by its design with respect to supporting languages that are in large commercial or emerging markets for speech synthesis technologies but for which there was limited or no participation by either native speakers or experts during the development of SSML 1.0, the W3C held three workshops on the Internationalization of SSML. The first workshop [ WS ], in Beijing, PRC, in October 2005, focused primarily on Chinese, Korean, and Japanese languages, and the second [ WS2 ], in Crete, Greece, in May 2006, focused primarily on Arabic, Indian, and Eastern European languages. The third workshop [ WS3 ], in Hyderabad, India, in January 2007, focused heavily on Indian and Middle Eastern languages. Information collected during these workshops was used to develop a requirements document [ REQS11 ]. Changes from SSML 1.0 are motivated by these requirements.
This document has been produced as part of the Voice Browser Activity . The authors of this document are participants in the Voice Browser Working Group . For more information see the Voice Browser FAQ . The Working Group expects to advance this document to Recommendation status.
This is a W3C Candidate Recommendation for review by W3C Members and other interested parties. W3C publishes a technical report as a Candidate Recommendation to indicate that the document is believed to be stable, and to encourage implementation by the developer community.
The entrance criteria to the Proposed Recommendation phase require at least two independently developed interoperable implementations of each required feature, and at least one or two implementations of each optional feature depending on whether the feature's conformance requirements have an impact on interoperability. Detailed implementation requirements and the invitation for participation in the Implementation Report are provided in the Implementation Report Plan . We expect to meet all requirements of that report within the Candidate Recommendation period closing 5 January 27 October 2009 . The Voice Browser Working Group will advance SSML 1.1 to Proposed Recommendation no sooner than 5 January 27 November 2009 .
Although the Working Group has not formally identified any features as being at-risk, as a result of the previous publication, the Working Group now understands that some features may not receive adequate implementation experience. If this occurs, the group may request that the Director permit removal of the following features in a future request to advance to Proposed Recommendation:
clipBegin
,
clipEnd
,
repeatCount
,
and
repeatDur
(
Section
3.3.1.1
),
the
soundLevel
attribute
(
Section
3.3.1.2
),
and
the
speed
attribute
(
Section
3.3.1.3
).
The group therefore specifically seeks implementation reports from anyone who is concerned about the possible removal of these features.
Comments are welcome on www-voice@w3.org ( archive ). See W3C mailing list and archive usage guidelines . Please check the disposition of comments received during the Last Call period.
Publication as a Candidate Recommendation does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy . W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy .
This W3C specification is known as the Speech Synthesis Markup Language specification (SSML) and is based upon the JSGF and/or JSML specifications, which are owned by Sun Microsystems, Inc., California, U.S.A. The JSML specification can be found at [ JSML ].
SSML is part of a larger set of markup specifications for voice browsers developed through the open processes of the W3C. It is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to give authors of synthesizable content a standard way to control aspects of speech output such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms. A related initiative to establish a standard system for marking up text input is SABLE [ SABLE ], which tried to integrate many different XML-based markups for speech synthesis into a new one. The activity carried out in SABLE was also used as the main starting point for defining the Speech Synthesis Markup Requirements for Voice Markup Languages [ REQS ]. Since then, SABLE itself has not undergone any further development.
The intended use of SSML is to improve the quality of synthesized content. Different markup elements impact different stages of the synthesis process (see Section 1.2 ). The markup may be produced either automatically, for instance via XSLT or CSS3 from an XHTML document, or by human authoring. Markup may be present within a complete SSML document (see Section 2.2.2 ) or as part of a fragment (see Section 2.2.1 ) embedded in another language, although no interactions with other languages are specified as part of SSML itself. Most of the markup included in SSML is suitable for use by the majority of content developers; however, some advanced features like phoneme and prosody (e.g. for speech contour design) may require specialized knowledge.
The design and standardization process has followed from the Speech Synthesis Markup Requirements for Voice Markup Languages [ REQS ].
The following items were the key design criteria.
A Text-To-Speech system (a synthesis processor ) that supports SSML will be responsible for rendering a document as spoken output and for using the information contained in the markup to render the document as intended by the author.
Document creation: A text document provided as input to the synthesis processor may be produced automatically, by human authoring, or through a combination of these forms. SSML defines the form of the document.
Document processing: The following are the six major processing steps undertaken by a synthesis processor to convert marked-up text input into automatically generated voice output. The markup language is designed to be sufficiently rich so as to allow control over each of the steps described below so that the document author (human or machine) can control the final voice output. Although each step below is divided into "markup support" and "non-markup behavior", actual behavior is usually a mix of the two and varies depending on the tag. The processor has the ultimate authority to ensure that what it produces is pronounceable (and ideally intelligible). In general the markup provides a way for the author to make prosodic and other information available to the processor, typically information the processor would be unable to acquire on its own. It is then up to the processor to determine whether and in what way to use the information.
XML parse: An XML parser is used to extract the document tree and content from the incoming text document. The structure, tags and attributes obtained in this step influence each of the following steps.
Structure analysis: The structure of a document influences the way in which a document should be read. For example, there are common speaking patterns associated with paragraphs and sentences.
Markup support: The p and s elements defined in SSML explicitly indicate document structures that affect the speech output.
Non-markup behavior: In documents and parts of documents where these elements are not used, the synthesis processor is responsible for inferring the structure by automated analysis of the text, often using punctuation and other language-specific data.
Text normalization: All written languages have special constructs that require a conversion of the written form (orthographic form) into the spoken form. Text normalization is an automated process of the synthesis processor that performs this conversion. For example, for English, when "$200" appears in a document it may be spoken as "two hundred dollars". Similarly, "1/2" may be spoken as "half", "January second", "February first", "one of two" and so on. By the end of this step the text to be spoken has been converted completely into tokens. The exact details of what constitutes a token are language-specific. In English, tokens are usually separated by white space and are typically words. For languages with different tokenization behavior, the term "word" in this specification is intended to mean an appropriately comparable unit. Tokens in SSML cannot span markup tags except within the token and w elements. A simple English example is "cup<break/>board"; outside the token and w elements, the synthesis processor will treat this as the two tokens "cup" and "board" rather than as one token (word) with a pause in the middle. Breaking one token into multiple tokens this way will likely affect how the processor treats it.
Markup support: The say-as element can be used in the input document to explicitly indicate the presence and type of these constructs and to resolve ambiguities. The set of constructs that can be marked has not yet been defined but might include dates, times, numbers, acronyms, currency amounts and more. Note that many acronyms and abbreviations can be handled by the author via direct text replacement or by use of the sub element, e.g. "BBC" can be written as "B B C" and "AAA" can be written as "triple A". These replacement written forms will likely be pronounced as one would want the original acronyms to be pronounced. In the case of Japanese text, if you have a synthesis processor that supports both Kanji and kana, you may be able to use the sub element to identify whether 今日は should be spoken as きょうは ("kyou wa" = "today") or こんにちは ("konnichiwa" = "hello").
Non-markup behavior: For text content that is not marked with the say-as element the synthesis processor is expected to make a reasonable effort to automatically locate and convert these constructs to a speakable form. Because of inherent ambiguities (such as the "1/2" example above) and because of the wide range of possible constructs in any language, this process may introduce errors in the speech output and may cause different processors to render the same document differently.
Text-to-phoneme conversion: Once the synthesis processor has determined the set of tokens to be spoken, it must derive pronunciations for each token. Pronunciations may be conveniently described as sequences of phonemes, which are units of sound in a language that serve to distinguish one word from another. Each language (and sometimes each national or dialect variant of a language) has a specific phoneme set: e.g., most US English dialects have around 45 phonemes, Hawai'ian has between 12 and 18 (depending on who you ask), and some languages have more than 100! This conversion is made complex by a number of issues. One issue is that there are differences between written and spoken forms of a language, and these differences can lead to indeterminacy or ambiguity in the pronunciation of written words. For example, compared with their spoken form, words in Hebrew and Arabic are usually written with no vowels, or only a few vowels specified. In many languages the same written word may have many spoken forms. For example, in English, "read" may be spoken as "reed" (I will read the book) or "red" (I have read the book). Both human speakers and synthesis processors can pronounce these words correctly in context but may have difficulty without context (see "Non-markup behavior" below). Another issue is the handling of words with non-standard spellings or pronunciations. For example, an English synthesis processor will often have trouble determining how to speak some non-English-origin names, e.g. "Caius College" (pronounced "keys college") and President Tito (pronounced "sutto"), the president of the Republic of Kiribati (pronounced "kiribass").
Markup support: The phoneme element allows a phonemic sequence to be provided for any token or token sequence. This provides the content creator with explicit control over pronunciations. The say-as element might also be used to indicate that text is a proper name that may allow a synthesis processor to apply special rules to determine a pronunciation. The lexicon and lookup elements can be used to reference external definitions of pronunciations. These elements can be particularly useful for acronyms and abbreviations that the processor is unable to resolve via its own text normalization and that are not addressable via direct text substitution or the sub element (see paragraph 3, above).
Non-markup behavior: In the absence of a phoneme element the synthesis processor MUST apply automated capabilities to determine pronunciations. This is typically achieved by looking up tokens in a pronunciation dictionary (which may be language-dependent) and applying rules to determine other pronunciations. Synthesis processors are designed to perform text-to-phoneme conversions so most words of most documents can be handled automatically. As an alternative to relying upon the processor, authors may choose to perform some conversions themselves prior to encoding in SSML. Written words with indeterminate or ambiguous pronunciations could be replaced by words with an unambiguous pronunciation; for example, in the case of "read", "I will reed the book". Authors should be aware, however, that the resulting SSML document may not be optimal for visual display.
Prosody analysis: Prosody is the set of features of speech output that includes the pitch (also called intonation or melody), the timing (or rhythm), the pausing, the speaking rate, the emphasis on words and many other features. Producing human-like prosody is important for making speech sound natural and for correctly conveying the meaning of spoken language.
Markup support: The emphasis element, break element and prosody element may all be used by document creators to guide the synthesis processor in generating appropriate prosodic features in the speech output.
Non-markup behavior: In the absence of these elements, synthesis processors are expert (but not perfect) in automatically generating suitable prosody. This is achieved through analysis of the document structure, sentence syntax, and other information that can be inferred from the text input.
While most of the elements of SSML can be considered high-level in that they provide either content to be spoken or logical descriptions of style, the break and prosody elements mentioned above operate at a later point in the process and thus must coexist both with uses of the emphasis element and with the processor's own determinations of prosodic behavior. Unless specified in the appropriate sections, details of the interactions between the processor's own determinations and those provided by the author at this level are processor-specific. Authors are encouraged not to casually or arbitrarily mix these two levels of control.
Waveform production: The phonemes and prosodic information are used by the synthesis processor in the production of the audio waveform. There are many approaches to this processing step so there may be considerable processor-specific variation.
Markup support: The voice element allows the document creator to request a particular voice or specific voice qualities (e.g. a young male voice). The audio element allows for insertion of recorded audio data into the output stream, with optional control over the duration, sound level and playback speed of the recording. Rendering can be restricted to a subset of the document by using the trimming attributes on the speak element.
Non-markup behavior: The default volume/sound level, speed, and pitch/frequency of both voices and recorded audio in the document are that of the unmodified waveforms, whether they be voices or recordings.
There are many classes of document creator that will produce marked-up documents to be spoken by a synthesis processor . Not all document creators (including human and machine) have access to information that can be used in all of the elements or in each of the processing steps described in the previous section . The following are some of the common cases.
The document creator has no access to information to mark up the text. All processing steps in the synthesis processor must be performed fully automatically on raw text . The document requires only the containing speak element to indicate the content is to be spoken.
When marked text is generated programmatically the creator may have specific knowledge of the structure and/or special text constructs in some or all of the document. For example, an email reader can mark the location of the time and date of receipt of email. Such applications may use elements that affect structure, text normalization , prosody and possibly text-to-phoneme conversion.
Some document creators make considerable effort to mark as many details of the document as possible to ensure consistent speech quality across platforms and to more precisely specify output qualities. In these cases, the markup may use any or all of the available elements to tightly control the speech output. For example, prompts generated in telephony and voice browser applications may be fine-tuned to maximize the effectiveness of the overall system.
The most advanced document creators may skip the higher-level markup (structure, text normalization , text-to-phoneme conversion, and prosody analysis) and produce low-level speech synthesis markup for segments of documents or for entire documents. This typically requires tools to generate sequences of phonemes, plus pitch and timing information. For instance, tools that do "copy synthesis" or "prosody transplant" try to emulate human speech by copying properties from recordings.
The following are important instances of architectures or designs from which marked-up synthesis documents will be generated. The language design is intended to facilitate each of these approaches.
Dialog language : It is a requirement that it SHOULD be possible to include documents marked with SSML into the dialog description document to be produced by the Voice Browser Working Group.
Interoperability with aural CSS (ACSS) : Any HTML processor that is aural CSS-enabled can produce SSML. ACSS is covered in Section 19 of the Cascading Style Sheets, level 2 (CSS2) Specification [ CSS2 §19]. This usage of speech synthesis facilitates improved accessibility to existing HTML and XHTML content.
Application-specific style sheet processing : As mentioned above, there are classes of applications that have knowledge of text content to be spoken, and that can be incorporated into the speech synthesis markup to enhance rendering of the document. In many cases, it is expected that the application will use style sheets to perform transformations of existing XML documents to SSML. This is equivalent to the use of ACSS with HTML and once again SSML is the resulting representation to be passed to the synthesis processor . In this context, SSML may be viewed as a superset of ACSS [ CSS2 §19] capabilities, excepting spatial audio.
SSML provides a standard way to specify gross properties of synthetic speech production such as pronunciation, volume, pitch, rate, etc. Exact specification of synthetic speech output behavior across disparate processors, however, is beyond the scope of this document.
Unless otherwise specified, markup values are merely indications rather than absolutes. For example, it is possible for an author to explicitly indicate the duration of a text segment and also indicate an explicit duration for a subset of that text segment. If the two durations result in a text segment that the synthesis processor cannot reasonably render, the processor is permitted to modify the durations as needed to render the text segment.
anyURI
primitive
as
defined
in
XML
Schema
Part
2:
Datatypes
[
SCHEMA2
§3.2.17].
For
informational
purposes
only,
[
RFC3986
]
and
[
RFC2732
]
may
be
useful
in
understanding
the
structure,
format,
and
use
of
URIs.
Note
that
IRIs
(see
[
RFC3987
])
are
permitted
within
the
above
definition
of
URI.
Any
relative
URI
reference
MUST
be
resolved
according
to
the
rules
given
in
Section
3.1.3.1
.
In
this
specification
URIs
are
provided
as
attributes
to
elements,
for
example
in
the
audio
and
lexicon
elements.
A legal stand-alone Speech Synthesis Markup Language document MUST have a legal XML Prolog [ XML 1.0 or XML 1.1 , as appropriate, §2.8].
The XML prolog is followed by the root speak element. See Section 3.1.1 for details on this element.
The
speak
element
MUST
designate
the
SSML
namespace.
This
can
be
achieved
by
declaring
an
xmlns
attribute
or
an
attribute
with
an
"xmlns"
prefix.
See
[
XMLNS
1.0
or
XMLNS
1.1
,
as
appropriate,
§2]
for
details.
Note
that
when
the
xmlns
attribute
is
used
alone,
it
sets
the
default
namespace
for
the
element
on
which
it
appears
and
for
any
child
elements.
The
namespace
for
SSML
is
defined
to
be
http://www.w3.org/2001/10/synthesis
.
It
is
RECOMMENDED
that
the
speak
element
also
indicate
the
location
of
the
appropriate
SSML
schema
(see
Appendix
D
)
via
the
xsi:schemaLocation
attribute
from
[
SCHEMA1
§2.6.3].
Although
such
indication
is
not
required,
to
encourage
it
this
document
provides
such
indication
on
all
of
the
examples.
When
this
attribute
is
not
given,
the
Core
profile
[
Section
2.2.5
]
MUST
be
assumed.
The following are two examples of legal SSML headers:
<?xml version="1.0"?> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis " xml:lang="en-US">
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
The meta , metadata and lexicon elements MUST occur before all other elements and text contained within the root speak element. There are no other ordering constraints on the elements in this specification.
A document fragment is a Conforming Core Speech Synthesis Markup Language Fragment if:
xml:lang
and
xml:base
,
all
non-synthesis
namespace
elements
and
attributes
and
all
xmlns
attributes
which
refer
to
non-synthesis
namespace
elements
are
removed
from
the
document,
xmlns
attribute,
then
xmlns="http://www.w3.org/2001/10/synthesis"
is
added
to
the
element.
A document fragment is a Conforming Extended Speech Synthesis Markup Language Fragment if:
xml:lang
and
xml:base
,
all
non-synthesis
namespace
elements
and
attributes
and
all
xmlns
attributes
which
refer
to
non-synthesis
namespace
elements
are
removed
from
the
document,
xmlns
attribute,
then
xmlns="http://www.w3.org/2001/10/synthesis"
is
added
to
the
element.
A document is a Conforming Stand-Alone Core Speech Synthesis Markup Language Document if it meets both the following conditions:
A document is a Conforming Stand-Alone Extended Speech Synthesis Markup Language Document if it meets both the following conditions:
The SSML specification and these conformance criteria provide no designated size limits on any aspect of synthesis documents. There are no maximum values on the number of elements, the amount of character data, or the number of characters in attribute values.
The synthesis namespace MAY be used with other XML namespaces as per the appropriate Namespaces in XML Recommendation (1.0 [ XMLNS 1.0 ] or 1.1 [ XMLNS 1.1 ], depending on the version of XML being used). Future work by W3C is expected to address ways to specify conformance for documents involving multiple namespaces. Language-specific (i.e. non-SSML) elements and attributes may be inserted into SSML using an appropriate namespace. However, such content would only be rendered by a synthesis processor that supported the custom markup. Here is an example of how one might insert Ruby [ RUBY ] elements into SSML:
<?xml version="1.0" encoding="UTF-8"?> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis " xml:lang="ja"> <!-- It's 20 July today. --> <s>今日は七月 <xhtml:ruby> <xhtml:rb>二十日</xhtml:rb> <xhtml:rt role="alphabet:x-JEITA">ハツカ</xhtml:rt> </xhtml:ruby> です。 </s> <!-- It's 20 July today. --> <s>今日は七月 <xhtml:ruby> <xhtml:rb>二十日</xhtml:rb> <xhtml:rt role="alphabet:x-JEITA">ニジューニチ</xhtml:rt> </xhtml:ruby> です。 </s>
<?xml version="1.0" encoding="UTF-8"?>
<speak version="1.1"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xhtml="http://www.w3.org/1999/xhtml"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="ja">
<!-- It's 20 July today. -->
<s>今日は七月
<xhtml:ruby>
<xhtml:rb>二十日</xhtml:rb>
<xhtml:rt role="alphabet:x-JEITA">ハツカ</xhtml:rt>
</xhtml:ruby>
です。
</s>
<!-- It's 20 July today. -->
<s>今日は七月
<xhtml:ruby>
<xhtml:rb>二十日</xhtml:rb>
<xhtml:rt role="alphabet:x-JEITA">ニジューニチ</xhtml:rt>
</xhtml:ruby>
です。
</s>
</speak>
In a Conforming Speech Synthesis Markup Language Processor , the XML parser MUST be able to parse and process all XML constructs defined by XML 1.0 [ XML 1.0 ] and XML 1.1 [ XML 1.1 ] and the corresponding versions of Namespaces in XML (1.0 [ XMLNS 1.0 ] and 1.1 [ XMLNS 1.1 ]). This XML parser is not required to perform validation of an SSML document as per its schema or DTD; this implies that during processing of an SSML document it is OPTIONAL to apply or expand external entity references defined in an external DTD.
A Conforming Speech Synthesis Markup Language Processor MUST meet the following requirements for handling of natural (human) languages:
There is no conformance requirement with respect to performance characteristics of the Speech Synthesis Markup Language Processor. For instance, no statement is required regarding the accuracy, speed or other characteristics of speech produced by the processor. No statement is made regarding the size of input that a Speech Synthesis Markup Language Processor must support.
A Core Speech Synthesis Markup Language processor is a Conforming Speech Synthesis Markup Language Processor that can parse and process Conforming Stand-Alone Core Speech Synthesis Markup Language documents .
A Conforming Core Speech Synthesis Markup Language Processor MUST correctly understand and apply the semantics of the elements and attributes of the Core profile as described by this document.
When a Conforming Core Speech Synthesis Markup Language Processor encounters elements or attributes other than those included in the Core profile it MAY :
An Extended Speech Synthesis Markup Language processor is a Conforming Speech Synthesis Markup Language Processor that can parse and process Conforming Stand-Alone Extended Speech Synthesis Markup Language documents .
A Conforming Extended Speech Synthesis Markup Language Processor MUST correctly understand and apply the semantics of the elements and attributes of the Extended profile as described by this document.
When a Conforming Extended Speech Synthesis Markup Language Processor encounters elements or attributes other than those included in the Extended profile it MAY :
An SSML Profile is a collection of SSML elements and attributes. There are only two profiles defined in this document:
clipBegin
,
clipEnd
,
repeatCount
,
repeatDur
,
soundLevel
,
and
speed
attributes
on
the
audio
element.
A Conforming User Agent is a Conforming Speech Synthesis Markup Language Processor that is capable of accepting an SSML document as input and producing a spoken output by using the information contained in the markup to render the document as intended by the author. A Conforming User Agent MUST support at least one natural language.
Since the output cannot be guaranteed to be a correct representation of all the markup contained in the input there is no conformance requirement regarding accuracy. A conformance test MAY , however, require some examples of correct synthesis of a reference document to determine conformance.
The Synchronized Multimedia Integration Language (SMIL, pronounced "smile") [ SMIL ] enables simple authoring of interactive audiovisual presentations. SMIL is typically used for "rich media"/multimedia presentations which integrate streaming audio and video with images, text or any other media type. SMIL is an easy-to-learn HTML-like language, and many SMIL presentations are written using a simple text editor. See the SMIL/SSML integration examples in Appendix F .
Aural Cascading Style Sheets [ CSS2 §19] are employed to augment standard visual forms of documents (like HTML) with additional elements that assist in the synthesis of the text into audio. In comparison to SSML, ACSS-generated documents are capable of more complex specifications of the audio sequence, including the designation of 3D location of the audio source. Many of the other ACSS elements overlap SSML functionality, especially in the specification of voice type/quality. SSML may be viewed as a superset of ACSS capabilities, excepting spatial audio.
The Voice Extensible Markup Language [ VXML ] enables Web-based development and content-delivery for interactive voice response applications (see voice browser ). VoiceXML supports speech synthesis , recording and playback of digitized audio, speech recognition, DTMF input, telephony call control, and form-driven mixed initiative dialogs. VoiceXML 2.0 extends SSML for the markup of text to be synthesized. For an example of the integration between VoiceXML and SSML see Appendix F .
The fetching and caching behavior of SSML documents is defined by the environment in which the synthesis processor operates. In a VoiceXML interpreter context for example, the caching policy is determined by the VoiceXML interpreter.
The following elements and attributes are defined in this specification.
The Speech Synthesis Markup Language is an XML application. The root element is speak .
xml:lang
is
a
REQUIRED
attribute
specifying
the
language
of
the
root
document.
xml:base
is
an
OPTIONAL
attribute
specifying
the
Base
URI
of
the
root
document.
onlangfailure
is
an
OPTIONAL
attribute
specifying
the
desired
behavior
upon
language
speaking
failure.
The
version
attribute
is
a
REQUIRED
attribute
that
indicates
the
version
of
the
specification
to
be
used
for
the
document
and
MUST
have
the
value
"1.1".
The trimming attributes are specified in a subsection, below.
Before the speak element is executed, the synthesis processor MUST select a default voice. Note that a language speaking failure (see Section 3.1.13 ) will occur as soon as the first text is encountered if the language of the text is one that the default voice cannot speak. This assumes that the voice has not been changed before encountering the text, of course.
<?xml version="1.0"?> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis " xml:lang="en-US"> ... the body ... </speak>
<?xml version="1.0"?>
<speak version="1.1"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
... the body ...
</speak>
The speak element can only contain text to be rendered and the following elements: audio , break , emphasis , lang , lexicon , lookup , mark , meta , metadata , p , phoneme , prosody , say-as , sub , s , token , voice , w .
Trimming attributes define the span of the document to be rendered. Both the start and the end of the span within the speak content can be specified using marks.
The following trimming attributes are defined for speak :
Name | Required | Type | Default Value | Description |
---|---|---|---|---|
startmark
|
false |
type
xsd:token
[
SCHEMA2
§3.3.2]
|
none | The mark used to determined when rendering starts. |
endmark
|
false |
type
xsd:token
[
SCHEMA2
§3.3.2]
|
none | The mark used to determine when rendering ends. |
The
startmark
and
endmark
attributes
specify
a
name
that
references
a
marker
as
assigned
by
the
name
attribute
of
the
mark
element.
Only
markers
defined
once
in
the
document,
i.e.
that
are
unique,
are
permitted
as
the
value
of
either
startmark
or
endmark
.
The
span
of
the
document
rendered
is
determined
as
follows:
startmark
is
specified,
then
rendering
starts
at
the
startmark
.
If
startmark
is
not
specified,
rendering
begins
at
the
beginning
of
the
document.
endmark
is
specified,
then
rendering
ends
at
the
endmark
.
If
the
endmark
is
not
specified,
rendering
ends
at
the
document
end.
startmark
is
after
the
endmark
,
then
no
audio
is
generated.
It
is
an
error
if
the
value
given
for
either
startmark
or
endmark
is
not
a
valid
mark
in
the
document.
If no trimming attributes are specified, then the complete document is rendered:
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis " xml:lang="en-US"> <audio src="first.wav"/> <mark name="mark1"/> <audio src="middle.wav"/> <mark name="mark2"/> <audio src="last.wav"/> </speak>
<speak version="1.1"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<audio src="first.wav"/>
<mark name="mark1"/>
<audio src="middle.wav"/>
<mark name="mark2"/>
<audio src="last.wav"/>
</speak>
here "first.wav", "middle.wav" and "last.wav" are rendered, where the mark "mark2" is the last mark rendered.
The
startmark
can
be
used
to
specify
that
rendering
begins
from
a
specific
mark:
<speak startmark="mark1" version="1.1"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<audio src="first.wav"/>
<mark name="mark1"/>
<audio src="middle.wav"/>
<mark name="mark2"/>
<audio src="last.wav"/>
</speak>
"middle.wav"
and
"last.wav"
are
rendered,
but
not
"first.wav"
since
it
occurs
before
the
startmark
"mark1".
The
end
of
rendering
can
be
specified
using
the
endmark
:
<speak endmark="mark2" version="1.1"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<audio src="first.wav"/>
<mark name="mark1"/>
<audio src="middle.wav"/>
<mark name="mark2"/>
<audio src="last.wav"/>
</speak>
where "first.wav" and "middle.wav" are completely rendered but none of "last.wav" is rendered.
Finally, these trimming attributes can be used to control both the start and end of rendering:
<speak startmark="mark1" endmark="mark1" version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis " xml:lang="en-US"> <audio src="first.wav"/> <mark name="mark1"/> <audio src="middle.wav"/> <mark name="mark2"/> <audio src="last.wav"/> </speak>
<speak startmark="mark1" endmark="mark2"
version="1.1"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<audio src="first.wav"/>
<mark name="mark1"/>
<audio src="middle.wav"/>
<mark name="mark2"/>
<audio src="last.wav"/>
</speak>
where only "middle.wav" is played.
xml:lang
Attribute
The
xml:lang
attribute
,
as
defined
by
XML
[
XML
1.0
or
XML
1.1
,
as
appropriate,
§2.12],
MAY
be
used
in
SSML
to
indicate
the
natural
language
of
the
written
content
of
the
element
on
which
it
occurs.
BCP47
[
BCP47
]
can
help
in
understanding
how
to
use
this
attribute.
Language information is inherited down the document hierarchy, i.e. it needs to be given only once if the whole document is in one language, and language information nests, i.e. inner attributes overwrite outer attributes.
xml:lang
is
a
defined
attribute
for
the
speak
,
lang
,
desc
,
p
,
s
,
token
,
and
w
elements.
xml:lang
is
permitted
on
p
,
s
,
token
,
and
w
only
because
it
is
common
to
change
the
language
at
those
levels.
The
synthesis
processor
SHOULD
use
the
value
of
the
xml:lang
attribute
attribute
to
assist
it
in
determining
the
best
way
of
rendering
the
content
of
the
element
on
which
it
occurs. When
occurs. When
the
synthesis
processor
comes
across
text
it
does
not
know
how
to
speak,
it
is
the
responsibility
of
the
processor
to
decide
what
to
do
(see
the
onlangfailure
attribute).
One
of
the
sources
of
information
it
can
draw
upon
to
make
this
decision
is
the
value
of
the
xml:lang
attribute.
The
synthesis
processor
may
also
use
the
value
of
the
xml:lang
attribute
to
help
it
to
determine
the
language
of
the
content,
which
may
of
course
affect
how
the
voice
will
speak
the
content.
For
example,
"
The
French
word
for
cat
is
<lang
xml:lang="fr">chat</lang>,
not
chat.
"
If
the
document
author
requires
a
new
voice
that
is
better
adapted
to
the
new
language,
then
the
synthesis
processor
can
be
explicitly
requested
to
select
a
new
voice
by
using
the
voice
element.
Further
information
about
voice
selection
appears
in
Section
3.2.1
.
The text normalization processing step may be affected by the enclosing language. This is true for both markup support by the say-as element and non-markup behavior. In the following example the same text "2/1/2000" may be read as "February first two thousand" in the first sentence, following American English pronunciation rules, but as "the second of January two thousand" in the second one, which follows Italian preprocessing rules.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis " xml:lang="en-US"> <s>Today, 2/1/2000.</s> <!-- Today, February first two thousand --> <s xml:lang="it">Un mese fà, 2/1/2000.</s> <!-- Un mese fà, il due gennaio duemila --> <!-- One month ago, the second of January two thousand --> </speak>
<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<s>Today, 2/1/2000.</s>
<!-- Today, February first two thousand -->
<s xml:lang="it">Un mese fà, 2/1/2000.</s>
<!-- Un mese fà, il due gennaio duemila -->
<!-- One month ago, the second of January two thousand -->
</speak>
xml:base
Attribute
Relative URIs are resolved according to a base URI , which may come from a variety of sources. The base URI declaration allows authors to specify a document's base URI explicitly. See Section 3.1.3.1 for details on the resolution of relative URIs.
The base URI declaration is permitted but OPTIONAL . The two elements affected by it are
- audio
- The OPTIONAL
src
attribute can specify a relative URI.- lexicon
- The
uri
attribute can specify a relative URI.
The
base
URI
declaration
follows
[
XML-BASE
]
and
is
indicated
by
an
xml:base
attribute
on
the
root
speak
element.
<?xml version="1.0"?>
<speak version="1.1" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:base="http://www.example.com/base-file-path">
<?xml version="1.0"?>
<speak version="1.1" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
"
xml:base="http://www.example.com/another-base-file-path">
<?xml version="1.0"?>
<speak version="1.1" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:base="http://www.example.com/another-base-file-path">
User agents MUST calculate the base URI for resolving relative URIs according to [ RFC3986 ]. The following describes how RFC3986 applies to synthesis documents.
User agents MUST calculate the base URI according to the following precedences (highest priority to lowest):
xml:base
attribute
on
the
speak
element
(see
Section
3.1.3
).
xml:id
Attribute
The
xml:id
attribute
[
XML-ID
]
MAY
be
used
in
SSML
to
give
an
element
an
identifier
that
is
unique
to
the
document,
allowing
the
element
to
be
referenced
from
other
documents.
xml:id
is
a
defined
attribute
for
the
lexicon
,
p
,
s
,
token
,
and
w
elements.
An SSML document MAY reference one or more lexicon documents. A lexicon document is located by a URI with an OPTIONAL media type and is assigned a name that is unique in the SSML document.
Any number of lexicon elements MAY occur as immediate children of the speak element.
The
lexicon
element
MUST
have
a
uri
attribute
specifying
a
URI
that
identifies
the
location
of
the
lexicon
document.
The
lexicon
element
MUST
have
an
xml:id
attribute
that
assigns
a
name
to
the
lexicon
document.
The
name
MUST
be
unique
to
the
current
SSML
document.
The
scope
of
this
name
is
the
current
SSML
document.
The
lexicon
element
MAY
have
a
type
attribute
that
specifies
the
media
type
of
the
lexicon
document.
The
default
value
of
the
type
attribute
is
application/pls+xml
,
the
media
type
associated
with
Pronunciation
Lexicon
Specification
[
PLS
]
documents
as
defined
in
[
RFC4267
].
The
lexicon
element
MAY
have
a
fetchtimeout
attribute
that
specifies
the
timeout
for
fetches.
The
value
is
a
Time
Designation
.
The
default
value
is
processor-specific.
The
lexicon
element
MAY
have
a
maxage
attribute
that
indicates
that
the
document
is
willing
to
use
content
whose
age
is
no
greater
than
the
specified
time
(cf.
'max-age'
in
HTTP
1.1
[RFC2616
]).
The
value
is
an
xsd:nonNegativeInteger
[
SCHEMA2
§3.3.20].
The
document
is
not
willing
to
use
stale
content,
unless
maxstale
is
also
provided.
The
lexicon
element
MAY
have
a
maxstale
attribute
that
indicates
that
the
document
is
willing
to
use
content
that
has
exceeded
its
expiration
time
(cf.
'max-stale'
in
HTTP
1.1
[
RFC2616
]).
The
value
is
an
xsd:nonNegativeInteger
[
SCHEMA2
§3.3.20].
If
maxstale
is
assigned
a
value,
then
the
document
is
willing
to
accept
content
that
has
exceeded
its
expiration
time
by
no
more
than
the
specified
amount
of
time.
The lexicon element is an empty element.
If an error occurs in fetching or parsing a lexicon document, the synthesis processor MUST notify the hosting environment that such an error has occurred. The occurred. The processor MAY notify the hosting environment immediately with an asynchronous event, or the processor MAY make the error notification through its logging system. The system. The processor SHOULD include information about the error where possible; for example, if the lexicon couldn't be fetched due to an http 404 error, that error code could be included with the notification. After notification, the processor MUST continue processing as if it had loaded an empty valid lexicon.
Note:
the
description
and
table
that
follow
use
an
imaginary
vendor-specific
lexicon
type
of
x-vnd.example.lexicon
.
This
is
intended
to
represent
whatever
format
is
returned/available,
as
appropriate.
A
lexicon
resource
indicated
by
a
URI
reference
may
be
available
in
one
or
more
media
types
.
The
SSML
author
can
specify
the
preferred
media
type
via
the
type
attribute.
When
the
content
represented
by
a
URI
is
available
in
many
data
formats,
a
synthesis
processor
MAY
use
the
preferred
type
to
influence
which
of
the
multiple
formats
is
used.
For
instance,
on
a
server
implementing
HTTP
content
negotiation,
the
processor
may
use
the
type
to
order
the
preferences
in
the
negotiation.
Upon
delivery,
the
resource
indicated
by
a
URI
reference
may
be
considered
in
terms
of
two
types.
The
declared
media
type
is
the
alleged
value
for
the
resource
and
the
actual
media
type
is
the
true
format
of
its
content.
The
actual
type
should
be
the
same
as
the
declared
type,
but
this
is
not
always
the
case
(e.g.
a
misconfigured
HTTP
server
might
return
text/plain
for
a
document
following
the
vendor-specific
x-vnd.example.lexicon
format).
A
specific
URI
scheme
may
require
that
the
resource
owner
always,
sometimes,
or
never
return
a
media
type.
Whenever
a
type
is
returned,
it
is
treated
as
authoritative.
The
declared
media
type
is
determined
by
the
value
returned
by
the
resource
owner
or,
if
none
is
returned,
by
the
preferred
media
type
given
in
the
SSML
document.
Three special cases may arise. The declared type may not be supported by the processor; this is an error . The declared type may be supported but the actual type may not match; this is also an error . Finally, no media type may be declared; the behavior depends on the specific URI scheme and the capabilities of the synthesis processor . For instance, HTTP 1.1 allows document introspection (see [ RFC2616 §7.2.1]), the data scheme falls back to a default media type, and local file access defines no guidelines. The following table provides some informative examples:
HTTP
1.1
request
|
Local
file
access
|
|||
---|---|---|---|---|
Media type returned by the resource owner | text/plain | x-vnd.example.lexicon | <none> | <none> |
Preferred media type from the SSML document | Not applicable; the returned type is authoritative. | x-vnd.example.lexicon | application/pls+xml | |
Declared media type | text/plain | x-vnd.example.lexicon | x-vnd.example.lexicon | <none> |
Behavior for an actual media type of x-vnd.example.lexicon | This MUST be processed as text/plain. This will generate an error if text/plain is not supported or if the document does not follow the expected format. | The declared and actual types match; success if x-vnd.example.lexicon is supported by the synthesis processor; otherwise an error . | Scheme specific; the synthesis processor might introspect the document to determine the type. |
The
lookup
element
MUST
have
a
ref
attribute.
The
ref
attribute
specifies
a
name that
name that
references
a
lexicon
document
as
assigned
by
the
xml:id
attribute
of
the
lexicon
element.
The
referenced
lexicon
document
may
contain
information
(e.g.,
pronunciation)
for
tokens
that
can
appear
in
a
text
to
be
rendered.
For
PLS
lexicon
documents,
the
information
contained
within
the
PLS
document
MUST
be
used
by
the
synthesis
processor
when
rendering
tokens
that
appear
within
the
context
of
a
lookup
element.
For
non-PLS
lexicon
documents,
the
information
contained
within
the
lexicon
document
SHOULD
be
used
by
the
synthesis
processor
when
rendering
tokens
that
appear
within
the
content
of
a
lookup
element,
although
the
processor
MAY
choose
not
to
use
the
information
if
it
is
deemed
incompatible
with
the
content
of
the
SSML
document.
For
example,
a
vendor-specific
lexicon
may
be
used
only
for
particular
values
of
the
interpret-as
attribute
of
the
say-as
element,
or
for
a
particular
set
of
voices.
Vendors
SHOULD
document
the
expected
behavior
of
the
synthesis
processor
when
SSML
content
refers
to
a
non-PLS
lexicon.
A lookup element MAY contain other lookup elements. When a lookup element contains other lookup elements, the child lookup elements have higher precedence. Precedence means that a token is first looked up in the lexicon with highest precedence. Only if the token is not found in that lexicon is it then looked up in the lexicon with the next lower precedence, and so on until the token is successfully found or until all lexicons have been used for lookup. It is assumed that the synthesis processor already has one or more built-in system lexicons which will be treated as having a lower precedence than those specified using the lexicon and lookup elements. Note that if a token is not within the scope of at least one lookup element, then the token can only be looked up in the built-in system lexicons.
The lookup element can only contain text to be rendered and the following elements: audio , break , emphasis , lang , lookup , mark , p , phoneme , prosody , say-as , sub , s , token , voice , , w .
<?xml version="1.0"?> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis " xml:lang="en-US"> <lexicon uri="http://www.example.com/lexicon.pls" xml:id="pls"/> <lexicon uri="http://www.example.com/strange-words.file" xml:id="sw" type="media-type"/> <lookup ref="pls"> tokens here are looked up in lexicon.pls <lookup ref="sw"> tokens here are looked up first in strange-words.file and then, if not found, in lexicon.pls </lookup> tokens here are looked up in lexicon.pls </lookup> tokens here are not looked up in lexicon documents ... </speak>
<?xml version="1.0"?>
<speak version="1.1"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<lexicon uri="http://www.example.com/lexicon.pls"
xml:id="pls"/>
<lexicon uri="http://www.example.com/strange-words.file"
xml:id="sw"
type="media-type"/>
<lookup ref="pls">
tokens here are looked up in lexicon.pls
<lookup ref="sw">
tokens here are looked up first in strange-words.file and then, if not found, in lexicon.pls
</lookup>
tokens here are looked up in lexicon.pls
</lookup>
tokens here are not looked up in lexicon documents
...
</speak>
The metadata and meta elements are containers in which information about the document can be placed. The metadata element provides more general and powerful treatment of metadata information than meta by using a metadata schema.
A
meta
declaration
associates
a
string
to
a
declared
meta
property
or
declares
"http-equiv"
content.
Either
a
name
or
http-equiv
attribute
is
REQUIRED
.
It
is
an
error
to
provide
both
name
and
http-equiv
attributes.
A
content
attribute
is
REQUIRED
.
The
seeAlso
property
is
the
only
defined
meta
property
name.
It
is
used
to
specify
a
resource
that
might
provide
additional
metadata
information
about
the
content.
This
property
is
modeled
on
the
seeAlso
property
of
Resource
Description
Framework
(RDF)
Schema
Specification
1.0
[
RDF-SCHEMA
§5.4.1].
The
http-equiv
attribute
has
a
special
significance
when
documents
are
retrieved
via
HTTP.
Although
the
preferred
method
of
providing
HTTP
header
information
is
by
using
HTTP
header
fields,
the
"http-equiv"
content
MAY
be
used
in
situations
where
the
SSML
document
author
is
unable
to
configure
HTTP
header
fields
associated
with
their
document
on
the
origin
server,
for
example,
cache
control
information.
Note
that
HTTP
servers
and
caches
are
not
required
to
introspect
the
contents
of
meta
in
SSML
documents
and
thereby
override
the
header
values
they
would
send
otherwise.
Informative: This is an example of how meta elements can be included in an SSML document to specify a resource that provides additional metadata information and also indicate that the document must not be cached.
<?xml version="1.0"?> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis " xml:lang="en-US"> <meta name="seeAlso" content="http://example.com/my-ssml-metadata.xml"/> <meta http-equiv="Cache-Control" content="no-cache"/> </speak>
<?xml version="1.0"?>
<speak version="1.1"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<meta name="seeAlso" content="http://example.com/my-ssml-metadata.xml"/>
<meta http-equiv="Cache-Control" content="no-cache"/>
</speak>
The meta element is an empty element.
The metadata element is a container in which information about the document can be placed using a metadata schema. Although any metadata schema can be used with metadata , it is RECOMMENDED that the XML syntax of the Resource Description Framework (RDF) [ RDF-XMLSYNTAX ] be used in conjunction with the general metadata properties defined in the Dublin Core Metadata Initiative [ DC ].
The Resource Description Format [ RDF ] is a declarative language and provides a standard way for using XML to represent metadata in the form of statements about properties and relationships of items on the Web. Content creators should refer to W3C metadata Recommendations [ RDF-XMLSYNTAX ] and [ RDF-SCHEMA ] when deciding which metadata RDF schema to use in their documents. Content creators should also refer to the Dublin Core Metadata Initiative [ DC ], which is a set of generally applicable core metadata properties (e.g., Title, Creator, Subject, Description, Rights, etc.).
Document properties declared with the metadata element can use any metadata schema.
Informative: This is an example of how metadata can be included in an SSML document using the Dublin Core version 1.0 RDF schema [ DC ] describing general document information such as title, description, date, and so on:
<?xml version="1.0"?> <speak version="1.1 " xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis " xml:lang="en-US"> <metadata> <rdf:RDF xmlns:rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs = "http://www.w3.org/2000/01/rdf-schema#" xmlns:dc = "http://purl.org/dc/elements/1.1/"> <!-- Metadata about the synthesis document --> <rdf:Description rdf:about="http://www.example.com/meta.ssml" dc:Title="Hamlet-like Soliloquy" dc:Description="Aldine's Soliloquy in the style of Hamlet" dc:Publisher="W3C" dc:Language="en-US" dc:Date="2002-11-29" dc:Rights="Copyright 2002 Aldine Turnbet" dc:Format="application/ssml+xml" > <dc:Creator> <rdf:Seq ID="CreatorsAlphabeticalBySurname"> <rdf:li>William Shakespeare</rdf:li> <rdf:li>Aldine Turnbet</rdf:li> </rdf:Seq> </dc:Creator> </rdf:Description> </rdf:RDF> </metadata> </speak>
<?xml version="1.0"?>
<speak version="1.1"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<metadata>
<rdf:RDF
xmlns:rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs = "http://www.w3.org/2000/01/rdf-schema#"
xmlns:dc = "http://purl.org/dc/elements/1.1/">
<!-- Metadata about the synthesis document -->
<rdf:Description rdf:about="http://www.example.com/meta.ssml"
dc:Title="Hamlet-like Soliloquy"
dc:Description="Aldine's Soliloquy in the style of Hamlet"
dc:Publisher="W3C"
dc:Language="en-US"
dc:Date="2002-11-29"
dc:Rights="Copyright 2002 Aldine Turnbet"
dc:Format="application/ssml+xml" >
<dc:Creator>
<rdf:Seq ID="CreatorsAlphabeticalBySurname">
<rdf:li>William Shakespeare</rdf:li>
<rdf:li>Aldine Turnbet</rdf:li>
</rdf:Seq>
</dc:Creator>
</rdf:Description>
</rdf:RDF>
</metadata>
</speak>
The metadata element can have arbitrary content, although none of the content will be rendered by the synthesis processor .
A p element represents a paragraph. An s element represents a sentence.
xml:lang
,
xml:id
,
and
onlangfailure
are
defined
attributes
on
the
p
and
s
elements.
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<p>
<s>This is the first sentence of the paragraph.</s>
<s>Here's another sentence.</s>
</p>
</speak>
The use of p and s elements is OPTIONAL . Where text occurs without an enclosing p or s element the synthesis processor SHOULD attempt to determine the structure using language-specific knowledge of the format of plain text.
The p element can only contain text to be rendered and the following elements: audio , break , emphasis , lang , , lookup , mark , phoneme , prosody , say-as , sub , s , token , voice , w .
The s element can only contain text to be rendered and the following elements: audio , break , emphasis , lang , , lookup , mark , phoneme , prosody , say-as , sub , token , voice , w .
The token element allows the author to indicate its content is a token and to eliminate token (word) segmentation ambiguities of the synthesis processor.
The token element is necessary in order to render languages
Use of this element can result in improved cues for prosodic control (e.g., pause) and may assist the synthesis processor in selection of the correct pronunciation for homographs. Other elements such as break , mark , and prosody are permitted within token to allow annotation at a sub-token level (e.g., syllable, mora, or whatever units are reasonable for the current language). Synthesis processors are REQUIRED to parse these annotations and MAY render them as they are able.
The text contents of the token element and its subelements are together considered to be one token for lexical lookup purposes as follows:
Thus, "<token><emphasis>hap</emphasis>py</token>" and "<token><emphasis> hap </emphasis> py</token>" "<token><emphasis> hap </emphasis> py</token>" would refer to the tokens "happy" and "hap py", "hap py", respectively. Note that this is different from how text and markup outside a token element are treated (see "Text normalization" in Section 1.2 ).
The use of token elements is OPTIONAL . Where text occurs without an enclosing token element the synthesis processor SHOULD attempt to determine the token segmentation using language-specific knowledge of the format of plain text.
xml:lang
is
a
defined
attribute
on
the
token
element
to
identify
the
written
language
of
the
content.
xml:id
is
a
defined
attribute
on
the
token
element.
onlangfailure
is
an
OPTIONAL
attribute
specifying
the
desired
behavior
upon
language
speaking
failure.
role
is
an
OPTIONAL
defined
attribute
on
the
token
element.
The
role
attribute
takes
as
its
value
one
or
more
white
space
separated
QNames
(as
defined
in
Section
4
of
Namespaces
in
XML
(1.0
[
XMLNS
1.0
]
or
1.1
[
XMLNS
1.1
],
depending
on
the
version
of
XML
being
used)).
A
QName
in
the
attribute
content
is
expanded
into
an
expanded-name
using
the
namespace
declarations
in
scope
for
the
containing
token
element.
Thus,
each
QName
provides
a
reference
to
a
specific
item
in
the
designated
namespace.
In
the
second
example
below,
the
QName
within
the
role
attribute
expands
to
the
"VV0"
item
in
the
"http://www.example.com/claws7tags"
namespace.
This
mechanism
allows
for
referencing
defined
taxonomies
of
word
classes,
with
the
expectation
that
they
are
documented
at
the
specified
namespace
URI.
The
role
attribute
is
intended
to
be
of
use
in
synchronizing
with
other
specifications,
for
example
to
describe
additional
information
to
help
the
selection
of
the
most
appropriate
pronunciation
for
the
contained
text
inside
an
external
lexicon
(see
lexicon
documents
).
The token element can only contain text to be rendered and the following elements: audio , break , emphasis , mark , phoneme , prosody , say-as , sub , voice .
The token element can only be contained in the following elements: audio , emphasis , lang , lookup , prosody , speak , p , s , voice .
The w element is an alias for the token element.
Here is an example showing the use of the token element.
<?xml version="1.0"?> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis " xml:lang="zh-CN"> <!-- The Nanjing Changjiang River Bridge --> <token>南京市</token><token>长江大桥</token> <!-- The mayor of Nanjing city, Jiang Daqiao --> 南京市长<w>江大桥</w> <!-- Shanghai is a metropolis --> 上海是个<w>大都会</w> <!-- Most Shanghainese will say something like that --> 上海人<w>大都</w>会那么说
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="zh-CN">
<!-- The Nanjing Changjiang River Bridge -->
<token>南京市</token><token>长江大桥</token>
<!-- The mayor of Nanjing city, Jiang Daqiao -->
南京市长<w>江大桥</w>
<!-- Shanghai is a metropolis -->
上海是个<w>大都会</w>
<!-- Most Shanghainese will say something like that -->
上海人<w>大都</w>会那么说
</speak>
The
next
example
shows
the
use
of
the
role
attribute.
The
first
document
below
is
a
sample
lexicon
(PLS)
for
the
Chinese
word
"处".
The
second
references
this
lexicon
and
shows
how
the
role
attribute
may
be
used
to
select
the
appropriate
pronunciation
of
the
Chinese
word
"处"
in
the
dialog.
<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
xmlns:claws="http://www.example.com/claws7tags"
alphabet="x-myorganization-pinyin"
xml:lang="zh-CN">
<lexeme role="claws:VV0">
<!-- base form of lexical verb -->
<grapheme>处</grapheme>
<phoneme>chu3</phoneme>
<!-- pinyin string is: "chǔ" in 处罚 处置 -->
</lexeme>
<lexeme role="claws:NN">
<!-- common noun, neutral for number -->
<grapheme>处</grapheme>
<phoneme>chu4</phoneme>
<!-- pinyin string is: "chù" in 处所 妙处 -->
</lexeme>
</lexicon>
<?xml version="1.0" encoding="UTF-8"?> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis " xmlns:claws="http://www.example.com/claws7tags" xml:lang="zh-CN"> <lexicon uri="http://www.example.com/lexicon.pls" type="application/pls+xml" xml:id="mylex"/> <lookup ref="mylex"> 他这个人很不好相<w role="claws:VV0">处</w>。 此<w role="claws:NN">处</w>不准照相。 </lookup> </speak>
<?xml version="1.0" encoding="UTF-8"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xmlns:claws="http://www.example.com/claws7tags"
xml:lang="zh-CN">
<lexicon uri="http://www.example.com/lexicon.pls"
type="application/pls+xml"
xml:id="mylex"/>
<lookup ref="mylex">
他这个人很不好相<w role="claws:VV0">处</w>。
此<w role="claws:NN">处</w>不准照相。
</lookup>
</speak>
The say-as element allows the author to indicate information on the type of text construct contained within the element and to help specify the level of detail for rendering the contained text.
Defining a comprehensive set of text format types is difficult because of the variety of languages that have to be considered and because of the innate flexibility of written languages. SSML only specifies the say-as element, its attributes, and their purpose. It does not enumerate the possible values for the attributes. The Working Group expects to produce a separate document that will define standard values and associated normative behavior for these values. Examples given here are only for illustrating the purpose of the element and the attributes.
The
say-as
element
has
three
attributes:
interpret-as
,
format
,
and
detail
.
The
interpret-as
attribute
is
always
REQUIRED
;
the
other
two
attributes
are
OPTIONAL
.
The
legal
values
for
the
format
attribute
depend
on
the
value
of
the
interpret-as
attribute.
The say-as element can only contain text to be rendered.
interpret-as
and
format
attributes
The
interpret-as
attribute
indicates
the
content
type
of
the
contained
text
construct.
Specifying
the
content
type
helps
the
synthesis
processor
to
distinguish
and
interpret
text
constructs
that
may
be
rendered
in
different
ways
depending
on
what
type
of
information
is
intended.
In
addition,
the
OPTIONAL
format
attribute
can
give
further
hints
on
the
precise
formatting
of
the
contained
text
for
content
types
that
may
have
ambiguous
formats.
When
specified,
the
interpret-as
and
format
values
are
to
be
interpreted
by
the
synthesis
processor
as
hints
provided
by
the
markup
document
author
to
aid
text
normalization
and
pronunciation.
In all cases, the text enclosed by any say-as element is intended to be a standard, orthographic form of the language currently in context. A synthesis processor SHOULD be able to support the common, orthographic forms of the specified language for every content type that it supports.
When
the
value
for
the
interpret-as
attribute
is
unknown
or
unsupported
by
a
processor,
it
MUST
render
the
contained
text
as
if
no
interpret-as
value
were
specified.
When
the
value
for
the
format
attribute
is
unknown
or
unsupported
by
a
processor,
it
MUST
render
the
contained
text
as
if
no
format
value
were
specified,
and
SHOULD
render
it
using
the
interpret-as
value
that
is
specified.
When
the
content
of
the
say-as
element
contains
additional
text
next
to
the
content
that
is
in
the
indicated
format
and
interpret-as
type,
then
this
additional
text
MUST
be
rendered.
The
processor
MAY
make
the
rendering
of
the
additional
text
dependent
on
the
interpret-as
type
of
the
element
in
which
it
appears.
When
the
content
of
the
say-as
element
contains
no
content
in
the
indicated
interpret-as
type
or
format
,
the
processor
MUST
render
the
content
either
as
if
the
format
attribute
were
not
present,
or
as
if
the
interpret-as
attribute
were
not
present,
or
as
if
neither
the
format
nor
interpret-as
attributes
were
present.
The
processor
SHOULD
also
notify
the
environment
of
the
mismatch.
Indicating the content type or format does not necessarily affect the way the information is pronounced. A synthesis processor SHOULD pronounce the contained text in a manner in which such content is normally produced for the language.
detail
attribute
The
detail
attribute
is
an
OPTIONAL
attribute
that
indicates
the
level
of
detail
to
be
read
aloud
or
rendered.
Every
value
of
the
detail
attribute
MUST
render
all
of
the
informational
content
in
the
contained
text;
however,
specific
values
for
the
detail
attribute
can
be
used
to
render
content
that
is
not
usually
informational
in
running
text
but
may
be
important
to
render
for
specific
purposes.
For
example,
a
synthesis
processor
will
usually
render
punctuations
through
appropriate
changes
in
prosody.
Setting
a
higher
level
of
detail
may
be
used
to
speak
punctuations
explicitly,
e.g.
for
reading
out
coded
part
numbers
or
pieces
of
software
code.
The
detail
attribute
can
be
used
for
all
interpret-as
types.
If
the
detail
attribute
is
not
specified,
the
level
of
detail
that
is
produced
by
the
synthesis
processor
depends
on
the
text
content
and
the
language.
When
the
value
for
the
detail
attribute
is
unknown
or
unsupported
by
a
processor,
it
MUST
render
the
contained
text
as
if
no
value
were
specified
for
the
detail
attribute.
The phoneme element provides a phonemic/phonetic pronunciation for the contained text. The phoneme element MAY be empty. However, it is RECOMMENDED that the element contain human-readable text that can be used for non-spoken rendering of the document. For example, the content may be displayed visually for users with hearing impairments.
The
ph
attribute
is
a
REQUIRED
attribute
that
specifies
the
phoneme/phone
string.
This element is designed strictly for phonemic and phonetic notations and is intended to be used to provide pronunciations for words or very short phrases. The phonemic/phonetic string does not undergo text normalization and is not treated as a token for lookup in the lexicon (see Section 3.1.5 ), while values in say-as and sub may undergo both. Briefly, phonemic strings consist of phonemes, language-dependent speech units that characterize linguistically significant differences in the language; loosely, phonemes represent all the sounds needed to distinguish one word from another in a given language. On the other hand, phonetic strings consist of phones, speech units that characterize the manner (puff of air, click, vocalized, etc.) and place (front, middle, back, etc.) of articulation within the human vocal tract and are thus independent of language; phones represent realized distinctions in human speech production.
The
alphabet
attribute
is
an
OPTIONAL
attribute
that
specifies
the
phonemic/phonetic
pronunciation
alphabet.
A
pronunciation
alphabet
in
this
context
refers
to
a
collection
of
symbols
to
represent
the
sounds
of
one
or
more
human
languages.
The
only
valid
values
for
this
attribute
are
"
ipa
"
(see
the
next
paragraph),
values
defined
in
the
Pronunciation
Alphabet
Registry
and
vendor-defined
strings
of
the
form
"
x-organization
"
or
"
x-organization-alphabet
".
For
example,
the
Japan
Electronics
and
Information
Technology
Industries
Association
[
JEITA
]
might
wish
to
encourage
the
use
of
an
alphabet
such
as
"x-JEITA"
or
"x-JEITA-
IT-4002
"
"x-JEITA-IT-4002"
for
their
phoneme
alphabet
[
JEIDAALPHABET
].
Synthesis
processors
SHOULD
support
a
value
for
alphabet
of
"
ipa
",
corresponding
to
Unicode
representations
of
the
phonetic
characters
developed
by
the
International
Phonetic
Association
[
IPA
].
In
addition
to
an
exhaustive
set
of
vowel
and
consonant
symbols,
this
character
set
supports
a
syllable
delimiter,
numerous
diacritics,
stress
symbols,
lexical
tone
symbols,
intonational
markers
and
more.
For
this
alphabet,
legal
ph
values
are
strings
of
the
values
specified
in
Appendix
2
of
[
IPAHNDBK
]
;
];
note
that
an
IPA
transcription
may
contain
white
space
characters
to
assist
readability,
which
have
no
implications
for
the
pronunciation
.
pronunciation.
Informative
tables
of
the
IPA-to-Unicode
mappings
can
be
found
at
[
IPAUNICODE1
]
and
[
IPAUNICODE2
].
Note
that
not
all
of
the
IPA
characters
are
available
in
Unicode.
For
processors
supporting
this
alphabet,
ph
values.
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<phoneme alphabet="ipa" ph="təmei̥ɾou̥"> tomato </phoneme>
<!-- This is an example of IPA using character entities -->
<!-- Because many platform/browser/text editor combinations do not
correctly cut and paste Unicode text, this example uses the entity
escape versions of the IPA characters. Normally, one would directly
use the UTF-8 representation of these symbols: "təmei̥ɾou̥". -->
</speak>
It
is
an
error
if
a
value
for
alphabet
is
specified
that
is
not
known
or
cannot
be
applied
by
a
synthesis
processor
.
The
default
behavior
when
the
alphabet
attribute
is
left
unspecified
is
processor-specific.
The
type
attribute
is
an
optional
attribute
that
indicates
additional
information
about
how
the
pronunciation
information
is
to
be
interpreted.
The
only
allowed
values
for
this
attribute
are
"
default
",
which
has
no
implications,
and
"
ruby
",
which
indicates
that
the
pronunciation
information
is
from
ruby
text
[
RUBY
].
The
default
value
of
this
attribute
is
"
default
".
The phoneme element itself can only contain text (no elements).
Links to the Pronunciation Alphabet Registry can be found on the SSML namespace page at http://www.w3.org/2001/10/synthesis .
The
sub
element
is
employed
to
indicate
that
the
text
in
the
alias
attribute
value
replaces
the
contained
text
for
pronunciation.
This
allows
a
document
to
contain
both
a
spoken
and
written
form.
The
REQUIRED
alias
attribute
specifies
the
string
to
be
spoken
instead
of
the
enclosed
string.
The
processor
SHOULD
apply
text
normalization
to
the
alias
value.
The sub element can only contain text (no elements).
<?xml version="1.0"?> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis " xml:lang="en-US"> <sub alias="World Wide Web Consortium">W3C</sub> <!-- World Wide Web Consortium --> </speak>
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<sub alias="World Wide Web Consortium">W3C</sub>
<!-- World Wide Web Consortium -->
</speak>
The lang element is used to specify the natural language of the content.
xml:lang
is
a
REQUIRED
attribute
specifying
the
language
of
the
root
document.
onlangfailure
is
an
OPTIONAL
attribute
specifying
the
desired
behavior
upon
language
speaking
failure.
This
element
MAY
be
used
when
there
is
a
change
in
the
natural
language.
There
is
no
text
structure
associated
with
the
language
change
indicated
by
the
lang
element.
It
MAY
be
used
to
specify
the
language
of
the
content
at
a
level
other
than a
paragraph, sentence
than a
paragraph, sentence
or
word
level.
When
language
change
is
to
be
associated
with
text
structure,
it
is
RECOMMENDED
to
use
the
xml:lang
attribute
on the
attribute
on the
respective
p
,
s
,
token
,
or
w
element.
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
The French word for cat is <w xml:lang="fr">chat</w>.
He prefers to eat pasta that is <lang xml:lang="it">al dente</lang>.
</speak>
The lang element can only contain text to be rendered and the following elements: audio , break , emphasis , lang , , lookup , mark , p , phoneme , prosody , say-as , sub , s , token , voice , w .
onlangfailure
Attribute
The
onlangfailure
attribute
is
an
OPTIONAL
attribute
that
contains
one
value
from
the
following
enumerated
list
describing
the
desired
behavior
of
the
synthesis
processor
upon
language
speaking
failure.
A
conforming
synthesis
processor
MUST
report
a
language
speaking
failure
in
addition
to
taking
the
action(s)
below.
A
language
speaking
failure
occurs
whenever
the
synthesis
processor
decides
that
the
currently-selected
voice
(see
Section
3.2.1
)
cannot
speak
the
declared
language
of
the
text.
This
can
occur
when
the
synthesis
processor
encounters
a
new
xml:lang
value
or
characters
or
character
sequences
that
the
voice
does
not
know
how
to
process.
The value of this attribute is inherited down the document hierarchy, i.e. it needs to be given only once if the desired behavior for the whole document is the same, and settings of this value nest, i.e. inner attributes overwrite outer attributes. The top-level default value for this attribute is "processorchoice". Other languages which embed fragments of SSML (without a speak element) MUST declare the top-level default value for this attribute.
onlangfailure
is
permitted
on
all
elements
which
can
contain
xml:lang
,
so
it
is
a
defined
attribute
for
the
speak
,
lang
,
desc
,
p
,
s
,
token
,
and
w
elements.
The voice element is a production element that requests a change in speaking voice. There are two kinds of attributes for the voice element: those that indicate desired features of a voice and those that control behavior. The voice feature attributes are:
gender
:
OPTIONAL
attribute
indicating
the
preferred
gender
of
the
voice
to
speak
the
contained
text.
Enumerated
values
are:
"
male
",
"
female
",
"
neutral
",
or
the
empty
string
"".
age
:
OPTIONAL
attribute
indicating
the
preferred
age
in
years
(since
birth)
of
the
voice
to
speak
the
contained
text.
Acceptable
values
are
of
type
xsd:nonNegativeInteger
[
SCHEMA2
§3.3.20]
or
the
empty
string
"".
variant
:
OPTIONAL
attribute
indicating
a
preferred
variant
of
the
other
voice
characteristics
to
speak
the
contained
text.
(e.g.
the
second
male
child
voice).
Valid
values
of
variant
are
of
type
xsd:positiveInteger
[
SCHEMA2
§3.3.25]
or
the
empty
string
"".
name
:
OPTIONAL
attribute
indicating
a
processor-specific
voice
name
to
speak
the
contained
text.
The
value
MAY
be
a
space-separated
list
of
names
ordered
from
top
preference
down
or
the
empty
string
"".
As
a
result
a
name
MUST
NOT
contain
any
white
space.
languages
:
OPTIONAL
attribute
indicating
the
list
of
languages
the
voice
is
desired
to
speak.
The
value
MUST
be
either
the
empty
string
""
or
a
space-separated
list
of
languages,
with
OPTIONAL
accent
indication
per
language.
Each
language/accent
pair
is
of
the
form
"
language
"
or
"
language
:
accent
",
where
both
language
and
accent
MUST
be
an
Extended
Language
Range
[
BCP47,
Matching
of
Language
Tags
§2.2],
except
that
the
values
"und"
and
"zxx"
are
disallowed.
A
voice
satisfies
the
languages
feature
if,
for
each
language/accent
pair
in
the
list,
For
example,
a
languages
value
of
"en:zh
fr:ja"
can
legally
be
matched
by
any
voice
that
can
both
read
English
(speaking
it
with
a
Chinese
accent)
and
read
French
(speaking
it
with
a
Japanese
accent).
Thus,
a
voice
that
only
supports
"en-US"
with
a
"zh-yue"
accent
and
"fr-CA"
with
a
"ja"
accent
would
match.
As
another
example,
if
we
have
<voice
languages="fr:zh">
and
there
is
no
voice
that
supports
French
with
a
Chinese
accent,
then
a
voice
selection
failure
will
occur.
Note
that
if
no
accent
indication
is
given
for
a
language,
then
any
voice
that
speaks
the
language
is
acceptable,
regardless
of
accent.
Also,
note
that
author
control
over
language
support
during
voice
selection
is
independent
of
any
value
of
xml:lang
in
the
text.
For the feature attributes above, an empty string value indicates that any voice will satisfy the feature. The top-level default value for all feature attributes is "", the empty string.
The behavior control attributes of voice are:
required
:
OPTIONAL
attribute
that
specifies
a
set
of
features
by
their
respective
attribute
names.
This
set
of
features
is
used
by
the
voice
selection
algorithm
described
below.
Valid
values
of
required
are
a
space-separated
list
composed
of
values
from
the
list
of
feature
names:
"
name
",
"
languages
",
"
gender
",
"
age
",
"
variant
"
or
the
empty
string
"".
The
default
value
for
this
attribute
is
"languages".
ordering
:
OPTIONAL
attribute
that
specifies
the
priority
ordering
of
features.
Valid
values
of
ordering
are
a
space-separated
list
composed
of
values
from
the
list
of
feature
names:
"
name
",
"
languages
",
"
gender
",
"
age
",
"
variant
"
or
the
empty
string
"",
where
features
named
earlier
in
the
list
have
higher
priority
.
The
default
value
for
this
attribute
is
"languages".
Features
not
listed
in
the
ordering
list
have
equal
priority
to
each
other
but
lower
than
that
of
the
last
feature
in
the
list.
Note
that
if
the
ordering
attribute
is
set
to
the
empty
string
then
all
features
have
the
same
priority.
onvoicefailure
:
OPTIONAL
attribute
containing
one
value
from
the
following
enumerated
list
describing
the
desired
behavior
of
the
synthesis
processor
upon
voice
selection
failure.
The
default
value
for
this
attribute
is
"priorityselect".
The following voice selection algorithm MUST be used:
required
attribute
value
are
matched.
When
the
value
of
the
required
attribute
is
the
empty
string
"",
any
and
all
voices
are
considered
successful
matches.
If
one
or
more
voices
are
identified,
the
selection
is
considered
successful;
otherwise
there
is
voice
selection
failure.
required
attribute
value)
are
used
to
choose
a
voice
by
feature
priority
,
where
the
starting
candidate
set
is
the
set
of
all
voices
identified.
onvoicefailure
attribute.
ordering
attribute.
ordering
list,
if
multiple
voices
remain
in
the
candidate
set
,
the
synthesis
processor
MUST
use
any
one
of
them.
Although each attribute individually is optional, it is an error if no attributes are specified when the voice element is used.
<?xml version="1.0"?> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis " xml:lang="en-US"> <voice gender="female" languages="en-US" required="languages gender variant">Mary had a little lamb,</voice> <!-- now request a different female child's voice --> <voice gender="female" variant="2"> Its fleece was white as snow. </voice> <!-- processor-specific voice selection --> <voice name="Mike" required="name">I want to be like Mike.</voice> </speak>
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<voice gender="female" languages="en-US" required="languages gender variant">Mary had a little lamb,</voice>
<!-- now request a different female child's voice -->
<voice gender="female" variant="2">
Its fleece was white as snow.
</voice>
<!-- processor-specific voice selection -->
<voice name="Mike" required="name">I want to be like Mike.</voice>
</speak>
For every voice made available to a synthesis processor, the vendor of the voice must document the following:
Although
indication
of
language
(using
xml:lang
)
and
selection
of
voice
(using
voice
)
are
independent,
there
is
no
requirement
that
a
synthesis
processor
support
every
possible
combination
of
values
of
the
two.
However,
a
synthesis
processor
MUST
document
expected
rendering
behavior
for
every
possible
combination.
See
the
onlangfailure
attribute
for
information
on
what
happens
when
the
processor
encounters
text
content
that
the
voice
cannot
speak.
voice attributes are inherited down the tree including to within elements that change the language. The defaults described for each attribute only apply at the top (document) level and are overridden by explicit author use of the voice element. In addition, changes in voice are scoped and apply only to the content of the element in which the change occurred. When processing reaches the end of a voice element content, i.e. the closing </voice> tag, the voice in effect before the beginning tag is restored.
Similarly, if a voice is changed by the processor as a result of a language speaking failure, the prior voice is restored when that voice is again able to speak the content. Note that there is always an active voice, since the synthesis processor is required to select a default voice before beginning execution of the document (see section 3.1.1 ).
<?xml version="1.0"?> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis " xml:lang="en-US"> <voice gender="female" required="languages gender age" languages="en-US ja"> Any female voice here. <voice age="6"> A female child voice here. <lang xml:lang="ja"> <!-- Same female child voice rendering Japanese text. --> </lang> </voice> </voice> </speak>
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<voice gender="female" required="languages gender age" languages="en-US ja">
Any female voice here.
<voice age="6">
A female child voice here.
<lang xml:lang="ja">
<!-- Same female child voice rendering Japanese text. -->
</lang>
</voice>
</voice>
</speak>
Relative changes in prosodic parameters SHOULD be carried across voice changes. However, different voices have different natural defaults for pitch, speaking rate, etc. because they represent different personalities, so absolute values of the prosodic parameters may vary across changes in the voice.
The quality of the output audio or voice may suffer if a change in voice is requested within a sentence.
The voice element can only contain text to be rendered and the following elements: audio , break , emphasis , lang , , lookup , mark , p , phoneme , prosody , say-as , sub , s , token , voice , w .
The emphasis element requests that the contained text be spoken with emphasis (also referred to as prominence or stress). The synthesis processor determines how to render emphasis since the nature of emphasis differs between languages, dialects or even voices. The attributes are:
level
:
the
OPTIONAL
level
attribute
indicates
the
strength
of
emphasis
to
be
applied.
Defined
values
are
"strong"
,
"moderate"
,
"none"
and
"reduced"
.
The
default
level
is
"moderate"
.
The
meaning
of
"strong"
and
"moderate"
emphasis
is
interpreted
according
to
the
language
being
spoken
(languages
indicate
emphasis
using
a
possible
combination
of
pitch
change,
timing
changes,
loudness
and
other
acoustic
differences).
The
"reduced"
level
is
effectively
the
opposite
of
emphasizing
a
word.
For
example,
when
the
phrase
"going
to"
is
reduced
it
may
be
spoken
as
"gonna".
The
"none"
level
is
used
to
prevent
the
synthesis
processor
from
emphasizing
words
that
it
might
typically
emphasize.
The
values
"none"
,
"moderate"
,
and
"strong"
are
monotonically
non-decreasing
in
strength.
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
That is a <emphasis> big </emphasis> car!
That is a <emphasis level="strong"> huge </emphasis>
bank account!
</speak>
The emphasis element can only contain text to be rendered and the following elements: audio , break , emphasis , lang , , lookup , mark , phoneme , prosody , say-as , sub , token , voice , w .
The break element is an empty element that controls the pausing or other prosodic boundaries between tokens. The use of the break element between any pair of tokens is OPTIONAL . If the element is not present between tokens, the synthesis processor is expected to automatically determine a break based on the linguistic context. In practice, the break element is most often used to override the typical automatic behavior of a synthesis processor. The attributes on this element are:
strength
:
the
strength
attribute
is
an
OPTIONAL
attribute
having
one
of
the
following
values:
"none"
,
"x-weak"
,
"weak"
,
"medium"
(default
value),
"strong"
,
or
"x-strong"
.
This
attribute
is
used
to
indicate
the
strength
of
the
prosodic
break
in
the
speech
output.
The
value
"none"
indicates
that
no
prosodic
break
boundary
should
be
outputted,
which
can
be
used
to
prevent
a
prosodic
break
which
the
processor
would
otherwise
produce.
The
other
values
indicate
monotonically
non-decreasing
(conceptually
increasing)
break
strength
between
tokens.
The
stronger
boundaries
are
typically
accompanied
by
pauses.
"
x-weak
"
and
"
x-strong
"
are
mnemonics
for
"extra
weak"
and
"extra
strong",
respectively.
time
:
the
time
attribute
is
an
OPTIONAL
attribute
indicating
the
duration
of
a
pause
to
be
inserted
in
the
output
in
seconds
or
milliseconds.
It
follows
the
time
value
format
from
the
Cascading
Style
Sheets
Level
2
Recommendation
[
CSS2
],
e.g.
"250ms",
"3s".
The
strength
attribute
is
used
to
indicate
the
prosodic
strength
of
the
break.
For
example,
the
breaks
between
paragraphs
are
typically
much
stronger
than
the
breaks
between
words
within
a
sentence.
The
synthesis
processor
MAY
insert
a
pause
as
part
of
its
implementation
of
the
prosodic
break.
A
pause
of
a
specific
length
can
also
be
inserted
by
using
the
time
attribute.
If
a
break
element
is
used
with
neither
strength
nor
time
attributes,
a
break
will
be
produced
by
the
processor
with
a
prosodic
strength
greater
than
that
which
the
processor
would
otherwise
have
used
if
no
break
element
was
supplied.
If
both
strength
and
time
attributes
are
supplied,
the
processor
will
insert
a
break
with
a
duration
as
specified
by
the
time
attribute,
with
other
prosodic
changes
in
the
output
based
on
the
value
of
the
strength
attribute.
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
Take a deep breath <break/>
then continue.
Press 1 or wait for the tone. <break time="3s"/>
I didn't hear you! <break strength="weak"/> Please repeat.
</speak>
The prosody element permits control of the pitch, speaking rate and volume of the speech output. The attributes, all OPTIONAL , are:
pitch
:
the
baseline
pitch
for
the
contained
text.
Although
the
exact
meaning
of
"baseline
pitch"
will
vary
across
synthesis
processors,
increasing/decreasing
this
value
will
typically
increase/decrease
the
approximate
pitch
of
the
output.
Legal
values
are:
a
number
followed
by
"Hz",
a
relative
change
or
"x-low"
,
"low"
,
"medium"
,
"high"
,
"x-high"
,
or
"default"
.
Labels
"x-low"
through
"x-high"
represent
a
sequence
of
monotonically
non-decreasing
pitch
levels.
contour
:
sets
the
actual
pitch
contour
for
the
contained
text.
The
format
is
specified
in
Pitch
contour
below.
range
:
the
pitch
range
(variability)
for
the
contained
text.
Although
the
exact
meaning
of
"pitch
range"
will
vary
across
synthesis
processors,
increasing/decreasing
this
value
will
typically
increase/decrease
the
dynamic
range
of
the
output
pitch.
Legal
values
are:
a
number
followed
by
"Hz",
a
relative
change
or
"x-low"
,
"low"
,
"medium"
,
"high"
,
"x-high"
,
or
"default"
.
Labels
"x-low"
through
"x-high"
represent
a
sequence
of
monotonically
non-decreasing
pitch
ranges.
rate
:
a
change
in
the
speaking
rate
for
the
contained
text.
Legal
values
are:
a
non-negative
percentage
or
"x-slow"
,
"slow"
,
"medium"
,
"fast"
,
"x-fast"
,
or
"default"
.
Labels
"x-slow"
through
"x-fast"
represent
a
sequence
of
monotonically
non-decreasing
speaking
rates.
When
the
value
is
a
non-negative
percentage
it
acts
as
a
multiplier
of
the
default
rate.
For
example,
a
value
of
100%
means
no
change
in
speaking
rate,
a
value
of
200%
means
a
speaking
rate
twice
the
default
rate,
and
a
value
of
50%
means
a
speaking
rate
of
half
the
default
rate.
The
default
rate
for
a
voice
depends
on
the
language
and
dialect
and
on
the
personality
of
the
voice.
The
default
rate
for
a
voice
SHOULD
be
such
that
it
is
experienced
as
a
normal
speaking
rate
for
the
voice
when
reading
aloud
text.
Since
voices
are
processor-specific,
the
default
rate
will
be
as
well.
duration
:
a
value
in
seconds
or
milliseconds
for
the
desired
time
to
take
to
read
the
contained
text.
Follows
the
time
value
format
from
the
Cascading
Style
Sheet
Level
2
Recommendation
[
CSS2
],
e.g.
"250ms",
"3s".
volume
:
the
volume
for
the
contained
text.
Legal
values
are:
a
number
preceded
by
"+"
or
"-"
and
immediately
followed
by
"dB";
or
"silent"
,
"x-soft"
,
"soft"
,
"medium"
,
"loud"
,
"x-loud"
,
or
"default"
.
The
default
is
+0.0dB.
Specifying
a
value
of
"silent"
amounts
to
specifying
minus
infinity
decibels
(dB).
Labels
"silent"
through
"x-loud"
represent
a
sequence
of
monotonically
non-decreasing
volume
levels.
When
the
value
is
a
signed
number
(dB),
it
specifies
the
ratio
of
the
squares
of
the
new
signal
amplitude
(a
1
)
and
the
current
amplitude
(a
0
),
and
is
defined
in
terms
of
dB:
volume
(dB)
=
20
log
10
(a
1
/
a
0
)
Note that all numerical volume levels (in dB) are relative to the current level and that they are always signed (including zero). Also note that once the current volume level is set to "silent" all child relative changes also result in silence. A child prosody element MAY use the label "default" to reset the current volume level.
So that for a value of:
Note that the behavior of this attribute for label values may differ from that of numerical values. Use of a numerical value causes direct modification of the waveform, while use of a label value may result in prosodic modifications that more accurately reflect how a human being would increase or decrease the perceived loudness of his speech, e.g., adjusting frequency and power differently for different sound units.
Although each attribute individually is optional, it is an error if no attributes are specified when the prosody element is used. The " x- foo " attribute value names are intended to be mnemonics for "extra foo ". All units ("Hz", "st") are case-sensitive. Note also that customary pitch levels and standard pitch ranges may vary significantly by language, as may the meanings of the labelled values for pitch targets and ranges.
Here
is
an
example
of
how
to
use
the
volume
attribute:
<speak version="1.1"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<s>I am speaking this at the default volume for this voice.</s>
<s><prosody volume="+6dB">
I am speaking this at approximately twice the original signal amplitude.
</prosody></s>
<s><prosody volume="-6dB">
I am speaking this at approximately half the original signal amplitude.
</prosody></s>
</speak>
A number is a simple positive floating point value without exponentials. Legal formats are "n", "n.", ".n" and "n.n" where "n" is a sequence of one or more digits.
A non-negative percentage is an unsigned number immediately followed by "%".
Relative changes for the attributes above can be specified
pitch
and
range
attributes,
relative
changes
can
be
given
in
semitones
(a
number
preceded
by
"+"
or
"-"
and
followed
by
"st")
or
in
Hertz
(a
number
preceded
by
"+"
or
"-"
and
followed
by
"Hz"):
"+0.5st",
"+5st",
"-2st",
"+10Hz",
"-5.5Hz".
A
semitone
is
half
of
a
tone
(a
half
step)
on
the
standard
diatonic
scale.
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
The price of XYZ is <prosody rate="90%">$45</prosody>
</speak>
The
pitch
contour
is
defined
as
a
set
of
white
space-separated
targets
at
specified
time
positions
in
the
speech
output.
The
algorithm
for
interpolating
between
the
targets
is
processor-specific.
In
each
pair
of
the
form
(time
position,target)
,
the
first
value
is
a
percentage
of
the
period
of
the
contained
text
(a
number
followed
by
"%")
and
the
second
value
is
the
value
of
the
pitch
attribute
(a
number
followed
by
"Hz",
a
relative
change
,
or
a
label
value).
Time
position
values
outside
0%
to
100%
are
ignored.
If
a
pitch
value
is
not
defined
for
0%
or
100%
then
the
nearest
pitch
target
is
copied.
All
relative
values
for
the
pitch
are
relative
to
the
pitch
value
just
before
the
contained
text.
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<prosody contour="(0%,+20Hz) (10%,+30%) (40%,+10Hz)">
good morning
</prosody>
</speak>
The
duration
attribute
takes
precedence
over
the
rate
attribute.
The
contour
attribute
takes
precedence
over
the
pitch
and
range
attributes.
The
default
value
of
all
prosodic
attributes
is
no
change.
For
example,
omitting
the
rate
attribute
means
that
the
rate
is
the
same
within
the
element
as
outside.
The prosody element can only contain text to be rendered and the following elements: audio , break , emphasis , lang , , lookup , mark , p , phoneme , prosody , say-as , sub , s , token , voice , w .
All prosodic attribute values are indicative. If a synthesis processor is unable to accurately render a document as specified (e.g., trying to set the pitch to 1 MHz or the speaking rate to 1,000,000 words per minute), it MUST make a best effort to continue processing by imposing a limit or a substitute for the specified, unsupported value and MAY inform the host environment when such limits are exceeded.
In some cases, synthesis processors MAY elect to ignore a given prosodic markup if the processor determines, for example, that the indicated value is redundant, improper or in error. In particular, concatenative-type synthetic speech systems that employ large acoustic units MAY reject prosody-modifying markup elements if they are redundant with the prosody of a given acoustic unit(s) or would otherwise result in degraded speech quality.
The audio element supports the insertion of recorded audio files (see Appendix A for REQUIRED formats) and the insertion of other audio formats in conjunction with synthesized speech output. The audio element MAY be empty. If the audio element is not empty then the contents should be the marked-up text to be spoken if the audio document is not available. The alternate content MAY include text, speech markup, desc elements, or other audio elements. The alternate content MAY also be used when rendering the document to non-audible output and for accessibility (see the desc element). In addition to the OPTIONAL attributes described in subsections below, audio has the following attributes:
Name | Required | Type | Default Value | Description |
---|---|---|---|---|
src
|
false | URI | None | The URI of a document with an appropriate media type. If absent, the audio element behaves as if src were present with a legal URI but the document could not be fetched. |
fetchtimeout
|
false | Time Designation | Processor-specific | The timeout for fetches. |
fetchhint
|
false | The value "prefetch" or the value "safe" | prefetch | This tells the synthesis processor whether or not it can attempt to optimize rendering by pre-fetching audio. The value is either safe to say that audio is only fetched when it is needed, never before; or prefetch to permit, but not require the processor to pre-fetch the audio. |
maxage
|
false | xsd:nonNegativeInteger | None |
Indicates
that
the
document
is
willing
to
use
content
whose
age
is
no
greater
than
the
specified
time
(cf.
'max-age'
in
HTTP
1.1
[
RFC2616
]).
The
document
is
not
willing
to
use
stale
content,
unless
maxstale
is
also
provided.
|
maxstale
|
false | xsd:nonNegativeInteger | None |
Indicates
that
the
document
is
willing
to
use
content
that
has
exceeded
its
expiration
time
(cf.
'max-stale'
in
HTTP
1.1
[
RFC2616
]).
If
maxstale
is
assigned
a
value,
then
the
document
is
willing
to
accept
content
that
has
exceeded
its
expiration
time
by
no
more
than
the
specified
amount
of
time.
|
<?xml version="1.0"?> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis " xml:lang="en-US"> <!-- Empty element --> Please say your name after the tone. <audio src="beep.wav"/> <!-- Container element with alternative text --> <audio src="prompt.au">What city do you want to fly from?</audio> <audio src="welcome.wav"> <emphasis>Welcome</emphasis> to the Voice Portal. </audio> </speak>
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<!-- Empty element -->
Please say your name after the tone. <audio src="beep.wav"/>
<!-- Container element with alternative text -->
<audio src="prompt.au">What city do you want to fly from?</audio>
<audio src="welcome.wav">
<emphasis>Welcome</emphasis> to the Voice Portal.
</audio>
</speak>
An audio element is successfully rendered by:
When attempting to play the audio source a number of different issues may arise such as mismatched media types or bad header information about the media. media. In general the synthesis processor makes a best effort to play the referenced media and, when unsuccessful, the processor MUST play the alternative content. content. Note the processor MUST NOT render both all or part of the referenced media and all or part of the referenced alternative content. If any of the referenced media is processed and rendered then the playback is considered a successful playback within the context of this section. section. If an error occurs that causes the alternative content to be rendered instead of the referenced media the processor MUST notify the hosting environment that such an error has occurred. occurred. The processor MAY notify the hosting environment immediately with an asynchronous event, or the processor MAY notify the hosting environment only at the end of playback when it signals to the hosting environment that it has completed rendering the request, or the processor MAY make the error notification through its logging system. system. The processor SHOULD include information about the error where possible; for example, if the media resource couldn't be fetched due to an http 404 error, that error code could be included with the notification.
The audio element can only contain text to be rendered and the following elements: audio , break , desc , emphasis , lang , , lookup , mark , p , phoneme , prosody , say-as , sub , s , token , voice , w .
Trimming attributes define the span of the audio to be rendered. Both the start and the end of the span within the audio content can be specified using time offsets. The duration of the span, including repetitions, can also be specified with repeat attributes. Synthesis processor support for these attributes is REQUIRED in the Extended profile .
The following trimming attributes are defined for audio :
Name | Required | Type | Default Value | Description |
---|---|---|---|---|
clipBegin
|
false | Time Designation | 0s | offset from start of media to begin rendering. This offset is measured in normal media playback time from the beginning of the media. |
clipEnd
|
false | Time Designation | None | offset from start of media to end rendering. This offset is measured in normal media playback time from the beginning of the media. |
repeatCount
|
false | a positive Real Number | 1 | number of iterations of media to render. A fractional value describes a portion of the rendered media. |
repeatDur
|
false | Time Designation | None | total duration for repeatedly rendering media. This duration is measured in normal media playback time from the beginning of the media. |
Calculations of rendered durations and interaction with other timing properties follow SMIL 2.1 Computing the active duration where
clipBegin
,
clipEnd
,
and
repeatDur
are
a
subset
of
SMIL
Clock-value
repeatCount
will
have
no
effect.
clipEnd
is
after
the
end
of
the
audio,
then
rendering
ends
at
the
audio
end.
clipBegin
is
after
clipEnd
,
no
audio
will
be
produced.
repeatDur
takes
precedence
over
repeatCount
in
determining
the
total
time
for
rendering
media.
Note that not all SMIL 2.1 Timing features are supported.
Real numbers and integers are specified in decimal notation only.
An integer consists of one or more digits "0" to "9".
A real number may be an integer, or it may be zero or more digits followed by a dot (.) followed by one or more digits. Both integers and real numbers may be preceded by a "-" or "+" to indicate the sign.
Time designations consist of a non-negative real number followed by a time unit identifier. The time unit identifiers are:
Examples include: "3s", "850ms", "0.7s", ".5s" and "+1.5s".
In the following example, rendering of the media begins 10 seconds into the audio:
<?xml version="1.0"?> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis " xml:lang="en-US"> <audio src="radio.wav" clipBegin="10s" /> </speak>
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis-extended.xsd"
xml:lang="en-US">
<audio src="radio.wav" clipBegin="10s" />
</speak>
Here the rendering of the media ends after 20 seconds of audio:
<?xml version="1.0"?> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis " xml:lang="en-US"> <audio src="radio.wav" clipBegin="10s" clipEnd="20s" /> </speak>
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis-extended.xsd"
xml:lang="en-US">
<audio src="radio.wav" clipBegin="10s" clipEnd="20s" />
</speak>
Note
that
if
the
duration
of
"radio.wav"
is
less
than
20
seconds,
the
clipEnd
value
is
ignored,
and
the
rendering
end
is
set
equal
to
the
effective
end
of
the
media.
In
the
following
example,
the
duration
of
the
audio
is
constrained
by
repeatCount
:
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis-extended.xsd"
xml:lang="en-US">
<audio src="3second_sound.au" repeatCount="0.5" />
</speak>
Only the first half of the clip will play; the active duration will be 1.5 seconds.
In
the
following
example,
the
audio
will
repeat
for
a
total
of
7
seconds.
It
will
play
fully
two
times,
followed
by
a
fractional
part
of
2
seconds.
This
is
equivalent
to
a
repeatCount
of
2.8.
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis-extended.xsd"
xml:lang="en-US">
<audio src="2.5second_music.mp3" repeatDur="7s" />
</speak>
In
the
following
example,
the
active
duration
of
the
audio
will
be
4
seconds.
Playback
will
start
1
second
into
the
audio
(as
specified
by
the
clipBegin
value)
and
then
play
for
1
second
(since
clipEnd
is
specified
as
2
seconds),
and
then
this
span
will
be
repeated
so
that
the
total
duration
is
4
seconds
(as
specified
by
repeatDur
).
Note
that
the
value
of
repeatDur
takes
precedence
over
the
value
of
repeatCount
.
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
"
xml:lang="en-US">
<audio src="2.5second_music.mp3" clipBegin="1s" clipEnd="2s"
repeatCount="5" repeatDur="4s" />
</speak>
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis-extended.xsd"
xml:lang="en-US">
<audio src="2.5second_music.mp3" clipBegin="1s" clipEnd="2s"
repeatCount="5" repeatDur="4s" />
</speak>
These attributes can interact with the rendering specified by speak trimming attributes:
<speak version="1.1" startmark="mark1" endmark="mark2" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis " xml:lang="en-US"> <audio src="first.wav"/> <mark name="mark1"/> <audio src="15second_music.mp3" clipBegin="2s" clipEnd="7s" /> <mark name="mark2"/> <audio src="last.wav"/> </speak>
<speak version="1.1" startmark="mark1" endmark="mark2"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis-extended.xsd"
xml:lang="en-US">
<audio src="first.wav"/>
<mark name="mark1"/>
<audio src="15second_music.mp3" clipBegin="2s" clipEnd="7s" />
<mark name="mark2"/>
<audio src="last.wav"/>
</speak>
The
speak
startmark
and
endmark
allow
only
the
"15second_music.mp3"
clip
to
be
played.
The
actual
duration
of
the
audio
is
5
seconds:
the
clip
begins
at
2
seconds
into
the
audio
and
ends
after
7
seconds,
hence
a
duration
of
5
seconds.
soundLevel
Attribute
The
soundLevel
attribute
specifies
the
relative
volume
of
the
referenced
audio.
It
is
inspired
by
the
similarly-named
attribute
in
SMIL
3
[
SMIL3
].
Synthesis
processor
support
for
this
attribute
is
REQUIRED
in
the
Extended
profile
.
Name | Required | Type | Default Value | Description |
---|---|---|---|---|
soundLevel
|
false |
signed ("+" or "-") CSS2 numbers immediately followed by "dB" |
The default value is +0.0dB. |
Decibel
values
are
interpreted
as
a
ratio
of
the
squares
of
the
new
signal
amplitude
(a
1
)
and
the
current
amplitude
(a
0
)
and
are
defined
in
terms
of
dB:
soundLevel
(dB)
=
20
log
10
(a
1
∕
a
0
).
A
setting
of
a
large
negative
value
effectively
plays
the
media
silently.
A
value
of
'-6.0dB'
will
play
the
media
at
approximately
half
the
amplitude
of
its
current
signal
amplitude.
Similarly,
a
value
of
'+6.0dB'
will
play
the
media
at
approximately
twice
the
amplitude
of
its
current
signal
amplitude
(subject
to
hardware
limitations).
The
absolute
sound
level
of
media
perceived
is
further
subject
to
system
volume
settings,
which
cannot
be
controlled
with
this
attribute.
|
Here
is
an
example
of
how
to
use
the
soundLevel
attribute:
<speak version="1.1"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis-extended.xsd"
xml:lang="en-US">
<s>This is the original, unmodified waveform:
<audio src="message.wav"/>
</s>
<s>This is the same audio at approximately twice the signal amplitude:
<audio soundLevel="+6dB" src="message.wav"/>
</s>
<s>This is the same audio at approximately half the original signal amplitude:
<audio soundLevel="-6dB" src="message.wav"/>
</s>
</speak>
speed
Attribute
The
speed
attribute
controls
the
playback
speed
of
the
referenced
audio,
to
speed
up
or
slow
down
the
effective
rate
of
play
relative
to
the
original
speed
of
the
waveform.
The
argument
value
does
not
specify
an
absolute
play
speed,
but
rather
is
relative
to
the
playback
speed
of
the
original
waveform.
Synthesis
processor
support
for
this
attribute
is
REQUIRED
in
the
Extended
profile
.
Name | Required | Type | Default Value | Description |
---|---|---|---|---|
speed
|
false |
x%
(where x is a positive real value) |
The default value is 100%, which corresponds to the speed of an unmodified audio waveform. |
The
speed
at
which
to
play
the
referenced
audio,
relative
to
the
original
speed.
The speed is set to the requested percentage of the speed of the original waveform. |
A
change
in
the
value
of
the
speed
attribute
will
change
the
rate
at
which
recorded
samples
are
played
back.
Note
that
this
will
affect
the
pitch.
Here
is
an
example
of
how
to
use
the
speed
attribute:
<speak version="1.1"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis-extended.xsd"
xml:lang="en-US">
<s>This is the original, unmodified waveform:
<audio src="message.wav"/>
</s>
<s>This is the same audio at twice the speed:
<audio speed="200%" src="message.wav"/>
</s>
<s>This is the same audio at half the original speed:
<audio speed="50%" src="message.wav"/>
</s>
</speak>
A
mark
element
is
an
empty
element
that
places
a
marker
into
the
text/tag
sequence.
It
has
one
REQUIRED
attribute,
name
,
which
is
of
type
xsd:token
[
SCHEMA2
§3.3.2].
The
mark
element
can
be
used
to
reference
a
specific
location
in
the
text/tag
sequence,
and
can
additionally
be
used
to
insert
a
marker
into
an
output
stream
for
asynchronous
notification.
When
processing
a
mark
element,
a
synthesis
processor
MUST
do
one
or
both
of
the
following:
name
attribute
and
with
information
allowing
the
platform
to
retrieve
the
corresponding
position
in
the
rendered
output.
name
attribute
of
the
element.
The
hosting
environment
defines
the
destination
of
the
event.
The mark element does not affect the speech output process.
<?xml version="1.0"?> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis " xml:lang="en-US"> Go from <mark name="here"/> here, to <mark name="there"/> there! </speak>
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
Go from <mark name="here"/> here, to <mark name="there"/> there!
</speak>
The
desc
element
can
only
occur
within
the
content
of
the
audio
element.
When
the
audio
source
referenced
in
audio
is
not
speech,
e.g.
audio
wallpaper
or
sonicon
punctuation,
it
should
contain
a
desc
element
whose
textual
content
is
a
description
of
the
audio
source
(e.g.
"door
slamming").
If
text-only
output
is
being
produced
by
the
synthesis
processor
,
the
content
of
the
desc
element(s)
SHOULD
be
rendered
instead
of
other
alternative
content
in
audio
.
The
OPTIONAL
xml:lang
attribute
can
be
used
to
indicate
that
the
content
of
the
element
is
in
a
different
language
from
that
of
the
content
surrounding
the
element.
The
OPTIONAL
onlangfailure
attribute
can
be
used
to
specify
the
desired
behavior
upon
language
speaking
failure.
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<!-- Normal use of <desc> -->
Heads of State often make mistakes when speaking in a foreign language.
One of the most well-known examples is that of John F. Kennedy:
<audio src="ichbineinberliner.wav">If you could hear it, this would be
a recording of John F. Kennedy speaking in Berlin.
<desc>Kennedy's famous German language gaffe</desc>
</audio>
<!-- Suggesting the language of the recording -->
<!-- Although there is no requirement that a recording be in the current language
(since it might even be non-speech such as music), an author might wish to
suggest the language of the recording by marking the entire <audio> element
using <lang>. In this case, the xml:lang attribute on <desc> can be used
to put the description back into the original language. -->
Here's the same thing again but with a different fallback:
<lang xml:lang="de-DE">
<audio src="ichbineinberliner.wav">Ich bin ein Berliner.
<desc xml:lang="en-US">Kennedy's famous German language gaffe</desc>
</audio>
</lang>
</speak>
The desc element can only contain descriptive text.
This document was written with the participation of the following participants in the W3C Voice Browser Working Group and other W3C Working Groups (listed in family name alphabetical order) :
The editors also wish to thank the members of the W3C Internationalization Working Group, who have provided significant review and contributions to SSML 1.0 and 1.1.
This appendix is normative.
SSML requires that a platform support the playing of the audio formats specified below.
Audio Format | Media Type |
---|---|
Raw (headerless) 8kHz 8-bit mono mu-law (PCM) single channel. (G.711) | audio/basic (from [ RFC1521 ]) |
Raw (headerless) 8kHz 8 bit mono A-law (PCM) single channel. (G.711) | audio/x-alaw-basic |
WAV (RIFF header) 8kHz 8-bit mono mu-law (PCM) single channel. | audio/x-wav |
WAV (RIFF header) 8kHz 8-bit mono A-law (PCM) single channel. | audio/x-wav |
The 'audio/basic' media type is commonly used with the 'au' header format as well as the headerless 8-bit 8kHz mu-law format. If this media type is specified for playing, the mu-law format MUST be used. For playback with the 'audio/basic' media type, processors MUST support the mu-law format and MAY support the 'au' format.
This appendix is normative.
SSML is an application of XML [ XML 1.0 or XML 1.1 ] and thus supports [ UNICODE ] which defines a standard universal character set.
SSML
provides
a
mechanism
for
control
of
the
spoken
language
via
the
use
of
the
xml:lang
attribute.
Language
changes
can
occur
as
frequently
as
per
token
(word),
although
excessive
language
changes
can
diminish
the
output
audio
quality.
SSML
also
permits
finer
control
over
output
pronunciations
via
the
lexicon
and
phoneme
elements,
features
that
can
help
to
mitigate
poor
quality
default
lexicons
for
languages
with
only
minimal
commercial
support
today.
This appendix is normative.
The media type associated with the Speech Synthesis Markup Language specification is "application/ssml+xml" and the filename suffix is ".ssml" as defined in [ RFC4267 ].
This appendix is normative.
The synthesis schema for the Core profile ( Sec. 2.2.5 ) is located at http://www.w3.org/TR/speech-synthesis11/synthesis.xsd , and the schema for the Extended profile ( Sec. 2.2.5 ) is located at http://www.w3.org/TR/speech-synthesis11/synthesis-extended.xsd .
Note: the synthesis schema s schemas include no-namespace schema s schemas for the Core and Extended profiles, located respectively at http://www.w3.org/TR/speech-synthesis11/synthesis-nonamespace.xsd and http://www.w3.org/TR/speech-synthesis11/synthesis-nonamespace-extended.xsd , which MAY be used as a basis for specifying Speech Synthesis Markup Language Fragments ( Sec. 2.2.1 ) embedded in non-synthesis namespace schemas.
This
appendix
is
informative.
The following is an example of reading headers of email messages. The p and s elements are used to mark the text structure. The break element is placed before the time and has the effect of marking the time as important information for the listener to pay attention to. The prosody element is used to slow the speaking rate of the email subject so that the user has extra time to listen and write down the details.
<?xml version="1.0"?> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis " xml:lang="en-US"> <p> <s>You have 4 new messages.</s> <s>The first is from Stephanie Williams and arrived at <break/> 3:45pm. </s> <s> The subject is <prosody rate="-20%">ski trip</prosody> </s> </p> </speak>
<?xml version="1.0"?>
<speak version="1.1"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<p>
<s>You have 4 new messages.</s>
<s>The first is from Stephanie Williams and arrived at <break/> 3:45pm.
</s>
<s>
The subject is <prosody rate="-20%">ski trip</prosody>
</s>
</p>
</speak>
The following example combines audio files and different spoken voices to provide information on a collection of music.
<?xml version="1.0"?> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis " xml:lang="en-US"> <p> <voice gender="male"> <s>Today we preview the latest romantic music from Example.</s> <s>Hear what the Software Reviews said about Example's newest hit.</s> </voice> </p> <p> <voice gender="female"> He sings about issues that touch us all. </voice> </p> <p> <voice gender="male"> Here's a sample. <audio src="http://www.example.com/music.wav"/> Would you like to buy it? </voice> </p> </speak>
<?xml version="1.0"?>
<speak version="1.1"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<p>
<voice gender="male">
<s>Today we preview the latest romantic music from Example.</s>
<s>Hear what the Software Reviews said about Example's newest hit.</s>
</voice>
</p>
<p>
<voice gender="female">
He sings about issues that touch us all.
</voice>
</p>
<p>
<voice gender="male">
Here's a sample. <audio src="http://www.example.com/music.wav"/>
Would you like to buy it?
</voice>
</p>
</speak>
It is often the case that an author wishes to include a bit of foreign text (say, a movie title) in an application without having to switch languages (for example via the lang element). A simple way to do this is shown here. In this example the synthesis processor would render the movie name using the pronunciation rules of the container language ("en-US" in this case), similar to how a reader who doesn't know the foreign language might try to read (and pronounce) it.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis " xml:lang="en-US"> The title of the movie is: "La vita è bella" (Life is beautiful), which is directed by Roberto Benigni. </speak>
<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
The title of the movie is:
"La vita è bella"
(Life is beautiful),
which is directed by Roberto Benigni.
</speak>
With some additional work the output quality can be improved tremendously either by creating a custom pronunciation in an external lexicon (see Section 3.1.5 ) or via the phoneme element as shown in the next example.
It is worth noting that IPA alphabet support is an OPTIONAL feature and that phonemes for an external language may be rendered with some approximation (see Section 3.1.5 for details). The following example only uses phonemes common to US English.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis " xml:lang="en-US"> The title of the movie is: <phoneme alphabet="ipa" ph="ˈlɑ ˈviːɾə ˈʔeɪ ˈbɛlə"> La vita è bella </phoneme> <!-- The IPA pronunciation is --> (Life is beautiful), which is directed by <phoneme alphabet="ipa" ph="ɹəˈbɛːɹɾoʊ bɛˈniːnji"> Roberto Benigni </phoneme> <!-- The IPA pronunciation is --> <!-- Note that in actual practice an author might change the encoding to UTF-8 and directly use the Unicode characters in the document rather than using the escapes as shown. The escaped values are shown for ease of copying. --> </speak>
<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
The title of the movie is:
<phoneme alphabet="ipa"
ph="ˈlɑ ˈviːɾə ˈʔeɪ ˈbɛlə">
La vita è bella </phoneme>
<!-- The IPA pronunciation is ˈlɑ ˈviːɾə ˈʔeɪ ˈbɛlə -->
(Life is beautiful),
which is directed by
<phoneme alphabet="ipa"
ph="ɹəˈbɛːɹɾoʊ bɛˈniːnji">
Roberto Benigni </phoneme>
<!-- The IPA pronunciation is ɹəˈbɛːɹɾoʊ bɛˈniːnji -->
<!-- Note that in actual practice an author might change the
encoding to UTF-8 and directly use the Unicode characters in
the document rather than using the escapes as shown.
The escaped values are shown for ease of copying. -->
</speak>
The SMIL language [ SMIL ] is an XML-based multimedia control language. It is especially well suited for describing dynamic media applications that include synthetic speech output.
File 'greetings.ssml' contains the following:
<?xml version="1.0"?> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis " xml:lang="en-US"> <s> <mark name="greetings"/> <emphasis>Greetings</emphasis> from the <sub alias="World Wide Web Consortium">W3C</sub>! </s> </speak>
<?xml version="1.0"?>
<speak version="1.1"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<s>
<mark name="greetings"/>
<emphasis>Greetings</emphasis> from the <sub alias="World Wide Web Consortium">W3C</sub>!
</s>
</speak>
SMIL Example 1: W3C logo image appears, and then one second later, the speech sequence is rendered. File 'greetings.smil' contains the following:
<smil xmlns="http://www.w3.org/2001/SMIL20/Language"> <head> <top-layout width="640" height="320"> <region id="whole" width="640" height="320"/> </top-layout> </head> <body> <par> <img src="http://www.w3.org/Icons/w3c_home" region="whole" begin="0s"/> <ref src="greetings.ssml" begin="1s"/> </par> </body> </smil>
<smil xmlns="http://www.w3.org/2001/SMIL20/Language">
<head>
<top-layout width="640" height="320">
<region id="whole" width="640" height="320"/>
</top-layout>
</head>
<body>
<par>
<img src="http://www.w3.org/Icons/w3c_home" region="whole" begin="0s"/>
<ref src="greetings.ssml" begin="1s"/>
</par>
</body>
</smil>
SMIL Example 2: W3C logo image appears, then clicking on the image causes it to disappear and the speech sequence to be rendered. File 'greetings.smil' contains the following:
<smil xmlns="http://www.w3.org/2001/SMIL20/Language"> <head> <top-layout width="640" height="320"> <region id="whole" width="640" height="320"/> </top-layout> </head> <body> <seq> <img id="logo" src="http://www.w3.org/Icons/w3c_home" region="whole" begin="0s" end="logo.activateEvent"/> <ref src="greetings.ssml"/> </seq> </body> </smil>
<smil xmlns="http://www.w3.org/2001/SMIL20/Language">
<head>
<top-layout width="640" height="320">
<region id="whole" width="640" height="320"/>
</top-layout>
</head>
<body>
<seq>
<img id="logo" src="http://www.w3.org/Icons/w3c_home" region="whole" begin="0s" end="logo.activateEvent"/>
<ref src="greetings.ssml"/>
</seq>
</body>
</smil>
The following is an example of SSML in VoiceXML (see Section 2.3.3 ) for voice browser applications. It is worth noting that the VoiceXML namespace includes the SSML namespace elements and attributes. See Appendix O of [ VXML ] for details.
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"> <form> <block> <prompt> <emphasis>Welcome</emphasis> to the Bird Seed Emporium. <audio src="rtsp://www.birdsounds.example.com/thrush.wav"/> We have 250 kilogram drums of thistle seed for $299.95 plus shipping and handling this month. <audio src="http://www.birdsounds.example.com/mourningdove.wav"/> </prompt> </block> </form> </vxml>
<?xml version="1.0" encoding="UTF-8"?>
<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/vxml
http://www.w3.org/TR/voicexml20/vxml.xsd">
<form>
<block>
<prompt>
<emphasis>Welcome</emphasis> to the Bird Seed Emporium.
<audio src="rtsp://www.birdsounds.example.com/thrush.wav"/>
We have 250 kilogram drums of thistle seed for
$299.95
plus shipping and handling this month.
<audio src="http://www.birdsounds.example.com/mourningdove.wav"/>
</prompt>
</block>
</form>
</vxml>
This is a consolidated list of all changes since SSML 1.0.
Changes in this draft: