Internationalizing SSML

Kazuyuki Ashimura,
W3C, Team contact for the Voice Browser Working Group

<ashimura@w3.org>

Why Internationalizing SSML?

Global users of the Web

The Web is not only for English-native people but also everyone in the world.
- SSML might be used for international connection services between one country and another like international call.
- SSML should provide various features for spoken languages of all countries and areas in the world.

Extension of SSML ability

Enhancements for non-English languages to make SSML more useful in current and emerging markets (e.g. China, Korea, Japan, etc.).
- More precise pronunciation identification and prosodic controls are essential for richer speech synthesis.
- Lots of useful suggestions are included in non-English speech synthesis especially Asian language synthesis.

Problem to be solved: Pronunciation ambiguity

SSML 1.0 vocabulary provides various ways to eliminate pronunciation ambiguities.
- Word-level, phoneme-level and waveform-level controls
  e.g. The <phoneme> element and the <say-as> element
However, still many problems remaining...
- One specific character sequence can be pronounced as various pronunciations.
  - Text input provides only "What to say" information.
  - Prosody is very important as "How to say" information to solve pronunciation ambiguities.

Example of pronunciation ambiguity in Japanese (1)

A certain character sequence can have several different meanings with different pitch accents.

Note: "'" means that there is accent nucleus (= perceived pitch falling).

Example of pronunciation ambiguity in Japanese (2)

Sometimes a certain character sequence can have even opposite meanings with different combination of duration and intonation.

Controls for prosodic information

To solve the problem of pronunciation ambiguities, additional specification must be provided to SSML.

Especially, controls for prosodic information are essential for Asian tonal languages.
Such controls can be specified for each step of TTS process to control each DB and/or Model (e.g. model selection, parameters for model).

Category of prosodic controls

According to Fujisaki , prosodic information is classified into three categories.
Therefore we should consider these three categories when we discuss prosodic controls.

Linguistic Information

Symbolic information represented by a set of discrete symbols and rules for their combination.
It can be represented either explicitly by the written language, or can be easily and uniquely inferred from context.
It is discrete and categorical, for example, character sequences, parts of speech, accent types, etc.

Paralinguistic Information

Information not inferable from the written counterpart but deliberately added by the speaker to modify or supplement the linguistic information.
It can be both discrete and continuous, for example, duration and speech rate, fundamental frequency transition, spectrum transition, etc.

Nonlinguistic Information

Information concerns factors as age, gender, idiosyncrasy, physical and emotional states of the speaker.
It is not directly related to linguistic information nor paralinguistic information, and not generally under control of the speaker.

Possible prosodic controls

There are various prosodic controls that are useful for rendering non-English languages.
Some of them are already included in SSML 1.0, others should be added.
Additional topics and extensions to current SSML will be proposed in this Workshop ;-)

Items in black:	Examples of potential controls borrowed from Fujisaki's definition
Items in red:	Elements for prosodic controls in SSML 1.0

Category of prosody	Input Level
Category of prosody	Text Analysis	Prosody Analysis	Waveform Production
Linguistic Information	character sequences part of speech accent types <p> <s> <say-as> <sub> <lexicon> <phoneme>	?	?
Paralinguistic Information	?	duration and speech rate fundamental frequency transition spectrum transition <prosody> <emphasis> <break>	<prosody> (partially)
Nonlinguistic Information	?	?	age gender idiosyncrasy physical and emotional states of the speaker <voice> <audio>

Let's get started

Goals & Scope of the workshop

To identify and prioritize extensions and additions to SSML that will improve the use of SSML for rendering non-English languages.
The scope of the workshop is not limited to Asian languages.
Suggestions for enhancements to SSML for the support of any non-English language are welcome, especially if they are relevant for multiple languages.

Topics

Diacritics for auto-completion
Representing special word classes
Representing word boundaries
Denoting language and character sets
Tones
Sentence structure
Words with multiple pronunciations and meanings
Text with multiple languages
Expression, speaking style, and focus
Other extensions and/or additions to SSML