Copyright © 2006 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
This document defines the syntax for specifying pronunciation lexicons to be used by Automatic Speech Recognition and Speech Synthesis engines in voice browser applications.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This is the 26 October 2006 W3C Last Call Working Draft of "Pronunciation Lexicon specification (PLS) Version 1.0". The Last Call period ends on 26 November 2006.
This document has been produced as part of the W3C Voice Browser Activity, following the procedures set out for the W3C Process. The authors of this document are members of the Voice Browser Working Group (W3C Members only).
The following is a summary of the major changes since the previous Last Call Working Draft was published.
The Voice Browser Working Group believes that this specification addresses its requirements and all previous Last Call issues (see the Disposition of Comments document).
This is a W3C Last Call Working Draft for review by W3C Members and other interested parties. Last Call means that the Working Group believes that this specification is technically sound and therefore wishes this to be the Last Call for comments. If the feedback is positive, the Working Group plans to submit it for consideration as a W3C Candidate Recommendation. Comments can be sent until 26 November 2006.
This document is for public review. Comments and discussion are welcomed on the public mailing list < www-voice@w3.org >. To subscribe, send an email to <www-voice-request@w3. org> with the word subscribe in the subject line (include the word unsubscribe if you want to unsubscribe). The archive for the list is accessible on-line.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
<lexicon>
Element<meta>
Element<metadata>
Element<lexeme>
Element<grapheme>
Element<phoneme>
Element<alias>
Element<example>
ElementThis section is informative.
The accurate specification of pronunciation is critical to the success of speech applications. Most Automatic Speech Recognition (ASR) and Text-To-Speech (TTS) engines internally provide extensive high quality lexicons with pronunciation information for many words or phrases. To ensure a maximum coverage of the words or phrases used by an application, application-specific pronunciations may be required. For example, these may be needed for proper nouns such as surnames or business names.
The Pronunciation Lexicon Specification (PLS) is designed to enable interoperable specification of pronunciation information for both ASR and TTS engines within voice browsing applications. The language is intended to be easy to use by developers while supporting the accurate specification of pronunciation information for international use.
The language allows one or more pronunciations for a word or phrase to be specified using a standard pronunciation alphabet or if necessary using vendor specific alphabets. Pronunciations are grouped together into a PLS document which may be referenced from other markup languages, such as the Speech Recognition Grammar Specification [SRGS] and the Speech Synthesis Markup Language [SSML].
In its most general sense, a lexicon is merely a list of words or phrases, possibly containing information associated with and related to the items in the list. This document uses the term "lexicon" in only one specific way, as "pronunciation lexicon". In this particular document, "lexicon" means a mapping between words (or short phrases), their written representations, and their pronunciations suitable for use by an ASR engine or a TTS engine. However, pronunciation lexicons are not limited to voice browsers, because they have proven effective mechanisms to support accessibility for persons with disabilities as well as greater usability for all users (for instance in screen readers and other user agents, such as multimodal interfaces).
A TTS engine aims to transform input content (either text or markup, such as SSML) into speech. This activity involves several processing steps:
SSML enables a user to control and enhance TTS activity by acting through SSML elements on these levels of processing (see [SSML] for details).
The PLS is the
standard format of the documents referenced by the <lexicon>
element
of SSML (see [SSML], section 3.1.4).
The following is a simple example of an SSML document. It includes an Italian movie title and the name of the director to be read in US English.
<?xml version="1.0" encoding="UTF-8"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US"> The title of the movie is: "La vita è bella" (Life is beautiful), which is directed by Roberto Benigni. </speak>
To be pronounced correctly the Italian title and the director's name might include the pronunciation inline in the SSML document.
<?xml version="1.0" encoding="UTF-8"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US"> The title of the movie is: <phoneme alphabet="ipa" ph="ˈlɑ ˈviːɾə ˈʔeɪ ˈbɛlə">"La vita è bella"</phoneme> <!-- The IPA pronunciation is: "ˈlɑ ˈviːɾə ˈʔeɪ ˈbɛlə" --> (Life is beautiful), which is directed by <phoneme alphabet="ipa" ph="ɹəˈbɛːɹɾoʊ bɛˈniːnji">Roberto Benigni.</phoneme> <!-- The IPA pronunciation is: "ɹəˈbɛːɹɾoʊ bɛˈniːnji" --> </speak>
With the use of the PLS, all the pronunciations can be factored out in an external PLS document
which is referenced by the <lexicon>
element of SSML (see [SSML], section 3.1.4).
<?xml version="1.0" encoding="UTF-8"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US"> <lexicon uri="http://www.example.com/movie_lexicon.pls"/> The title of the movie is: "La vita è bella" (Life is beautiful), which is directed by Roberto Benigni. </speak>
The following is an example of the "movie_lexicon.pls" document.
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US"> <lexeme> <grapheme>La vita è bella</grapheme> <phoneme>ˈlɑ ˈviːɾə ˈʔeɪ ˈbɛlə</phoneme> <!-- IPA string is: "ˈlɑ ˈviːɾə ˈʔeɪ ˈbɛlə" --> </lexeme> <lexeme> <grapheme>Roberto</grapheme> <phoneme>ɹəˈbɛːɹɾoʊ</phoneme> <!-- IPA string is: "ɹəˈbɛːɹɾoʊ" --> </lexeme> <lexeme> <grapheme>Benigni</grapheme> <phoneme>bɛˈniːnji</phoneme> <!-- IPA string is: "bɛˈniːnji" --> </lexeme> </lexicon>
The PLS engine will load the external PLS document and transparently apply the pronunciations during the processing of the SSML document. An application may contain several distinct PLS documents to be used in different points of the application. Section 3.1.4 of [SSML] describes how to use more than one PLS document referenced in a SSML document.
Given that many platform/browser/text editor combinations do not correctly cut and paste Unicode text, IPA symbols may be entered as numeric character references (see Section 4.1 on Character and Entity References of "Extensible Markup Language (XML) 1.0 (Fourth Edition)" [XML]) in the pronunciation. However, the UTF-8 representation of an IPA symbol should always be used in preference to its numeric character reference. In order to overcome potential problems with viewing the UTF-8 representation of IPA symbols in this document, examples of pronunciation are also shown in a comment using numeric character references.
An ASR engine transforms an audio signal into a recognized sequence of words or a semantic representation of the meaning of the utterance (see Semantic Interpretation for Speech Recognition [SISR] for a standard definition of Semantic Interpretation).
An ASR grammar is used to improve ASR performance by describing the possible words and phrases the ASR might recognize. SRGS is the standard definition of ASR grammars (see [SRGS] for details).
PLS may be used by an ASR processor to allow multiple pronunciations of words, phrases and also text normalization, such as acronym expansion and abbreviations.
This is a very simple SRGS grammar that allows the recognition of sentences like "Boston Massachusetts" or "Miami Florida".
<?xml version="1.0" encoding="UTF-8"?> <grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en-US" version="1.0" root="city_state" mode="voice"> <rule id="city" scope="public"> <one-of> <item>Boston</item> <item>Miami</item> <item>Fargo</item> </one-of> </rule> <rule id="state" scope="public"> <one-of> <item>Florida</item> <item>North Dakota</item> <item>Massachusetts</item> </one-of> </rule> <rule id="city_state" scope="public"> <ruleref uri="#city"/> <ruleref uri="#state"/> </rule> </grammar>
If a pronunciation lexicon is referenced by a SRGS grammar it can allow multiple pronunciations of the word in the grammar to accommodate different speaking styles. See the same grammar with reference to an external PLS document.
<?xml version="1.0" encoding="UTF-8"?> <grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en-US" version="1.0" root="city_state" mode="voice"> <lexicon uri="http://www.example.com/city_lexicon.pls"/> <rule id="city" scope="public"> <one-of> <item>Boston</item> <item>Miami</item> <item>Fargo</item> </one-of> </rule> <rule id="state" scope="public"> <one-of> <item>Florida</item> <item>North Dakota</item> <item>Massachusetts</item> </one-of> </rule> <rule id="city_state" scope="public"> <ruleref uri="#city"/> <ruleref uri="#state"/> </rule> </grammar>
Also a SRGS grammar might reference multiple PLS documents.
A VoiceXML 2.0 application ([VXML]) contains SRGS grammars for ASR and SSML prompts for TTS. The introduction of PLS in both SRGS and SSML will directly impact VoiceXML applications.
The benefits described in Section 1.1 and Section 1.2 are also available in VoiceXML applications. The application may use several contextual PLS documents at different points in the interaction, but may also use the same PLS document both in SRGS, to improve ASR, and in SSML, to improve TTS. This is an example:
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US"> <lexeme> <grapheme>judgment</grapheme> <grapheme>judgement</grapheme> <phoneme>ˈʤʌʤ.mənt</phoneme> <!-- IPA string is: "ˈʤʌʤ.mənt" --> </lexeme> <lexeme> <grapheme>fiancé</grapheme> <grapheme>fiance</grapheme> <phoneme>fiˈɒns.eɪ</phoneme> <!-- IPA string is: "fiˈɒns.eɪ" --> <phoneme>ˌfiː.ɑːnˈseɪ</phoneme> <!-- IPA string is: "ˌfiː.ɑːnˈseɪ" --> </lexeme> </lexicon>
which can be used to improve TTS in the following SSML document:
<?xml version="1.0" encoding="UTF-8"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US"> <lexicon uri="http://www.example.com/lexicon_defined_above.xml"/> <p> In the judgement of my fiancé, Las Vegas is the best place for a honeymoon. I replied that I preferred Venice and didn't think the Venetian casino was an acceptable compromise.<\p> </speak>
but also to improve ASR in the following SRGS grammar:
<?xml version="1.0" encoding="UTF-8"?> <grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en-US" version="1.0" root="movies" mode="voice"> <lexicon uri="http://www.example.com/lexicon_defined_above.xml"/> <rule id="movies" scope="public"> <one-of> <item>Terminator 2: Judgment Day</item> <item>My Big Fat Obnoxious Fiance</item> <item>Pluto's Judgement Day</item> </one-of> </rule> </grammar>
The current specification is focused on the major features described in the requirements document [REQS]. The most complex features have been postponed to a future revision of this specification. Some of the complex features not included, for instance, are the introduction of morphological, syntactic and semantic information associated with pronunciations (such as tense, parts-of-speech, word stems, etc.). Many of these features can be specified using RDF [RDF-XMLSYNTAX] that reference lexemes within one or more pronunciation lexicons.
A
phonemic/phonetic alphabet
is used to specify a pronunciation. An alphabet in this context refers to a collection
of symbols to represent the sounds of one or more human languages.
In the PLS specification the pronunciation
alphabet is specified by the alphabet
attribute (see
Section 4.1 and Section 4.6
for details on the use of this attribute). The only valid values
for the alphabet
attribute are "ipa"
(see the next paragraph) and vendor-defined strings of the form
"x-organization"
or
"x-organization-alphabet"
. For example, the Japan
Electronics and Information Technology Industries Association
[JEITA] might wish to encourage the use of
an alphabet such as "x-jeita"
or
"x-jeita-2000"
for their phoneme alphabet [JEIDAALPHABET]. Another example might be
"x-sampa"
[X-SAMPA] an
extension of SAMPA
phonetic alphabet [SAMPA] to cover the
entire range of characters in the International Phonetic Alphabet
[IPA].
A compliant PLS processor
MUST support
"ipa"
as the value of the
alphabet
attribute.
This means that the PLS processor
MUST support
the Unicode representations of
the phonetic characters developed by the International Phonetic Association
[IPA]. In addition to an
exhaustive set of vowel and consonant symbols, this character set
supports a syllable delimiter, numerous diacritics, stress symbols,
lexical tone symbols, intonational markers and more. For this
alphabet, legal phonetic/phonemic values are strings of the values
specified in Appendix 2 of [IPAHNDBK].
Informative tables of the IPA-to-Unicode mappings can be found at
[IPAUNICODE1] and [IPAUNICODE2]. Note that not all of the IPA
characters are available in Unicode. For processors supporting this
alphabet,
Currently there is no ready way for a blind or partially sighted person to read or interact with a lexicon containing IPA symbols. It is hoped that implementers will provide tools which will enable such an interaction.
This section enumerates the conformance rules of this specification.
All sections in this specification are normative, unless otherwise indicated. The informative parts of this specification are identified by "Informative" labels within sections.
Individual conformance requirements or testable statements are identifiable in the PLS specification through imperative voice statements. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. However, for readability, these words do not appear in all uppercase letters in this specification.
The Pronunciation Lexicon markup language consists of the following elements and attributes:
Elements | Attributes | Description |
---|---|---|
<lexicon> |
version xml:base xmlns xml:lang alphabet |
root element for PLS |
<meta> |
name http-equiv content |
element containing meta data |
<metadata> |
element containing meta data | |
<lexeme> |
xml:id role |
the container element for a single lexical entry |
<grapheme> |
contains orthographic information for a lexeme | |
<phoneme> |
prefer alphabet |
contains pronunciation information for a lexeme |
<alias> |
prefer |
contains acronym expansions and words' substitutions |
<example> |
contains an example of the usage for a lexeme |
<lexicon>
ElementThe root element of the Pronunciation Lexicon markup language is
the <lexicon>
element. This
element is the container for all other elements of the PLS
language. A <lexicon>
element MUST contain zero or more <meta>
elements, followed by an OPTIONAL <metadata>
element, followed by zero or more <lexeme>
elements.
The <lexicon>
element
MUST specify an alphabet
attribute which indicates
the default pronunciation alphabet to be used within the PLS document. The
values of the alphabet
attribute are described in
Section 2 and it MAY be overridden for a given lexeme using the <phoneme>
element.
The REQUIRED version
attribute indicates the
version of the specification to be used for the document and MUST
have the value "1.0"
.
The REQUIRED xml:lang
attribute allows
identification of the language for which the pronunciation lexicon
is relevant. IETF Best Current Practice 47 [BCP47] is the normative reference on the values of the
xml:lang
attribute.
Note that xml:lang
specifies a single unique language for the
entire PLS document. This does not limit the ability to create
multilingual SRGS [SRGS] and
SSML [SSML] documents. These documents may reference
multiple pronunciation lexicons, possibly written for different languages.
The namespace URI for PLS is
"http://www.w3.org/2005/01/pronunciation-lexicon"
. All PLS markup
MUST be associated with the PLS namespace, using a Namespace Declaration
as described in [XMLNS]. This can for instance be achieved by declaring
an xmlns
attribute on the <lexicon>
element, as the examples in this specification show.
The xml:base
attribute allows to define a base
URI for the PLS document as defined in XML Base [XML-BASE]. As in the HTML 4.01
specification [HTML], a URI
which all the relative references within the document take as their
base.
Note that in this version of the specification, only the contents of metadata can potentially use relative URIs.
A simple PLS document for the word "tomato" and its pronunciation.
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US"> <lexeme> <grapheme>tomato</grapheme> <phoneme>təmei̥ɾou̥</phoneme> <!-- IPA string is: "təmei̥ɾou̥" --> </lexeme> </lexicon>
<meta>
ElementThe <metadata>
and
<meta>
elements are
containers in which information about the document can be placed.
The <metadata>
element
provides more general and powerful treatment of metadata
information than <meta>
by
using a metadata schema.
A <meta>
element
associates a string to a declared meta property or declares
http-equiv
content. Either a name
or
http-equiv
attribute is REQUIRED. It is an error to
provide both name
and http-equiv
attributes. A content
attribute is also REQUIRED.
The only <meta>
property defined by this specification is "seeAlso"
.
It is used to
specify a resource that might provide additional metadata
information about the content. This property is modeled on the
"seeAlso"
property of "RDF Vocabulary Description
Language 1.0: RDF Schema" [RDF-SCHEMA], section 5.4.1.
The http-equiv
attribute has a
special significance when documents are retrieved via HTTP.
Although the preferred method of providing HTTP header information
is that of using HTTP header fields, the http-equiv
content MAY be used in situations where the PLS document author is
unable to configure HTTP header fields associated with their
document on the origin server, for example, cache control
information. Note that HTTP servers and caches are not required to
inspect the contents of <meta>
in PLS documents and thereby
override the header values they would send otherwise.
The <meta>
element is an
empty element.
This section is modelled after
the <meta>
description in the
HTML 4.01 Specification [HTML]. Despite the fact that the name/content model is
now being replaced by better ways to include metadata,
see for instance section 20.6 of
XHTML 2.0 [XHTML2], and the fact that the http-equiv
directive is no longer
recommended in section 3.3 of
XHTML Media Types [XHTML-MTYPES], the Working Group has
decided to retain this for compatibility with the other
specifications of the first version of the Voice Interface Framework
(VoiceXML, SSML, SRGS, CCXML).
Future versions of the framework will align with more
modern metadata schemes.
This is an example of how <meta>
elements can be included in a
PLS document to specify a resource that provides additional
metadata information.
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US"> <meta http-equiv="Cache-Control" content="no-cache"/> <meta name="seeAlso" content="http://example.com/my-pls-metadata.xml"/> <!-- If lexemes are to be added to this lexicon, they start below --> </lexicon>
<metadata>
ElementThe <metadata>
element is
a container in which information about the document can be placed
using metadata markup. The behavior of software processing the
content of a <metadata>
element is not described in this specification. Therefore, software
implementing this specification is free to ignore that content.
Although any metadata markup can be used within <metadata>
, it is RECOMMENDED that
the RDF/XML Syntax [RDF-XMLSYNTAX] be
used, in conjunction with the general metadata properties defined
by the Dublin Core Metadata Initiative [DC]
(e.g., Title, Creator, Subject, Description, Rights, etc.)
This is an example of how metadata can be included in a PLS document using the "Dublin Core Metadata Element Set, Version 1.1" [DC-ES] describing general document information such as title, description, date, and so on:
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US"> <metadata> <rdf:RDF xmlns:rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc = "http://purl.org/dc/elements/1.1/"> <!-- Metadata about the PLS document --> <rdf:Description rdf:about="" dc:title="Pronunciation lexicon for W3C terms" dc:description="Common pronunciations for many W3C acronyms and abbreviations, i.e. I18N or WAI" dc:publisher="W3C" dc:language="en-US" dc:date="2005-11-29" dc:rights="Copyright 2002 W3C" dc:format="application/pls+xml"> <dc:creator>The W3C Voice Browser Working Group</dc:creator> </rdf:Description> </rdf:RDF> </metadata> <!-- If lexemes are to be added to this lexicon, they start below --> </lexicon>
<lexeme>
ElementThe <lexeme>
element is a
container for a lexical entry which MAY include multiple
orthographies and multiple pronunciation information.
The <lexeme>
element contains
one or more <grapheme>
elements,
one or more pronunciations (either by <phoneme>
or <alias>
elements or a
combination of both), and zero or more <example>
elements
The children of the <lexeme>
element can appear
in any order, but note that the order will have an impact on the treatment of multiple pronunciations,
see Section 4.9.
The <lexeme>
element has an OPTIONAL
xml:id
[XML-ID] attribute,
allowing the element to be referenced from other documents (through
fragment identifiers or XPointer [XPOINTER], for instance). For
example, developers may use external RDF statements
[RDF-CONC] to associate metadata
(such as part of speech or word relationships) with a lexeme.
The <lexeme>
element has an OPTIONAL role
attribute which takes as
its value one or more white-space separated QNAMEs (as defined in
Section 3.2.1.8 of XML Schema Part2: Datatypes Second Edition
[XML-SCHEMA]).
The role
attribute describes additional information to help the
selection of the most appropriate pronunciation for a given orthography.
The main use is to differentiate words that have the same spelling but
are pronounced in different ways (cf. homographs and see also
Section 5.5).
A pronunciation lexicon for the Italian language with two lexemes. One of them is for the loan word "file" which is often used in technical discussions to have the same meaning and pronunciation as in English. This is distinct from the homograph noun "file" which is the plural form of "fila" meaning "queue".
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="it"> <lexeme> <grapheme>file</grapheme> <phoneme>faɪl</phoneme> <!-- This is the pronunciation of the loan word "file" in Italian. IPA string is: "faɪl" --> </lexeme> <lexeme> <grapheme>EU</grapheme> <alias>Unione Europea</alias> <!-- This is a substitution of the European Union acronym in Italian language. --> </lexeme> </lexicon>
The following is an example of a pronunciation lexicon for the word "read":
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" xmlns:claws="http://www.example.com/claws7tags" alphabet="ipa" xml:lang="en"> <lexeme role="claws:VVI claws:VV0 claws:NN1"> <!-- verb infinitive, verb present tense, singular noun --> <grapheme>read</grapheme> <phoneme>riːd</phoneme> <!-- IPA string is: "riːd" --> </lexeme> <lexeme role="claws:VVN claws:VVD"> <!-- verb past participle, verb past tense --> <grapheme>read</grapheme> <phoneme>red</phoneme> </lexeme> </lexicon>
Note that the role
attribute is based on qualified values (in this example
from the UCREL CLAWS7 tagset of part-of-speech)
to distinguish the verb infinitive, present tense and singular noun
from the verb past tense and past participle pronunciation of the
word "read".
The following is an example document which references the above lexicon and includes an
extension element to show how the role
attribute may be used to select the appropriate pronunciation of the word
"read" in the dialog.
<?xml version="1.0" encoding="UTF-8"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:myssml="http://www.example.com/ssml_extensions" xmlns:claws="http://www.example.com/claws7tags" xml:lang="en"> <lexicon http://www.example.com/lexicon.pls" type="application/pls+xml"/> <voice gender="female" age="3"> Can you <myssml:token role="claws:VVI">read</myssml:token> this book to me? </voice> <voice gender="male" age="43"> I've already <myssml:token role="claws:VVN">read</myssml:token> it three times! </voice> </speak>
The SRGS 1.0 [SRGS] and SSML 1.0 [SSML]
specifications do not currently support
a selection mechanism based on the role
attribute.
Future versions of these specifications are expected to allow
the selection of appropriate pronunciations on the basis of the
role
attribute.
<grapheme>
ElementA <lexeme>
contains at
least one <grapheme>
element.
The <grapheme>
element contains
text describing the orthography
of the <lexeme>
.
The <grapheme>
element MUST NOT be empty,
and MUST NOT contain subelements.
In more complex situations there may be alternative textual representations for the same word or phrase; this can arise due to a number of reasons, for example:
In order to remove the need for duplication of pronunciation
information to cope with the above variations, the <lexeme>
element MAY contain more
than one <grapheme>
element
to define the base orthography and any variants. Note that all the pronunciations given within the
<lexeme>
apply
to each and every <grapheme>
within the <lexeme>
.
An example of a single grapheme and a single pronunciation.
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US"> <lexeme> <grapheme>Sepulveda</grapheme> <phoneme>səˈpʌlvɪdə</phoneme> <!-- IPA string is: "səˈpʌlvɪdə" --> </lexeme> </lexicon>
Another example with more than one written form for a lexical entry, where the first orthography uses Latin characters for "Romaji" orthography, the second one uses "Kanji" orthography and the third one uses the "Hiragana" orthography:
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="jp"> <lexeme> <grapheme>nihongo</grapheme> <grapheme>日本語</grapheme> <grapheme>にほんご</grapheme> <phoneme>ɲihoŋo</phoneme> <!-- IPA string is: "ɲihoŋo" --> </lexeme> </lexicon>
<phoneme>
ElementA <lexeme>
MAY contain
one or more <phoneme>
elements.
The <phoneme>
element contains text describing how
the <lexeme>
is pronounced. The <phoneme>
element
MUST NOT be empty, and MUST NOT contain subelements.
A <phoneme>
element MAY optionally
have an alphabet
attribute which indicates the
pronunciation alphabet that is used for this <phoneme>
element only. The legal
values for the alphabet
attribute are described in
Section 2.
The prefer
is an OPTIONAL attribute which
indicates the preferred pronunciation to be used by a speech
synthesis engine. The possible values
are: "true"
or "false"
. The default value
is "false"
.
The prefer mechanism spans both the <phoneme>
and <alias>
elements; see the
examples in Section 4.7.
Section 4.9 describes how multiple pronunciations are specified in PLS for
ASR and TTS,
and gives many examples in Section 4.9.3.
More than one pronunciation per lexical entry:
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US"> <lexeme> <grapheme>huge</grapheme> <phoneme prefer="true">hjuːʤ</phoneme> <!-- IPA string is: "hjuːʤ" --> <phoneme>juːʤ</phoneme> <!-- IPA string is: "juːʤ" --> </lexeme> </lexicon>
More than one written form and more than one pronunciation:
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US"> <lexeme> <grapheme>theater</grapheme> <grapheme>theatre</grapheme> <phoneme prefer="true">ˈθɪətər</phoneme> <!-- IPA string is: "ˈθɪətər" --> <phoneme>ˈθiːjətər</phoneme> <!-- IPA string is: "ˈθiːjətər" --> </lexeme> </lexicon>
An example of a <phoneme>
that
changes the pronunciation alphabet to a proprietary one.
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US"> <lexeme> <grapheme>color</grapheme> <phoneme>ˈkʌlər</phoneme> <!-- IPA string is: "ˈkʌlər" --> </lexeme> <lexeme> <grapheme>XYZ</grapheme> <phoneme alphabet="x-example-alphabet">XYZ</phoneme> <!-- The above pronunciation is given in a proprietary alphabet called: "x-example-alphabet" --> </lexeme> </lexicon>
<alias>
ElementA <lexeme>
element MAY
contain one or more <alias>
elements
which are used to indicate the pronunciation
of an acronym, an
abbreviated term, in terms of other orthographies, or
other substitutions as necessary, see examples below and in Section 4.9.3.
The <alias>
element MUST NOT be empty,
and MUST NOT contain subelements.
In a <lexeme>
element, both <alias>
elements
and <phoneme>
elements MAY be present.
If authors want explicit control over the pronunciation, they can
use the <phoneme>
element
instead of the <alias>
element.
The <alias>
element
has an OPTIONAL prefer
attribute analogous to
the prefer
attribute for the <phoneme>
element; see Section 4.6 for a normative description the prefer
attribute.
Pronunciations of <alias>
element contents MUST be generated by the
processor without invoking recursion on the <alias>
elements of any
constituent graphemes.
Acronym expansion using the <alias>
element:
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US"> <lexeme> <grapheme>W3C</grapheme> <alias>World Wide Web Consortium</alias> </lexeme> </lexicon>
The following example illustrates a combination of <alias>
and <phoneme>
elements.
The indicated acronym, "GNU", has only two pronunciations, as recursion of <alias>
is not permissible.
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US"> <lexeme> <grapheme>GNU</grapheme> <alias>GNU is Not Unix</alias> <phoneme>gəˈnuː</phoneme> <!-- IPA string is: "gəˈnuː" --> </lexeme> </lexicon>
<example>
ElementThe <example>
element
includes an example sentence that illustrates an occurrence of this
lexeme. Because the examples are explicitly marked, automated tools
can be used for regression testing and for generation of
pronunciation lexicon documentation.
The <example>
element MUST NOT be empty,
and MUST NOT contain subelements.
Zero, one or many <example>
elements MAY be provided
for a single <lexeme>
element.
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US"> <lexeme> <grapheme>lead</grapheme> <phoneme>led</phoneme> <example>My feet were as heavy as lead.</example> </lexeme> <lexeme> <grapheme>lead</grapheme> <phoneme>liːd</phoneme> <!-- IPA string is: "liːd" --> <example>The guide once again took the lead.</example> </lexeme> </lexicon>
This section describes the treatment of multiple pronunciations specified in a PLS document for ASR and TTS.
If more than one pronunciation for a given <lexeme>
is
specified (either by <phoneme>
elements or <alias>
elements or a combination of both), an ASR processor MUST consider each
of them as valid pronunciations for the word. See Example 2 and
following examples in Section 4.9.3.
If more than one <lexeme>
contains the same
<grapheme>
, all their pronunciations will be collected in document order
and an ASR processor MUST consider all of them as valid pronunciations for the <grapheme>
.
See Example 7 and Example 8 in Section 4.9.3.
If more than one pronunciation for a given <lexeme>
is
specified (either by <phoneme>
elements or <alias>
elements or a combination of both), a TTS processor MUST use the first one
in document order that has the prefer
attribute set to "true"
.
If none of the pronunciations has prefer
set to "true"
, the TTS processor MUST use the first one in document order.
See Example 2 and
following examples in Section 4.9.3.
If more than one <lexeme>
contains the same
<grapheme>
, all their pronunciations will be collected in document order
and a TTS processor MUST use the first one in document order that has the prefer
attribute set to "true"
.
If none of the pronunciations has prefer
set to "true"
, the TTS processor MUST use the first one in document order.
See Example 7 and Example 8 in Section 4.9.3.
The following examples are designed to describe and illustrate the most common examples of multiple pronunciations. Both ASR and TTS behavior is described.
In the following example, there is only one pronunciation. It will be used by both ASR and TTS processors.
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US"> <lexeme> <grapheme>bead</grapheme> <phoneme>biːd</phoneme> <!-- IPA string is: "biːd" --> </lexeme> </lexicon>
In the following example, there are two pronunciations. An ASR processor will recognize both pronunciations, whereas a TTS processor will only use the first one (because it is first in document order).
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US"> <lexeme> <grapheme>read</grapheme> <phoneme>red</phoneme> <phoneme>riːd</phoneme> <!-- IPA string is: "riːd" --> </lexeme> </lexicon>
In the following example, there are two pronunciations. An ASR processor will recognize
both pronunciations, whereas a TTS processor will only use the second one
(because it has prefer
set to
"true"
).
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US"> <lexeme> <grapheme>lead</grapheme> <phoneme>led</phoneme> <phoneme prefer="true">liːd</phoneme> <!-- IPA string is: "liːd" --> </lexeme> </lexicon>
In the following example, "read" has two pronunciations. The first one is specified by means of an alias to "red", which is defined just below it. An ASR processor will recognize both pronunciations, whereas a TTS processor will only use the first one (because it is first in document order). In this example, the alias refers to a lexeme later in the lexicon, but in general, this order is not relevant.
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US"> <lexeme> <grapheme>read</grapheme> <alias>red</alias> <phoneme>riːd</phoneme> <!-- IPA string is: "riːd" --> </lexeme> <lexeme> <grapheme>red</grapheme> <phoneme>red</phoneme> </lexeme> </lexicon>
In the following example, there are two pronunciations for "lead". Both are given with
prefer
set to "true"
. An ASR processor will recognize both pronunciations, whereas a
TTS processor will only use the first one
(because it is first in document order).
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US"> <lexeme> <grapheme>lead</grapheme> <alias prefer="true">led</alias> <phoneme prefer="true">liːd</phoneme> <!-- IPA string is: "liːd" --> </lexeme> <lexeme> <grapheme>led</grapheme> <phoneme>led</phoneme> </lexeme> </lexicon>
In the following example, there are two pronunciations. ASR processor will recognize
both pronunciations, whereas a TTS processor will only use the second one
(because it has prefer
set to
"true"
). Note that the alias entry for "lead" as "led" does not inherit the preference of the pronunciation
of the alias.
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US"> <lexeme> <grapheme>lead</grapheme> <alias>led</alias> <phoneme prefer="true">liːd</phoneme> <!-- IPA string is: "liːd" --> </lexeme> <lexeme> <grapheme>led</grapheme> <phoneme prefer="true">led</phoneme> </lexeme> </lexicon>
In the following example, "lead" has two different entries in the lexicon. An ASR processor will recognize both pronunciations given here, but a TTS processor will only recognize the "led" pronunciation, because it is the first one in document order.
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US"> <lexeme> <grapheme>lead</grapheme> <phoneme>led</phoneme> </lexeme> <lexeme> <grapheme>lead</grapheme> <phoneme>liːd</phoneme> <!-- IPA string is: "liːd" --> </lexeme> </lexicon>
In the following example, there are two pronunciations in each of two different lexeme
entries in the same lexicon document. An ASR processor will recognize both pronunciations given here,
but a TTS processor will only recognize the "liːd" pronunciation,
because it is the first one in document order that
has prefer
set to "true"
.
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US"> <lexeme> <grapheme>lead</grapheme> <alias>led</alias> <phoneme prefer="true">liːd</phoneme> <!-- IPA string is: "liːd" --> </lexeme> <lexeme> <grapheme>lead</grapheme> <phoneme prefer="true">led</phoneme> <phoneme>liːd</phoneme> <!-- IPA string is: "liːd" --> </lexeme> </lexicon>
This section is informative.
In its simplest form the Pronunciation Lexicon language allows orthographies (the textual representation) to be associated with pronunciations (the phonetic/phonemic representation). A Pronunciation Lexicon document typically contains multiple entries. So, for example, to specify the pronunciation for proper names, such as "Newton" and "Scahill", the markup will look like the following.
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-GB"> <lexeme> <grapheme>Newton</grapheme> <phoneme>ˈnjuːtən</phoneme> <!-- IPA string is: "ˈnjuːtən" --> </lexeme> <lexeme> <grapheme>Scahill</grapheme> <phoneme>ˈskɑhɪl</phoneme> <!-- IPA string is: "ˈskɑhɪl" --> </lexeme> </lexicon>
Here we see the root element <lexicon>
which contains the two
lexemes for the
words "Newton" and "Scahill". Each <lexeme>
is a composite element
consisting of the orthographic and pronunciation representations
for the entry. For the two <lexeme>
elements there is a single
<grapheme>
element which
includes the orthographic text and the <phoneme>
element which includes the
pronunciation. In this case the alphabet
attribute
of the <lexicon>
element is
set to "ipa"
, so the International Phonetic
Alphabet [IPA] has to be used for all the
pronunciations.
For ASR systems it is common to rely on multiple
pronunciations of the same word or phrase in order to cope with
variations of pronunciation within a language. In the Pronunciation
Lexicon language, multiple pronunciations are represented by more
than one <phoneme>
elements
within the same <lexeme>
element.
In the following example the word "Newton" has two possible pronunciations.
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-GB"> <lexeme> <grapheme>Newton</grapheme> <phoneme>ˈnjuːtən</phoneme> <!-- IPA string is: "ˈnjuːtən" --> <phoneme>ˈnuːtən</phoneme> <!-- IPA string is: "ˈnuːtən" --> </lexeme> </lexicon>
In the situation where only a single pronunciation needs to be
selected among multiple pronunciations that are available (for
example where a pronunciation lexicon is also being used by a
speech synthesis system), then the prefer
attribute
on the <phoneme>
element may
be used to indicate the preferred pronunciation.
In some situations there are alternative textual representations
for the same word or phrase. This can arise due to a number of
reasons. See Section 4.5 for details. Because these are representations
that have the same meaning (as opposed to homophones), it is recommended that they be
represented using a single <lexeme>
element that contains multiple graphemes.
Here are two simple examples of multiple orthographies: alternative spelling of an English word and multiple writings of a Japanese word.
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US"> <!-- English entry showing how alternative spellings are handled --> <lexeme> <grapheme>colour</grapheme> <grapheme>color</grapheme> <phoneme>ˈkʌlər</phoneme> <!-- IPA string is: "ˈkʌlər" --> </lexeme> </lexicon>
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="jp"> <!-- Japanese entry showing how multiple writing systems are handled romaji, kanji and hiragana orthographies --> <lexeme> <grapheme>nihongo</grapheme> <grapheme>日本語</grapheme> <grapheme>にほんご</grapheme> <phoneme>ɲihoŋo</phoneme> <!-- IPA string is: "ɲihoŋo" --> </lexeme> </lexicon>
A different example is the English names "Smyth" and "Smith". In some cases the pronunciations may overlap rather than
being exactly the same, for example the English names "Smyth" and "Smith" share one pronunciation but "Smyth" has a pronunciation that
is only relevant to itself. Hence this needs to be represented
using multiple <lexeme>
.
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US"> <lexeme> <grapheme>Smyth</grapheme> <grapheme>Smith</grapheme> <phoneme>smɪθ/phoneme> <!-- IPA string is: "smɪθ" --> </lexeme> <lexeme> <grapheme>Smyth</grapheme> <phoneme>smaɪð</phoneme> <!-- IPA string is: "smaɪð" --> </lexeme> </lexicon>
Most languages have homophones, words with the same pronunciation but different meanings (and possibly different spellings), for instance "seed" and "cede". It is recommended that these be represented as different lexemes.
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US"> <lexeme> <grapheme>cede</grapheme> <phoneme>siːd</phoneme> <!-- IPA string is: "siːd" --> </lexeme> <lexeme> <grapheme>seed</grapheme> <phoneme>siːd</phoneme> <!-- IPA string is: "siːd" --> </lexeme> </lexicon>
Most languages have words with
different meanings but the same spelling (and sometimes different pronunciations), called
homographs. For example, in
English the word bass (fish) and the bass (in music) have identical spellings but different meanings and pronunciations. Although it is recommended that these words be represented using separate <lexeme>
elements
by using the role
attribute
to differentiate them (see Section 4.4), if a pronunciation lexicon
author does not want to
distinguish between the two words they could simply be
represented as alternative pronunciations within the
same <lexeme>
element. In the latter case the TTS
processor will not be able to distinguish when to apply the first
or the second transcription.
In this example the pronunciations of the homograph "bass" is shown.
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US"> <lexeme> <grapheme>bass</grapheme> <phoneme>bæs</phoneme> <!-- IPA string is: bæs --> <phoneme>beɪs</phoneme> <!-- IPA string is: beɪs --> </lexeme> </lexicon>
Note that English contains numerous examples of noun-verb pairs that can be treated either as homographs or as alternative pronunciations, depending on author preference. Two examples are the noun/verb "refuse" and the noun/verb "address".
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" xmlns:mypos="http://www.example.com/my_pos_namespace" alphabet="ipa" xml:lang="en-US"> <lexeme role="mypos:verb"> <grapheme>refuse</grapheme> <phoneme>rɪˈfjuːz</phoneme> <!-- IPA string is: "rɪˈfjuːz" --> </lexeme> <lexeme role="mypos:noun"> <grapheme>refuse</grapheme> <phoneme>ˈrefjuːs</phoneme> <!-- IPA string is: "ˈrefjuːs" --> </lexeme> </lexicon>
For some words and phrases pronunciation can be quickly and
conveniently expressed as a sequence of other orthographies.
The developer is not required to have
linguistic knowledge, but
instead makes use of the pronunciations that are already expected
to be available. To express pronunciations using other
orthographies the <alias>
element may be used.
This feature may be very useful to deal with acronym expansion.
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US"> <!-- Acronym expansion --> <lexeme> <grapheme>W3C</grapheme> <alias>World Wide Web Consortium</alias> </lexeme> <!-- number representation --> <lexeme> <grapheme>101</grapheme> <alias>one hundred and one</alias> </lexeme> <!-- crude pronunciation mechanism --> <lexeme> <grapheme>Thailand</grapheme> <alias>tie land</alias> </lexeme> <!-- crude pronunciation mechanism and acronym expansion --> <lexeme> <grapheme>BBC 1</grapheme> <alias>be be sea one</alias> </lexeme> </lexicon>
This specification was written with the help of the following people (listed in alphabetical order):
The editor wishes to thank the first author of this document, Frank Scahill, BT.
This section is normative.
There are two schemas which can be used to validate PLS documents:
"http://www.w3.org/2006/01/pronunciation-lexicon/pls.xsd"
."http://www.w3.org/2006/01/pronunciation-lexicon/pls.rng"
.This section is normative.
The media type associated to Pronunciation Lexicon Specification documents is
"application/pls+xml"
and the filename suffix is ".pls"
as defined in [RFC4267].
This section is informative.
Speech applications that use a PLS document need a mechanism enabling them
to retrieve appropriate lexical content. In the simplest of cases, an
application will search the PLS document for <grapheme>
elements with content that exactly matches the input and retrieve all
corresponding lexemes. In general, however, the retrieval of
lexical content is not so trivial; it is necessary to define what
constitutes an exact match and which lexemes are to be retrieved when competing
matches can apply.
Here is an example of an approach to retrieve appropriate lexical content.
<grapheme>
element with content "n't".<grapheme>
element whose content exactly matches the longest possible sequence
of consecutive tokens. Thus, a lexeme for "they'll" should have precedence
over a lexeme for "they" given the input "they'll'.This outlined approach is designed principally with the needs of English in mind and should be modified to accommodate the particular requirements of other languages.
It is recommended for applications that use a PLS document to describe the approach they adopt in retrieving lexical content.
An application that uses the following PLS document:
<?xml version="1.0" encoding="UTF-8"?> <lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US"> <lexeme> <grapheme>New York</grapheme> <alias>NY</alias> </lexeme> <lexeme> <grapheme>York City</grapheme> <alias>YC</alias> </lexeme> </lexicon>
should process "New York City" as "NY City" rather than "New YC" if it uses the above approach.