Position Paper of Panasonic Beijing Laboratory for W3C Workshop on Internationalizing the SSML
Hairong Xia
Panasonic Beijing Laboratory
Panasonic Beijing Laboratory (PBL), established in 2001, is an overseas laboratory of Matsushita Electric Industrial Co. Ltd. The goal of PBL is to develop next-generation human-machine interactive technologies for the future. Currently, we are focusing on the R&D of speech and language processing, robot control, and image processing, etc. Past projects include home network, sensor network and Digital TV.
As a Panasonic's pioneer laboratory in China, PBL participates in various technology activities and organizations with great passion, such as AVS and DTV standardization group. Meanwhile, PBL keeps good relationship with local universities and institutes by sponsoring national academic conferences. PBL is also willing to cooperate with other companies and organizations in order to promote the progress of technologies for humankind.
Speech synthesis team belongs to the speech technology group in PBL. In 2004, Panasonic initiated a global speech project for high-performance speech technologies. Members in this global project are from Japan, USA (Panasonic Speech Technology Lab), and China (PBL).
In PBL, research work on Mandarin synthesis has been continued for more than 2 years. Till now, we have developed a Mandarin unlimited TTS system that is based on large corpus, and the quality is comparable to some products in the industry according to our evaluation. Moreover, the development for small database system is ongoing.
PBL is interested in the topic of representation of word and phrase boundaries and tones inflection for Chinese. Other topics are attractive to us too. We are thinking about the following extensions to the current SSML for Chinese.
3.1 Dialect selection
Explanation:
There are several popular dialects in China, including Szechwan-ese, Cantonese, and Shanghai-ese. Although SSML 1.0 has included a language attribute xml:lang, it is still an open problem that how to support those dialects which are not defined in XML 1.0. Our suggestion is to add a secondary language attribute, like ssml::lang2 = "sc".
Example:
<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<!--Use Szechwan-ese to synthesize the following sentence. -->
<p xml:lang="zh-cn" ssml:lang2="cn-sc">欢迎来成都游览。</p>
</speak>
|
3.2 Pronunciation of character
Explanation:
According to the paper [1], there are about 1036 polyphone characters in Chinese, which may result in about 0.88% overall error rate in grapheme-to phoneme conversion. Most synthesizers employ some algorithms to reduce the error rate. However, the best solution is to introduce a special tag to tell the system how to read the character. Considering SSML 1.0 has included a "phoneme" element already, what we need is to specify the alphabet used in "phoneme" tag for Chinese.
Further issues about Chinese pronunciation are "Er-hua" and "Qingsheng" . A perfect synthesizer may be able to decide when to read characters in "Er-hua" or "Qingsheng" properly. It is still a good idea to label those characters explicitly in the script.
Example:
<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="zh-cn">
<!--Ployphone character "将". -->
<p>你定购的衣服<phoneme alphabet="?" ph="jiang1">将</phoneme>按照地址送到您的家中。</p>
</speak>
|
3.3 Sound effect filter
Explanation:
Sound effect may be useful in some circumstances. Some devices or systems may have no enough bandwidth to transmit high-quality voice signals. In this case, a low-pass filter is needed to restrict the bandwidth of the voice. Another use case is to play soft background music while reading mail.
Example:
<?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="zh-cn"> <!--Use "some-filter" to render the synthesized sound. --> <p> |
3.4 Major phrase boundaries
Explanation:
For most SSML script programmers, it is very hard to decide the break boundary value assuredly with "x-weak, weak, strong" . On the other hand, nowadays synthesis system can decide the word and minor phrase boundaries with satisfactory accuracy. So, it could be better to ask them to separate the sentence to some major prosodic groups with a "<L3/>" tag. Original labels like "x-weak" can be used to prevent another prosodic break which the processor otherwise may produce. Other tags such as "<L0/>", "<L1/>" and "<L2/>" are useful too.
Example:
<?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="zh-cn"> <!--Major phrase label: <L3/> --> <p> |
3.5 Speaking style template
Explanation:
There are a lot of elements to adjust the prosody and style in SSML 1.0. However, it is inconvenient to write these codes repeatedly, especially in those cases that some sentences are always read in the same style. Our suggestion is to allow programmers to define style templates with which they can set the style easily.
Example:
<?xml version="1.0"?> <!-- |
3.6 Element value: macro
Explanation:
SSML 1.0 supports several different value types, including "date", "time", "telephone number", "character string", "cardinal number" and "ordinal number". Similar to c++ language, macro type is useful for SSML as well. The macro may be defined statically or dynamically.
Example:
<?xml version="1.0"?> <!-- |
3.7 Extension for say-as element: translation
Explanation:
Hybrid text that consists of several languages becomes more and more common now. A synthesizer that supports multi-language is the best choice for processing this kind of text. However, some synthesizers cannot process foreign languages. In this case, translation element will be useful to translate foreign unknown words into local language.
Example:
<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="zh-cn">
<!--Translation say-as element. NSDQ -> 纳斯达克 -->
<p >现在为你播报最新的<say-as intepret-as="translation">NSDQ</say-as>行情。</p>
</speak>
|
4. Referece
[1] ZHANG Zirong, CHU min, "A statistical Approach for Grapheme-to-Phoneme Conversion in Chinese", Microsoft Research Asia website.