Session 10 - Expression, Speaking Style and Focus Presentations: Tsinghua University: Toward Synthesizing Expressive Mandarin Speech France Télécom: Toward Synthesis of Focus in Mandarin TTS System Discussion: How is focus different from focus? emphasis is the way that a speaker indicates focus. Not necessarily [France Télécom speaker gives an example where emphasis isn't focus] Focus is about semantics, emphasis is about rendering. The real question is how much anntation to put in SSML. We have to know how we would want the TTS Engine to render the focus. There are 2 main levels: logical description vs. rendering. It's the same thing in HTML, with H1 for instance. So should a focus, or general logical structure markup, element be added to SSML? Is something missing in the rendering controls that must be added through focus markup? Focus can be realised by emphasis and pause. Maybe what's missing is that more controls like pause, which the current spec doesn't mention for expressing emphasis. ** Conclusion: we note that when we revisit the topic of semantic vs rendering level, then we consider focus as a topic. Speaking styles: "news", "story", "sport", etc. Styles could be mapped on paragraphs and sentences to have more information. Possible attributes would give more information about the way this piece of information needs to be rendered. Maybe there isn't anything missing. SSML is a crossing of different semantic levels. The usefulness of adding semantic markup to help the synthesizer speak the text better. Determining important categories is going to be very hard, just like POS. Opinions differ among vendors regarding the use of adding markup for styles. Some think SS with style is too far away. Needs more research. Others have think it;s nearer that that and have implemented some. It could be an optional feature. There is less agreement on this feature than on some other. The question is maybe this is the right time to standardize. So what shoiuld we do about expressive elements? Is there enough agreement on how it should be represented and is there enopugh agreement on how it should be rendered. e.g. how many basic emotions: Can we agree to name 6? Then how do you describe what happens when you use them. Is "anger" ok, or do you need an anger intensity. There is enough interest. We should revisit this to understand whether it's ready to be standardised. There may be an issue with providing support in all voices in all languages. Optional behaviour, etc. There are all those levels from semantics to lexical. The problem is whether SSML must remain on the levels it's at now, or whether we want it to change, for new purposes like research.