Copyright © 2004 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply.
CSS (Cascading Style Sheets) is a language for describing the rendering of HTML and XML documents on screen, on paper, in speech, etc. CSS define aural properties that give control over rendering XML to speech. This draft describes the text to speech properties proposed for CSS level 3. These are designed for match the model described in the Speech Synthesis Markup Language (SSML).
The CSS3 Speech Module is a community effort and if you would like to help with implementation and driving the specification forward along the W3C Recommendation track, please contact the editors.
This section describes the status of this document at the time of its publication. Other documents may supersede it. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document is a draft of one of the "modules" for the upcoming CSS3 specification.
This document is a working draft of the CSS working group which is part of the style activity (see summary). It has been developed in cooperation with the Voice Browser working group.
The CSS working group would like to receive feedback: comments on this draft may be sent to the editors, discussion takes place on the (archived) public mailing list www-style@w3.org (see instructions). W3C Members can also send comments directly to the CSS working group.
This document was produced under the 24 January 2002 CPP as amended by the W3C Patent Policy Transition Procedure. Patent disclosures relevant to CSS may be found on the Working Group's public patent disclosure page.
This CSS3 module depends on the following other CSS3 modules:
It has non-normative (informative) references to the following other CSS3 modules:
The speech rendering of a document, already commonly used by the blind and print-impaired communities, combines speech synthesis and "auditory icons". Often such aural presentation occurs by converting the document to plain text and feeding this to a screen reader — software or hardware that simply reads all the characters on the screen. This results in less effective presentation than would be the case if the document structure were retained. Style sheet properties for text to speech may be used together with visual properties (mixed media) or as an aural alternative to visual presentation.
Besides the obvious accessibility advantages, there are other large markets for listening to information, including in-car use, industrial and medical documentation systems (intranets), home entertainment, and to help users learning to read or who have difficulty reading.
When using voice properties, the canvas consists of a two channel stereo space and a temporal space (you can specify audio cues before and after synthetic speech). The CSS properties also allow authors to vary the characteristics of synthetic speech (voice type, frequency, inflection, etc.).
Examples:
h1, h2, h3, h4, h5, h6 { voice-family: paul; voice-stress: 20; cue-before: url("ping.au") } p.heidi { voice-balance: left; voice-family: female } p.peter { voice-balance: right; voice-family: male } p.goat { voice-level: soft }
This will direct the speech synthesizer to speak headers in a voice (a kind of "audio font") called "paul". Before speaking the headers, a sound sample will be played from the given URL. Paragraphs with class "heidi" will appear to come from the left (if the sound system is capable of stereo), and paragraphs of class "peter" from the right. Paragraphs with class "goat" will be played softly.
Value: | <number> | <percentage> | silent | x-soft | soft | medium | loud | x-loud | inherit |
Initial: | medium |
Applies to: | all elements |
Inherited: | yes |
Percentages: | refer to inherited value |
Media: | speech |
The voice-volume refers to the amplitude of the waveform output by the speech synthesiser. This may be mixed with other audio sources, influencing the perceived loudness of synthetic speech relative to these sources. Note that voice-volume does not apply to audio cues for which there is a separate means to set the relative loudness.
Values have the following meanings:
User agents should allow the level corresponding to '100' to be set by the listener. No one setting is universally applicable; suitable values depend on the equipment in use (speakers, headphones), and the environment (in car, home theater, library) and personal preferences.
Value: | <number> | left | center | right | leftwards | rightwards | inherit |
Initial: | center |
Applies to: | all elements |
Inherited: | yes |
Percentages: | N/A |
Media: | speech |
voice-balance refers to the balance between left and right channels, and presumes a two channel (stereo) model that is widely supported on consumer audio equipment.
Values have the following meanings:
Many speech synthesizers only support a single channel. The voice-balance property can then be treated as part of a post synthesis mixing step. This is where speech is mixed with other audio sources. Note that unlike voice-volume, voice-balance does apply to cues.
An additional speech property, speak-header, is described in the CSS module covering tables.
Value: | none | normal | spell-out | digits | literal-punctuation | no-punctuation | inherit |
Initial: | normal |
Applies to: | all elements |
Inherited: | yes |
Percentages: | N/A |
Media: | speech |
This property specifies whether text will be rendered aurally and if so, in what manner. The possible values are:
Note the difference between an element whose 'voice-level' property has a value of 'silent' and an element whose 'speak' property has the value 'none'. The former takes up the same time as if it had been spoken, including any pause before and after the element, but no sound is generated. The latter requires no time and is not rendered (though its descendants may be).
Speech synthesizers are knowledgeable about what is a number and what isn't. The speak property gives authors the means to control how the synthesizer renders the numbers it discovers in the source text, and may be implemented as a preprocessing step before passing the text to the speech synthesizer.
Value: | <time> | none | x-weak | weak | medium | strong | x-strong | inherit |
Initial: | implementation dependent |
Applies to: | all elements |
Inherited: | no |
Percentages: | N/A |
Media: | speech |
Value: | <time> | none | x-weak | weak | medium | strong | x-strong | inherit |
Initial: | implementation dependent |
Applies to: | all elements |
Inherited: | no |
Percentages: | N/A |
Media: | speech |
These properties specify a pause or prosodic boundary to be observed before (or after) speaking an element's content. Values have the following meanings:
The pause is inserted between the element's content and any 'cue-before' or 'cue-after' content. Adjacent pauses should be merged by selecting the strongest named break and the longest absolute time interval. Thus 'strong' is selected when comparing 'strong' and 'weak', while "1s" is selected when comparing "1s" and "250mS". A combination of a named break and time duration is treated additively.
Value: | [ <'pause-before'> || <'pause-after'> ] | inherit |
Initial: | implementation dependent |
Applies to: | all elements |
Inherited: | no |
Percentages: | N/A |
Media: | speech |
The 'pause' property is a shorthand for setting 'pause-before' and 'pause-after'. If two values are given, the first value is 'pause-before' and the second is 'pause-after'. If only one value is given, it applies to both properties.
Examples:
H1 { pause: 20ms } /* pause-before: 20ms; pause-after: 20ms */ H2 { pause: 30ms 40ms } /* pause-before: 30ms; pause-after: 40ms */ H3 { pause-after: 10ms } /* pause-before: unspecified; pause-after: 10ms */
Value: | <uri> [<number> | <percentage> | silent | x-soft | soft | medium | loud | x-loud] | none | inherit |
Initial: | none |
Applies to: | all elements |
Inherited: | no |
Percentages: | apply to inherited value for voice-volume |
Media: | speech |
Value: | <uri> [<number> | <percentage> | silent | x-soft | soft | medium | loud | x-loud] | none | inherit |
Initial: | none |
Applies to: | all elements |
Inherited: | no |
Percentages: | apply to inherited value for voice-volume |
Media: | speech |
Auditory icons are another way to distinguish semantic elements. Sounds may be played before and/or after the element to delimit it. Values have the following meanings:
Examples:
A {cue-before: url("bell.aiff"); cue-after: url("dong.wav") }
H1 {cue-before: url("pop.au"); cue-after: url("pop.au") }
Value: | [ <'cue-before'> || <'cue-after'> ] | inherit |
Initial: | not defined for shorthand properties |
Applies to: | all elements |
Inherited: | no |
Percentages: | apply to inherited value for voice-volume |
Media: | speech |
The 'cue' property is a shorthand for setting 'cue-before' and 'cue-after'. If two values are given the first value is 'cue-before' and the second is 'cue-after'. If only one value is given, it applies to both properties.
The following two rules are equivalent:
H1 {cue-before: url("pop.au"); cue-after: url("pop.au") }
H1 {cue: url("pop.au") }
If a user agent cannot render an auditory icon (e.g., the user's environment does not permit it), we recommend that it produce an alternative cue (e.g., popping up a warning, emitting a warning sound, etc.)
Please see the sections on the :before and :after pseudo-elements for information on other content generation techniques.
Value: | <string> |
Initial: | none |
Applies to: | all elements |
Inherited: | no |
Percentages: | N/A |
Media: | speech |
Value: | <string> |
Initial: | none |
Applies to: | all elements |
Inherited: | no |
Percentages: | N/A |
Media: | speech |
The mark properties allow named markers to be attached to the audio stream. For compatibility with SSML, this must conform to the xsd:token datatype as defined in XML Schema. Synthesis processors must do one or both of the following when encountering a mark:
The mark properties have no audible effect on the speech and instead just serve to mark points in the stream.
Values have the following meanings:
Examples:
H1 {mark-before: section} p {mark-before: attr("id") }
Value: | [ <'mark-before'> || <'mark-after'> ] |
Initial: | not defined for shorthand properties |
Applies to: | all elements |
Inherited: | no |
Percentages: | N/A |
Media: | speech |
The 'mark' property is a shorthand for setting 'mark-before' and 'mark-after'. If two values are given the first value is 'mark-before' and the second is 'mark-after'. If only one value is given, it applies to both properties.
The following two rules are equivalent:
div {mark-before: start; mark-after: end }
div {mark: begin end }
Value: | [[<specific-voice> | [<age>] <generic-voice>] [<number>],]* [<specific-voice> | [<age>] <generic-voice>] [<number>] | inherit |
Initial: | implementation dependent |
Applies to: | all elements |
Inherited: | yes |
Percentages: | N/A |
Media: | speech |
The value is a comma-separated, prioritized list of voice family names (compare with 'font-family'). Values have the following meanings:
Examples:
h1 { voice-family: announcer, old male }
p.part.romeo { voice-family: romeo, young male }
p.part.juliet { voice-family: juliet, female }
p.part.mercutio { voice-family: male 2 }
p.part.tybalt { voice-family: male 3 }
p.part.nurse { voice-family: child female }
Names of specific voices may be quoted, and indeed must be quoted if any of the words that make up the name does not conform to the syntax rules for identifiers. Any whitespace characters before and after the voice name are ignored. For compatibility with SSML, whitespace characters are not permitted within voice names.
The voice-family property is used to guide the selection of the voice to be used for speech synthesis. The overriding priority is to match the language specified by the xml:lang attribute as per the XML 1.0 specification, and as inherited by nested elements until overridden by a further xml:lang attribute.
If there is no voice available for the requested value of xml:lang, the processor should select a voice that is closest to the requested language (e.g. a variant or dialect of the same language). If there are multiple such voices available, the processor should use a voice that best matches the values provided with the voice-volume property. It is an error if there are no such matches.
Value: | <percentage> | x-slow | slow | medium | fast | x-fast | inherit |
Initial: | implementation dependent |
Applies to: | all elements |
Inherited: | yes |
Percentages: | refer to default value |
Media: | speech |
This property controls the speaking rate. The default rate for a voice depends on the language and dialect and on the personality of the voice. The default rate for a voice should be such that it is experienced as a normal speaking rate for the voice when reading aloud text. Since voices are processor-specific, the default rate will be as well.
Value: | <number> | <percentage> | x-low | low | medium | high | x-high | inherit |
Initial: | medium |
Applies to: | all elements |
Inherited: | yes |
Percentages: | refer to inherited value |
Media: | speech |
Specifies the average pitch (a frequency) of the speaking voice. The average pitch of a voice depends on the voice family. For example, the average pitch for a standard male voice is around 120Hz, but for a female voice, it's around 210Hz.
Values have the following meanings:
ISSUE: should we also allow for relative changes in terms of semitones as permitted by SSML?
Value: | <number> | x-low | low | medium | high | x-high | inherit |
Initial: | implementation dependent |
Applies to: | all elements |
Inherited: | yes |
Percentages: | refer to inherited value |
Media: | speech |
Specifies variation in average pitch. The perceived pitch of a human voice is determined by the fundamental frequency and typically has a value of 120Hz for a male voice and 210Hz for a female voice. Human languages are spoken with varying inflection and pitch; these variations convey additional meaning and emphasis. Thus, a highly animated voice, i.e., one that is heavily inflected, displays a high pitch range. This property specifies the range over which these variations occur, i.e., how much the fundamental frequency may deviate from the average pitch.
Values have the following meanings:
ISSUE: should we also allow for relative changes in terms of semitones as permitted by SSML?
Value: | strong | moderate | none | reduced | inherit |
Initial: | moderate |
Applies to: | all elements |
Inherited: | yes |
Percentages: | N/A |
Media: | speech |
Indicates the strength of emphasis to be applied. Emphasis is indicated using a combination of pitch change, timing changes, loudness and other acoustic differences) that varies from one language to the next.
Values have the following meanings:
Value: | <time> |
Initial: | implementation dependent |
Applies to: | all elements |
Inherited: | no |
Percentages: | N/A |
Media: | speech |
This allows authors to specify how long they want a given element to be rendered. This property overrides the 'voice-rate' property. Values have the following meanings:
Specifies a value in seconds or milliseconds for the desired time to take to speak the element contents, for instance, "250ms", or "3s".
Value: | <string> |
Initial: | implementation dependent |
Applies to: | all elements |
Inherited: | no |
Percentages: | N/A |
Media: | speech |
This allows authors to specify a phonetic pronunciation for the text contained by the corresponding element. The default alphabet for the pronunciation string is the International Phonetic Alphabet ("ipa"). The phonetic alphabet can be explicitly specified using the @phonetic-alphabet rule, for instance:
Example:
@phonetic-alphabet "ipa"; #tomato { phonemes: "tɒmɑtoʊ" }
This will direct the speech synthesizer to replace the default pronunciation by the corresponding sequence of phonemes in the designated alphabet.
Sometimes, authors will want to specify a mapping from the source text into another string prior to the application of the regular pronunciation rules. This may be used for uncommon acronyms which are unlikely to be recognized by the synthesizer. The 'content' property can be used to replace one string by another. In the following example, the acronym element is rendered using the content of the title attribute instead of the element's content:
Example:
acronym { content: attr(title) } ... <acronym title="world wide web consortium">W3C</acronym>
This replaces the content of the selected element by the string "world wide web consortium".
Editor's note: The alphabet is specified via an at-rule to avoid problems with inappropriate cascades that can occur of the alphabet was set via a property.
The main changes have been to align the definitions with the latest version of SSML as it approaches W3C Recommendation status. This effects voice-volume, voice-rate, voice-pitch, voice-pitch-range, and voice-stress, where the enumerated logical values are now defined as monotonically non-decreasing sequences to match SSML. Named relative values such as louder and softer have been dropped since they are not supported by SSML and can't be related through percentage changes to the enumeration of logical values.
The cue- properties have been modified to allow the cue volume to be set independently or relative to that of synthetic speech.
The mark-before, mark-after and mark properties have been introduced to take advantage of SSML's mark feature, which is often used for rewinding back to a marked point in an audio stream.
The interpret-as property has been temporarily dropped until the Voice Browser working group has further progressed work on the SSML <say-as> element.
TBD
The editors would like to thank the members of the W3C Voice Browser and Cascading Style Sheets working groups for their assistance in preparing this new draft. Special thanks to Ellen Eide (IBM) for her detailed comments.
TBD