The workshop brought together people from a broad variety of organizations working on voice technology. The presentations covered a diverse range of approaches, and helped to reveal variations in requirements. The workshop participants were broadly split into two groups according to whether they felt voice interaction required its own markup language or whether html and style sheets could be adapted to meet this need. There was strong support for W3C to set up an Interest Group to study opportunities for joint work on Web standards for Voice Interaction.
Dave Raggett/W3C/HP (DR): W3C founded Oct 94 MIT Cambridge, next year INRIA joined, Aug 96 Keio became 3rd host. Over 270 diverse members, vendor neutral forum for development. Advisory Committee helps run W3C, meets semi-annually. Tim Berners-Lee, inventor of Web, directs. Some W3C team members visiting engineers, Dave Raggett from HP. Work starts as working drafts, then proposed recommendation then Recommendation. Working Groups, Interest Groups where development goes.
DR HTML created by Tim at CERN. Very simple, enabled people to access common docs. Expanded on at NCSA. Standard of HTML 2.0, represented status mid 1994. Fall 1995 W3C brought Web vendors around the table to bring together evolving standard into HTML 3.2, then eventually 4.0. Building in other features. Accessibility, tables, forms, multimedia. Also Cascading Style Sheets. May 1998 workshop of future of HTML. Need to think beyond desktop systems. Now need to re-formulate HTML within XML. Browsers need to be able to purpose content. Work with performance profile: what a particular device will have to do: syntax, semantics. Content providers will be able to know who supports. Proposal is to have proposals written in RDF (W3C's word for meta data.) Also tied in with development of Mobile Access, following workshop in Japan earlier this year. Transformation tools allow content to be re-purposed to match device profiles. Apply transformation tools. Doesn't even have to be in HTML.
DR What should W3C role be in promoting this? Allow higher quality speech synthesis and more accurate speech recognition. Need ways to combine different media more flexibly, and to ensure accessibility in doing so. Towards the end of the day we should evaluate whether to launch an ongoing interest group or working group.
MH Taken a journey through accessibility, voice browsing. Looked at requirements of voice browsing for people with visual disabilities, and the HTML requirements. Go through telephone browsing & other carry over applications. Several years ago was building an internet-based information kiosk, exploring issues in Web access for visually impaired. Took it as a challenge. Technology for visually impaired is screen reader, renders into speech or Braille output. PW looked at whether could bypass visual display step.
MH T.V. Raman, at Adobe, developed EMacSpeak which also looks at foundation HTML. Tables, forms. Look more at what is behind the scene. What does the underlying HTML mean, what used for. W3C took on the Web Accessibility Initiative. Some successes in terms of table mark-up that gives more semantic information.
MH Wanted non-visual client to be an equal client to visual browsers. Also wanted to exploit ACSS (Audio CSS) for navigation. Define a tag set that would define how elements would render into speech. Went through first prototype of PWWebSpeak. Rudimentary. Not as well synchronized between text & speech display, problematic for people with learning disabilities. Later added support for some of the access features in HTML 4.0 such as titles, etc. Some issues on support of access key. First support of voice browsing by phone using PWWebSpeak pioneered by an organization in Japan in 1997.
MH Also developed protocol for Digital Talking Book standard: Hybrid Full Text/Full Digital Audio content. Can be device independent: phone, CD's, over the Web. Interoperability testing last week went excellently.
MH Voice Browsing is here to stay. What is missing from it so far? Form-filling? How do you know intent of form? Are there ways to pre-fill it w/out need for dialog? Also, voice-browsing toolkit, where does VB and CTI app development begin? Also: all Web content needs to be authored so universally accessible, device independent? client-independent.
MH Demo: of a voice-browser. "Page is ready. Mark Hakkinen. Eight msgs. Want to read NYTimes?" MH is navigating by hand on a speaker phone. 50 different funtions to navigate by hand. Have been working with several speech recognition engines to drive these over the phone.
MH Demo: of a SMIL player. Protype talking book player. Showing navigation by tree structure, by top level headings, or by lower level headings. In a SMIL presentation can increase the time scale, to speed listening. Can run over telephone browsing. Runs over a telephone system in Japan.
Tomasz Imielinski (TI)/Rutgers: Like to hear more about applications interface.
MH use Webspeak engine. can use different layers? voice recognition layer on top of underlying object. skip to an element. navigate hyperlinks.
TI do the keys recognize touch-tone?
MH looking at a variety of interfaces, touch-tone among them.
Michael Wynblatt (MW)/Siemens how much effort is involved in building a smil demonstration like the one shown
MH authoring process very straightforward. begin with document. import into recording studio system. there are parts of the system that do the synchronization. meant for someone who works in a talking book recording studio to build these books.
DR have you tried this with streaming audio?
MH not yet tested that.
DR challenge is to predict the time needed to fill the buffer. Also, was that a skilled reader recording it?
MH not always, often untrained volunteer, kind of person who records in talking book studio.
DR if you don't have a pre-recorded audio, how well can you automate it?
MH also looking at speech synthesis in hand-helds.
Great that the first two presentations deal with accessibility. People with visual disabilities have had to put up with information in visual two dimensional space for too long. Had the first ? in 1970's. Then the first screen reader in 1980's was IBM's. Now the term has become generic. In the 1990's had screen reader for OS2. Now have home page reader. Mark & his colleagues are leaders in that field. Blind readers put up with an awful lot. The process of trying to render all content with speech is difficult. Home page reader released in Japan in 1997. IBM Japan. Suprised the special needs organization in the status, the main designer Chieko Asakawa is the main designer. She also was involved in the translation of the OS2 screen reader. Special Needs Systems in IBM Japan doing home page reader 2. Receives HTML from Netscape. Uses IBM's via-voice out loud. How about using a high-quality text-to-speech... the DEC 10 years ago produced about the same quality speech. Delighted with the quality of speech in via voice. Use the numeric keypad as input device. Don't do anything with CSS, SMIL, DHTML, JavaScript. It's extremely difficult to present content to blind readers. It's designed by blind people for blind people.
There's a stop key, a play key, a fast-forward key. And previous, current, next item. Current link key. Can enter a URL. Can search on page or net. Lots of setting. Variable speech rate. Blind users are used to text to speech. Via voice goes 340 words per minute. Requirement is to go up to 700 words per minute. People learn to hear at very high rates. There's a page summary. Number of titles, tables, forms, items in length. A "where am I" key uses that context. Like a gross way of saying the structural information.
Hardest part of access is: people who browse visually filter a lot of information, including dynamic content. Need to be able to match this. If your search engine has 22 matches... Don't have to listen to all the ads. ACB (American Council of the Blind) Website gives you a "skip over navigation" option at top of page. Another feature of via voice is that links are spoken in a female voice... can be adjusted. Have to know what links are as you read, in context. Really believe that was key. One of the first requests for beta, how to turn off, can now change to pings. Searching, skipping, jump keys. Fast forward is a scanning strategy. Then lots of different jump keys. Jump between structures.
Several other features still very boring despite efforts of the WAI group. Mailer works well with access keys. Home page reader menu provides menus for navigation. HTML 4.0 support: use of headers... decided to put in speaking headers of table cells. But problem in implementation, resulted in too much verbiage for some tables.
Andrew Forest, AT&T: Can we hear a sample of Via Voice
JT: at a break, yes.
Rajeev Argawal, Texas Instruments: given the problem with pressing keys, have you experimented with speech recognition, you'd only need 50 words for commands.
JT: will, haven't yet.
MI: am using home reader in Japanese. But English speech quality is poor.
JT: Pro-Talker speech engine works poorly in English. Should
JT: need to be able to support different speech engines to get better quality sound.
JB: hope that effective speech recognition combined with speech systhesis is available soon; increases usability for general population in eyes-busy & hands-busy situations, as well as for other disabilities such as people with physical disabilities who can't use keyboard.
[coffee break]
[Dave: check the notes for your own presentation. There are several one-to-two minute gaps from when I was talking with the AV guy]
Presentation: Dave Raggett (DR)/W3C/HP Will present content of a note that DR & Or Ben-Natan did in a W3C Note earlier. The bigger the market the better. If you can get voice browsers to access as much content as possible... Pouring content from a database. For dealing with today's content, should be possible to use heuristic techniques. Many pages that prevent even obvious gestures towards accessibility. Helpful to have for instance alt attributes for image maps.
DR CSS2 -- number of things for positioning. will enable phase-out of use of tables for layout. if the data model is more explicit, can give a good browsing experience orally. In next generation HTML, will try to improve forms, to support aural presentation. Mobile Interest Group also looking at issues involved in repurposing content. CSS2 has a range of features for speech: speech rate, voice family, pitch, spatial direction, volume, stress, richness. Pause, queue properties. Features compare well with SABLE mark-up language for speech synthesis. Pronunciation hints from authors helps. Options for handling selection errors. Prompting. Capability to support pronunciation dictionaries & to provide aural cues.
DR How do you give an attribute for a heading. There are new techniques from HTML 4 for this. But most of the content out there is not marked up that way. Selection concept is to add new events to HTML. - OnSelectTimeOut if you haven't selected something in a certain timeframe - OnSelectionError allows author to specify options if the wrong option is selected
DR Other approaches include adding an attribute to HTML. Use of switch element from SMIL (if you don't use this, use this instead) this could be added directly to HTML. When you want to use speech recognition, handy for author. Consider development of speech grammar format for commands/ or selection acknowledgement.
DR If this group interested, W3C can develop briefing package for formal activity in this area. Opportunity to drive next generation of the Web.
Discussion:
___? Your slide about scanning -- you said digital scanning easy? Any comments about that?
DR Lots of practical experience about that in this room, didn't go into in depth. Also should look at DOM and scripting.
GW Portico service gives voice access to intelligent agent. to business information, intelligent messaging, important public information, exec call handling research assistant, investment manager. customized newspaper. allows synchronization of information onto PDA's & make it available while traveling etc. toll free dial in service. our architecture is... bank of pc's, SR servers, text to speech servers...expanded view... dish that downloads.
GW demo over the phone to Portico. Going through email. Taking commands of playback email. Then recorded a speech message. Uses an auditory desktop in place of a visual metaphor. Accepts barge-in (interrupt). Checking on mail forwarding capabilities. Also making an appointment. Personable interface. Requested fax. Demoing call-back of old voice mail messages. Going to address book. Narrates menu options. Going to stock report for General Magic.
GW uses in car driving to work, works in a noisy environment. This phone had a higher misreq than usual. User needs an audio player. Access the Web done behind the scenes to get news and stock quotes. Working on building in an explicit portal to the Web. Acquired NetPhonic within the past year. Looking at VoxML which could make pages more browsable, scopable. Also doing customized interfaces inside General Magic.
Jim Colson, IBM: on slide saw syncrhonize GUI & voice, but didn't see GUI in demo.
GW yes we have GUI at portico.com and we're synchronizing with that so can see as browse visually.
John Burger, MITRE; what vocabulary size for SR engine.
GW it is vendor independent. perplexity: 15,000 stock quotes. we do it phonetically. Other pts in achieving highest quality of grammar recognition over phone. Linguist types in phonetic strings w/ co-articulatory phenomena.
JBurger: more about dialog tree?
GW: built a huge dialog tree, w/ help of linguists, scripters, hollywood folks...
Rajeev/TI: has dynamic grammar? and context sensitive help?
GW: yes. uses concept of graceful help. Barge-in is a great help. but want to track how often user has to interrupt; two options available. Novice option detects multiple looping, gives different level of support. Expert level also available.
JBrewer: what was going on in email reply demo, that didn't sound like speech-to-text-to-speech, rather digitized.
GW: reply becomes an audio file that is attached to the email message, and user needs a real audio player to access it.
Presentation: David Stallard (DS)/BBN
DS VADAR: Telephone interface to military cargo shipments
DS EMALL Interface: Telephone interface to Defense Logistics Agency's EMALL Website.
DS Talk'n'Travel: phone interface to commercial air travel Websites. free-form language input allowed.
DS Dialog System Architecture: Dialog manager, written in Java except for voice recognizer, too slow. DIABOLIC dialog rule language. Scripting language for Dialog Manager objects. Rule elicits constraints on attribute. Prompt provides content of what to say. Action is what to do following assimilation of that constraint. Meaning is represented as frame structure; language understanding done based on simple patterns; language generation also template driven. Three stages: fetching; parsing; groveling (crawling through the tree structure to find the data that's needed).
DS Poses question: is HTML suitable for voice access? A lot of tags that aren't semantic that are in the way, have to wade through those. Need to wind up doing natural language understanding more than should have to. and the page layouts do not facilitate good voice access; have to repackage the info a lot. Need to find specific info on a table.
DS What things would improve? XML. Having explicit markup of semantic content. VOX ML work at Motorola will help. Our language is more procedural, descriptive. whereas VOX ML more declarative. Ours is more expressive, but probably harder to use. but may help with client-side processing.
DS Demo: dialing in. querying re a new TCN. spelling out. tracking military shipment between air force bases. it goes out and gets the info from the Web. And then it checks to make sure the caller is really done.
Or Ben-Natan (OBN), Microsoft: Want to understand more about VoxML, why need to upgrade or replace HTML? You wanted semantic description of info to create an automatic dialog to rephrase questions or present alternatives?
DS allow to reformulate data. need to be able to get at that flight data.
OBN you hope to get that from VoxML?
DS VoxML would enable you to create scripts that have all the data in it.
DR couldn't that be supplied by an agent interface? W3C next generation should help.
Michael Wynblatt/ Siemens: talk about the trade-offs between this approach, and going straight to a database through a phone call-in standard.
Brian Altman, Applied Technologies: Meta tags from W3C, and other developments, that relate to this, have you looked at any of that?
DS not yet, no.
Tomasz Imielinski/Rutgers: how domain dependent is this? What would be involved in porting it to another domain?
DS less than a month, we've done it before.
Presentation: Rajeev Argawal (RA), Texas Instruments: Voice Browsing the Web for information access.
RA there will be an explosion in speech-based access systems & we need standards. (1) functional categories: web browsing with speech-enabled interface. (2) limited information access: info from Web but designed for speech & scripted somehow. (3) spoken dialog systems: monitors everything the client does Sometimes these categories overlap.
RA Spoken dialog systems: (1) graph-based. represents entire dialog interaction, mostly system initiated (2) frame-based. info need to complete user query, mostly mixed iniative, user can specify slots (3) plan-based. .... more portable.
RA Texas Instruments developments follow:
RA Web Browser, Voice Browser. speakable links, bookmarks, browser commands, smart pages, no audio support
RA InfoPhone get info on flights, stocks, weather, voice i/o only, not much display, customizable
RA Dialog Manager: conversation between human & machine takes place at domain independent level mixed initiative frame basis, either can start a sub dialog
RA Remote E-Mail, Voice-mail: can filter, categorize, navigate email will develop voice-mail send &recieve
RA Voice navigation: maps & directions, etc businesses coupled with GPS on client side
RA Design Issues: - need better UI for Voice I/O - portable dialog managers... - for wireless: ...?
RA need additional handlers for errors need OnRejectionError for when SR engine doesn't have enough confidence need OnHelpRequest need "Speech" media descriptor, maybe look at JSGF (Java Speech Grammar Format)
RA helps to have dynamic grammar capability; Extensions will be most effective for: Web browsing and limited information access apps; Mere Web-enabling of an IVR (interactive Voice Response) IVR systems are graphics-based. Merely just speech-enabling that interface won't be most helpful. the system-initiated prompting/client responding are frustrating.
RA Frame-based approaches enable people to just say what they want to say.
GW Excellent presentation. IVR different perspective. Consider social aspect of dialogs. Sore point is that IVR takes control away from the user.
___?/Microsystems. Most people don't like even using touch tone systems. There are usability issues with touch tone.
Ramish/ fixed grammar does improve accuracy; but with links, people may want to add words before or after; should have options such as free text; With VCSR-based systems, will the words be transcribed automatically. how much flexibility to have?
RA it's the dynamic grammars that improve the recognition. smart pages have embedded grammar. knows what to expect you to say. you look at a generic page. there is a way to incorporate a VCSR (large vocab, continuous speech recongition) can just go to the link.
[Break for lunch.]
Philipp Hoschka, W3C
where should we be moving next ?
we need heuristics - liked link density metric
what's next ? hope there will be an interest group on voice browser - we'll participate
diversity of opinion on use of html
html is usable for voice browsing, especially 4.0
need more information like this - what is the purpose of this element etc. ?
our audience are non-visual people, don't think that voxml addresses this
like to see extensions of html4
there will be an alternative way to access web, and it will probably be voice
liked VXML proposal
we also need agent controls
main concern; writing html
hard enough to get people to do something simple like including alt tags
not sure whether they want to learn another language
or present content in two forms
i think: use html, adapt it, extend it, rather than doing a seperate language
perceived difficulties could be improved if web designers doing practical work would have a better understanding of how to use existing means in html to build better web sites
speech output interest
There are a number of proposals to control output of voice synthesizers: CSS, SABLE, etc. It is critical that there be at least one scheme, but if more than one: we need standard tools for conversion needed. However we don't need half a dozen schemes
Researchers should be involved, not only user community of speech synthesizer.Some properties can be quite contentious issues, e.g pitch-range. Its not really clear what the set of tags should be. Best deciders: people working on the speech synthesizers. We need a lot of interaction with people that do the technology
need both extensions to HTML and a new language
W3C could be forum to coordinate this
less change to html
not enough to have own language
we have heuristics
reduce ambiguity in markup language so that heuristics work better
make it easier to do interactive presentations
SMIL could help
html is a defacto standard for defining information
heard a lot about that html is not good enough for IVR
not convinced
html needs to be improved, but not much is needed
are we going to make new standard for each new device ?
need a frame-based mechanism in the language - specify what kinds of slots a frame has - grammar based slots
Suprised about focus on telephone access. In our product, html is a good match. we don't want to get rid of the screen for the voice browsing application. Voice dialog over the telephone without a screen is a challenge.
whether html or a new language: there are web-aware users and web-unaware. first group knows the web site, and want to go to third page. web-unaware (former IVR users): html extensions may not be appropriate
problem about html: my language could be defined a subset of html - my problem: cannot imagine designing for audio user and visual user at the same time - completely different designs
only against hmtl when talking about web-unaware users - make a lot of sense for handicapped users
support special interest group
universal design explanation
explanation of WAI
www.w3.org/WAI
brewer@w3.org
use universal design approach to repurpose content for different uses - this may apply to some of the questions discussed today
agree with ken: people won't do content several times
Dave Ragget: any questions ? what should future work address ?
Windblad, CNET: link density metric and other metrics: we have a paper at IEEE Multimedia
new language or not: both may be needed. question; when do you need what ? under which context ?
jim close (?): browsing is not the goal: important point. think about kind of content we're trying to render. notion of seperation of content and presentation makes sense. new markup language for new device: yes, see WAP forum, so seems needed. deliver content, decide about rendering once you know the device capabilities. "voice browser" seems to cover a lot of different devices. not sure we are going to find one answer (opinion to previous questions)
lucent: number of people have made distinction between browsing and IVR: mechanisms are very similar. reporpusing content: depends on how hard repurposing content ist. still a lot of research needs to be done. big advantage in repurposing.
rutgers: looked at html, html with css, hdml, ... analyzed pro's and con's. there are compelling arguments why a different language is necessary in HDML. if these are true, than voice access needs even more its own markup language. lot of work presented today didn't target phone, but for accessibility. thus natural orientation to repurpose html content. europe is far ahead of us in installation of mobile devices. see fusion of small devices with speech output and screen. there seems to be agreement that seperate language for mobile is needed, see hdml - why do small devices get their own language, but voice doesn't ?
??: hdml goal was to optimize for low bandwidth (based on WML)
Or Ben-Natan (microsoft): hdml means proliferation of standards - you will have five different languages for different devices - need one language for all devices- don't start a new language every time
att: when to use html, when something else: html excellent choice for talking books, with SMIL extensions. html does not make sense in transaction-based system - example: ups tracking website does give no clue on how to create voice-based site derived from website - imagine from an 800 number reaching this page - there is almost no overlap - completely different designs - don't believe html is a solution for this purpose
mark (repsonse): not all people using talking books are experienced users - elderly, young children use it. html page is useful to non-web-aware user if well-designed. a lot of the extra content on ups is just bad authoring. what about advertising and voice browsing ? we just skip it
lucent: distinction between ...: tried using html. could do rendering of information, but couldn't do the best what we thought was best. moved to dialogue driven approach, provided better results/better experience for users - capitalize on capabilities on specific users interfaces - wonder if RDF/profiling won't bloat the whole system up and make it too complicated. ...
rajev: html or different kinds of system: ... client-server solution ... for web page designer, designed forms interface - can leave it at that when going to voice browser - unless he gets more money to do voice version -
agree with ora, just add something to html, so that author can gradually improve the web page for voice - avoid a different language
gragget systems (?): display size: difference between dtmf and speech input. speech input allows many more than 12 choices at a time. don't have to go into a directed graph. shouldn't use extensions that force you to use the old way of doing things
sun: five info representations: html - substantial limitations, but pragramatic says: this will be used - CSS: ... grammars need to be provided to speech synthesizer, more than extracting links is needed - happy to provide spec - voxml, ...: three speech oriented markup - how do you minimize effort for web designers ? took that to heart. idea: same way as CSS/stylesheets, provide a navigation info in seperate part - navigation document, break down html document into navigation parts
ken: we can learn from IVR - ... - if you make it a different language, rest of the world doesn't have access to things that could be useful ...
john burger, mitre: participated at html WG - next version of html will likely be number of XML modules. decomposing html - core html, tables, forms. add an additional module, not part of html proper. hard division between html and completely different markup language disappears - add embellishments using external modules - should allow much easier repurposing of HTML - check out current HTML WG - if you think it's appropriate, please join, we may need the expertise
Rutgers: one web page corresponds to one audio site - many to many mapping. language may be able to solve it, but i'm unsure
Or Ben-Natan: html in several subparts: as long as we don't allow overlapping, great idea - even if you special-purpose design an audio site, you can still do that in HTML, even if it doesn't give the best visual experience
ken hore, lucent: pml had success, ora had success. whether its' extensions or new language: depends how complicated. when we started pml: how can we make access for phone users easier. how can we make creating phone services easier for web designers. certain percentage of the phone services can be build by html. talked to ivr people ... even general magic is similar to ivr. goal: make it easier to make these services. ivr people should be able to do more complicated things - lots of bloat in the language when using html only. you can also do some things only in html
??: (general magic ??): phone without a display is an anachronism: numbers don't support this - 800 Mio phones vs. not many PCs. new on web: ... different about gui: voices interface requires dialogues because you don't have click and point. didn't hear much discussion of what constitutes a dialogue. we may need a standard for this.
Ramesh Sarukkai: ivr difficult, tedious: why not adopt keyword strategy for accessing web content ? keyword-structuring of web pages. dialogue-oriented structure for form filling. focus on modifications of html for speech recognizers- grammars. cues to speech recognizer would be good to integrate.
lucent: advocating less new language. you want high quality performance, and you need new language for that. not convinced that we've taken enough advantage of what we already have - should try that
steve lilley: many of you used html for presentation, not powerpoint. not so great display. same is true for designing a speech app. far beyond capablities of html. you can do something simple, but not too fancy. there are still many applications that html can be used for
separate: access to the web, sophisticated dialogue-based systems in interest group
lucent: need a more natural two-way conversation with systems
dave: need help for doing a briefing package
??: shameless plug for speech recognition. when you have only audio access, you need to the best with what you have. opposite for speech recognition
who would like to be involved in IG
fair number (80 % of about 30 people)
set up a mailing list, use this to clarify some of the ideas in this session
ken: identify specific needs to check whether they should be extensions or a new language - laundry list
lucent: what can't be done as an extension of an existing language ? make a list
Or Ben-Natan: need a guideline - at which point is html not enough ? i haven't gotten there yet -want to solve most of the problems for most of the people - not everything for everyone - there's always C++ for applications
dave: speech grammar formats needed
rutgers: next version of html will have xml blocks - so problem is solved
dave: comment on frames vesus graphs -can you say more ?
rajev: graph-based systems always have the limitation that you need to specificy the whole state transition diagram - if something new happens, you don't know how to handle this
repurposing content: need engines to convert between old and new content (??)
markup that you use for synthesis can also be used for recognition