06:45:28 RRSAgent has joined #voice 06:45:28 logging to https://www.w3.org/2021/10/18-voice-irc 06:45:31 RRSAgent, stay 06:45:34 RRSAgent, make log public 06:45:50 RRSAgent, this meeting spans midnight 14:28:54 kaz has joined #voice 23:45:09 kaz has joined #voice 23:45:45 Zakim has joined #voice 23:45:51 rrsagent, bye 23:45:51 I see no action items 23:45:56 RRSAgent has joined #voice 23:45:56 logging to https://www.w3.org/2021/10/18-voice-irc 23:49:11 meeting: Next Directions for Voice and the Web Breakout 23:55:08 takio has joined #voice 23:56:08 Ben has joined #voice 23:59:16 present+ Kaz_Ashimura__W3C, Bev_Corwin, Francis_Storr, Jennie_Delisi Masakakazu_Kitahara, Takio_Yamaoka__Yahoo_Japan 23:59:30 Jennie has joined #voice 23:59:30 present+ Makoto_Murata__DAISY 23:59:54 present+ Muhammad Sam_Kanta, Tomoaki_Mizushima__IRI 00:00:01 MURATA_ has joined #voice 00:00:01 zakim, who is on the call? 00:00:02 Present: Kaz_Ashimura__W3C, Bev_Corwin, Francis_Storr, Jennie_Delisi, Masakakazu_Kitahara, Takio_Yamaoka__Yahoo_Japan, Makoto_Murata__DAISY, Muhammad, Sam_Kanta, 00:00:04 ... Tomoaki_Mizushima__IRI 00:00:44 BC has joined #voice 00:00:49 Hello 00:01:12 kirkwood has joined #voice 00:01:13 MasakazuKitahara has joined #voice 00:01:17 present+ 00:01:22 present+ 00:01:29 fantasai has joined #voice 00:01:44 present+ 00:02:13 present+ 00:03:09 scribenick: fantasai 00:03:14 kaz: Thanks for joining this breakout session 00:03:31 kaz: This is a breakout session on new directions for Voice and Web 00:03:39 kaz: There was a breakout panel during AC meeting 00:03:51 kaz: discussion about how to improve web speech capabilities in general 00:04:19 kaz: There were several breakout sessions previously (previous TPAC??) 00:04:27 kaz: We want to summarize situation and figure out how to improve 00:04:37 -> https://www.w3.org/2021/Talks/1018-voice-dd-ka/20211018-voice-breakout-dd-ka.pdf slides 00:04:52 kaz: First, reviewing existing standards and requirements for voice and web 00:05:05 kaz: Then would like to look into the issue of interop among voice agents 00:05:13 kaz: Then think about potential voice workshop 00:05:29 kaz: If you have any questions please raise your hand on Zoom chat, or type q+ on IRC 00:05:58 [slide 2] 00:06:07 kaz: Existing mechanisms for speech interfaces 00:06:16 kaz: We used to have markup languages like VoiceXML and SSML 00:06:28 kaz: There was also CSS speech modules 00:06:38 kaz: And Web Speech API 00:06:48 kaz: Lastly there's specification for spoken presentation in HTML WD 00:07:01 kaz: Most popular one is Web Speech API, but this is not a W3C REC but a CG report 00:07:04 kaz: so that's a question 00:07:06 [slide 3] 00:07:18 kaz: As voice agents are getting more and more popular, and very useful 00:08:13 ddahl has joined #voice 00:09:05 kaz: Need improved voice agents 00:09:15 Tomoaki_Mizushima has joined #voice 00:09:30 [slide 4] 00:09:35 kaz: Interoperability of voice agents 00:09:40 kaz: local voice agent or on the cloud side 00:09:47 kaz: most are proprietary, and not based on actual standards 00:09:55 kaz: speech API is very convenient but not a standard yet 00:10:01 kaz: Desktop and mobile apps, various implementations 00:10:08 kaz: how can we get them to interoperate with each other? 00:10:15 kaz: Do we need some standards-based infrastructure? 00:10:26 kaz: Voice Interaction CG chaired by David has been working on interop issues 00:10:31 kaz: will meet next week during TPAC 00:10:55 [slide 5] 00:11:44 s/David/ddahl/ 00:12:00 ddahl: Our CG has been working on voice and web, focusing on interop among intelligent personal assistants right now 00:12:14 ddahl: We've noticed that these assistants (like Siri, Cortana, Alexa, etc.) 00:12:22 ddahl: they really have a lot in common in terms of what they are useful for 00:12:40 i|slide 5|-> https://www.w3.org/community/voiceinteraction/ Voice Interaction CG| 00:12:41 ddahl: Like a web page, their goal is to help users find info, learn things, be entertained, and also intelligent personal assistance 00:12:56 ddahl: They communicate with servers on the internet, which contribute functionality in service of their goals 00:13:18 ddahl: Two types of interacting are different because web page is primarily graphical UI and PA is primarily voice interaction 00:13:27 s/PA/IPA/ 00:13:34 ddahl: But there are some arbitrary differences also 00:13:46 ddahl: web page rendered in browser; IPA in a proprietary platform 00:13:57 ddahl: but that's an arbitrary architectural difference that devs of IPAs have chosen to use 00:14:02 ddahl: web pages run in any browser 00:14:08 ddahl: but IPAs only run on their own platform 00:14:17 ddahl: If you have Amazon function it can't run on the Web, it can't run on your phone 00:14:25 ddahl: it runs only on its own proprietary smart speaker 00:14:38 ddahl: similarly, web pages are very familiary with URL mechanism or search engine 00:14:51 ddahl: IPA is found through its proprietary platform, however that platform chooses to make it available 00:15:01 ddahl: So finding functionality is purely proprietary 00:15:13 [next slide] 00:15:25 slide depicts diagram of IPA architecture 00:15:34 ddahl: Focus on the three major boxes 00:15:42 ddahl: First box is data capture parts of functionality 00:15:49 ddahl: In case if IPA, most typicaly want to capture speech 00:15:55 ddahl: compared to web page, we're capturing user input 00:16:00 s/next slide/slide 6/ 00:16:17 ddahl: function in the middle is basically does the intelligent parts of the processing 00:16:24 ddahl: This is analogous to a browser 00:16:32 ddahl: On the right we have connection to other functionalities 00:16:39 ddahl: other IPAs or other web sites 00:16:44 ddahl: Found through search engine, DNS, combination 00:16:58 ddahl: Rightmost part of this box we find other functionalities 00:17:09 ddahl: e.g. the websites themselves, in the case of an IPA some other IPA 00:17:14 ddahl: For example looking for shopping site 00:17:21 ddahl: want to find interoperably from UI 00:17:25 ddahl: That's architecture that we're looking at 00:17:28 ddahl: seems parallel to Web 00:17:34 ddahl: we'd like to be able to make those alignments possible 00:17:45 ddahl: and use as much of the existing Web infrastructure as possible for IPAs to be interoperable 00:17:54 [next slide] 00:18:05 kaz: There are many issues emerging these days 00:18:29 kaz: So we'd like to organize a dedicated W3C workshop to summarize the current situation, the pain points, and discuss how we could solve and improve the situation 00:18:37 kaz: by providing e.g. a forum fo rjoint discussion by related stakeholders 00:18:46 kaz: I've created a dedicated GH issue in the strategy repo 00:18:56 -> https://github.com/w3c/strategy/issues/221 00:19:07 kaz: Please join the workshop and give your thoughts, pain points, solitions 00:19:12 s/solition/solutions/ 00:19:20 kaz: Any questions, comments? 00:19:21 s/next slide/slide 7 00:19:35 Sam has joined #voice 00:19:53 kaz: Murata-san, you were very interested in a11y in general and also interaction of ruby and speech 00:19:59 kaz: interested in this workshop? 00:20:14 MURATA_: Yes, interested, and wondering what are the existing obstacles to existing specifications? 00:20:19 MURATA_: Why are they not widely used? 00:20:26 kaz: There are various approaches to this 00:20:37 kaz: e.g. markup-based approach like VoiceXML/SSML 00:20:41 kaz: and CSS-based approach 00:20:44 kaz: and JS-based approach 00:20:57 kaz: So we should think about how to integrate all these mechanisms into common speech platform 00:21:12 kaz: and have content authors and applications able to use various features for controlling speech freely and nicely 00:21:21 kaz: that kind of integration should be one discussion point for the workshop as well 00:21:38 kaz: You have been working on text information. Part of this, pronunciation specification, should also be included 00:21:41 MURATA_: yes 00:21:55 kaz: any other questions/comments/opinions/ideas? 00:22:01 MURATA_: Let me report one thing about EPU 00:22:03 q? 00:22:08 MURATA_: EPUB3 has included SSML and PLS 00:22:15 MURATA_: But now EPUB3 is heading for Recommendation 00:22:28 MURATA_: and some in WG don't want to include features that are not widely implemented 00:22:41 MURATA_: so WG decided to move SSML and PLS to a separate note, which is maintained by the EPUB WG 00:22:49 MURATA_: But that spec is detached from mainstream EPUB 00:22:55 MURATA_: Not intended to be a Recommendation in the near future 00:23:03 MURATA_: On the other hand, I know some Japanese companies use SSML and PLS 00:23:09 q? 00:23:11 MURATA_: One company uses PLS and a few use SSML 00:23:22 MURATA_: In particular, the biggest textbook publisher in Japan uses SSML 00:23:42 MURATA_: And I hear the cost of ebook is 3-4 times more if try to really incorporate SSML and try to make everything natural 00:23:59 MURATA_: For textbooks, wrong pronunciation is very problematic, especially for new language learners 00:24:06 MURATA_: It is therefore worth the cost for these cases 00:24:15 MURATA_: But it is not cost-effective for broader materials 00:24:26 MURATA_: So SSML-based approach can't scale 00:24:31 MURATA_: But more optimistic about PLS 00:24:39 MURATA_: Japanese manga and novels, character names are unreadable 00:24:46 MURATA_: If you use PLS you have to describe each name only once 00:24:59 MURATA_: Dragon Slayer is very common, but doesn't read well using text to speech 00:25:03 MURATA_: I'm hoping that PLS would make things better 00:25:20 kaz: As former Team contact for Voice group, I love SSML 1.1 and PLS 1.0 00:25:31 kaz: I would like to see the potential for improving those specifications further 00:25:45 kaz: Also, there's possibility that we might want an even newer mechanism to achieve the requirements 00:26:00 kaz: For example, LĂ©onie mentioned it is maybe good time to re-start speech work in W3C, during AC meeting 00:26:06 kaz: Personally I would like to say Yes! 00:26:15 kaz: So I think a workshop would be a good starting point for that direction 00:26:25 kaz: Any other viewpoints? 00:26:28 q? 00:26:45 q+ ddahl 00:26:46 ddahl: Want to say something about why things not implemented in browsers 00:26:47 ack d 00:26:56 ddahl: Since those early specifications, technology has gotten much stronger 00:27:05 ddahl: previously, speech recognition did not work well 00:27:11 ddahl: now text to speech works much better also 00:27:25 ddahl: So I think much of this was marginalized, it didn't work, and wouldn't use it 00:27:31 ddahl: was considered it wouldn't have anything to do with the Web 00:27:37 ddahl: but now the tech is far better than it was at the time 00:27:43 ddahl: It really does make sense to look at how it is used in the browser 00:27:51 q? 00:27:56 ack f 00:28:12 fantasai: CSS and PLS seem to very different 00:28:17 ... CSS is about styling 00:28:25 ... not closely tied with each other 00:28:51 ... you definitely can't have only CSS speech module but could use it to extend what is existing 00:29:02 ... cue sound, etc. 00:29:06 ... sifting volume, etc. 00:29:26 s/sound/sound, pauses/ 00:29:29 ... can't change spoken pronunciation itself 00:29:33 q? 00:29:47 ... maybe we need new technology 00:29:55 ... what is missing for that 00:30:00 s/maybe we need/you said maybe we need/ 00:30:11 s/for that/that we need to create technology for? 00:30:31 kaz: I was thinking about how to integrate various modalities 00:30:35 kaz: that are not interoperable currently 00:30:48 kaz: also how to implement dialog processing for interactive services 00:30:54 kaz: and possible integration with IoT services 00:31:20 kaz: so 2001 Space Odessey, asking for voice as a key for opening the dooor 00:31:29 q+ 00:31:30 kaz: maybe because I'm working for WoT and Smart Cities as well 00:31:43 kaz: my dream is to apply voice technology as part of user interfaces for IoT and smart cities 00:31:52 q? 00:32:36 ????: I have a lot of opinions on what's needed. Used voice interface for 20+ years 00:32:44 ????: I had to use totally hands free for 3 yrs 00:32:45 s/????:/kim:/ 00:32:47 s/????:/kim:/ 00:32:56 kim: Now also use wacom tablet 00:33:04 kim: Speech is not really well integrated with other forms of input 00:33:18 kim: If speech was well implemented, many people would use a little bit. A few people would use for everything. 00:33:22 kim: There's so much that is not there 00:33:31 kim: You were talkinga bout it being siloed, and that's one of the problems 00:33:39 kim: for example, when you have keyboard shortcuts 00:33:45 kim: Sometimes you can change it, and that's great 00:34:02 kim: But can only link to letters now. Would be great to integrate with speech 00:34:06 q? 00:34:07 q+ to talk about chatbots on websites 00:34:12 q? 00:34:14 kim: Instead of thinking as another input method, how do you put alongside 00:34:25 kim: It should be something with good defaults and works alongside everything else 00:34:32 kim: Getting there more with Siri etc. 00:34:41 kim: If you say "search the web for green apples" it's faster than typing 00:34:50 +1 to Kim Patch - would also see a need for sounds/vocal melodies. Some cannot articulate clear words but can make a melody. 00:34:52 kim: but big gaps, I think because of the underlying technology 00:35:00 kim: But I think speech has a ton of potential 00:35:05 kim: I can show some of it using custom stuff 00:35:10 kim: that really has not been realized 00:35:16 kim: But it's also used some places where it shouldn't be used 00:35:26 kim: Send is a really bad one-word speech command! 00:35:34 kim: I see a lot of stuff being implemented that is not well thought through 00:35:43 kim: It's too bad that more of us don't use a little bit of speech 00:35:55 * kaz sure 00:35:57 kim: Also some problems like e.g. need to have a good microphone 00:36:08 kim: Engines are getting better, but have to make sure didn't record something totally off the wall 00:36:16 q? 00:36:24 ack t 00:36:30 takio: Thanks for presentation today 00:36:30 q+ Jennie 00:36:40 takio: I'm new around here, not sure about this specification 00:36:50 takio: but I'm concerned about emotional things (?) 00:36:59 takio: e.g. if ... 00:37:09 takio: If laughing or angry, this may be dropped 00:37:24 takio: So I'm concerned about these specifications, if they take care of emotional expression 00:37:30 takio: Also asking about intermediate formats 00:37:33 takio: e.g. ... 00:37:40 takio: e.g. emotional info is important for that person 00:38:08 kaz: For example, some telecom companies or research companies have been working on extracting emotion info from speech 00:38:18 kaz: and trying to deal with that information once we've extracted some of it 00:38:18 -> https://www.w3.org/TR/emotionml/ EmotionML 00:38:30 kaz: There is a dedicated specification to describe emotional information, named EmotionML 00:38:41 kaz: As debbie also mentioned, speech tech has improved a lot the last 10 years 00:38:49 q? 00:38:54 ack d 00:38:54 ddahl, you wanted to talk about chatbots on websites 00:38:55 kaz: We might want to also rethink EmotionML 00:39:04 ddahl: I've been noticing about websites recently 00:39:13 ddahl: complex websites especially tend to have a chatbot 00:39:23 ddahl: Seems like a failure of the website, that users can't find the information they're looking for 00:39:31 ddahl: so they add a chatbot to help find information quickly 00:39:40 ddahl: A very interesting characteristic of voice is that it is semantic 00:39:49 ddahl: It doesn't require the same kind of navigation that you need in a complex website 00:39:55 ddahl: theoretically you ask for what you want and you go there 00:40:06 ddahl: chatbots are normally not voice-enabled, but they are natural-language enabled 00:40:17 ddahl: and that's an area where we can have some synergy between traditional websites and voice interaction 00:40:22 kaz: That's a good use case 00:40:32 kaz: Reminds me of my recent TV 00:40:39 kaz: It has great capabilities, but there are so many menus 00:40:51 kaz: I'm not really sure how to use all these given the complicated menus 00:40:59 kaz: but it has speech recognition, so I can simply talk to that TV 00:41:04 kaz: "I'd like to watch Dragon Slayer" 00:41:20 ddahl: That's an amazing use case, because traditionally TV and DVRs were held up as examples of poor user interfaces 00:41:31 ddahl: Too difficult to even set the time, without lots of struggle 00:41:44 ddahl: So need to think about how to cut through layers of menus and navigation with voice and natural language 00:41:56 kaz: These days even TV devices use web interface for their UI 00:42:03 kaz: TV menu is a kind of web application 00:42:12 kaz: that implies speech interface is good solution 00:42:18 q? 00:42:26 ack j 00:42:38 kim_patch has joined #voice 00:42:41 Jennie: I thought Kim's point about keyboard shortcut types redirecting is excellent 00:42:51 Jennie: Can see use case for ppl who use speech who has limited use of vocalization 00:43:02 Jennie: If there was a way to program instead of using a keyboard shortcut, using a melodic phrase 00:43:10 Jennie: similar to physical gesture on mobile device 00:43:26 Jennie: Would be helpful for ppl who are limited, to control devices 00:43:34 Jennie: Using a shortcut or shorthand of melodic phrase 00:43:42 Jennie: for ppl who are hospitalized or have limited mobility 00:43:50 q+ kim 00:43:53 ack kim 00:44:09 Kim: In early days ... 00:44:20 Kim: But one thing that worked really well was blowing to close the window 00:44:31 Kim: 5-6 years ago someone was experimenting with that in an engine 00:44:48 Kim: I think it would work well both for folks who have difficulty vocalizing, and would be neat for other people as well 00:44:52 Kim: but would have to be easy to do 00:45:25 ddahl: Needs to be easy to do, but would be interesting to adapt 00:45:36 s/ddahl:/Jennie:/ 00:45:42 Kim: 10yrs ago I was working with ppl who are gesture specialists, and trying to get a grant for combined speech + gesture 00:45:57 +1 to Kim P! 00:45:59 Kim: A couple of gestures, a couple of sounds, would add a lot to many use cases 00:46:11 Kim: True mixed input 00:46:20 q+ 00:46:22 q? 00:46:25 ack d 00:46:45 ddahl: That was an interesting point about gestures, reminded me of the recent requirements for natural language interfaces just published 00:47:04 ddahl: They mentioned sign language interpretation in natural language interfaces 00:47:08 ddahl: that is obviously gesture based 00:47:17 ddahl: research world 00:47:23 ddahl: but thinking about gesture-based input 00:47:27 ddahl: could be personal gestures 00:47:33 ddahl: or formal language gestures, like sign language 00:47:38 ddahl: but that would help a lot of people 00:47:56 q+ 00:47:56 Kim: With mixed input, can do multiple input at the same time that doesn't have to be aware of each other 00:48:06 Kim: When pointing, computer knows where you're pointing 00:48:09 Kim: Hard for computer 00:48:23 Kim: Computer doesn't have to be aware of this 00:48:28 s/Hard for computer/.../ 00:48:29 -> https://www.w3.org/TR/2021/WD-naur-20211012/ Natural Language Interface Accessibility User Requirements 00:48:32 q? 00:48:35 ack j 00:48:45 Jennie: One of the other questions I had, since I'm not as familiar with the specs 00:48:52 Jennie: for touchscreen devices and computers 00:49:09 Jennie: we have ways to control for tremors or repeated actions to choose the right one to respond to 00:49:32 Jennie: Do we have any consideration for that in voice, e.g. stuttering, to control which sounds the voice assistant would listen to? 00:50:00 Afraid I don't, sorry! 00:50:28 ddahl: I don't know of anything like that. Would be very useful 00:50:38 ddahl: Probably some research, especially for stuttering, because it's a very common problem 00:50:43 ddahl: but still in the research world right now 00:50:57 Kim: In days of Dragon Dictate, had to pause between words 00:51:07 Kim: People who had serious speech problems, this worked well for them 00:51:24 Kim: and so they stuck with it even as speech input became more natural and looked for phrases 00:51:41 Kim: Speech seems remarkably good at understanding people with a lot of halting, almost better than accents 00:51:51 Kim: I've been surprised how well it deals with stutters 00:52:12 kaz: So probably during workshop we should cover those cases as well, what are actual pain points 00:52:18 q? 00:52:49 Kim: Something else to think about 00:52:54 Kim: There's a time for natural language 00:53:15 Kim: And there's a time where it's a lot more useful to have good default set of commands, one way to say something (maybe a few) and let the user change anything they want 00:53:30 Kim: Dragon made mistake, I think, giving 24 different ways to say "go to end of the line" 00:53:40 This is a link to a research paper titled "A DATASET FOR STUTTERING EVENT DETECTION FROM PODCASTS WITH PEOPLE WHO STUTTER". It might be useful reading material on the subject -> https://arxiv.org/pdf/2102.12394.pdf 00:53:41 Kim: If you have good defaults, it's much easier to teach someone 00:54:03 Kim: I think it's really important to think when natural language is better UX and when good default set of commands that can be learned easily and have structure is good 00:54:18 q? 00:54:21 Kim: The type of interaction, and what fits, has to be considered 00:54:33 Jennie: Should we try to list topics for the workshop? 00:54:36 *Thanks for sharing that study Ben 00:54:37 kaz: yes that's a good idea 00:54:52 kaz: Starting with existing standards within W3C first 00:55:12 kaz: Specifications including natural language interface requirements, recent work as well 00:55:18 ddahl: Some technologies haven't found their way to any specs 00:55:25 ddahl: Like speaker recognition 00:55:28 s/Jennie:/ddahl:/ 00:55:38 ddahl: Any value to including that in a standard? 00:56:07 ddahl: What are pain points in a11y? What would be valueable to do in voice? 00:56:29 ddahl: maybe think about some disabilities that involve voices, either in speaking or hearing 00:56:44 ddahl: what can we do with text to speech that would cover some of the issues around pronunciation spec 00:56:48 ddahl: and SSML 00:57:04 jamesn has joined #voice 00:57:25 ddahl: I guess EmotionML would be an interesting presentation 00:57:46 ddahl: Looking at emotions being expressed in text or speech would add a lot to the users' perception of what the web page is trying to say 00:58:34 Kim: Some research at MIT using common sense database 00:58:51 Kim: They found it increased recognition a certain percent, but people's perception was that it was more than twice as good 00:59:00 Kim: I guess because it took out the most stupid mistakes 00:59:05 Kim: So the user experience was a lot better 00:59:29 kaz: So will revise workshop proposal based on discussion today 00:59:45 kaz: Kim, please give us further comments in the workshop committee 00:59:53 -> https://github.com/w3c/strategy/issues/221 workshop proposal 00:59:56 kaz: would be great if more participants in this session can join the committee 01:00:08 ashimura@w3.org 01:00:09 kaz: you can directly give your input in GH or contact me at my W3C email address 01:00:24 *Thank you - very interesting! 01:00:31 kaz: OK, time to adjourn 01:00:35 kaz: Thank you everyone! 01:00:37 [adjourned] 01:00:38 Thank you 01:00:46 rrsagent, make log public 01:00:49 I have made the request to generate https://www.w3.org/2021/10/18-voice-minutes.html fantasai 05:12:26 Zakim has left #voice