W3C

– DRAFT –
Web Speech API Improvements

25 September 2024

Attendees

Present
Dom, Eric_Carlson, Evan_Liu, Francois_Daoust, James_Craig, Jer_Noble, Michael_Wilson, Nigel_Megitt, Ninxing, smaug, solis
Regrets
-
Chair
Evan_Liu
Scribe
nigel, tidoust

Meeting minutes

Early introductions (no scribe)

dom: Is there a link to LLMs?

Evan_Liu: We might use LLMs in Google internally, but the API won't change

dom: Signal to abort processing?

Evan_Liu: Probably out of scope.
… We're not talking about changing the core capabilities beyond introducing this new functionality

Offline Speech Recognition

<jcraig> WICG/speech-api#108

[slide]

Evan_Liu: These proposals have been around for a while,
… for on-device speech recognition
… Proposing to support by introducing two new attributes on the SpeechRecognition interface
… localService attribute
… allowCloudFallback attribute
… both Booleans. Names may change.
… Cloud speech to text service can support more options than on device
… Might combine the bools into a 3-value enum
… Idea is to control where the speech recognition is allowed to happen.
… New methods
… allow triggering download of a language pack
… query if on-device speech recognition is available
… Privacy concerns
… Chrome planning to only allow websites to install a language pack if it matches the user's
… primary language, or if not, ask permission

Jer: What's the use case for disabling local speech recognition?

Evan_Liu: More about allowing server-based STT, but also users may
… not want to use the CPU on their machine

Jer: When would you ever set local service to false.
… Sounds like only if the website knows it may be doing other cpu intensive tasks

ningxin: discussion of on/off device use cases, including device capabilities

<smaug> asdf

<smaug> (smaug == Olli Pettay)

smaug: [missed]

michael_wilson: Why not an enum?

Evan_Lie: No reason, seems like it would be simpler based on this discussion

Jer: An enum with 3 values would mean each option is clearer to understand

smaug: the "on" prefix...

Evan_Liu: Good point maybe should be "is"

Jer: Seems like a property, so a getter would make sense

solis: On Device might make more sense than Local

Evan_Liu: It's matching existing spec text, I prefer On Device too. Any preferences?

dom: The boundary of a device might evolve over time.
… Trying to communicate the privacy aspect of the choice here.
… Instead of focusing on this abstract boundary, focus on what is driving the developer's choice.
… Not sure it matters, as much as the privacy implications based on which environment is shown.
… Not sure how to formulate this.
… Hybrid models might exist, so this distinction might not be what we need.

tidoust: The notion of "parties" comes to mind, but also not right.
… Could say it's First party for the UA, or 3rd party for someone else.
… Not sure that's the right way either.
… Google has plenty of hats here.

Evan_Liu: installOnDeviceSpeechRecognition() - returns a boolean,
… could take minutes to download, so we just return if the fetch has been initiated.
… Could have event listeners to reveal when the installation is complete
… Or return true when the installation is complete

eric_carlson: Some of the language packs are quite large.
… Do we need to be concerned about allowing a page to use a lot of user data in this way?
… e.g. people on limited data plans

Evan_Liu: Another Chrome criterion is if the user is on cellular net or on WiFi/ethernet.
… Could be in the spec, or could be in the browser, depending on what people want.

smaug: Feels v scary this install, needs to be async, and behind permissions.
… May never get past the installation.
… Maybe return a promise

Evan_Liu: This would be async, and would return as soon as the user signals their preference

smaug: User may not say anything

Evan_Liu: Any concerns about that API?

eric_carlson: Should be a promise
… May not ever return, or the download may time out
… To prevent polling, it should resolve once it's been downloaded and is available for use
… Polling is an anti-pattern

nigel: Do you have a scheme in mind for caching if two pages want to fetch the same language?

Evan_Liu: For Chrome, there will always be only one package per language

smaug: There may be a privacy issue here: where a page polls for installed languages. Fingerprinting problem.

Evan_Liu: True. If it requires a pop-up that scares people away, we feel the fingerprinting issue goes away

eric_carlson: Same issue with fonts. Recurring issue.

Jer: Not sure if this is something that could be partitioned
… Always ask the user for every page even if the language pack is already available

eric_carlson: Interesting idea: there was a way to get microphone permissions, which was used
… for fingerprinting.
… We returned a fixed list until the user started to capture, and only then return the correct list
… Do something similar here, signal only the user's language signalled as available,
… until the API actually asks to download something.
… If the pack is already there, return more quickly.
… User would get prompted every time, which would be wierd.
… If you prompt on download request and then reveal what's actually available,
… that could be one way to handle it.

dom: If you ignore media stream track at the minute, right now the speech api does go through
… a prompt. Agree that default language makes sense, but there may be an opportunity
… for bringing something into the prompt.
… It's super privacy invasive to do recognition in the first place

eric_carlson: Could be that microphone permission is enough of a barrier

mjwilson: Is it assumed to only work on mic, or other audio resources?

eric_carlson: Wouldn't work for other audio streams

Evan_Liu: Could have auto-download as soon as recognition begins
… Should this be in the spec?

eric_carlson: The spec must have at least recommendations.
… Leave room for UAs to figure out other ways

dom: If you leave privacy mitigations and let them have interop impact then its a race to the bottom
… so agree put something in the spec

jer: Best way would be to only allow user's language STT
… Could imagine foreign language STT and then translation though
… Or could have a list of preferred language

dom: Except for duolingo!

jer: that already knows the text though.

dom: hard to separate this from whether it only applies to live capture or not
… as soon as you cut that link it's not an effective mitigation, so need other mitigation for non-live capture

jer: Misunderstood the duolingo example - the user is speaking and the page wants to know if
… they spoke correctly

nigel: I used text to speech features of browsers and experienced divergence in performance across browsers. Especially during starting times.
… Any plan to signal "readiness"?
… To avoid missing bits?

Evan_Liu: No real requirement in the spec that things must happen within a specified amount of time.

jcraig: Presumably, that's a bug, not supposed to drop bits.

Nigel: What about extending to other types of sounds? Sometimes, it's not only speech you're interested in.
… There are machine-based products that recognize sound.

jcraig: My understanding of the proposal is that the main interest is converting speech to text.

MediaStreamTrack support

[slide] github issue #66

Evan_Liu: start() method requests permission to use microphone, and starts using it if allowed

jer: If you send noise to a speech recognition API, you can figure out what platform is being used
… so its a big entropy change for fingerprinting
… needs to be listed in the spec, exposing arbitrary speech recognition

nigel: I haven't understood the use case properly. Why does the page need the API?

Evan_Liu: For video conferencing tools, streams may be coming from elsewhere.

nigel: It does not seem very sustainable to do the work on all clients.

s/API? Why not have the user trigger the microphone and provide a text stream? Why does the page need to know? Why not have the user trigger the microphone and provide a text stream? Why does the page need to know?

[discussion of use cases] - live translate and dub

jcraig: Sender side could have better sound quality from local mic

dom: Completely muted conference, only sending speech text

jer: Mitigation could be a mandated delay so a website can't immediately get an answer.
… Require min 30s before producing text mitigates by forcing website to wait to get fingerprinting data
… One possible mitigation to privacy impacts. Not a solution, just increases cost for the page.

Evan_Liu: Concern is profiling performance capabilities?

jer: Not perf, each implementation might give a different answer, so tell devices apart

eric_carlson: Might get fine grained information about the machine

jer: Could be less risky for Chrome having a large number of users with the same characteristics

mjwilson: This isn't theoretical, it actually happened with Web Audio, allows fine grained identification

jer: which is why we bring this up, it's been an active area of attack

Spoken Punctuation Parameter

Evan_Liu: boolean attribute - if true, uses the punctuation, if false spells out e.g. "comma"
… If you're using it for captioning you might want to spell it out

dom: i18n questions about what constitutes punctuation
… likely to regret a boolean, there will be other choices

Evan_Liu: Could start with an enum to make it more extensible

jcraig: Agree, verbosity of screen readers is use case dependent

Nigel: Does this control unspoken punctuation?

Evan_Liu: No, Google supports it but we haven't had a request for that yet

tidoust: Could be an array

ningxin: For mass education, want to speak a formula, and have that appear.
… Send it to the cloud. Previously used Web Speech API with a different backend, but got a different answer
… Would be helpful to do mathematical representation output

mjwilson: Last year MathML WG had presentation on spoken math, very interesting and deep topic

Remove SpeechGrammar

Evan_Liu: Not implemented, or well defined, requests to remove it.

<jcraig> that would have been Neil Soiffer discussing spoken MathML and the MathJax project

Evan_Liu: Seems to be consensus to remove, not controversial.
… Intended to do biasing support

Biasing support

Evan_Liu: Add bias to certain phrases, depends on recognition support.
… Chrome's recognition supports this. It's pretty generic, not tied to specific use cases.

dom: Guess this would be super useful, would want to include language info and substream tagging
… If the phrase is used in several languages, you need to tag that, so need a different structure

Evan_Liu: Each recogniser only works for a single language

dom: Does the API need to support multiple language audio? Need to surface in the API.

Evan_Liu: Multi-lang recognition in the same phrase not supported

dom: If I put my French name into an English phrase, the pronunciation would change

dom: If I put my Chinese name, that wouldn't work either.
… Fair, needs to be clear what language the stream is in
… Don't know how you take it back

jcraig: Blind colleague laughed because Ke$ha was pronounced "Key dollar har"
… Also other words get pronounced differently by domain experts.
… Need phonetic hinting, perhaps IPA "International Phonetic Alphabet"

<Zakim> jcraig, you wanted to mention musical artists like Ke$ha... technical or domain that don't align well terms (not a perfect example but LaTeX is "la-tek" not "latex")

<Zakim> nigel, you wanted to mention pages

Nigel: was the case that "pagers" got misrecognised as "pages" recently in the news.
… So what jcraig said: need to know the phonetic details
… Other point: need to know how you would layer in speaker recognition, or changes of speaker,
… again for the captioning use case

mjwilson: Also language recognition

Meeting close

Evan_Liu: We're 2 minutes over, let's close

Minutes manually created (not a transcript), formatted by scribe.perl version 229 (Thu Jul 25 08:38:54 2024 UTC).

Diagnostics

Succeeded: s/discussion of on/HN: discussion of on/

Succeeded: s/privacy issue here/privacy issue here: where a page polls for installed languages. Fingerprinting problem.

Succeeded: s/[slide]/

Succeeded: s/??1/jcraig

Succeeded 1 times: s/??1:/jcraig:/g

Succeeded: s/API?/API? Why not have the user trigger the microphone and provide a text stream? Why does the page need to know?

Succeeded 1 times: s/HN:/ningxin:/g

Succeeded: s/Need phonetic spelling/Need phonetic hinting, perhaps IPA "International Phonetic Alphabet"/

Maybe present: Evan_Lie, jcraig, Jer, mjwilson, nigel, ningxin, tidoust

All speakers: dom, eric_carlson, Evan_Lie, Evan_Liu, jcraig, Jer, michael_wilson, mjwilson, nigel, ningxin, smaug, solis, tidoust

Active on IRC: jcraig, mjwilson, nigel, nigel_, ningxin, smaug, tidoust, tpac-breakout-bot