Meeting minutes
Early introductions (no scribe)
dom: Is there a link to LLMs?
Evan_Liu: We might use LLMs in Google internally, but the API won't change
dom: Signal to abort processing?
Evan_Liu: Probably out of scope.
… We're not talking about changing the core capabilities beyond introducing this new functionality
Offline Speech Recognition
<jcraig> WICG/
[slide]
Evan_Liu: These proposals have been around for a while,
… for on-device speech recognition
… Proposing to support by introducing two new attributes on the SpeechRecognition interface
… localService attribute
… allowCloudFallback attribute
… both Booleans. Names may change.
… Cloud speech to text service can support more options than on device
… Might combine the bools into a 3-value enum
… Idea is to control where the speech recognition is allowed to happen.
… New methods
… allow triggering download of a language pack
… query if on-device speech recognition is available
… Privacy concerns
… Chrome planning to only allow websites to install a language pack if it matches the user's
… primary language, or if not, ask permission
Jer: What's the use case for disabling local speech recognition?
Evan_Liu: More about allowing server-based STT, but also users may
… not want to use the CPU on their machine
Jer: When would you ever set local service to false.
… Sounds like only if the website knows it may be doing other cpu intensive tasks
ningxin: discussion of on/off device use cases, including device capabilities
<smaug> asdf
<smaug> (smaug == Olli Pettay)
smaug: [missed]
michael_wilson: Why not an enum?
Evan_Lie: No reason, seems like it would be simpler based on this discussion
Jer: An enum with 3 values would mean each option is clearer to understand
smaug: the "on" prefix...
Evan_Liu: Good point maybe should be "is"
Jer: Seems like a property, so a getter would make sense
solis: On Device might make more sense than Local
Evan_Liu: It's matching existing spec text, I prefer On Device too. Any preferences?
dom: The boundary of a device might evolve over time.
… Trying to communicate the privacy aspect of the choice here.
… Instead of focusing on this abstract boundary, focus on what is driving the developer's choice.
… Not sure it matters, as much as the privacy implications based on which environment is shown.
… Not sure how to formulate this.
… Hybrid models might exist, so this distinction might not be what we need.
tidoust: The notion of "parties" comes to mind, but also not right.
… Could say it's First party for the UA, or 3rd party for someone else.
… Not sure that's the right way either.
… Google has plenty of hats here.
Evan_Liu: installOnDeviceSpeechRecognition() - returns a boolean,
… could take minutes to download, so we just return if the fetch has been initiated.
… Could have event listeners to reveal when the installation is complete
… Or return true when the installation is complete
eric_carlson: Some of the language packs are quite large.
… Do we need to be concerned about allowing a page to use a lot of user data in this way?
… e.g. people on limited data plans
Evan_Liu: Another Chrome criterion is if the user is on cellular net or on WiFi/ethernet.
… Could be in the spec, or could be in the browser, depending on what people want.
smaug: Feels v scary this install, needs to be async, and behind permissions.
… May never get past the installation.
… Maybe return a promise
Evan_Liu: This would be async, and would return as soon as the user signals their preference
smaug: User may not say anything
Evan_Liu: Any concerns about that API?
eric_carlson: Should be a promise
… May not ever return, or the download may time out
… To prevent polling, it should resolve once it's been downloaded and is available for use
… Polling is an anti-pattern
nigel: Do you have a scheme in mind for caching if two pages want to fetch the same language?
Evan_Liu: For Chrome, there will always be only one package per language
smaug: There may be a privacy issue here: where a page polls for installed languages. Fingerprinting problem.
Evan_Liu: True. If it requires a pop-up that scares people away, we feel the fingerprinting issue goes away
eric_carlson: Same issue with fonts. Recurring issue.
Jer: Not sure if this is something that could be partitioned
… Always ask the user for every page even if the language pack is already available
eric_carlson: Interesting idea: there was a way to get microphone permissions, which was used
… for fingerprinting.
… We returned a fixed list until the user started to capture, and only then return the correct list
… Do something similar here, signal only the user's language signalled as available,
… until the API actually asks to download something.
… If the pack is already there, return more quickly.
… User would get prompted every time, which would be wierd.
… If you prompt on download request and then reveal what's actually available,
… that could be one way to handle it.
dom: If you ignore media stream track at the minute, right now the speech api does go through
… a prompt. Agree that default language makes sense, but there may be an opportunity
… for bringing something into the prompt.
… It's super privacy invasive to do recognition in the first place
eric_carlson: Could be that microphone permission is enough of a barrier
mjwilson: Is it assumed to only work on mic, or other audio resources?
eric_carlson: Wouldn't work for other audio streams
Evan_Liu: Could have auto-download as soon as recognition begins
… Should this be in the spec?
eric_carlson: The spec must have at least recommendations.
… Leave room for UAs to figure out other ways
dom: If you leave privacy mitigations and let them have interop impact then its a race to the bottom
… so agree put something in the spec
jer: Best way would be to only allow user's language STT
… Could imagine foreign language STT and then translation though
… Or could have a list of preferred language
dom: Except for duolingo!
jer: that already knows the text though.
dom: hard to separate this from whether it only applies to live capture or not
… as soon as you cut that link it's not an effective mitigation, so need other mitigation for non-live capture
jer: Misunderstood the duolingo example - the user is speaking and the page wants to know if
… they spoke correctly
nigel: I used text to speech features of browsers and experienced divergence in performance across browsers. Especially during starting times.
… Any plan to signal "readiness"?
… To avoid missing bits?
Evan_Liu: No real requirement in the spec that things must happen within a specified amount of time.
jcraig: Presumably, that's a bug, not supposed to drop bits.
Nigel: What about extending to other types of sounds? Sometimes, it's not only speech you're interested in.
… There are machine-based products that recognize sound.
jcraig: My understanding of the proposal is that the main interest is converting speech to text.
MediaStreamTrack support
[slide] github issue #66
Evan_Liu: start() method requests permission to use microphone, and starts using it if allowed
jer: If you send noise to a speech recognition API, you can figure out what platform is being used
… so its a big entropy change for fingerprinting
… needs to be listed in the spec, exposing arbitrary speech recognition
nigel: I haven't understood the use case properly. Why does the page need the API?
Evan_Liu: For video conferencing tools, streams may be coming from elsewhere.
nigel: It does not seem very sustainable to do the work on all clients.
s/API? Why not have the user trigger the microphone and provide a text stream? Why does the page need to know? Why not have the user trigger the microphone and provide a text stream? Why does the page need to know?
[discussion of use cases] - live translate and dub
jcraig: Sender side could have better sound quality from local mic
dom: Completely muted conference, only sending speech text
jer: Mitigation could be a mandated delay so a website can't immediately get an answer.
… Require min 30s before producing text mitigates by forcing website to wait to get fingerprinting data
… One possible mitigation to privacy impacts. Not a solution, just increases cost for the page.
Evan_Liu: Concern is profiling performance capabilities?
jer: Not perf, each implementation might give a different answer, so tell devices apart
eric_carlson: Might get fine grained information about the machine
jer: Could be less risky for Chrome having a large number of users with the same characteristics
mjwilson: This isn't theoretical, it actually happened with Web Audio, allows fine grained identification
jer: which is why we bring this up, it's been an active area of attack
Spoken Punctuation Parameter
Evan_Liu: boolean attribute - if true, uses the punctuation, if false spells out e.g. "comma"
… If you're using it for captioning you might want to spell it out
dom: i18n questions about what constitutes punctuation
… likely to regret a boolean, there will be other choices
Evan_Liu: Could start with an enum to make it more extensible
jcraig: Agree, verbosity of screen readers is use case dependent
Nigel: Does this control unspoken punctuation?
Evan_Liu: No, Google supports it but we haven't had a request for that yet
tidoust: Could be an array
ningxin: For mass education, want to speak a formula, and have that appear.
… Send it to the cloud. Previously used Web Speech API with a different backend, but got a different answer
… Would be helpful to do mathematical representation output
mjwilson: Last year MathML WG had presentation on spoken math, very interesting and deep topic
Remove SpeechGrammar
Evan_Liu: Not implemented, or well defined, requests to remove it.
<jcraig> that would have been Neil Soiffer discussing spoken MathML and the MathJax project
Evan_Liu: Seems to be consensus to remove, not controversial.
… Intended to do biasing support
Biasing support
Evan_Liu: Add bias to certain phrases, depends on recognition support.
… Chrome's recognition supports this. It's pretty generic, not tied to specific use cases.
dom: Guess this would be super useful, would want to include language info and substream tagging
… If the phrase is used in several languages, you need to tag that, so need a different structure
Evan_Liu: Each recogniser only works for a single language
dom: Does the API need to support multiple language audio? Need to surface in the API.
Evan_Liu: Multi-lang recognition in the same phrase not supported
dom: If I put my French name into an English phrase, the pronunciation would change
dom: If I put my Chinese name, that wouldn't work either.
… Fair, needs to be clear what language the stream is in
… Don't know how you take it back
jcraig: Blind colleague laughed because Ke$ha was pronounced "Key dollar har"
… Also other words get pronounced differently by domain experts.
… Need phonetic hinting, perhaps IPA "International Phonetic Alphabet"
<Zakim> jcraig, you wanted to mention musical artists like Ke$ha... technical or domain that don't align well terms (not a perfect example but LaTeX is "la-tek" not "latex")
<Zakim> nigel, you wanted to mention pages
Nigel: was the case that "pagers" got misrecognised as "pages" recently in the news.
… So what jcraig said: need to know the phonetic details
… Other point: need to know how you would layer in speaker recognition, or changes of speaker,
… again for the captioning use case
mjwilson: Also language recognition
Meeting close
Evan_Liu: We're 2 minutes over, let's close