This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
The spec doesn't currently specify how a user agent should behave if multiple windows try to speak at the same time. This includes different windows and different tabs, but also different frames within the same page. I'll use "window" for all of these, since each tab, window, and frame has its own DOM Window object, and the speech synthesis object is owned by the window. Here are some possible ways this could be resolved: 1. All windows could be allowed to speak at once. 2. Only one window could be allowed to access the Speech Synthesis service at a time. 3. All windows share one speech synthesis queue, and windows aren't allowed to affect speech from other windows 4. All windows share one speech synthesis system, but when one window wants to speak, it must stop all other windows and flush the queue first 5. All windows share one speech synthesis queue, and any window can stop speech and flush the queue, but not inspect utterances from other windows My favorite is the last solution: it allows for the possibility of two windows cooperatively interleaving speech utterances where this would be beneficial to the user. However this is resolved I think this definitely belongs in the spec; I won't want different user agents to have totally different policies with respect to multiple windows trying to speak.
I'll also add to this that apps outside the browser might be trying to speak and that would also have an impact on synthesis
It's also possible that the user is running two different browsers (similar to the apps outside the browser case). Also, there may be a shared synthesizer engine, which may impose limitations.
I guess we just stay in pending state until our utterance is reached in the external queue and starts speaking? There is a larger problem when this behavior could change depending on the engine limitations. In other words, engines that rely on platform speech API or a library with a global synth state would have to wait for the previous window utterance to finish. But other engines, for example remote services, don't have this limitation and could start speaking immediately on top of another window's output.
OK, I propose that we specifically try to minimize the case where two windows speak at the same time, because: * Many platform implementations will have this limitation anyway * There aren't any good use-cases that require multiple simultaneous speech The specific API changes: SpeechSynthesis.speaking is true if another browser window is speaking or if another application that accesses the same underlying speech system is speaking. SpeechSynthesis methods cancel(), pause(), and resume() operate on the global speech synthesis system, where possible. When multiple browser windows have speech queued and no call to cancel() is made, all utterances from all windows should eventually be queued. (The order in which utterances from different windows are chosen from the queue is not specified.) For security reasons, it should not be possible for one browser window to inspect the speech queue. The only thing it can determine is whether the speech synthesis system is currently speaking. Finally, we need some way for a window to know when the system is no longer speaking, when that speech was coming from another window.
(In reply to comment #4) > OK, I propose that we specifically try to minimize the case where two > windows speak at the same time, because: > > * Many platform implementations will have this limitation anyway Not all, as far as I can tell. Here is what I found from a quick glance: - SAPI on Windows allows you to write audio to a given file handle. - Apple has "speech channels", where I assume each one is a distinct queue (may be wrong). - Not sure about Android, I think maybe one queue per Activity? There is a synthesize to file option, but it takes a file path, which looks painful. - Linux has Speech Dispatcher with one global queue, afaict. There is some hardship in the libraries that I played with, namely pico and espeak. Both have a global state, so you need to finish one synthesis task before starting the next. But at least that is not real-time, so you wouldn't need to wait for the previous utterance to complete playing. > * There aren't any good use-cases that require multiple simultaneous speech > I don't know what use-cases are considered good or not. But I can imagine an app having more than one simultaneous synthesized stream. I would also expect a web app to have some certainty when calling speak(), but I guess that is what you are trying to do here :) > The specific API changes: > > SpeechSynthesis.speaking is true if another browser window is speaking or if > another application that accesses the same underlying speech system is > speaking. > > SpeechSynthesis methods cancel(), pause(), and resume() operate on the > global speech synthesis system, where possible. > It seems scary to me that one app could erase the queue of another app. > When multiple browser windows have speech queued and no call to cancel() is > made, all utterances from all windows should eventually be queued. (The > order in which utterances from different windows are chosen from the queue > is not specified.) > > For security reasons, it should not be possible for one browser window to > inspect the speech queue. The only thing it can determine is whether the > speech synthesis system is currently speaking. > This is the case now, I believe. A speech queue could not be examined by its own window either. > Finally, we need some way for a window to know when the system is no longer > speaking, when that speech was coming from another window. Here is an implementation dilemma: A tabbed browser might want to pause background tabs in favor of the selected tab. If pause() is called on the background tab it locks up the global queue.
Let's pick up the thread on this bug again. I still say that we shouldn't support multiple simultaneous speech streams for now. The main reason for this is that humans can't listen to two utterances at once, that's why I say there are no "good" use-cases. The goal of this API isn't to allow any possible audio effect involving speech, it's to make the common cases of speech feedback easy for web developers, and that means speaking one thing at a time. We could always extend the API later to allow for multiple speech channels. Everything else we do could still apply to the current channel. I agree, though, that pause shouldn't lock up the speech queue. So here's my new proposal: * There's one global speech queue (later we could extend it to have multiple channels, each with their own queue - on platforms where that's supported - but not for now) * One window/frame can only cancel, pause or receive events on utterances that it added to the queue * speechSynthesis.speaking reflects the global state of the speech synthesizer * New event handlers on speechSynthesis allows a window to monitor the status of speech globally, that may be coming from other tabs - that way apps don't have to poll speechSynthesis.speaking if they want to know when speech stops or starts
I support these changes We also want to define what happens if you try to speak when the other client is speaking. Does a new error get returned using an ErrorEvent, or does it just get enqueued... Do speech jobs from different clients get enqueued interleaved? or does one client get continually access until they have no more speech jobs? (In reply to Dominic Mazzoni from comment #6) > Let's pick up the thread on this bug again. > > I still say that we shouldn't support multiple simultaneous speech streams > for now. The main reason for this is that humans can't listen to two > utterances at once, that's why I say there are no "good" use-cases. The goal > of this API isn't to allow any possible audio effect involving speech, it's > to make the common cases of speech feedback easy for web developers, and > that means speaking one thing at a time. > > We could always extend the API later to allow for multiple speech channels. > Everything else we do could still apply to the current channel. > > I agree, though, that pause shouldn't lock up the speech queue. > > So here's my new proposal: > > * There's one global speech queue (later we could extend it to have multiple > channels, each with their own queue - on platforms where that's supported - > but not for now) > * One window/frame can only cancel, pause or receive events on utterances > that it added to the queue > * speechSynthesis.speaking reflects the global state of the speech > synthesizer > * New event handlers on speechSynthesis allows a window to monitor the > status of speech globally, that may be coming from other tabs - that way > apps don't have to poll speechSynthesis.speaking if they want to know when > speech stops or starts
> We also want to define what happens if you try to speak when the other client is speaking. Does a new error get returned using an ErrorEvent, or does it just get enqueued... I think it should enqueue. You can always check speechSynthesis.speaking to determine if someone else is speaking. That's also why I think we need an event on speechSynthesis that fires when speech finishes, globally. > Do speech jobs from different clients get enqueued interleaved? or does one client get continually access until they have no more speech jobs? I was assuming interleaving. It doesn't really make sense to have a shared queue if one window can denial-of-service the other windows. Windows that want to wait until everyone has has stopped speaking should have an easy way to do that - the event on speechSynthesis.
In trying to keep this simple, how about: - If multiple windows/frames make speak method requests, they are queued (rather than played simultaneously). An "audio" or "synthesis" busy error should not be fired unless the user-agent is unable to process the request. - A window/frame can only cancel, pause or receive events on SpeechSynthesisUtterances that it added to the queue. - speechSynthesis.speaking reflects the global speaking state of all windows/frames. - The user-agent MAY speak utterances from the active window/frame before utterances from other frames. If the user switches focus from one window/frame to another, the user-agent MAY pause speech from other windows/frames to give precedence to the active one.
Right now, there is an inconsistency in Chrome and Safari's implementation. Chrome has a global speech queue, while Safari supports concurrent speech. Safari could afford to do this because the Apple speech service, both on iOS and OS X allows multiple speech channels. Chrome needs to work on all platforms, so it cannot consistently provide concurrent speech. Firefox will have the same dilemma. So I have two proposals: 1. If *all* the voices that a browser provides support concurrency, then the browser should allow speech to be simultaneous. If *any* voice or set of voices conform to a global queue, *all* speech synthesis is scheduled with a global queue. Maybe there could be a field for introspection to know what mode the browser is in, or the developer could find out the hard way. 2. In the case of non-concurrent speech, which I understand is a necessary evil, I wrote up a table with cases and resolutions based on the input from bug 21110. I want to compile this eventually into a conformance test. Any input on the behavior, or additional cases/dilemmas would be greatly appreciated. I would like to see this stuff sorted out in the spec as well. Here is the draft. https://docs.google.com/spreadsheets/d/1QojxT6JdNW7lyhRRMjt44ipthhpfGVI_UGftQoJvbbw/edit?usp=sharing
I thought Chris was supportive of not supporting concurrency. Could we agree on that and eventually have that be the implementation in all browsers, rather than having two possible implementations? Your conformance test sounds like a good idea. In cell J5, did you mean "W1 calls speak()..."? (You wrote W2).
(In reply to Dominic Mazzoni from comment #11) > I thought Chris was supportive of not supporting concurrency. Could we agree > on that and eventually have that be the implementation in all browsers, > rather than having two possible implementations? > I would prefer that we allow concurrency whenever possible. At this stage, it would mean keeping the spec as-is, and have a clearly defined contingency for single-queue platforms. That way, when concurrency becomes possible on platforms browsers could move quickly to support it without being blocked by further spec design. I think non-concurrent implementations are very problematic, for reasons already mentioned, but here they are again: 1. They allow windows to share state, this could be a (minor) security concern. 2. When a web app calls speak() the result is entirely undetermined. The app could start speaking immediately (if there are no other speaking windows), in 30 seconds (after another window finishes speaking), or never (if another window paused mid-utterance and never resumed). 3. User experience: The argument for a single queue is that humans can't focus on more than one speaker at a time. But the alternative of queued speech that is spoken out of context of the app's actual state is worse, IMHO. Imagine any combination of voice assistant, navigation app, ebook reader and screen reader. A user can't listen to music and watch TV at the same time, yet access to the audio device is not queued. The user ultimately makes the choice of what they can and cannot run concurrently. We should give them the same choice with speech. For added polish platforms/browsers could perform "ducking" when an utterance is spoken with other speech output already happening. 4. While we can flesh out the details of how a browser behaves in a non-concurrent platform, we have to remember that the browser is just one speech client that is potentially competing for a place in the queue with other speaking apps. The scheduling of this queue will depend on the speech service on each platform. So this would introduce further uncertaintly and platform inconsistencies into an already fragile setup. With all that said, I understand that a single queue is the reality on most platforms now, and we will need to work within those constraints. But I would like to see the Web Speech API remain future-proof and allow concurrent speech when it is available. Hopefully platforms will soon allow this. > Your conformance test sounds like a good idea. In cell J5, did you mean "W1 > calls speak()..."? (You wrote W2). Fixed, thanks.
We should aim for supporting concurrency - TTS isn't any different to playing audio from multiple sources. It is a limitation on some platforms if they don't support concurrent handling here, but let's not limit web platform's capabilities because of that.
It looks like we could do concurrent speech in Windows as well with their Windows 8 API: https://msdn.microsoft.com/en-us/library/windows/apps/windows.media.speechsynthesis.aspx There might be some caveats there because it may break compatibility with legacy SAPI voices since they only support "Microsoft signed voices", and at least for 8.x apps they need to be "Windows Store apps", I don't yet know what that means. But if this is true, and if we could use that instead of SAPI on Windows, then the only two holdouts for non-concurrent implementations would be Android and desktop Linux.
Moved to https://github.com/w3c/speech-api/issues/47 to discontinue use of Bugzilla for Speech API.