VTT-based Audio Descriptions for Media Accessibility - TPAC 2022 breakout

Meeting minutes

jcraig: defining audio descriptions, similar to how closed captions work for the deaf, but for the blind
… AD are alternate or additive audio tracks
… for low vision users
… old time radio shows used to do that
… Apple has experience in this area
… [screen shot of all the audio choices and caption choices]
… We’ve got some experience in this area. Apple isn’t just a provider of the playback hardware and software, it’s also a major content distributor, and Apple TV+ original content leads the industry in support for international audio description and caption choices. This is a screen shot of the audio and caption or subtitle choices of Ted Lasso Season 1 Episode 1. There are 10 spoken languages, and 10 additional AD versions in each of those languages. 20 Audio Tracks in total. There are also 40 caption (SDH VTT) tracks including those 10 + 30 other languages.

<jcraig> 1. Recorded audio: full mix (composited in studio) or dry mix (composited on device; new possibilities may be enabled by Dolby AC4 “accessibility descriptor”)

jcraig: among the type of AD
… there is a new type of Dolby AC4 with tagging, accessibility descriptors

<JohnRochford> James/Apple should be proud of the 20+ languages, with audio description for each, for the Ted Lasso production on Apple TV.

jcraig: could allow repositioning of the audio description to the head phones or near the ears
… use cases are posted on the Dolby site

<jcraig> 2. Generated text-to-speech-based audio. (Will mention pros/cons later)

jcraig: we are going to talk about the 2nd typem, which is text to speech
… we will talk about pros/cons

<jcraig> 3. Text-to-braille descriptions w/ or w/o audio. No known implementations, but we hope to change that. Wanted to call this out b/c the “audio descriptions” doesn’t necessarily mean the end format will be audio.

jcraig: another type is text to braille
… it could be used for people who are deaf and blind as well
… who is familiar with Extended Descriptions
… they pause the media playback
… it can be awkward and we are looking for feedback
… there are cases where it is necessary and some others where it's not
… 2 implementations in Chromium and WebKit

<jcraig> Descriptions that pause the media playback. Appropriate in some circumstances, and less desirable in others…. For example in most entertainment contexts (e.g. Movies and TV shows) extended descriptions are often considered undesirable. In other contexts, the utility of extended descriptions may be necessary. We’ll talk more about this in the demo.

eric: A WebVTT has an attribute that describes the kind of cues in it
… it always had a kind of "description" but was never used
… the concept is simple
… in the web engine we enable the caption track and when we would render the cues to the screen
… instead we pass the text to the voice synthesis engine
… and like other types of captions, in the caption files, there is a start time and end time
… in this experimental implementation that I added to WebKit
… there are 2 modes
… a cue would start at it start time
… but if the utterance is longer than its start time, we don't stop the audio
… it may overlap over the audio
… if another cue starts, the previous cue is stopped

cyril_: is the demo public?

eric: no, it's not. I don't own the content
… the descriptions were carefully authored to fit the gap

[demo]
… second demo of a lecturer drawing on a chalk board
… we have not yet enable Extended Descriptions
… so you'll hear overlap
… obviously not a good user experience
… we duck the volume of the video a bit to let the description be louder
… let's turn on Extended Descriptions now
… [video is paused until the description is fully spoken]
… that makes it more understandable
… however in a movie trailer, where the descriptions were not authored carefully enough
… in that case, it can be really disruptive to the experience
… one of the questions that we have is: how can we let the user control that?

<JohnRochford> That's a much-better user experience.

<JohnRochford> The one with the professor documenting a formula on a chalkboard along with extended audio description that pauses the lecture.

jcraig: is this just a content problem or do we need more user control at playback?
… the lectures and technical discussions work well but entertainment media does not
… turning a 2h movie into 3

JohnRochford: I'm legally blind and love audio descriptions
… maybe we don't want to see a 3h movie, but I would want a lecture as was demonstrated

jcraig: VTT already works in the browsers
… it was not as challenging as expected
… some other pros is that it's easy to author (but it's a con also)
… it's easy internationalized
… VTT could be used to augment voiced audio descriptions
… for hybrid voiced and brailled
… it's much more scalable and cheaper but quality control is a concern
… another pro is that most streamed media use manifest formats
… there is a content negotiation happening
… you're only going to get the bit you need
… so if it's packaged (epub ...) so all the audio tracks increase the file size
… this would allow assistive technology data does not increase the file size
… to the cons, generated text to speech does not sound as good as studio recorded
… most of the blind community prefers the recorded version if available
… you can screw up internationalization by just translating and keeping the same timing
… some content providers provide low quality tts descriptions
… it's difficult to know how long an utterance will take
… different users have different preferences for speed
… so sometimes timing might not work
… should we clip, compress, etc..
… what happens if it's authored the wrong way?
… this does not yet work with spoken captions
… Apple shipped that on TVOS and iOS several years ago
… where for deaf/blind in a country where the language is not your primary language
… you could ask the system to speak the subtitles

nigel: it's commonly used in Europe, called spoken subtitles

jcraig: in that case we have issues combining spoken captions and text based descriptions
… the most important con is that there is no way to support Extended Descriptions in live streams
… you could author a VTT file that would pause the live stream

eric: in the current implementation, only one description track can be active at a time
… it's technically possible to have more than one
… this way was more straightforward to implement
… we are looking for feedback

JohnRochford: the first 2 videos that you showed, were they examplifuing the same thing?

eric: the first 2 was to demonstrate that it can work well with 2 pieces of content with no need to pause
… the first showing of lecture was to show that standard descriptions was not working
… and the second showing was to demonstrate how they could work

JohnRochford: I have use cases for multiple tracks
… for example to learn a language

eric: that's a good point
… do you know what we would want to do in the case where both tracks have cues that should be active at the same time?
… speak one utterance and then the next one

JohnRochford: it would seem to me that it could be a user preference
… simultaneous or sequential
… a lot of immigrants in the US watch Sesame Street

<Zakim> nigel, you wanted to ask how you decide when and by how much to duck the video audio

<Travis> Is this the proposal: https://github.com/WebKit/explainers/tree/main/texttracks ?

nigel: you said that you duck the video/audio
… that's usually done as part of the audio recording process
… so how do you decide in the implementation

eric: this is an experiment
… hot off the press so not spent a lot of time on it

jcraig: we have a user preference for audio ducking for screen reader
… we would probably implement it in a similar way and not specific for Safari

nigel: the use case for setting the amount by which you duck during authoring is that the program audio loudness varies
… you don't want to duck by the same amount

jcraig: we could have hints on the ducking

<jcraig> ?

gkatsev: I have a couple of comments
… 1. as a maintainer of video.js, we've had support for that, exposing to screen readers, and use the speech to text
… Owen Edwards and I worked on that

<gkatsev> https://github.com/OwenEdwards/videojs-speak-descriptions-track

gkatsev: as a plugin of videoJS
… we ducked the audio to a quarter of what it was
… but if it would be longer, we would pause

jcraig: is that using the WebSpeech API?

gkatsev: yes
… There is an Elephant's Dream track that you can use publicly
… and this is awesome that you're doing that

jasonjgw: you could use the maximum speech rate of the user
… you could also dynamically vary the speech rate up to the max user level
… you could also pause if this does not work

jcraig: we did talk about compression
… that is change the speech rate
… but what do you mean by user's highest acceptable speech rate

jasonjgw: the user is going to have comfort rang
… and you don't want to exceed that

jcraig: so you're proposing to expose a new user preference

jasonjgw: my second question is that I started to do spatial audio
… and it could help with having 2 audio tracks

<jcraig> great idea

jasonjgw: reading described video in braille in real time could be quite a challenging process
… I'm wondering what the user interface could be?

jcraig: there are braille displays that are more like a full page of braille than a line of braille
… we could have a scrolled buffer that's moving up
… it might display on the content type

jasonjgw: the 2D braille display are interesting
… they might speed up the reading process
… but it might not be enough

tink: 1. what type of AD might be wanted
… the context will answer
… with entertainement, we tend to leave the AD on
… but if I'm watching on my own, I might want Extended AD
… with lectures, XAD would be extremely useful

jcraig: do we need a different "kind" value for XAD
… you just mentioned different context

tink: the question was more about the mechanics of it?
… isn't the time the time between the captioning cues?

jcraig: it depends on what audio happens in between

<mbgower> take it from the sound effects track?

jcraig: let's say we're about to overrun and pause
… we could trhough machine learning speech recognition or caption data, we could detect

tink: we could also influence the authoring

dsinger: what is the interval of the video to which this description applies
… then it could be a decision of the user to speech over, pause or change reading rate
… but you want to preserve correct timing, so that if the user seeks it works

<Zakim> dsinger, you wanted to say that the time expressed in the cues should not be the time needed, but the time it applies to

jcraig: there are creative users of audio descriptions, one of the most impressive is to listen to credits at the end of the movie
… don't always assume that the time of audio description is accurate, there is an creative aspect

zcorpan: I'm happy to see this implemented
… you said earlier that in the current implementation, only one track is used
… does it apply to AD tracks?

<jcraig> only one description track at a time

eric: only one AD track at a time, you could also have subtitles/captions also if you want

zcorpan: a use case I envisage is for deaf/blind users with a braille output

JenniferS: it would be very good to have a documentation of the various use cases could be created so that the ideation of the solution could progress
… a lot of people don't do braille

<Zakim> nigel, you wanted to react to JenniferS to mention ADPT's expression of requirements and workflow

nigel: there is a thing called the Audio Description community group
… we create a document which is a requirements for authoring AD

cyril_: I want to mention the work of the Timed Text WG on the DAPT work https://w3c.github.io/dapt-reqs/#define-audio-mixing-instructions-ad-process-step-4 a profile of TTML for authoring/exchanging AD

<nigel> ADPT Requirements

nigel: the BBC has a lot of AD
… it's created for Broadcast
… so we could not do XAD

JohnRochford: I have a demo that I'll post
… with applicability to sign language
… as an extension of what you're doing
… have an inset, picture and picture ASL

<JohnRochford> http://bit.ly/ASLdemo Users can watch talking-head videos and/or ASL interpreters and/or view closed captioning.

JohnRochford: the deaf community told us that close captions are not enough because they don't read them well

<jcraig> Chrome implementation. https://chromium-review.googlesource.com/c/chromium/src/+/3810947

<zcorpan> jcraig: can you (or someone else) file a whatwg/html issue about a new <track kind> value?

<zcorpan> descriptions vs extended-descriptions

<nigel> Adhere demo of AD in TTML2

<ericc> WebKit patch adding support for standard descriptions: https://github.com/WebKit/WebKit/pull/3486

<ericc> WebKit patch adding support for extended descriptions: https://github.com/WebKit/WebKit/pull/4129

– DRAFT –
VTT-based Audio Descriptions for Media Accessibility - TPAC 2022 breakout

14 September 2022

Attendees

Meeting minutes

Diagnostics