In general usage, one meaning of the word script is
the written text of a film, television programme, play etc.
A script can be either a record of the completed production,
also known as a transcript,
or as a plan for a yet to be created production.
In this document, we use domain-specific terms, and define more specifically that:
- a transcript is the text representation of pre-existing media in another form,
for example the dialogue in a video;
- a script is a text representation of the intended content of media prior to its creation,
for example to guide an actor in recording an audio track.
The term DAPT script is used generically to refer to both transcripts and scripts,
and is a point of conformance to the formal requirements of this specification.
DAPT Scripts consist of timed text and associated metadata,
such as the character speaking.
In dubbing workflows, a transcript is generated and translated to create a script.
In audio description workflows, a transcript describes the video image,
and is then used directly as a script for recording an audio equivalent.
DAPT is a TTML-based format for the exchange of transcripts and scripts
(i.e. DAPT Scripts)
among authoring, prompting and playback tools in the localization and audio description pipelines.
A DAPT document is a serializable form of a DAPT Script designed to carry pertinent information for dubbing or audio description
such as type of DAPT script, dialogue, descriptions, timing, metadata, original language transcribed text, translated text, language information, and audio mixing instructions,
and to be extensible to allow user-defined annotations or additional future features.
This specification defines the data model for DAPT scripts and
its representation as a [TTML2] document (see 4. DAPT Data Model and corresponding TTML syntax)
with some constraints and restrictions (see 5. Constraints).
A DAPT script is expected to be used to make audio visual media accessible
or localized for users who cannot understand it in its original form,
and to be used as part of the solution for meeting user needs
involving transcripts, including accessibility needs described in [media-accessibility-reqs],
as well as supporting users who need dialogue translated into a different language via dubbing.
The authoring workflow for both dubbing and audio description involves similar stages,
that share common requirements as described in [DAPT-REQS].
In both cases, the author reviews the content and
writes down what is happening, either in the dialogue or in the video image,
alongside the time when it happens.
Further transformation processes can change the text to a different language and
adjust the wording to fit precise timing constraints.
Then there is a stage in which an audio rendering of the script is generated,
for eventual mixing into the programme audio.
That mixing can occur prior to distribution,
or in the client directly.
The dubbing process which consists in creating a dubbing script
is a complex, multi-step process involving:
- Transcribing and timing the dialogue in the original language from a completed programme to create a transcript;
- Notating dialogue with character information and other annotations;
- Generating localization notes to guide further adaptation;
- Translating the dialogue to a target language script;
- Adapting the translation to the dubbing;
for example matching the actor’s lip movements in the case of dubs.
A dubbing script is a transcript or script
(depending on workflow stage) used for
recording translated dialogue to be mixed with the non-dialogue programme audio,
to generate a localized version of the programme in a different language,
known as a dubbed version, or dub for short.
Dubbing scripts can be useful as a starting point for creation of subtitles or closed captions in alternate languages.
This specification is designed to facilitate the addition of, and conversion to,
subtitle and caption documents in other profiles of TTML, such as [TTML-IMSC1.2],
for example by permitting subtitle styling syntax to be carried in DAPT documents.
Alternatively, styling can be applied to assist voice artists when recording scripted dialogue.
Creating audio description content is also a multi-stage process.
An audio description,
also known as video description
or in [media-accessibility-reqs] as described video,
is an audio service
to assist viewers who can not fully see a visual presentation to understand the content.
It is the result of the audio rendition of one or more descriptions
mixed with the audio associated with the programme prior to any mixing with audio description
(sometimes referred to as main programme audio),
at moments when this does not clash with dialogue, to deliver an audio description mixed audio track.
A description is a set of words that describes an aspect of the programme presentation,
suitable for rendering into audio by means of vocalisation and recording
or used as a text alternative source for text to speech translation, as defined in [WCAG22].
More information about what audio description is and how it works can be found at [BBC-WHP051].
Writing the audio description script typically involves:
- watching the video content of the programme,
or series of programmes,
- identifying the key moments during which there is an opportunity to speak descriptions,
- writing the description text to explain the important visible parts of the programme at that time,
- creating an audio version of the descriptions, either by recording a human actor or using text to speech,
- defining mixing instructions (applied using [TTML2] audio styling) for combining the audio with the programme audio.
The audio mixing can occur prior to distribution of the media,
or in the client player.
If the audio description script is delivered to the player,
the text can be used to provide an alternative rendering,
for example on a Braille display,
or using the user's configured screen reader.
DAPT Scripts can be useful in other workflows and scenarios.
For example, Original language transcripts could be used as:
- the output format of a speech to text system, even if not intended for translation,
or for the production of subtitles or captions;
- a document known in the broadcasting industry as a "post production script",
used primarily for preview, editorial review and sales purposes;
Both Original language transcripts and Translated transcripts could be used as:
- an accessible transcript presented alongside audio or video in a web page or application;
in this usage, the timings could be retained and used for synchronisation with,
or navigation within, the media
or discarded to present a plain text version of the entire timeline.
The top level structure of a document is as follows:
- The
<tt>
root element in the namespace http://www.w3.org/ns/ttml
indicates that this is a TTML document
and the ttp:contentProfiles
attribute indicates that it adheres to the DAPT content profile defined in this specification.
- The
daptm:workflowType
attribute indicates the type of workflow.
- The
daptm:scriptType
attribute indicates the type of transcript or script
but in this empty example, it is not relevant, since only the structure of the document is shown.
The structure is applicable to all types of DAPT scripts, dubbing or audio description.
<tt xmlns="http://www.w3.org/ns/ttml"
xmlns:ttp="http://www.w3.org/ns/ttml#parameter"
xmlns:daptm="http://www.w3.org/ns/ttml/profile/dapt#metadata"
xml:lang="en"
ttp:contentProfiles="http://www.w3.org/ns/ttml/profile/dapt1.0/content"
daptm:workflowType="dubbing"
daptm:scriptType="originalTranscript">
<head>
<metadata>
</metadata>
<styling>
</styling>
<layout>
</layout>
</head>
<body>
</body>
</tt>
The following examples correspond to the timed text transcripts and scripts produced
at each stage of the workflow described in [DAPT-REQS].
The first example shows an early stage transcript in which timed opportunities for descriptions
or transcriptions have been identified but no text has been written:
...
<body>
<div xml:id="d1" begin="10s" end="13s">
</div>
<div xml:id="d2" begin="18s" end="20s">
</div>
</body>
...
The following examples will demonstrate different uses in dubbing and audio description workflows.
When descriptions are added this becomes a Pre-Recording Script:
<tt xmlns="http://www.w3.org/ns/ttml"
xmlns:ttp="http://www.w3.org/ns/ttml#parameter"
xmlns:daptm="http://www.w3.org/ns/ttml/profile/dapt#metadata"
xmlns:xml="http://www.w3.org/XML/1998/namespace"
ttp:contentProfiles="http://www.w3.org/ns/ttml/profile/dapt1.0/content"
daptm:workflowType="audioDescription"
daptm:scriptType="preRecording"
xml:lang="en">
<body>
<div begin="10s" end="13s">
<p daptm:langSrc="original">
A woman climbs into a small sailing boat.
</p>
</div>
<div begin="18s" end="20s">
<p daptm:langSrc="original">
The woman pulls the tiller and the boat turns.
</p>
</div>
</body>
</tt>
After creating audio recordings, if not using text to speech, instructions for playback
mixing can be inserted. For example, The gain of "received" audio can be changed before mixing in
the audio played from inside the span
, smoothly
animating the value on the way in and returning it on the way out:
<tt ...
daptm:workflowType="audioDescription"
daptm:scriptType="asRecorded"
xml:lang="en">
...
<div begin="25s" end="28s">
<p daptm:langSrc="original">
<animate begin="0.0s" end="0.3s" tta:gain="1;0.39" fill="freeze"/>
<animate begin="2.7s" end="3s" tta:gain="0.39;1"/>
<span begin="0.3s" end="2.7s">
<audio src="clip3.wav"/>
The sails billow in the wind.</span>
</p>
</div>
...
In the above example, the <div>
element's
begin
time becomes the "syncbase" for its child,
so the times on the <animate>
and <span>
elements are relative to 25s here.
The first <animate>
element drops the gain from 1
to 0.39 over 0.3s, freezing that value after it ends,
and the second one raises it back in the
final 0.3s of this description. Then the <span>
is
timed to begin only after the first audio dip has finished.
If the audio recording is long and just a snippet needs to be played,
that can be done using clipBegin
and clipEnd
.
If we just want to play the part of the audio from file from 5s to
8s it would look like:
...
<audio src="long_audio.wav" clipBegin="5s" clipEnd="8s"/>
A woman climbs into a small sailing boat.</span>
...
Or audio attributes can be added to trigger the text to be spoken:
...
<div begin="18s" end="20s">
<p daptm:langSrc="original">
<span tta:speak="normal">
The woman pulls the tiller and the boat turns.</span>
</p>
</div>
...
It is also possible to embed the audio directly,
so that a single document contains the script and
recorded audio together:
...
<div begin="25s" end="28s">
<p daptm:langSrc="original">
<animate begin="0.0s" end="0.3s" tta:gain="1;0.39" fill="freeze"/>
<animate begin="2.7s" end="3s" tta:gain="0.39;1"/>
<span begin="0.3s" end="2.7s">
<audio><source><data type="audio/wave">
[base64-encoded audio data]
</data></source></audio>
The sails billow in the wind.</span>
</p>
</div>
...
From the basic structure of Example 1, a transcription of the audio programme produces an original language dubbing script,
which can look as follows. No specific style or layout is defined, and here the focus is on the transcription of the dialogs.
Characters are identified within the <metadata>
.
<tt xmlns="http://www.w3.org/ns/ttml"
xmlns:ttp="http://www.w3.org/ns/ttml#parameter"
xmlns:daptm="http://www.w3.org/ns/ttml/profile/dapt#metadata"
xml:lang="fr"
ttp:contentProfiles="http://www.w3.org/ns/ttml/profile/dapt1.0/content"
daptm:workflowType="dubbing"
daptm:scriptType="originalTranscript">
<head>
<metadata>
<ttm:agent type="character" xml:id="character_1">
<ttm:name type="alias">ASSANE</ttm:name>
</ttm:agent>
</metadata>
</head>
<body>
<div begin="10s" end="13s">
<p daptm:langSrc="original" ttm:agent="character_1">
<span>Et c'est grâce à ça qu'on va devenir riches.</span>
</p>
</div>
</body>
</tt>
After translating the text, the document is modified. It includes translation text, and
in this case the original text is preserved. The main document language is changed to indicate
that the focus is on the translated language:
<tt xmlns="http://www.w3.org/ns/ttml"
xmlns:ttp="http://www.w3.org/ns/ttml#parameter"
xmlns:daptm="http://www.w3.org/ns/ttml/profile/dapt#metadata"
xml:lang="en"
ttp:contentProfiles="http://www.w3.org/ns/ttml/profile/dapt1.0/content"
daptm:workflowType="dubbing"
daptm:scriptType="translatedTranscript">
<head>
<metadata>
<ttm:agent type="character" xml:id="character_1">
<ttm:name type="alias">ASSANE</ttm:name>
</ttm:agent>
</metadata>
</head>
<body>
<div begin="10s" end="13s" ttm:agent="character_1">
<p xml:lang="fr" daptm:langSrc="original">
<span>Et c'est grâce à ça qu'on va devenir riches.</span>
</p>
<p xml:lang="en" daptm:langSrc="translation">
<span>And thanks to that, we're gonna get rich.</span>
</p>
</div>
</body>
</tt>
The process of adaptation, before recording, could adjust the wording and/or add further timing to assist in the recording.
The daptm:scriptType
attribute is also modified, as in the following example:
<tt xmlns="http://www.w3.org/ns/ttml"
xmlns:ttp="http://www.w3.org/ns/ttml#parameter"
xmlns:daptm="http://www.w3.org/ns/ttml/profile/dapt#metadata"
xml:lang="en"
ttp:contentProfiles="http://www.w3.org/ns/ttml/profile/dapt1.0/content"
daptm:workflowType="dubbing"
daptm:scriptType="preRecording">
<head>
<metadata>
<ttm:agent type="character" xml:id="character_1">
<ttm:name type="alias">ASSANE</ttm:name>
</ttm:agent>
</metadata>
</head>
<body>
<div begin="10s" end="13s" ttm:agent="character_1" daptm:onScreen="ON_OFF">
<p xml:lang="fr" daptm:langSrc="original">
<span>Et c'est grâce à ça qu'on va devenir riches.</span>
</p>
<p xml:lang="en" daptm:langSrc="translation">
<span begin="0s">And thanks to that,</span><span begin="1.5s"> we're gonna get rich.</span>
</p>
</div>
</body>
</tt>