In order to enable the search and retrieval of video from large archives, we need a representation of video content. Although some aspects of video can be automatically parsed, a detailed representation requires that video be annotated. We discuss the design criteria for a video annotation language with special attention to the issue of creating a global, reusable video archive. We outline in detail the iconic visual language we have developed and a stream-based representation of video data.
Our prototype system, Media Streams, enables users to create multi-layered, iconic annotations of video content. Within Media Streams, the organisation and categories of the Director's Workshop allow users to browse and compound over 2500 iconic primitives by means of a cascading hierarchical structure which supports compounding icons across branches of the hierarchy. Icon Palettes enable users to group related sets of iconic descriptors, use these descriptors to annotate video content, and reuse descriptive effort. Media Time Lines enable users to visualise and browse the structure of video content and its annotations. The problems of creating a representation of action for video are given special attention, as well as describing transitions in video.
1 Introduction: the need for video annotation
1.1 Video annotation today
1.2 Video annotation tomorrow
2 Design criteria for video annotation languages
3 Representing video
3.1 Streams vs. clips
3.2 Categories for media annotation
3.3 Video syntax and semantics
4 Media Streams: an overview
5 Why icons?
6 Director's Workshop
6.1 A language for action
6.2 Character actions and object actions
6.3 Characters and objects
6.4 Relative positions
6.5 Mise-en-scene: time, space, and weather
6.6 Cinematography
6.7 Recording medium
6.8 Screen positions
6.9 Thoughts
6.10 Transitions
6.11 Extensibility of the icon language
7 Media Time Lines
8 Conclusions and future work
Acknowledgements
References
The central problem in the creation of robust and extensible systems for manipulating video information lies in representing and visualising video content. Currently, content providers possess large archives of film and video for which they lack sufficient tools for search and retrieval. For the types of applications that will be developed in the near future (interactive television, personalised news, video on demand, etc.) these archives will remain a largely untapped resource, unless we are able to access their contents. Without a way of accessing video information in terms of its content, a thousand hours of video is less useful than one. With one hour of video, its content can be stored in human memory, but as we move up in orders of magnitude, we need to find ways of creating machine-readable and human-usable representations of video content. It is not simply a matter of cataloguing reels or tapes, but of representing and manipulating the content of video at multiple levels of granularity and with greater descriptive richness. This paper attempts to address that challenge.
Given the current state of the art in machine vision and image processing, we cannot now, and probably will not be able to for a long time, have machines 'watch' and understand the content of digital video archives for us. Unlike text, for which we have developed sophisticated parsing technologies, and which is accessible to processing in various structured forms (ASCII, RTF, PostScript), video is still largely opaque. We are currently able to automatically analyse scene breaks, pauses in the audio, and camera pans and zooms (41, 21, 31, 33, 34, 38, 39), yet this information alone does not enable the creation of a sufficiently detailed representation of video content to support content-based retrieval and repurposing.
In the near term, it is computer-supported human annotation that will enable video to become a rich, structured data type. At this juncture, the key challenge is to develop solutions for people who already devote time and money to annotating video, because they will help create the necessary infrastructure (both economically and in terms of the content itself) to support the ubiquitous use and reuse of video information. Today, simple queries often take tens of hours and cost thousands of dollars. If recorded reusable video is going to become a ubiquitous medium of daily communication, we will need to develop technologies which will change the current economics of annotation and retrieval.
In developing a structured representation of video content for use in annotation and retrieval of video from large archives, it is important to understand the current state of video annotation and to create specifications for how future annotation systems should be able to perform. Consequently, we can posit a hierarchy of the efficacy of annotations:
Slightly better, Chris should be able to use Pat's annotations.
Even better, Chris' computer should be able to use Pat's annotations.
At best, Chris' computer and Chris should be able to use Pat's and Pat's computer's annotations.
In the main, video has been archived and retrieved as if it were a non-temporal data type which could be adequately represented by keywords. A good example of this approach can be seen in Apple Computer's Visual almanac which describes and accesses the contents of its video and image archive by use of 'keywords' and 'image keys' (4). This technique is successful in retrieving matches in a fairly underspecified search but lacks the level of granularity and descriptive richness necessary for computer-assisted and automatic video retrieval and repurposing. The keyword approach is inadequate for representing video content for the following reasons:
Today, in organisations and companies around the world whose business it is to annotate, archive, and retrieve video information, by and large, the structure of the data is mostly represented in the memories of the human beings whose job it is to handle it. Even in situations in which keyword-based computer annotation systems are 'used,' short-term memory and long-term memory are the real repositories of information about the content of video data. 'Joe and Jane in the basement' are the real indexing and retrieval mechanisms in almost all video archives. Human memory is very good at retrieving video due to its associative and analogical capabilities; it has memory structures which any computerised retrieval system would want to emulate. Nevertheless, there are significant problems in sharing the contents of one human memory with others and of transferring the contents of one human memory to another. There are also severe limitations in terms of storage capacity and speed for human memory that aren't acceptable if we are going to scale up to a global media archive in which video is accessed and manipulated by millions of people everyday.
We need to create a language for the representation of video content which enables us to combine automatic, semi-automatic, and human annotation so as to be able to make use of today's annotation effort long into the future.
In the near future, we can imagine a world in which video annotation, search, and retrieval are conducted not just by professionals for professionals, but by anyone interested in repurposing footage. In a world where digital media are produced anywhere by anyone and are accessible to anyone anywhere, video will need to accrete layers of content annotations as it moves around the globe throughout its life cycle of use and reuse. In the future, annotation, both automatic and semi-automatic, will need to be fully integrated into the production, archiving, retrieval, and reuse of video and audio data. In production, cameras will encode and interpret detailed information about where, when, and how they are recording and attach that information to the digital data stream: global satellite locators will indicate altitude, longitude and latitude, time will be stamped into the bit stream, other types of sensing data - temperature, humidity, wind - as well as how the camera moves (pans, zooms, etc.) and how far away the camera is from its subjects (range data for example) will all provide useful layers of annotation of the stream of video and audio data which the camera produces. Still there will exist many other annotations of a more semantic nature which these cameras won't be able to automatically encode, and for which we will want to have formats so that humans working with machines will be able to easily annotate video content. In a sense, the challenge is to develop a language of description which humans can read and write and which computers can read and write which will enable the integrated description and creation of video data. Such a language would satisfy the fourth desideratum of video annotation (Chris' computer and Chris should be able to use Pat's and Pat's computer's annotations).
By having a structured representation of video content - meaningful bits about the bits - future annotation and retrieval technology will enable users to mix video streams according to their contents and to manipulate video at various levels of granularity. With this kind of representation, annotation, and retrieval technology we will create tools which enable users to operate on higher level content structures of video data instead of being stuck with just bits, pixels, frames, or clips.
A language for video annotation needs to support the visualisation and browsing of the structure of video content as well as search and retrieval of video content. There has been some excellent work in visualising and browsing video data (37, 40, 31, 33, 21) with which our work has affinity. The limitations of these systems rest in the question of their scalability and, a related problem, their lack of a developed video annotation language. For as visualisation and browsing interfaces must accommodate larger and larger video databases, they need to be able to work with video according to its content as well as its structure, and hence, annotation and retrieval become necessary components of the system.
A video annotation language needs to create representations that are durable and sharable. The knowledge encoded in the annotation language needs to extend in time longer than one person's memory or even a collective memory, and needs to extend in space across continents and cultures. Today, and increasingly, content providers have global reach. German news teams may shoot footage in Brazil for South Korean television which is then accessed by American documentary film makers, perhaps ten years later. We need a global media archiving system that can be added to and accessed by people who do not share a common language, and the knowledge of whose contents is not only housed in the memories of a few people working in the basements of news reporting and film production facilities. Visual languages may enable the design of an annotation language with which we can create a truly global media resource. Unlike other visual languages that are used internationally (e.g., for traffic signage, operating instructions on machines, etc. (18)) a visual language for video annotation can take advantage of the affordances of the computer medium. We can develop visual languages for video that utilise colour, animation, variable resolution, and sound in order to create durable and sharable representations of video content.
In designing a visual language for video content we must think about the structure of what is being represented. A video camera produces a temporal stream of image and sound data represented as a sequence of frames played back at a certain rate - normally 30 frames per second. Traditionally, this stream of frames is segmented into units called clips. Current tools for annotating video content used in film production, television production, and multimedia, add descriptors (often keywords) to clips. There is a significant problem with this approach. By taking an incoming video stream, segmenting it into various clips, and then annotating the content of those clips, we create a fixed segmentation of the content of the video stream. Imagine a camera recording a sequence of 100 frames.
Stream of 100 frames of video
Traditionally, one or more parts of the stream of frames would be segmented into clips which would then be annotated by attaching descriptors. The clip is a fixed segmentation of the video stream that separates the video from its context of origin and encodes a particular chunking of the original data.
A 'clip' from frame 47 to frame 68 with descriptors
In our representation, the stream of frames is left intact and is annotated by multi-layered annotations with precise time indexes (beginning and ending points in the video stream). Annotations could be made within any of the various categories for media annotation discussed below (e.g., characters, spatial location, camera motion, dialogue, etc.) or contain any data the user may wish. The result is that this representation makes annotation pay off - the richer the annotation, the more numerous the possible segmentations of the video stream. Clips change from being fixed segmentations of the video stream, to being the results of retrieval queries based on annotations of the video stream. In short, in addressing the challenges of representing video for large archives what we need are representations which make clips, not representations of clips.
The stream of 100 frames of video with 6 annotations resulting in 66 possible segmentations of the stream (i.e., 'clips')
A central question in our research is the development of a minimal representation of video content. This has resulted in the development of a set of categories for, and a way of thinking about, describing video content. Let us build up these categories from examining the qualities of video as a medium. One of the principal things that makes video unique is that it is a temporal medium. Any language for annotating the content of video must have a way of talking about temporal events - the actions of humans and objects in space over time. Therefore, we also need a way of talking about the characters and objects involved in actions as well as their setting, that is, the spatial location, temporal location, and weather/lighting conditions. The objects and characters involved in actions in particular settings also have significant positions in space relative to one another (beneath, above, inside, outside, etc.).
These categories - actions, characters, objects, locations, times, and weather - would be nearly sufficient for talking about actions in the world, but video is a recording of actions in the world by a camera, and any representation of video content must address further specific properties. First, we need ways of talking about cinematographic properties, the movement and framing of the camera recording events in the world. We also need to describe the properties of the recording medium itself (film or video, colour or black & white, graininess, etc.). Furthermore, in video, viewers see events depicted on screens, and therefore, in addition to relative positions in space, screen objects have a position in the two-dimensional grid of the frame and in the various layered vertical planes of the screen depth. Finally, video recordings of events can be manipulated as objects and rearranged. We create transitions in video in ways not possible in the real world. Therefore, cinematic transitions must also be represented in an annotation language for video content.
These categories need not be sufficient for media annotation (the range of potential things one can say is unbounded), but we believe they are necessary categories for media annotation in order to support retrieval and reuse of particular segments of video data from an annotated stream.
These minimal annotation categories attempt to represent information about media content that can function as a substrate:
In attempting to create a representation of video content, an understanding of the semantics and syntax of video information is a primary concern. Video has a radically different semantic and syntactic structure than text, and attempts to represent video and index it in ways similar to text will suffer serious problems.
First of all, it is important to realise that video images have very little intrinsic semantics. Syntax is highly determinative of their semantics, as evidenced by the Kuleshov Effect (30). The Kuleshov Effect is named after Lev Kuleshov, a Soviet cinematographer whose work at the beginning of the century deeply influenced the Soviet montage school and all later Soviet cinema (19, 20). Kuleshov was himself an engineer who, after only having worked on one film, ended up heading the Soviet film school after the October Revolution. Kuleshov was fascinated by the ability of cinema to create artificial spaces and objects through montage (editing) by virtue of the associations people create when viewing sequences of shots, which if the shots were taken out of sequence would not be created. In the classic Kuleshov example, Kuleshov showed the following sequence to an audience:
the same face of the actor - a coffin - go to black
the same face of the actor - a field of flowers - go to black.
The syntax of video sequences determines the semantics of video data to such a degree that any attempts to create context-free semantic annotations for video must be carefully scrutinised so as to determine which components are context-dependent and which preserve their basic semantics through recombination and repurposing. Any indexing or representational scheme for the content of video information needs to be able to facilitate our understanding of how the semantics of video changes when it is resequenced into new syntactic structures. Therefore, the challenge is twofold: to develop a representation of those salient features of video which, when combined syntactically, create new meanings; and to represent those features which do not radically change when recontextualised.
Over the past two years, members of the MIT Media Laboratory's Learning and Common Sense Section (Marc Davis with the assistance of Brian Williams and Golan Levin under the direction of Prof. Kenneth Haase) have been building a prototype for the annotation and retrieval of video information. This system is called Media Streams. Media Streams has developed into a working system that soon will be used by other researchers at the Media Lab and in various projects in which content annotated temporal media are required. Media Streams is written in Macintosh Common Lisp (2) and FRAMER (25, 24), a persistent framework for media annotation and description that supports cross-platform knowledge representation and database functionality. Media Streams has its own Lisp interface to Apple's QuickTime digital video system software (3). Media Streams is being developed on an Apple Macintosh Quadra 950 with three high resolution colour displays.
The system has three main interface components: the Director's Workshop (see figure 1); icon palettes (see figure 2); and media time lines (see figure 3). The process of annotating video in Media Streams using these components involves a few simple steps:
2) As the user creates iconic descriptors, they accumulate on one or more icon palettes. This process effectively groups related iconic descriptors. The user builds up icon palettes for various types of default scenes in which iconic descriptors are likely to co-occur, for example, an icon palette for 'treaty signings' would contain icons for certain dignitaries, a treaty, journalists, the action of writing, a stateroom, etc.
3) By dragging iconic descriptors from icon palettes and dropping them
onto a media time line, the user annotates the temporal media
represented in the media time line. Once dropped onto a media time
line, an iconic description extends from its insertion point in the
video stream to either a scene break or the end of the video stream.
In addition to dropping individual icons onto the media time line, the
user can construct compound icon sentences by dropping certain
'glommable' icons onto the Media Time Line, which, when completed, are
then added to the relevant Icon Palette and may themselves be used as
primitives. For example, the user initially builds up the compound
icon sentence for 'Arnold, an adult male, wears a jacket' by
successively dropping the icons
,
and
onto the media time line. The user then has the compund icon
on an icon palette to use in later annotation. By annotating various aspects of the video stream (time, space, characters, character actions, camera motions, etc.), the user constructs a multi-layered, temporally indexed representation of video content.
There have been serious efforts to create iconic languages to facilitate global communication (7) and provide international standard symbols for specific domains (18). We developed Media Streams' iconic visual language in response to trying to meet the needs of annotating video content in large archives. It seeks to enable:
The iconic language gains expressive power and range from the compounding of primitives and has set grammars of combination for various categories of icons. In Korfhage's sense Media Streams is an iconic language as opposed to being merely an iconography (28). Similar to other syntaxes for iconic sentences (13, 35) icon sentences for actions have the form of subject-action, subject-action-object, or subject-action-direction, while those for relative positions have the form subject-relative position-object. Icon sentences for screen positions are of the form subject-screen position, while cinematographic properties are of the form camera-movement-object (analogous to subject-action-object), as in 'the camera-is tracking-Steve' or 'the camera-zooms in on-Sally.'
It is also important to note that the icon hierarchy of the Director's Workshop is structured not as a tree, but as a graph. The same iconic primitives can often be reached by multiple paths. The system encodes the paths users take to get to these primitives; this enriches the representation of the compounds which are constructed out of these primitives. This is especially useful in the organisation of object icons, in which, for example, the icon for 'blow-dryer' may be reached under 'hand-held device,' 'heat-producing device,' or 'personal device.' These paths are also very important in retrieval, because they can guide generalisation and specialisation of search criteria by functioning as a semantic net of hierarchically organised classes, subclasses, and instances.
The central problem of a descriptive language for temporal media is the representation of dynamic events. For video in particular, the challenge is to come up with techniques for representing and visualising the complex structure of the actions of characters, objects, and cameras. There exists significant work in the normalisation of temporal events in order to support inferencing about their interrelationships (1) and to facilitate the compression and retrieval of image sequences by indexing temporal and spatial changes (5, 16, 17). Our work creates a representation of cinematic action which these and other techniques could be usefully applied to. For even if we had robust machine vision, temporal and spatial logics would still require a representation of the video content, because such a representation would determine the units these formalisations would operate on for indexing, compression, retrieval, and inferencing.
A representation of cinematic action for video retrieval and repurposing needs to focus on the granularity, reusability, and semantics of its units. In representing the action of bodies in space, the representation needs to support the hierarchical decomposition of its units both spatially and temporally. Spatial decomposition is supported by a representation that hierarchically orders the bodies and their parts which participate in an action. For example, in a complex action like driving an automobile, the arms, head, eyes, and legs all function independently. Temporal decomposition is enabled by a hierarchical organisation of units, such that longer sequences of action can be broken down into their temporal subabstractions all the way down to their atomic units. In (29), Lenat points out the need for more than a purely temporal representation of events that would include semantically relevant atomic units organised into various temporal patterns (repeated cycles, scripts, etc.). For example, the atomic unit of 'walking' would be 'taking a step' which repeats cyclically. An atomic unit of 'opening a jar' would be 'turning the lid' (which itself could theoretically be broken down into smaller units - but much of the challenge of representing action is knowing what levels of granularity are useful).
Our approach tries to address these issues in multiple ways with special attention paid to the problems of representing human action as it appears in video. It is important to note in this regard - and this holds true for all aspects of representing the content of video - that unlike the project of traditional knowledge representation which seeks to represent the world, our project is to represent a representation of the world. This distinction has significant consequences for the representation of human action in video. In video, actions and their units do not have a fixed semantics, because their meaning can shift as the video is recut and inserted into new sequences (30, 27). For example, a shot of two people shaking hands, if positioned at the beginning of a sequence depicting a business meeting, could represent 'greeting,' if positioned at the end, the same shot could represent 'agreeing.' Video brings to our attention the effects of context and order on the meaning of represented action. In addition, the prospect of annotating video for a global media archive brings forward an issue which traditional knowledge representation has largely ignored: cultural variance. The shot of two people shaking hands may signify greeting or agreeing in some cultures, but in others it does not. How are we to annotate shots of people bowing, shaking hands, waving hello and good-bye? The list goes on. In order to address the representational challenges of action in video we do not explicitly annotate actions according to their particular semantics in a given video stream (a shot of two people shaking hands is not annotated as 'greeting' or alternately as 'agreeing'), but rather according to the motion of objects and people in space. We annotate using physically-based description in order to support the reuse of annotated video in different contexts - be they cinematic or cultural ones. We create analogy mappings between these physically-based annotations in their concrete contexts in order to represent their contextual synonymy or lack thereof.
Object actions are subdivided horizontally into actions involving a single object, two objects, or groups of objects (see figure 5). Each of these is divided according to object motions and object state changes. For example, the action of a ball rolling is an object motion; the action of a ball burning is an object state change.
We represent actions for characters and objects separately in the Director's Workshop because of the unique actions afforded by the human form. Our icons for action are animated which takes advantage of the affordances of iconography in the computer medium as opposed to those of traditional graphic arts.
Objects are subdivided vertically into various types of objects and number of objects.
Space is subdivided vertically into geographical space (land, sea, air, and outer space), functional space (buildings, public outdoor spaces, wilderness, and vehicles), and topological space (inside, outside, above, behind, underneath, etc.)
Weather is subdivided vertically into moisture (clear, partly sunny, partly cloudy, overcast, rainy, and snowy) and wind (no wind, slight wind, moderate wind, and heavy wind) (see figure 9). Temperature is not something that can be directly seen. A video of a cold clear day may look exactly like a video of a hot clear day. It is the presence of snow or ice that indirectly indicates the temperature.
We use these icons to represent two very different types of space, time, and weather on a Media Time Line: the actual time, space, and weather of the recorded video and the visually inferable time, space, and weather of the video. The difference can be made clear in the following example. Imagine a shot of a dark alley in Paris that looks like a generic dark alley of any industrialised city in Europe (it has no distinguishing signs in the video image which would identify it as a Parisian dark alley). The actual recorded time, space, and weather for this shot differ from its visually inferable time, space, and weather. This distinction is vital to any representation for reusable archives of video data, because it captures both the scope within which a piece of video can be reused and the representality of a piece of video, i.e., some shots are more representative of their actual recorded time, space, and weather than others.
Shot 1 Person enters elevator, elevator doors close
Shot 2 Elevator doors open, person exits elevator
The viewer infers that a certain amount of time has passed and that a certain type of spatial translation has occurred. Noel Burch has developed a systematic categorisation of spatio-temporal transitions between shots in cinema (11). He divides temporal transitions into continuous, forward ellipses in time of a determinate length, forward ellipses of an indeterminate length, and the corresponding transitions in which there is a temporal reversal. Spatial transitions are divided into continuous, transitions in which spatial proximity is determinate, and transitions in which spatial proximity is indeterminate. Burch's categorisation scheme was used by Gilles Bloch in his groundbreaking work in the automatic construction of cinematic narratives (8). We adopt and extend Burch's categorisation of shot transitions by adding 'temporal overlaps' as a type of temporal transition and the category of 'visual transitions' for describing transition effects which unlike traditional cuts, can themselves have a duration (icons for transition effects which have durations are animated icons). In the Director's Workshop, we horizontally subdivide transitions between shots according to temporal transitions, spatial transitions, and visual transitions (cuts, wipes, fades, etc.) (see figure 12).
When a transition icon is dropped on the Media Time Line, Media Streams creates a compound icon in which the first icon is an icon-sized (32 x 32 pixels, 24 bits deep) QuickTime Movie containing the first shot, the second icon is the transition icon, and the third icon is an icon-sized QuickTime Movie containing the shot after the transition. So returning to our example of the two-shot elevator sequence, the compound icons would be as follows:
Temporal transition
(forward temporal ellipsis of a determinate length)
Spatial transition
(spatial translation of a determinate proximity)
Visual transition
(simple cut with no duration)
We intend to use transition icons to improve Media Streams' knowledge about the world and to facilitate new forms of analogical retrieval. A search using the icons above would enable the user to find a 'matching' shot in the following way. The user could begin with a shot of a person getting into an automobile and use one or more of the transition icons as analogical search guides in order to retrieve a shot of the person exiting the automobile in a nearby location. The query would have expressed the idea of 'find me a Shot B which has a similar relation to Shot A as Shot D has to Shot C.'
Users can also create new icons for character and object actions by means of the Animated Icon Editor (see figure 14). This editor allows users to define new icons as subsets or mixtures of existing animated icons. This is very useful in conjunction with our complete body model, because a very wide range of possible human motions can be described as subsets or mixtures of existing icons.
Applying the results of work on automatic icon incorporation would be a fruitful path of exploration (22). Already in our icon language, there are many iconic descriptors which we designed using the principle of incorporation (by which individual iconic elements are combined to form new icons). Creating tools to allow users to automatically extend the language in this way is a logical extension of our work in this area.
The Media Time Line is the core browser and viewer of Media Streams (see figure 3). It enables users to visualise video at multiple time scales simultaneously, to read and write multi-layered iconic annotations, and provides one consistent interface for annotation, browsing, query, and editing of video and audio data.
Since video is a temporal medium, the first challenge for representing and annotating its content is to visualise its content and structure. In the Media Time Line we represent video at multiple time scales simultaneously by trading off temporal and spatial resolution in order to visualise both the content and the dynamics of the video data. We create a sequence of thumbnails of the video stream by subsampling the video stream one frame every second. For longer movies, we sample a frame every minute as well. The spatial resolution of each thumbnail enables the user to visually inspect its contents. However, the temporal resolution is not as informative in that the sequence is being subsampled at one frame per second.
In order to overcome the lack of temporal resolution, we extend a technique pioneered by Ron MacNeil of the Visible Language Workshop of the MIT Media Laboratory (31) and used in the work of Mills and Cohen at Apple Computer's Advanced Technology Group (33). We create a videogram. A videogram is made by grabbing a centre strip from every video frame and concatenating them together. Underneath the subsampled thumbnail frames of the video is the videogram in which the concatenated strip provides fine temporal resolution of the dynamics of the content while sacrificing spatial resolution. Because camera operators often strive to leave significant information within the centre of the frame, a salient trace of spatial resolution is preserved.
In a videogram, a still image has an unusual salience: if a camera pans across a scene and then a centre strip is taken from each video frame, a still will be recreated which is coherently deformed by the pace and direction of the camera motion and/or the pace and direction of any moving objects within the frame. Our contribution is that by simultaneously presenting two different, but co-ordinated views of video data - the thumbnails, with good spatial resolution and poor temporal resolution, and the videogram, with poor spatial resolution but good temporal resolution - the system enables the viewer to use both representations simultaneously in order to visualise the structure of the video information (see figure 15). This idea of playing spatial and temporal resolutions off one another is also utilised in Laura Teodosio's work on 'salient stills' (36) and holds promise as a general guideline for creating new visualisations of video data. An example of this spatial/temporal trade-off can be seen in the figure below in which the movement of Arnold through the frame is visible in the right hand side of the videogram and the fact that swath of extended face corresponds to the central figure can be seen from the thumbnail above.
With little practice, users can learn to read this representation to quickly scan the dynamics of video content from this spatial representation. Scene breaks are clearly visible as are camera pans, zooms, tracking, and the difference between handheld and tripod recorded video footage. The deformation of the still image in the videogram provides a signature of camera and/or object motion as in the example above.
Audio data in the media timeline is represented by wave form depicting amplitude as well as pause bars depicting significant breaks in the audio. Currently our algorithm uses a set threshold which works fairly well for many videos but a more robust algorithm is needed. Significant work has been done by Barry Arons on pause detection and audio and speech parsing in general (6); we hope to incorporate these results into our system. Arons' work uses dynamic thresholding and windowing techniques to facilitate better detection of pauses in speech and the separation of speech from background noise in unstructured audio recordings. Similarly, work by Michael Hawley in developing specialised audio parsers for musical events in the audio track could be applied to automatically parsing the structure and enriching the representation of audio data (26).
In annotating the presence or absence of audio events within the data stream, our representation makes use of the fact that in thinking about audio, one thinks about the source that produced the audio. Icons for different objects and characters are compounded with the icon for the action of producing the heard sound in order to annotate audio events. This concept correlates to Christian Metz's notion of 'aural objects' (32).
Annotation of video content in a Media Time Line is a simple process of dropping down iconic descriptors from the Icon Space onto the Media Time Line. Frame regions are then created which may extend to the end of the current scene or to the end of the entire movie. The select bar specifies the current position in a movie and displays the icons that are valid at that point in time. Icons are 'good-till-cancelled' when they are dropped onto the Media Time Line. The user can specify the end points of frame regions by dragging off an icon and can adjust the starting and ending points of frame regions by means of dragging the cursor. A description is built up by dropping down icons for the various categories of video annotation. The granularity and specificity of the annotation are user determined.
Media Streams is about to be subjected to some rigorous real-world tests. In addition to several internal projects at the MIT Media Laboratory which will be building other systems on top of Media Streams, external projects involving large archives of news footage will be exploring using Media Streams for video annotation and retrieval. Clearly these internal and external projects will teach us much about the claim made in this paper: that an iconic visual language for video annotation and retrieval can support the creation of a stream-based, reusable, global archive of digital video. We believe that this goal articulates an important challenge and opportunity for visual languages in the 1990's (23).
The research discussed above was conducted at the MIT Media Laboratory and Interval Research Corporation. The support of the Laboratory and its sponsors is gratefully acknowledged. I want to thank Brian Williams and Golan Levin for their continually creative and Herculean efforts and my advisor, Prof. Kenneth Haase, for his insight, inspiration, and support. Thanks also to Warren Sack, Axil Comras, and Wendy Buffett for editorial and moral support.
[1] Allen, J F. Maintaining knowledge about temporal intervals. In: Brachman, R J, Levesque, H J (eds.), Readings in knowledge representation. San Mateo, CA, Morgan Kaufmann, 1985, 510-521.
[2] Apple Computer, Macintosh Common Lisp reference. Cupertino, CA, Apple Computer, 1993.
[3] Apple Computer, QuickTime developer's guide. Cupertino, CA, Apple Computer, 1993.
[4] Apple Multimedia Lab, The visual almanac. CA, San Francisco, Apple Computer, 1989.
[5] Arndt, T, Chang, S-K. Image sequence compression by iconic indexing. In: Proceedings of 1989 IEEE workshop on visual languages. Rome, IEEE Computer Society Press 1989, 177-182.
[6] Arons, B. SpeechSkimmer: interactively skimming recorded speech. Forthcoming in: Proceedings of UIST'93 ACM symposium on user interface software technology. Atlanta, ACM Press, 1993.
[7] Bliss, C K. Semantography-blissymbolics. 3rd ed. N. S. W., Sydney. Semantography-Blissymbolics Publications, 1978.
[8] Bloch, G R. From concepts to film sequences. Unpublished paper. Yale University Department of Computer Science, 1987.
[9] Bordwell, D. Narration in the fiction film. Madison, University of Wisconsin Press, 1985.
[10] Bordwell, D, Thompson, K. Film art: an introduction. 3rd ed. New York, McGraw-Hill, 1990.
[11] Burch, N. Theory of film practice. Princeton, Princeton University Press, 1969.
[12] Chang, S-K. Visual languages and iconic languages. In: Chang, S-K, Ichikawa, T, Ligomenides, P A (eds.) Visual languages. New York, Plenum Press, 1986, 1-7.
[13] Chang, S K et al. A methodology for iconic language design with application to augmentative commmunication. In: Proceedings of 1992 IEEE workshop on visual languages. Seattle, Washington, IEEE Computer Society Press, 1992, 110-116.
[14] Davis, M. Director's Workshop: semantic video logging with intelligent icons. In: Proceedings of AAAI-91 workshop on intelligent multimedia interfaces. Anaheim, CA, AAAI Press, 1991, 122-132.
[15] Davis, M. Media Streams: an iconic visual language for video annotation. In: Proceedings of 1993 IEEE symposium on visual languages. Bergen, Norway, IEEE Computer Society Press, 1993, 196-202.
[16] Del Bimbo, A, Vicario, E, Zingoni, D. A Spatio-temporal logic for image sequence coding and retrieval. In: Proceedings of 1992 IEEE workshop on visual languages. Seattle, Washington, IEEE Computer Society Press, 1992, 228-230.
[17] Del Bimbo, A, Vicario, E, Zingoni, D. Sequence retrieval by contents through spatio temporal indexing. In: Proceedings of 1993 IEEE symposium on visual languages. Bergen, Norway, IEEE Computer Society Press, 1993, 88-92.
[18] Dreyfuss, H. Symbol sourcebook: an authoritative guide to international graphic symbols. New York, McGraw-Hill, 1972.
[19] Eisenstein, S M. The film sense. San Diego, Harcourt Brace Jovanovich, 1947.
[20] Eisenstein, S M. Film form: essays in film theory. San Diego, Harcourt Brace Jovanovich, 1949.
[21] Elliott, E L. WATCH - GRAB - ARRANGE - SEE: thinking with motion images via streams and collages. M. S. V. S. thesis. Massachusetts Institute of Technology Media Laboratory, 1993.
[22] Fuji, H, Korfhage, R R. Features and a model for icon morphological transformation. In: Proceedings of 1991 IEEE workshop on visual languages. Kobe, Japan, IEEE Computer Society Press, 1991, 240-245.
[23] Glinert, E, Blattner, M M, Frerking, C J. Visual tool and languages: directions for the 90's. In: Proceedings of 1991 IEEE workshop on visual languages. Kobe, Japan, IEEE Computer Society Press, 1991, 89-95.
[24] Haase, K. FRAMER: a persistent portable representation library. Internal document. Cambridge, MA, MIT Media Laboratory, 1993.
[25] Haase, K, Sack, W. FRAMER manual. Internal document. Cambridge, MA, MIT Media Laboratory, 1993.
[26] Hawley, M. Structure out of sound. Ph. D. Thesis. Massachusetts Insitute of Technology, 1993.
[27] Isenhour, J P. The effects of context and order in film editing. AV Communications Review, 23(1), 69-80, 1975.
[28] Korfhage, R R, Korfhage, M A. Criteria for iconic languages. In: Chang, S-K, Ichikawa, T, Ligomenides, P A (eds.) Visual languages. New York, Plenum Press, 1986, 207-231.
[29] Lenat, D B, Guha, R V. Building large knowledge-based systems: representation and inference in the Cyc project. Reading, MA, Addison-Wesley, 1990.
[30] Levaco, R. (ed.) Kuleshov on film: writings by Lev Kuleshov. Berkeley, CA, University of California Press, 1974.
[31] MacNeil, R. Generating multimedia presentations automatically using TYRO: the constraint, case-based designer's apprentice. In: Proceedings of 1991 IEEE workshop on visual languages. Kobe, Japan, IEEE Computer Society Press, 1991, 74-79.
[32] Metz, C. Aural objects. Yale French Studies, 60, 1980, 24-32.
[33] Mills, M, Cohen, J, Wong, Y Y. A magnifier tool for video data. In: Proceedings of CHI'92, Monterey, CA, ACM Press, 1992, 93-98.
[34] Otsuji, K, Tonomura, Y, Ohba, Y. Video browsing using brightness data. SPIE visual communications and image processing '91: image processing. SPIE 1606, 1991, 980-989.
[35] Tanimoto, S L, Runyan, M S. PLAY: an iconic programming system for children. In: Chang, S-K, Ichikawa, T, Ligomenides, P A (eds.) Visual languages. New York, Plenum Press, 1986, 191-205.
[36] Teodosio, L. Salient stills. M. S. V. S. Thesis. Massachusetts Institute of Technology Media Laboratory, 1992.
[37] Tonomura, Y, Abe, S. Content oriented visual interface using video icons for visual database systems. In: Proceedings of 1989 IEEE workshop on visual languages. Rome, IEEE Computer Society Press, 1989, 68-73.
[38] Tonomura, Y et al. VideoMAP and VideoSpaceIcon: tools for anatomizing content. In: Proceedings of INTERCHI'93 conference on human factors in computing systems. Amsterdam, ACM Press, 1993, 131-136.
[39] Ueda, H et al. Automatic structure visualization for video editing. In: Proceedings of INTERCHI'93 conference on human factors in computing systems. Amsterdam, ACM Press, 1993, 137-141.
[40] Ueda, H, Miyatake, T, Yoshizawa, S. IMPACT: an interactive natural-motion-picture dedicated multimedia authoring system. In: Proceedings of CHI '91. New Orleans, Louisiana, ACM Press, 1991, 343-350.
[41] Zhang, H, Kankanhalli, A, Smoliar, S W. Automatic partitioning of full-motion video. Multimedia Systems, 1, 1993, 10-28.