A position paper for the W3C Video on the Web Workshop
12-13 December 2007, San Jose, California and Brussels, Belgium
In the Media Solutions Group, we see video as a necessary part of a true Web experience, but recognize that the Web allows for a much richer experience than just watching television. Furthermore, the proximity of a keyboard, mouse and other input device permits for a far richer interaction with the video content than is possible sitting back on a couch with a remote in hand.
To this end, we believe that enabling a true interactive experience with the end-user requires more descriptive data about the content they are watching, regardless of the format and location that they view the content. Lastly, the rise of a Web 2.0 technology sets the expectations for the end-user to be able to contribute, comment on and even correct the metadata associated with content.
The need for metadata is well recognized with W3C Metadata work focusing on areas including PICS as well as dozens of standards in the industry for metadata associated with video content. In the realm specific to video, we can divide this metadata into four basic categories:
The first two categories are well represented by many metadata standards including Apple ITUNES XML, Yahoo MediaRSS, and even the Cablelabs VOD Metadata Content Specification 2.0. The Mpeg-7 standard provides an XML form for the last two areas and even the W3C SMIL standard provides a framework for synchronizing information with presentation of content. There are also several next generation video on the web companies who are experimenting with video hotspot technology. While these standards serve different purposes, none has risen to the top as a common mechanism for holding the interactive information about video content on the web.
There is much growth on the web in Timed Metadata usage and implementation, from Adobe’s captioning support in Flash CS3 to web-based scene detection (and, thus, chapter-ization), deep-links in Google Video, and many others. Increasingly end users are allowed to contribute their own meta-information, in the form of timed comments or tags. However, as with the above, with the exception of captions, no standards exist for this burgeoning source of video metadata.
While the standard of representing the metadata is important to resolve, it is just as important to recognize where the metadata is being generated.
It is this last category which is key for deriving the true value of a Web based video experience. With the ability for users to contribute their knowledge to the content the engagement is higher and the experience becomes much more personalized.
Having rich metadata for video content provides many opportunities on which the web and other video applications can capitalize. The first obvious uses, driving many web companies today, are program guides complete with detailed program descriptions as well as mechanisms to search content in-depth -- finding not just the top level information such as one finds on IMDB , but even dialog and scenes contained within the video.
More sophisticated use of this metadata allows for better recommendations of content, and dynamic advertising where appropriate ads are displayed based on actual content of the video itself, or where merchandising opportunities are based on what has appeared in a video that a user has watched.
Unfortunately there are several impediments to being able to gather and fully utilize metadata for a next generation Web application
This is not a new area for the W3C to focus on with the standardization of metadata around images covering hotspots, alternate text and layout information embedded in the XHTML for a web page.
By adopting the same standard for separating the video from the metadata in elevating video to a first class object and providing the primitives for applications to truly interact with the video the same way that the W3C has done for static images, we believe that the Web community will be able to focus on growing the real value of video in their applications.