Copyright © 2011-2014 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. W3C liability, trademark and document use rules apply.
HTML microdata [MICRODATA] is an extension to HTML used to embed machine-readable data into HTML documents. Whereas the microdata specification describes a means of markup, the output format is JSON. This specification describes processing rules that may be used to extract RDF [RDF11-CONCEPTS] from an HTML document containing microdata.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document is an experimental work in progress. The concepts described herein are intended to provide guidance for a possible future Working Group chartered to provide a Recommendation for this transformation. As a consequence, implementers of this specification, either producers or consumers, should note that it may change prior to any possible publication as a Recommendation.
This Working Draft is an update of the W3C Interest Group Note, published in October 2012. This update simplifies processing using the following mechanisms:
md:item
RDF Collection to reconstruct the order that items appear in the DOM. This has also proven to
not be useful and has been dropped.
(see issue 6)@content
attribute
of the meta
element.
(see issue 7)@value
attribute
of the data
or meter
elements. If this value has
numeric form, it will produce a datatyped literal using the appropriate datatype
from [XMLSCHEMA11-2]
(see issue 8
and issue 9)propertyURI
registry
setting. This setting could previously have taken either the vocabulary
or
contextual
settings. As contextual
was never used, and usage
in the wild favors the vocabulary
setting, support for contextual
has been eliminated, and consequently support for the propertyURI
element
within the registry.
This issue remains open pending community review; specifically, anyone depending on this
feature should provide feedback as requested below.
(see issue 10)multipleValues
registry setting were set
to list
. Although the previous registry did have such a setting for some
schema.org values, this is not honored by most search engines, and so has been dropped,
and consequently support for the multipeValues
element with the registry.
This issue remains open pending community review; specifically, anyone depending on this
feature should provide feedback as requested below.
(see issue 10)vocab_expansion
option. Support for Vocabulary Expansion has been substantially
simplified, and is no longer under control of an option.
This issue remains open pending community review; specifically, anyone depending on this
feature should provide feedback as requested below.
(see issue 10)The intention is to publish this draft as a new version of the Interest Group Note after gathering and incorporating community input.
This document was published by the Semantic Web Interest Group as an Interest Group Note. If you wish to make comments regarding this document, please send them to semantic-web@w3.org (subscribe, archives). All comments are welcome.
Publication as an Interest Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
The disclosure obligations of the Participants of this group are described in the charter.
This document is governed by the 1 August 2014 W3C Process Document.
This section is non-normative.
This document describes a means of transforming HTML containing microdata into RDF. HTML Microdata [MICRODATA] is an extension to HTML used to embed machine-readable data to HTML documents. This specification describes transformation directly to RDF [RDF11-CONCEPTS].
There are a variety of ways in which a mapping from microdata to RDF might be configured to give a result that is closer to the required result for a particular vocabulary. This specification defines terms that can be used as hooks for vocabulary-specific behavior, which could be defined within a registry or on an implementation-defined basis.
For background on the trade-offs between these options, see http://www.w3.org/wiki/Mapping_Microdata_to_RDF and GitHub Issues.
This section is non-normative.
Microdata [MICRODATA] is a way of embedding data in HTML documents using attributes. The HTML DOM is extended to provide an API for accessing microdata information, and the microdata specification defines how to generate a JSON representation from microdata markup.
Mapping microdata to RDF enables consumers to merge data expressed in other RDF-based formats with microdata. It facilitates the use of RDF vocabularies within microdata, and enables microdata to be used with the full RDF toolchain. Some use cases for this mapping are described in Section 1.2 below.
Microdata's data model does not align neatly with RDF.
http://example.org/Cat
can have
both the property color
and the property http://example.org/color
,
and these properties are semantically distinct under microdata. In
RDF, all properties have IRIs.@lang
attributes could
be used to provide datatype and language information for RDF data, this
would be contrary to the microdata specification.Thus, in some places the needs of RDF consumers violate requirements of the microdata specification. This specification highlights where such violations occur and the reasons for them.
This specification allows for vocabulary-specific rules that affect the generation of property URIs and value serializations. This is facilitated by a registry that associates URIs with specific rules based on matching itemtype values against registered URI prefixes do determine a vocabulary and potentially vocabulary-specific processing rules.
This specification also assumes that consumers of RDF generated from microdata may have to process the results in order to, for example, assign appropriate datatypes to property values.
This section is non-normative.
During the period of the task force, a number of use cases were put forth for the use of microdata in generating RDF:
rdf:List
values; when they take
multiple values they are unordered. The rdfs:range
of a GoodRelations
property indicates the datatype of the expected value, and GoodRelations
processors will expect values to be cast to that type. Language
information from the HTML needs to be captured as it is common that
multiple values will be used to specify the same information in different
languages.http://schema.org/musicGroupMember
, and an author might express more detail through an ad-hoc
sub-property musicGroupMember/leadVocalist, having the URI
http://schema.org/musicGroupMember/leadVocalist
.This section is non-normative.
Decisions or open issues in the specification are tracked on the GitHub Issue Tracker. These include the following:
Experimental support itemprop-reverse. This attribute is not part of [MICRODATA] and is included as an experimental feature. Specific feedback from the community is requested. Based on addoption, the attribute may be considered for inclusion in forthcoming versions of [MICRODATA] and this note.
The purpose of this specification is to provide input to a future working group that can make decisions about the need for a registry and the details of processing. Among the options investigated by the Task Force are the following:
http://www.w3.org/ns/md#item
mapping at all.
rdf:Seq
, or place all values,
whether or not multiple, into some form of collection.
The microdata specification [MICRODATA] defines a number of attributes and the way in which those attributes are to be interpreted. The microdata DOM API provides methods and attributes for retrieving microdata from the HTML DOM.
For reference, attributes used for specifying and retrieving HTML microdata are referenced here:
element.itemType
on the element.
The item type is also used to resolve non-URL names to absolute URLs.
Available through the
Microdata DOM API as
element.itemType
.
(See itemtype
in [MICRODATA]).
In RDF, it is common for people to shorten vocabulary terms via abbreviated URIs that use a 'prefix' and a 'reference'. throughout this document assume that the following vocabulary prefixes have been defined:
dc: | http://purl.org/dc/terms/ |
md: | http://www.w3.org/ns/md# |
rdf: | http://www.w3.org/1999/02/22-rdf-syntax-ns# |
rdf: | http://www.w3.org/1999/02/22-rdf-syntax-ns# |
rdfa: | http://www.w3.org/ns/rdfa# |
xsd: | http://www.w3.org/2001/XMLSchema# |
This section is non-normative.
In a perfect world, all processors would be able to generate the same output for a given input without regards to the requirements of a particular vocabulary. However, microdata doesn't provide sufficient syntactic help in making these decisions. Different vocabularies have different needs.
The registry is located at the namespace defined for microdata: http://www.w3.org/ns/md
in
a variety of formats. Under control of a runtime option, a processor should use
another provided by reference, to affect processing.
The registry associates a URI prefix with one or more key-value pairs denoting processor behavior. A hypothetical JSON representation of such a registry might be the following:
{ "http://schema.org/": { "properties": { "additionalType": {"subPropertyOf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#type"} }, "http://microformats.org/profile/hcard": {} } }
This structure associates mappings for two URIs: http://schema.org/
and
http://microformats.org/profile/hcard
. Items having an item type with a URI prefix from this registry
use the the rules described for that prefix within the scope of that
item type. For http://schema.org/
, this mapping currently defines a single property: additionalType
with a value to indicate specific behavior. It also allows overrides
on a per-property basis; the item properties
key associates an individual name
with overrides for default behavior.
The interpretation of these
rules is defined in the following sections. If an item has no current type or the
registry contains no URI prefix matching current type, a conforming
processor MUST use the default values defined for these rules.
This section is non-normative.
For names which are not absolute URLs, this section defines the algorithm for generating an absolute URL given an evaluation context including a current type and current vocabulary.
The procedure for generating property URIs is defined in Generate Predicate URI.
The URI generation scheme appends names that are not absolute URLs to the URI prefix. When generating property URIs, if the URI prefix does not end with a '/' or '#', a '#' is appended to the URI prefix. (See Step 4 in Generate Predicate URI.)
URI creation uses a base URL with query parameters to indicate the in-scope type and name list. Consider the following example:
<span itemscope itemtype="http://microformats.org/profile/hcard"> <span itemprop="n" itemscope> <span itemprop="given-name"> Princeton </span> </span> </span>
Given the URI prefix http://microformats.org/profile/hcard
, this
would generate http://microformats.org/profile/hcard#n
and
http://microformats.org/profile/hcard#given-name
. Note that the '#' is automatically
added as a separator.
Looking at another example:
<div itemscope itemtype="http://schema.org/Person"> <h2 itemprop="name">Jeni</h2> </div>
Given the URI prefix http://schema.org/
,
this would generate http://schema.org/name
. Note that if the itemtype
were http://schema.org/Person/Teacher
, this would generate the same property URI.
If the registry contains no match for current type implementations MUST act as if there is a URI prefix made from the first itemtype value by stripping either the fragment content or last path segment, if the value has no fragment (See [RFC3986]).
The vocabulary URI prefix is made from the first itemtype value by stripping either the fragment content or last path segment, if the value has no fragment (See [RFC3986]).
Deconstructing the itemtype URL to create or identify a vocabulary URI is a violation of the microdata specification which is necessary to support the use of existing vocabularies designed for use with RDF, and shared or inherited properties within all vocabularies.
<div itemscope itemtype="http://example.org/Book"> <h2 itemprop="title">Just a Geek</h2> </div>
In this example, assuming no matching entry in the registry,
the URI prefix is constructed by removing the
last path segment, leaving the URI
http://example.org/
. The resulting property URI would be
http://example.org/title
.
If there is no in-scope itemtype, property URIs are generated using the base URI of the document and the name as a fragment Consider the following example:
<div itemscope> <p itemscope itemprop='bar'> <span itemprop='baz'>Baz</span> </p> </div>
If the document is located at http://example/author
,
the name bar generates the URI
http://example/author#bar
.
However, the included name baz is included in untyped item.
The inherited property URI is used to create a new property URI:
http://example/author#baz
.
This scheme is compatible with the needs of other RDF serialization formats such as RDF/XML [RDF-SYNTAX-GRAMMAR], which rely on QNames for expressing properties. For example, the generated property URIs can be split as follows:
<rdf:Description xmlns:base="http://example/author#" rdf:type="http://microformats.org/profile/hcard"> <base:bar> <rdf:Description> <base:baz>Baz</base:baz> </rdf:Description> </base:bar> </rdf:Description>
This section is non-normative.
In microdata, all values are strings. In RDF, values may be resources or may be typed with an appropriate datatype.
In some cases, the type of a microdata value can be determined from the element on which it is specified. In particular:
time
element provides dates, times and durationsdata
and meter
elements provides doubles and integersMicrodata requires that all values of itemtype come from the same vocabulary. This is required as itemprop values are resolved relative to that vocabulary. However, it is often useful to define an item to have types from multiple different vocabularies.
Vocabulary expansion uses simple rules to generate additional triples based on
rules and property relationships described in the registry.
Within the registry, a property definition may have either equivalentProperty
or subPropertyOf
keys having a IRI value (or array of IRI values)
of the associated property. Such an
entry causes the processor to generate triples associating the source
property IRI with the target property IRI using either
rdf:subPropertyOf
or
owl:equivalentProperty
predicates.
For example, the registry definition for the additionalType property
within schema.org, defines additionalType to have an rdfs:subPropertyOf
relationship with rdf:type
.
<div itemscope itemtype="http://schema.org/Person"> <link itemprop="additionalType" href="http://xmlns.com/foaf/0.1/Person"/> <a itemprop="email http://xmlns.com/foaf/0.1/mbox" href="mailto:mail@gmail.com"> mail@gmail.com </a> </div>
The previous example, indicates a registry rule, which causes the processor to emit
an extra triple when first seeing the additionalProperty
itemprop:
@prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfa: <http://www.w3.org/ns/rdfa#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix schema: <http://schema.org/> . [ a schema:Person; schema:additionalType foaf:Person; schema:email <mailto:mail@gmail.com>; foaf:mbox <mailto:mail@gmail.com> ] .
After performing vocabulary expansion, an additional rdf:type
triple is generated:
@prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfa: <http://www.w3.org/ns/rdfa#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix schema: <http://schema.org/> . <> rdfa:usesVocabulary schema: . [ a schema:Person, foaf:Person; schema:additionalType foaf:Person; schema:email <mailto:mail@gmail.com>; foaf:mbox <mailto:mail@gmail.com> ] .
The owl:equivalentProperty
rule is more powerfull than rdfs:subPropertyOf
,
in that if any equivalent property matches, then the source property would also cause a triple to be generated.
For example, if the registry stated that name
was equivalent to rdfs:label
,
than any use of name
in a itemprop would cause a triple using
rdfs:label
to be emitted, as with rdfs:subPropertyOf
. However, logically,
any use of label
where the current voabulary were rdfs:
could also cause
a triple using schema:name
to be emitted. To simplify processing, this specification
requires that all values of a owl:equivalentProperty
registry entry have their
own rules with those values as keys within the property
section of their respective
vocabularies.
The external registry may be controlled by the
registry
option passed to the microdata processor. If specified, the registry
must be loaded from the location indicated as the option value, Otherwise,
the processor MUST load the default registry from http://www.w3.org/ns/md
.
Setting registry
is performed in a processor-specific way.
When accessed as a web service using HTTP GET, POST or similar action, processors SHOULD use registry
query parameter. Acceptable values for registry
is a URI-encoded URL.
Web service processors SHOULD return the resulting RDF graph using a requested format specified by
HTTP Content Negotiation for an acceptable content type. Web service processors MUST support [N-TRIPLES].
Transformation of Microdata to RDF makes use of general processing rules described in [MICRODATA] for the treatment of items.
document.getItems
method.
element.properties
attribute.
a
, area
, audio
,
embed
, iframe
, img
, link
, object
,
source
, track
or video
)
element.itemValue
.
(See relevant attribute descriptions in [HTML5]).
meter
or data
element.
element.itemValue
.
http://www.w3.org/2001/XMLSchema#integer
.
http://www.w3.org/2001/XMLSchema#double
.
meta
element with a @content
attribute.
@content
attribute
with language information set from the
language
of the property element.
Otherwise, the value is a simple literal created from the value of the @content
attribute.
time
element.
element.itemValue
.
http://www.w3.org/2001/XMLSchema#date
.
http://www.w3.org/2001/XMLSchema#time
.
http://www.w3.org/2001/XMLSchema#dateTime
.
http://www.w3.org/2001/XMLSchema#gYearMonth
.
http://www.w3.org/2001/XMLSchema#gYear
.
http://www.w3.org/2001/XMLSchema#duration
.
The HTML valid yearless date string is similar to xsd:gMonthDay, but the lexical forms differ, so it is not included in this conversion.
See
The time
element
in [HTML5].
See
The lang
and xml:lang
attributes
in [HTML5] for determining the language of a node.
document.getItems
.
(See top-level microdata item in [MICRODATA]).
The HTML5/microdata content model for @href
, @src
,
@data
, itemtype and itemprop and itemid is that of
a URL, not a URI or IRI.
A proposed mechanism for specifying the range of property values to be URI reference or IRI could
allow these to be specified as subject or object using a @content
attribute.
A HTML document containing microdata MAY be converted to any other RDF-compatible document format using the algorithm specified in this section.
A conforming microdata processor implementing RDF conversion MUST implement a processing algorithm that results in the equivalent triples to those that the following algorithm generates:
When the user agent is to Generate triples for an item item, given evaluation context, it must run the following steps:
This algorithm has undergone substantial change from the original microdata specification [MICRODATA].
element.itemType
of the element defining the item.
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
element.itemType
of the element defining the item.
subPropertyOf
or
equivalentProperty
, for each such value equiv, generate the following triple:
Predicate URI generation makes use of current type and current vocabulary from an evaluation context context along with name.
http://example.org/doc
and an itemprop
of 'title', a URI will be constructed to be
http://example.org/doc#title
.
This section is non-normative.
The WebSchemas community has
proposed the use of a new Microdata attribute:
itemprop-reverse. Although not present in [MICRODATA] at this
time, the attribute can be very useful in many markup examples where items
are related using the reverse of a common property; this saves creating new properties
which exist solely for the purpose of describing such reverse relationships. Evidence
for the utility of such a feature can be seen in the
RDFa @rev
attribute [RDFA-CORE]
and the JSON-LD @reverse
property [JSON-LD].
See issue 5 for further reference.
This feature adds the following attribute:
The Algorithm is extended accordingly:
The Triples generation algorithm is extended with the following step to take place immediately after Step 9:
Simple use of itemprop-reverse:
<div itemscope itemtype="http://schema.org/Person"> <span itemprop="name">William Shakespeare</span> <link itemprop-reverse="creator" href="http://www.freebase.com/m/0yq9mqd"> </div>
Results in the following Turtle:
@prefix schema: <http://schema.org/> . <http://www.freebase.com/m/0yq9mqd> schema:creator [ a schema:Person; schema:name "William Shakespeare" ] .
This section is non-normative.
A test suite [MICRODATA-RDF-TESTS] under development to help processor developers verify conformance to this specification.
This section is non-normative.
The microdata example below expresses book information as an FRBR Work item.
<dl itemscope itemtype="http://purl.org/vocab/frbr/core#Work" itemid="http://books.example.com/works/45U8QJGZSQKDH8N" lang="en"> <dt>Title</dt> <dd><cite itemprop="http://purl.org/dc/terms/title">Just a Geek</cite></dd> <dt>By</dt> <dd><span itemprop="http://purl.org/dc/terms/creator">Wil Wheaton</span></dd> <dt>Format</dt> <dd itemprop="http://purl.org/vocab/frbr/core#realization" itemscope itemtype="http://purl.org/vocab/frbr/core#Expression" itemid="http://books.example.com/products/9780596007683.BOOK"> <link itemprop="http://purl.org/dc/terms/type" href="http://books.example.com/product-types/BOOK"> Print </dd> <dd itemprop="http://purl.org/vocab/frbr/core#realization" itemscope itemtype="http://purl.org/vocab/frbr/core#Expression" itemid="http://books.example.com/products/9780596802189.EBOOK"> <link itemprop="http://purl.org/dc/terms/type" href="http://books.example.com/product-types/EBOOK"> Ebook </dd> </dl>
Assuming that registry contains a an entry for http://purl.org/vocab/frbr/core#
this is equivalent to the following Turtle:
@prefix dc: <http://purl.org/dc/terms/> . @prefix frbr: <http://purl.org/vocab/frbr/core#> . @prefix rdfa: <http://www.w3.org/ns/rdfa#> . <> rdfa:usesVocabulary frbr: . <http://books.example.com/works/45U8QJGZSQKDH8N> a frbr:Work ; dc:creator "Wil Wheaton"@en ; dc:title "Just a Geek"@en ; frbr:realization <http://books.example.com/products/9780596007683.BOOK>, <http://books.example.com/products/9780596802189.EBOOK> . <http://books.example.com/products/9780596007683.BOOK> a frbr:Expression ; dc:type <http://books.example.com/product-types/BOOK> . <http://books.example.com/products/9780596802189.EBOOK> a frbr:Expression ; dc:type <http://books.example.com/product-types/EBOOK> .
The following snippet of HTML has microdata for two people with the same address. This illustrates two items referencing a third item, and how only a single RDF resource definition is created for that third item.
<p> Both <span itemscope itemtype="http://microformats.org/profile/hcard" itemref="home"> <span itemprop="fn" ><span itemprop="n" itemscope ><span itemprop="given-name">Princeton</span></span></span> </span> and <span itemscope itemtype="http://microformats.org/profile/hcard" itemref="home"> <span itemprop="fn" ><span itemprop="n" itemscope ><span itemprop="given-name">Trekkie</span></span></span> </span> live at <span id="home" itemprop="adr" itemscope> <span itemprop="street-address">Avenue Q</span>. </span> </p>
Assuming that registry contains a an entry for http://microformats.org/profile/hcard
it generates these triples expressed in Turtle:
@prefix hcard: <http://microformats.org/profile/hcard#> . @prefix rdfa: <http://www.w3.org/ns/rdfa#> . [ a <http://microformats.org/profile/hcard>; hcard:fn "Princeton"; hcard:n [ hcard:given-name "Princeton" ]; hcard:adr _:a ] . [ a <http://microformats.org/profile/hcard>; hcard:fn "Trekkie"; hcard:n [ hcard:given-name "Trekkie" ]; hcard:adr _:a ] . _:a hcard:street-address "Avenue Q" .
The following snippet of HTML has microdata for a playlist
and illustrates the use of the schema:additionalType
property to relate recordings to the Music Ontology:
<div itemscope itemtype="http://schema.org/MusicPlaylist"> <span itemprop="name">Classic Rock Playlist</span> <meta itemprop="numTracks" content="2"/> <p>Including works by <span itemprop="byArtist">Lynard Skynard</span> and <span itemprop="byArtist">AC/DC</span></p>. <div itemprop="tracks" itemscope itemtype="http://schema.org/MusicRecording"> <link itemprop="additionalType" href="http://purl.org/ontology/mo/MusicalManifestation"/> 1.<span itemprop="name">Sweet Home Alabama</span> - <span itemprop="byArtist">Lynard Skynard</span> <link href="sweet-home-alabama" itemprop="url" /> </div> <div itemprop="tracks" itemscope itemtype="http://schema.org/MusicRecording"> <link itemprop="additionalType" href="http://purl.org/ontology/mo/MusicalManifestation"/> 2.<span itemprop="name">Shook you all Night Long</span> - <span itemprop="byArtist">AC/DC</span> <link href="shook-you-all-night-long" itemprop="url" /> </div> </div>
Assuming that registry contains a an entry for http://schema.org/
it generates these triples expressed in Turtle:
@prefix mo: <http://purl.org/ontology/mo/> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfa: <http://www.w3.org/ns/rdfa#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix schema: <http://schema.org/> . [ a schema:MusicPlaylist; schema:name "Classic Rock Playlist"; schema:byArtist "Lynard Skynard" "AC/DC"; schema:numTracks "2"; schema:tracks [ a schema:MusicRecording, mo:MusicalManifestation; schema:additionalType mo:MusicalManifestation; schema:byArtist "Lynard Skynard"; schema:name "Sweet Home Alabama"; schema:url <sweet-home-alabama> ], [ a schema:MusicRecording, mo:MusicalManifestation; schema:additionalType mo:MusicalManifestation; schema:byArtist "AC/DC";; schema:name "Shook you all Night Long"; schema:url <shook-you-all-night-long> ] ] .
This section is non-normative.
The following is the default registry in JSON format, as of the time of publication.
{ "http://schema.org/": { "properties": { "additionalType": {"subPropertyOf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#type"} } }, "http://microformats.org/profile/hcard": {} }
This section is non-normative.
Thanks to Richard Cyganiak for property URI and vocabulary terminology and the general excellent consideration of practical problems in generating RDF from microdata.