Copyright © 2005 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
GRDDL is a mechanism for Gleaning Resource Descriptions from Dialects of Languages. The GRDDL specification introduces markup for declaring that an XML document includes gleanable data and for linking to an algorithm, typically represented in XSLT, for gleaning the RDF data from the document.
The markup includes a namespace-qualified attribute for use in general-purpose XML documents and a profile-qualified link relationship for use in valid XHTML documents. The GRDDL mechanism also allows an XML namespace document (or XHTML profile document) to declare that every document associated with that namespace (or profile) includes gleanable data and for linking to an algorithm for gleaning the data.
A corresponding GRDDL specification provides complete technical details. A GRDDL Primer demonstrates the mechanism on XHTML documents which include widely-deployed dialects, more recently known as microformats.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This is a First Public Working Draft of GRDDL Use Cases: Scenarios of extracting RDF data from XML documents. A log of changes is maintained for the convenience of editors and reviewers.
The GRDDL design was first released as a W3C technical report in April 2004. This document was developed by the GRDDL Working Group, which was chartered in July 2006 to review the specification and develop use cases and tutorial material. The Working Group expects to advance GRDDL to Recommendation Status, though these use cases may end up as a separate Working Group Note.
GRDDL is intended to contribute to addressing Web Architecture issues such as RDFinXHTML-35 and namespaceDocument-8 as well as issues postponed by the RDF Core working group such as rdfms-validating-embedded-rdf and faq-html-compliance.
Please send comments about this document to public-grddl-comments@w3.org (with public archive).
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
There are many dialects in practice among the many XML documents on the web. There are dialects of XHTML, XML and RDF that are used to represent everything from poetry to prose, purchase orders to invoices, spreadsheets to databases, schemas to scripts, and linked lists to ontologies. Some are more formally defined and others exhibit more loosely-couple semantics. Recently, two progressive encoding techniques have emerged to overlay additional semantics onto valid XHTML documents: RDF-a and microformats offer simple, open data formats built upon existing and widely adopted standards.
While this breadth of expression is quite liberating, inspiring new dialects to codify both common and customized meanings, it can prove to be a barrier to understanding across different domains or fields. How, for example, does software discover the author of a poem, a spreadsheet and an ontology? And how can software determine whether authors of each are in fact the same person?
Any number of those XML documents on the web may contain data whose value would increase dramatically if they were accessible to systems which might not directly support such a wide variety of dialects but which do support RDF.
The Resource Description Framework[RDFC04] provides a standard for making statements about resources in the form of a subject-predicate-object expression. One way to represent the fact "The Stand's author is Stephen King" in RDF would be as a triple whose subject is "The Stand," whose predicate is "has the author," and whose object is "Stephen King," The predicate, "has the author" expresses a relationship between the subject (The Stand) and the object (Stephen King). Using URIs to uniquely identify the book, the author and even the relationship would facilitate software design because not everyone knows Stephen King or even spells his name consistently.
RDF includes an XML concrete syntax and an abstract syntax. Software tools that use the Resource Description Framework naturally work with documents whose data is encoded using RDF/XML.
GRDDL is a mechanism for Gleaning Resource Descriptions from Dialects of Languages; that is, for extracting RDF data from XML documents by way of transformation algorithms, typically represented in XSLT.
For example, Dublin Core meta-data can be written in an HTML dialect[RFC2731] that has a clear correspondence to an encoding in RDF/XML[DCRDF]. The following HTML and RDF excerpts illustrate the correspondence:
<html xmlns="http://www.w3.org/1999/xhtml">; <head> <title>Some Document</title> <meta name="DC.Subject" content="ADAM; Simple Search; Index+; prototype" /> ... </head> ... </html>
<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" > <rdf:Description rdf:about=""> <dc:subject>ADAM; Simple Search; Index+; prototype</dc:subject> </rdf:Description> </rdf:RDF>
The transformation algorithm used to is expressed in an XSLT transformation, dc-extract.xsl.
This document collects a number of motivating use cases together with their goals and requirements for extracting RDF data from XML documents. These use cases also illustrate how XML and XHTML documents can be decorated with microformats, Embedded RDF or RDFa statements to support GRDDL transformations in charge of extracting valuable data that can then be used to automate a variety of tasks.
The companion GRDDL Working Draft is a concise technical specification of the GRDDL mechanism and its XML syntax. It specifies the GRDDL syntax to use in valid XHTML and well-formed XML documents, as well as how to encode GRDDL into namespaces and HTML profiles.
The companion document, the GRDDL Primer Working Draft@@pubfix, is a progressive tutorial on the GRDDL mechanism with illustrated examples taken from the GRDDL Use Cases Working Draft.
The seven use cases detailed below could be summarized as:
Jane is trying to coordinate a meeting with her friends Robin, David and Kate. They each live in separate cities but often bump into each other at different conferences throughout the year. Jane wants to find a time when all of her friends are in the same city.
Despite their different formats, the calendars of all four friends can be used as source documents and converted to RDF. Once expressed as RDF the data can be merged and queried using tools such as the SPARQL query language.
Jane uses a GRDDL-aware agent to automatically extract data from each page, load this data in an RDF store and combine it in a single model. She then writes a query to filter the events down to those dates when all four friends are in the same city.
Jane is delighted to find that all four of them will be at conferences in LA at the beginning of September and she immediately starts looking for restaurants to book for their night out.
Browsing the calendar of her friends, Jane noticed various conferences, talks, and other gatherings of social groups in her area. These groups publish their calendars in various HTML-based formats: microformats, eRDF, RDFa, or some home-grown way to express calendar information.
These calendars are source documents and thus Jane could easily add all of these events to her own calendar. However, Jane does not want to add all these events to her calendar. She wants to pick and choose which events to attend. She wants to browse this list of events and each time she finds an event she is interested in, she wants to be able to select it and copy-paste it to her calendar.
To enable this copy-paste, Jane's browser includes a GRDDL-aware agent and supports a default RDF-in-HTML embedding scheme called RDFa. The GRDDL transformation specified in the page indicates how to transform this XHTML into XHTML+RDFa, while preserving the style and layout of the page.
Thus, Jane's RDFa-aware browser can perform the transform even before rendering the XHTML. The rendered XHTML+RDFa provides a copy- paste functionality via, right-clicking on an event right in the rendered XHTML+RDFa.
See also: put projects and prototypes here.
Kayode, a developer for a clinical research data management system, uses XML as the main representation format for their computer-based patient record. He edits the XML remotely via forms, submits the XML document to a unique URI for each such record over HTTP.
He wants to use a content management system which includes a mechanism to automatically replicate an XML document into equivalent, named RDF graphs for persistence in synchrony with any changes to the document.
The expense of dual representation as single-purpose XML vocabulary and RDF includes space and synchrony problems, but the primary value is being able to query both as XML and as RDF. The corresponding XML documents can be transformed into other non-RDF formats, evaluated by XPath and XPointer expressions, cross-linked by XLink or XInclude, and structurally validated by RELAX NG (or XML Schema). Kayode has found RDF queries more amenable for investigative querying, since they allow him to ask speculative questions using standard healthcare ontologies for patient records, such as the HL7 OWL ontology.
Kayode realizes a GRDDL approach alleviates this expense by allowing a computer-based patient record or any XML-based collection of clinical research data to be queried semantically by associating a GRDDL profile to the specific XML vocabulary.
Using RDF helps manage research projects assigned to residents. Kayode finds RDF especially helpful while trying to determine an initial search criteria for a patient population relevant to a particular study. Each study has a set of classifications specific to the study that they express in an ontology or using rules.
Kayode designs a web-based user interface that works with a GRDDL-aware agent which picks computer-based patient records from a remote server. Each is a source document associated with transforms that extract clinical data as RDF expressed in a universally supported vocabulary for a computer-based patient record.
The resident physicians then ask speculative questions of the resulting RDF graph or apply the study-specific rules on the resulting RDF to classify the data according to his domain of interest, such as specific diagnoses and pathological observations.
For Kayode, having an RDF representation of the clinical data provides him advantages over just using a single-purpose XML vocabulary, in particular an additional level of interpretation and ability to integrate data from diverse sources. The inherent difficulties of using multiple XML vocabularies over domains such as clinical data make the mapping to a unified ontology even more valuable.
See also: GALEN / Open GALEN, 4Suite, HCLSIG HL7 OWL Ontology
Stephan wishes to buy a guitar, so he decides to check reviews. There are various special interest publications online which feature musical instrument reviews. There are also blogs which contain reviews by individuals. Among the reviewers there may be friends of Stephan, people whose opinion Stephan values (e.g. well-known musicians and people whose reviews Stephan has found useful in the past). There may also be reviews purposively planted by instrument manufacturers which offer very biased views.
Stephan visits a site offering a review service and enters his preference for guitar reviews which gave a high rating for the instrument. This initial request is answered with a list of all the relevant review titles/summaries together with information about the reviewers.
From this list Stephan chooses only the reviewers he trusts, and on submitting these preferences is finally presented with a set of full reviews which match his criteria.
Reviews published using hReview microformats can be discovered using existing search services. These source documents can be consumed by a GRDDL-aware agent to extract the RDF which is then aggregated together in a store. Information about the reviewers can also be aggregated from various sources including hCard and XFN microformats and autodiscovered FOAF profiles possibly harvested through links in Stephan's own profile. The filtering may be achieved by running SPARQL queries against the aggregated data, presented to the user through regular HTML form interfaces.
See also: put projects and prototypes here.
The Company DC4Plus uses its web site to publish its catalogue of products and services as well as a number of digital documents both on their public web site (white papers, user guides and technical manuals of products and brochures) and on their intranet (internal reports and administrative forms). Product after product, DC4Plus is growing a digital library as part of its web site.
Adeline is an IT manager at DC4Plus. She is concerned by the tension between on one hand the natural heterogeneity and distribution of all these electronic documents and on the other hand the need to have a integrated and unified view of all these productions. She believes there is a need to automate the detection, indexing and search capabilities for these documents. Moreover several corporate documents follow a standard process before being published and there is a growing demand from users and managers to be able to automate this process and follow the status of each document.
Adeline first focuses on the Technical Reports published by the different divisions of DC4Plus. These reports are published following a well-defined process. She proposes a system that relies on Semantic Web technologies to allow here company to streamline the publication paper trail of Technical Reports, to maintain an RDF-formalized index of these specifications and to create a number of tools using this newly available data.
Adeline's implementation of this vision at DC4Plus can be given in five steps:
This system relies on shared templates for publishing documents and including RDFa annotations to mark important data. A GRDDL-aware agent extracts this metadata as RDF. By crawling the published reports and applying the associated GRDDL transformations to them, a complete and up-to-date RDF index is built from resources distributed over the organization's website. This RDF index is then used to create a central yet flexible authoritative repository.
Adeline believes that this scenario can be generalized to any organization interested in maintaining a portal to a digital library with customized indexes, dedicated search forms, navigation widgets. In particular she appreciates that in such an architecture the simple fact that the XHTML documents put online following official templates allow GRDDL-aware agents to extract corresponding RDF annotations that can then be used to generate portals, feed workflow engines and run queries directly against the site.
See also: Automating the publication of Technical Reports
The Technical University of Marcilly (TMU) decided to use wikis to foster knowledge exchanges between lecturers and students. They tested several wikis over the years and they want to experiment with novel ways of structuring the wiki to improve navigation and retrieval and they also want to make it easier to reuse learning objects in different contexts. Ideally TMU wants the information structuring the wiki to be:
In this context TMU uses metadata embedded in the wikipages to:
Let us consider the case of Michel, a lecturer in engines and thermodynamics. He used the wiki to publish the handouts of his course. He initially tagged each handout with the main concepts it introduces (e.g. "RenewableEnergies", "Ethanol", "Diesel"). In addition, Michel automatically typed each section of the document using predefined styles (e.g. definitions, formula, example.). The next practical session will involve knowledge on classical Diesel engines and Ethanol-based engines. In order to generate a mnemonic card for this session Michel runs a query to extract definitions and formulas of the courses tagged with "Diesel" or "Ethanol". He also uses these tags to generate dynamic "see also" sections at the end of his sections suggesting other sections to read.
Students edit the online handouts, to add pointers, to insert comments on parts they found difficult to understand,and to recall pieces of previous courses useful for understanding a new course. Students also tagged the pages with their own tags to organize their reading and bookmark important parts for them; they use tags to create transversal thematic tracks (e.g. "LiquidFlow"), to give feedback on the content (e.g. "Difficult"), to prioritise reading (e.g. "NiceToKnow", "Vital"). These tags allow them to have transversal navigation and reorganize the content depending on the task they are doing (e.g. preparing an exam, writing a report, running an experiment). These tags are also used by Michel to evaluate the understanding and the shortcomings of his course.
Finally the mass of the course material and tags is such that it needs to be reorganised. Using the tag editor Michel groups "Ethanol" and "Methanol" as sub tags of a new tag he calls "Alcohol". Doing so the pages tagged with "Ethanol" or "Methanol" are grouped and accessible through "Alcohol". He repeats this with other tags (e.g. "Alcohol" and "Hydrogen" becomes sub- tags of "NewEngineEnergy"). This reorganizes the wiki seamlessly e.g. suggestion of navigation in the pages automatically propose narrower, broader and brother tags thus when viewing a page tagged with "Ethanol", the system suggest other pages tagged with "Methanol". Later when a student posts his report on an engine using "CopraOil", his new tag can be placed under the existing one "NewEngineEnergy"; he or anyone else can do it and the result will immediately benefit the whole community of the users. Using these tags and their organization, thematic indexes are dynamically generated for the materials of the course and automatically updated.
From the technical stand point, TMU designed a wiki that stores its pages directly in XHTML and RDF annotations are used to represent the wiki structure and annotate the wikipages and the objects it contains (images, uploaded files.). The RDF structure allows refactoring the wiki structure by editing the RDF annotations and the RDFS schemas they are based on. RDF annotations are embedded in the wiki pages themselves using the RDFa and microformats. Some of the learning objects can be saved in XML formats and an XSLT stylesheet exploits the styles used for the session to tag the different parts (e.g. definition, exercise, example) and these annotation can then be used to generate new views on this resource (e.g. list of definition, hypertext support for practical sessions.).
The embedded RDF is extracted by a GRDDL-aware agent using GRDDL transformations available online as XSLT stylesheets to provide semantic annotations directly to the application that needs to extract the embedded metadata:
See also: Sweet Wiki, Semantic Wikis
Voltaire's blog is pretty popular and encompasses many major areas of interest, one of which is bird watching. Voltaire has so many areas of interests and spends so much time watching birds that he doesn't want to surf the net and find each and every site he might want to syndicate. Rather than 'manually' subscribing to third-party blogs that are appropriate to the themes he covers, he wants to reverse the subscription model to be push-based i.e. people who want their blogs to be included can push the appropriate entries to his blog; his blog becomes somewhat of a magnet for similar entries of interest.
Voltaire has setup a weblog engine that utilizes XForms for editing entries remotely using the Atom Publishing Protocol. Voltaire has found the use of XForms for authoring fragments of Atom quite useful for a variety of reasons. In particular, the Atom Publishing Protocol's use of HTTP and single-purpose XML vocabulary as the primary remote messaging mechanism which allows Voltaire to easily author various XForm documents that use XForm submission elements to dispatch operations on web resources.
As a result, the XForms for dispatching these operations each contain a rather rich set of information about transport-level services in the form of service URIs, media-types and HTTP methods. These are completely encapsulated in an XForms submission element. It so happens that there is an RDF vocabulary for expressing transport metadata called RDF Forms.
Somewhere else on the planet, the professional ornithologist Johan Bos, who recently spotted a red kite (Milvus milvus) far from their breeding ground in central Wales, is planning to post blog entries about his observations. To make his results visible he wants his entries to be included in Voltaire's blog.
Voltaire's site provides a general GRDDL transformation that extracts an RDF Form graph from the XForms submission elements employed in the various web forms for editing, deleting, and updating Atom entries on his weblog. Such a transformation can uniformly extract an RDF description of the transport mechanisms for a software agent to interpret. Johan's client can automatically retrieve an Introspection Document (via the Atom Publishing Protocol), update existing entries using the identified service URIs, and perform other such services.
Thus Johan's client relies on a GRDDL-aware agent to periodically extract the service URIs, transform the content at these URIs to Atom/OWL and query the resulting RDF to determine if the topics match. Doing so, he will replicate his entries at the matching URIs by POSTing them there.
Voltaire does not need to manage the subscriptions, all he might want to do is perhaps grant accounts for Johan for HTTP-level authentication (as a deterrent for spam - as you can imagine, reversing the subscription model in this way opens up Voltaire's system for lots of spam).
See also: XForms 1.1 specification, Atom Publishing Format and Procotol (atompub).
The Open Archives Initiative (OAI) publishes an XML schema that universities can use to publish their archived documents. They include guidelines for expressing the rights of these documents, including the possibility of referencing a license, like a creative commons license.
More than 800 universities implement this schema. Creative Commons would like to deploy tools, like the MozCC browser extension which provides a convenient way to examine licenses embedded in web pages and interpret them.
It is unreasonable to expect to interpret everyone's favorite XML schema, yet communities like the OAI would like to be able to include licensing information in their XML shema.
On the other hand, Creative Commons would like to be able to make a generic recommendation to anyone with XML instance documents, allowing them to do what they want with their XML schemata, as long as they include a transformation of the instance documents to RDF.
Since the XML instance documents are often distributed, as in the OAI case, the XML schema itself could embed RDF descriptions identifying a transform to apply to all its instance documents. So doing, for each source document, the transformation is indirectly referenced by the XML Schema it follows.
The XML schema is served from the namespace location and is a source document which includes descriptions associating a GRDDL transform with its instances. Thus it serves a dual purpose for its instances: validation and identifying transforms to glean meaning
See also: Open Archives Initiative, Creative Commons, MozCC.
The editors greatfully acknowledge the contributions of the following Working Group members:
This document is a product of the GRDDL Working Group.
Changes since the 27 Sep WG decision to publish include:
$Log: Overview.html,v $ Revision 1.4 2018/10/09 13:28:32 denis fix validation of xhtml documents Revision 1.3 2017/10/02 10:32:21 denis add fixup.js to old specs Revision 1.2 2006/10/03 16:16:39 connolly removed editor's draft blurb from status section Revision 1.1 2006/10/03 16:06:55 jean-gui /TR/2006/WD-grddl-scenarios-20061002/ Revision 1.59 2006/10/03 08:08:33 jigsaw (fgandon) Changed through Jigsaw. Revision 1.58 2006/10/03 07:57:19 jigsaw (fgandon) Changed through Jigsaw. (fixed fragments and broken links) Revision 1.57 2006/10/02 23:00:31 connolly s/Contents/Table of Contents/ Revision 1.56 2006/10/02 22:54:06 connolly trim changelog at 27Sep Revision 1.55 2006/10/02 22:51:54 connolly turn comments mailbox into a link Revision 1.54 2006/09/30 00:51:52 connolly fix link to primer (well, where primer will be) Revision 1.53 2006/09/30 00:50:25 connolly remove extra W3C icon Revision 1.52 2006/09/30 00:49:51 connolly minor cite markup fix Revision 1.51 2006/09/30 00:45:44 connolly clobber edits since 1.46 Revision 1.46 2006/09/29 22:30:44 connolly - toward ,pubrules happiness: - XHTML 1.*0* doctype is required. go figure. - tweaked W3C icon markup (don't really understand why it needed tweaking) - added WD subtitle, stylesheet (will remove again soon) - this version, latest version links - added custom para to Status section Revision 1.45 2006/09/29 21:53:43 connolly validation fix: no name attr on a element of RDFC04 bib entry Revision 1.44 2006/09/29 21:44:14 connolly fix reference from intro to GRDDL spec Revision 1.43 2006/09/29 21:39:05 connolly moved contributors to acks section Revision 1.42 2006/09/29 19:23:32 connolly more intro clean-up Revision 1.41 2006/09/29 19:07:02 connolly - started drafting status section for publication - replaced 199 RDF citation with 2004 Recommendation - replaced RDFa editor's draft citation with publishd WD citation - replaced GRDDL editor's draft citation with published WD citation - another take at the introduction, aiming toward a shared intro across the 3 GRDDL drafts Revision 1.40 2006/09/29 17:42:30 connolly revised abstract Revision 1.39 2006/09/29 17:34:18 connolly updated h1 to match title reduced resolution of date in subtitle Revision 1.38 2006/09/29 17:32:43 connolly moved summary of use cases to end of introduction, after a coule more introductory paragraphs added xmlns decl Revision 1.37 2006/09/29 17:19:51 connolly title update Revision 1.36 2006/09/28 09:15:07 jigsaw (fgandon) Changed through Jigsaw. Revision 1.34 2006/09/28 08:58:20 jigsaw (fgandon) Changed through Jigsaw. (included reviews from Murray, Chimezie and Danny) Revision 1.33 2006/09/21 16:03:53 jigsaw (fgandon) Changed through Jigsaw.