Status: Team confidential – under team review

Team Comment on Website Parse Template Submission

W3C is pleased to receive the Website Parse Template Submission from OMFICA.

Website Parse Template

The submission defines an XML based format that describes the semantic structure followed by a set of HTML Web pages and rules to extract semantically rich data from portions of these pages. The format is intended to minimize the potential discrepancies between the actual content of an HTML page and an RDF graph representation of the same content, so that Web crawlers typically used by search engines may use the RDF graph to index the page appropriately. The submission uses the format to describe HTML pages, but it may be directly extended for use in general-purpose XML documents.

The submission also defines the WPT Ontology language that provides a very minimalist vocabulary definition language.

Consistency between structured content intended for humans and the authoritative meaning (or semantics) of that content intended for machines is ensured by the fact that the semantics are directly extracted from the content. The extraction rules bind sections of the page identified by XPath expressions to machine-readable content description based on ontologies. Since the extraction rules become the potential source of errors in this paradigm, there would be little value if they had to be defined for each and every single page. The submission thus encourages the re-use of the extraction rules on a set of URIs identified by some regular expression matching rule.

Analysis

The declarative approach used by the submission to define the semantic extraction rules follows the usual approach of W3C Web standards. While reviewing the submission, we actually wondered whether a similar goal could be achieved using existing Semantic Web standards besides XPath and OWL. For example, RDFa permits document authors to embed semantic information, including data about the document content, within the document. WPT addresses a use case also addressed by GRDDL where the original document author did not explicitly include a link to the extraction rules. However, WPT does not address how the document author may indicate explicitly that the extraction rules do in fact represent the intended semantics of the document. GRDDL makes this declaration explicit.

The notion of blocks introduced in section 2.1 of the submission could perhaps be transposed to the XSLT notion of template rules. XSLT stylesheets may be used to parse an XML document and generate an RDF graph represented in some XML serialization of RDF. This could serve as the basis for a potential alternative approach, as generalized by the GRDDL specification. The RDF triples generated by the XSLT stylesheet would be based on an ontology that could be defined using the OWL Lite profile for simplicity.

Having the extraction defined is not enough though. There would need to be a way to assert that the XSLT stylesheet can be used on the resource to generate an RDF graph and that it may be applied to a whole set of resources. GRRDL addresses the first part of the problem. The stylesheet transformation link could then be inserted in a POWDER document, as defined in the ongoing work on the POWDER (Protocol for Web Description Resources) suite of specifications. POWDER provides a general mechanism to publish RDF statements about individual documents or groups of documents in a "description resource" that may be provided by a third party. The POWDER mechanism for defining groups of documents is a superset of the URL pattern descriptors described in section 2.2 of the Submission. A POWDER description resource could then provide a link to a WPT parse template for that group of documents.

We also note that RDFa reduces potential mismatches between the structured content and the data produced by the extraction rules. RDFa defines a set of attributes that can be used to express RDF statements in any markup language, re-using the actual content whenever possible. RDFa in XHTML: Syntax and Processing specifies how to use RDFa with XHTML content.

The WPT ontology language presented in section 2.3, though simpler in syntax than RDF Schema and OWL, appears to have no facility for describing the meanings of nor additional characteristics about the terms named in a WPT ontology. We were not able to find a document at the WPT namespace URI to further investigate whether WPT ontology terms are grounded in URI space or not and thus were unable to investigate the degree to which WPT contributes to a self-describing Web.

Relationships to W3C Activities

The extraction of semantic meaning from structured content is directly relevant to the W3C Semantic Web Activity and specifically GRDDL and RDFa.

Next Steps

W3C has no plans at present to take up work based on this Submission. We encourage the community interested in this area to investigate augmentation of WPT with a combination of RDFa, GRDDL, and POWDER.

Author: Francois Daoust