This document describes experimental work in progress at HP Labs - Bristol on formal techniques for describing combinations of modular tagsets for documents written in XML. The motivation is provided by the increasing diversity of web browsers, running on desktops, television, handhelds, cellphones or voice browsers.
The goal is to provide a means for document to be described in terms of an algebra operating over modules, which in turn are described as collections of assertions. It is hoped that this work will provide an interesting comparison with traditional approaches based upon Document Type Declarations, and more recent approaches, such as the drafts published by the W3C XML Schemas working group.
XML documents are principally comprised from elements, attributes and text. The permitted arrangement of elements and their associated attributes varies according to the purpose. This specification provides a basis for defining a group of documents sharing a common syntax for elements and attributes.
The approach goes well beyond what can be represented with XML document type definitions (DTDs), providing much more precise definitions of attribute values and linked data formats. For instance, image formats, style sheets and scripts.
A schema specifies an unordered collection of modules, for example, headings, lists, tables and graphics. Each module is defined as an unordered collection of logical assertions. The underlying theoretical framework is founded upon sub-tree matching, sets and a simple mechanism for overriding inherited properties.
These are characterised by the element name, the permitted attributes, the context in which the element may appear, and the content the element can contain if any.
Documents commonly include a sequence of elements, all belonging to a given set. These sets can be given names. In this specification such names all start with a '%' character. You can define tags as appearing in a given set, or alternatively explicitly enumerate the tags in the set. Either way or any combination is fine. If sets are defined in terms of other sets, cyclic dependencies need to be guarded against.
Each attribute has a name, a data type and a default value (possibly implicit). The type and default value may vary according to which element the attribute applies to. Like tags, attributes may belong to sets. If several elements are associated with the same sets of attributes, it is convenient to name the sets.
The ability to make generalizations with exceptions can be applied to content models, attributes, data types and default values. Inheritance raises the possibility of conflicts but can be avoided to some extent by specificity rules.
The content model and attributes permitted may depend on the context. A general mechanism is proposed based on pattern matching against the tree structure of elements and attributes in a document. The approach can be thought of as a generalization of regular expressions. A conditional assertion only applies if a pattern match is found for the associated element or attribute.
Some assertions only apply in a particular context. This can be specified in terms of a condition describing a pattern matching the document markup. Patterns are composed from a number of sub-patterns:
p | element p |
---|---|
* | any element |
a/b | a is the direct parent of b |
a/../b | a is an ancestor of b |
^a | a is the first element in the content |
a$ | a is the last element in the content |
p#3 | p is the third element in the content |
a,b | a immediately followed by b |
a,..,b | a followed sooner or later by b |
a|b | a or b |
%p | a named enumeration equivalent to an or group |
~a | not a |
%p~a | anything in %p except a |
[a] | attribute a |
p[a] | element p has attribute a |
(pattern) | brackets used for grouping |
p! | pattern is anchored on element p |
! | pattern anchor (equivalent to *!) |
Some examples:
The anchor is used to define which element or attribute in the matched sub-tree the assertion applies to, e.g.
The pattern may contain more than one such anchor. The expression language could easily be extended for rich matches over attribute values. Another enhancement would be a means to specify cardinality constraints such as there may not be more than one form field within a label
The following sections cover some of the kinds of assertions and how they can be used in an example module. This is work in progress and is expected to be considerably expanded upon in future revisions to this spec.
This is used to import assertions from other modules. There are two properties:
src
: a web address referencing a module
(required)
name
: a short name used within the current module
to disambiguate name clashes across modules (optional)
Example:
<import src="linking.xml"/> <import src="inline-phrasal.xml"/> <import src="inline-presentational.xml"/> <import src="inline-structural.xml"/> <import src="block-phrasal.xml"/> <import src="block-presentational.xml"/> <import src="block-structural.xml"/> <import src="applet.xml"/> <import src="image.xml"/>
The above imports assertions from a number of modules. The ordering of the assertions has no consequence. This is true for all kinds of assertions and makes them easy to use.
Used to assert facts about XML elements. There are several
properties all of which are optional with the exception of
name
.
name
: the tag name (required)
context
: list of tags or tag groups in whose
content this tag can appear. Group names always start with a '%'
sign, e.g. "%inline" (optional)
condition
: a tree expression constraining
the applicability of the assertion (optional)
attributes
: list of attribute names or attribute
group names (optional)
content
: a grammar constraining the the element's
content. The grammar is defined in the same way as for XML 1.0
DTDs with the exception that tag groups are used in place of XML
parameter entities (optional)
Example:
<tag name="dir" context="%list"/> <tag name="menu" context="%list"/> <tag name="ul" context="%list"/> <tag name="ol" context="%list"/> <tag name="dl"/> <tag name="dt" content="%inline"/> <tag name="dd" content=%flow"/> <tag name="li" content=%flow"/>
The first four assertions state that dir, menu, ul and ol all appear in the context of the tag group "%list". The next assertion says that there is a tag called "dl" but tells other nothing else about it. The last three assertions specify the content model for dt, dd and li in terms of named tag groups. There is no need for these groups to have been defined beforehand.
If the content model is defined as "%foo" and "%foo" is the tag group {a, b, c}, the this is equivalent to writing: "(a|b|c)*". You can include named tag groups in content models, e.g. "%foo,d" expands to "(a|b|c)*,d"
Used to assert properties of attributes. There are several
properties all of which are optional with the exception of
name
.
name
: the attribute name (required)
context
: a list of names of attribute or
attribute groups (optional)
condition
: a tree expression constraining
the applicability of the assertion (optional)
type
: default data type (optional)
default
: default value (optional)
Example:
<attribute name="href" type="URI"/> <attribute name="src" type="URI"/> <attribute name="id" context="%common" type="ID"/> <attribute name="class" context="%common"/> <attribute name="style" context="%common"/> <attribute name="lang" context="%i18n"/> <attribute name="dir" context="%i18n"/>
The first two assertions state that the data type for attribute values for href and src is "URI". Unless otherwise specified this defines these attributes to be CDATA. The next one says that id is part of the attribute group "%common" and has a data type of "ID". The remaining assertions state that class and style are in the attribute group named "%common" and lang and dir in "%i18n".
Used to assert properties of named contexts. A context
represents a group of tags or attributes. You can also define
contexts hierarchically in terms of other contexts. There are
several properties all of which are optional with the exception
of name
. Note that context names always begin with a
'%' sign which makes them easy to distinguish from names for tags
or attributes:
name
: the context name (required)
condition
: a tree expression constraining
the applicability of the assertion (optional)
tags
: a list of names of tags or tag groups
(optional)
attributes
: a list of names of attribute or
attribute groups (optional)
content
: default content model for all tags that
belong to this context (optional)
type
: default data type for all attributes that
belong to this context (optional)
default
: default value for all attributes that
belong to this context (optional)
Dependencies between contexts define an acyclic graph — it is an error for there to be a cycle (this is something that is easily detected by software).
Example:
<context name="%block" tags="%heading"/> <context name="%heading" content="%inline"/> <context name="%inline" content="%inline"/> <context name="%object-content" tags="param %flow"/> <context name="%numeric" type="NUMBER"/>
The first assertion says that all tags belonging to the group "%heading" are also in the group "%block". The next two say that headings and inline elements default to having the inline content model. The next one defines the "%object-content" group as the union of param and the "%flow" tag group. The last assertion says that all attributes in the "%numeric" group have the data type "NUMBER" (mapped to CDATA unless otherwise specified).
Assertions make light work of generalizations. Consider:
<context name="%flow" tags="%block %inline" attributes="%common %i18n"/>
This says that the tag group "%flow" includes all tags in the "%block" and "%inline" groups. It further says that all tags in the "%flow" group have all the attributes in the groups %common" and "%i18n".
The order of assertions makes no difference and it doesn't matter if you repeat the same information one or more times. You can also split assertions up into several simpler ones, for instance, the above example can be restated as:
<context name="%flow" tags="%inline" > <context name="%flow" tags="%block"> <context name="%flow" attributes="%common"/> <context name="%flow" attributes="%i18n"/>
These properties make it easy to develop modules that can be combined together without having to worry about duplications or in which order the modules are imported.
Sometimes you want to make an exception to a generalization. You can override inherited values for the following properties:
element content models
data types for attributes
default values for attributes
The content model for a given tag is established by the procedure of first checking if it has been defined explicitly.
All li elements have the content model %flow.
<tag name="li" content="%flow"/>
If not, check if the content model has been defined for the tag groups which include this tag. This continues recursively for groups which include these groups and so on.
abr, acronym and cite all inherit the inline content model:
<context name="%inline" content="%inline"/> <tag name="abbr" context="%inline"/> <tag name="acronym" context="%inline"/> <tag name="cite" context="%inline"/>
If no such value can be found, the content model is taken to be EMPTY.
br defaults to the content model EMPTY:
<tag name="br"/>
But if we include br in %inline, we need to override the inherited value by an explicit definition:
<tag name="br" context="%inline" content="EMPTY"/>
This also makes it possible to define different content models according to the context in which an element appears. Something that is currently not possible with XML 1.0 document type definitions although allowable for well-formed XML.
The attributes for a given tag is the union of the set of attributes associated explicitly with the tag or with a tag group in which it appears (recursively to nested groups):
id, class, style, lang and dir can be used with all headings, block-level and inline elements:
<tag name="h1" context="%heading"/> <context name="%block" tags="%heading"/> <context name="%flow" tags="%block %inlne"/> <context name="%flow" attributes="%common %i18n"/> <attribute name="id" context="%common"/> <attribute name="class" context="%common"/> <attribute name="style" context="%common"/> <attribute name="lang" context="%i18n"/> <attribute name="dir" context="%i18n"/>
The default data type for attributes is CDATA. You can override this with an assertion for an attribute or attribute group.
The attribute "id" is defined to be of type "ID" rather than of type "CDATA":
<attribute name="id" type="ID"/>
The default attribute value is defined as "#IMPLICIT", but you can easily override this as needed.
This sets the default value for the "start" attribute to "1". (used for ul and ol elements):
<attribute name="%start" default="1"/>
Work is underway to develop assertions that bind attribute data types to external specifications or to lexical grammars. This will make it practical for example to verify that an href attribute conforms to RFC2038 and that it is a valid i.e. unbroken link. For images, you will be able to verify that the linked image is of an permitted image type and encoding.
I plan to expand this section, but have run out of time right now. My rudimentary understanding is that to transform modules into RDF, you would have to rewrite all assertions as binary relations, and to treat named contexts, tags and attributes etc. as RDF entities with their own URI. Needless to say the result would be much more verbose and harder to read.
Assertions use regular XML syntax, i.e. well-formed XML and don't need the special syntax reserved for DTDs. This makes it easy to add new kinds of assertions to document profiles without being forced to go back and change the XML language itself. This flexibility is expected to be critical to commercial applications.
DTDs force you to do work that could easily be done by computer. For instance, entities must be placed in the appropriate order so that any entities they depend on have been defined before hand. This makes it much harder to combine modulular definitions when using DTDs.
Many years of experience with DTDs have shown that they lack the ability to say that this tag belongs to that context. Instead you are forced to enumerate the tags in an entity definition or explicit content model. For example contrast:
<!ENTITY % inline "em | strong | a | img | br">
versus:
<tag name="em" context="%inline"/> <tag name="strong" context="%inline"/> <tag name="a" context="%inline"/> <tag name="img" context="%inline"/> <tag name="br" context="%inline"/>
The latter allows you to add new inline elements by importing a new module, for instance we can add a single import assertion to add presentational elements from a new module:
<tag name="i" context="%inline"/> <tag name="b" context="%inline"/> <tag name="tt" context="%inline"/> <tag name="u" context="%inline"/> <tag name="s" context="%inline"/>
Of course you can work around this in DTDs, for instance, by including an extension entity in the content list and overriding this in a module, but each module has to supply further such entities for additional modules to exploit creating a web of dependencies between the modules on these entity names.
DTD's don't support the ability specify exceptions to inherited properties. It is possible to work around this with careful use of parameter entities, but its tricky.
DTD's don't support rich data types. This is a major limitation on the usefulness of using DTDs to validate documents.
Finally, when using assertions, you don't have to throw away all the tools based upon validating DTDs. Such tools are likely to be with us for some time to come. The next section shows how you can automatically generate a composite DTD from the modules.
The assertions can be used to automatically generate an XML 1.0 DTD. You can obtain free Open Source software for this from HP Labs, see [DTDGEN].
DTDGEN is being used as a testbed for these ideas. Here is a sample module based upon a subset of XHTML 1.0 (note this is for explanatory purposes only).
<!-- define example module --> <context name="%inline" content="%inline"/> <context name="%inline" tags="#PCDATA"/> <tag name="em" context="%inline"/> <tag name="strong" context="%inline"/> <tag name="a" context="%inline" attributes="name href"/> <tag name="img" context="%inline" content="EMPTY"/> <tag name="img" attributes="alt src"/> <tag name="br" context="%inline" content="EMPTY"/> <tag name="h1" context="%heading"/> <tag name="h2" context="%heading"/> <tag name="h3" context="%heading"/> <context name="%heading" context="%block" content="%inline"/> <tag name="p" context="%block" content="%inline"/> <tag name="div" context="%block" content="li*"/> <tag name="ul" context="%block" content="li*"/> <tag name="ol" context="%block" content="li*"/> <tag name="dl" context="%block" content="(dt|dd)*"/> <tag name="li" content="%flow"/> <tag name="dt" content="%inline"/> <tag name="dd" content="%flow"/> <tag name="html" content="head,body"/> <tag name="head" content="(%head, (title, %head)?)"/> <tag name="body" content="%flow"/> <tag name="meta" context="%head" attributes="name content http-equiv"/> <tag name="link" context="%head" attributes="rel rev href"/> <context name="%flow" tags="%block %inline"/> <context name="%flow" attributes="%common"/> <attribute name="id" context="%common" type="ID"/> <attribute name="class" context="%common"/> <attribute name="src" type="URI"/> <attribute name="href" type="URI"/> <attribute name="name"/>
Here is the DTD it creates:
<!-- DTD automatically generated by DTDGEN <dsr@w3.org> --> <!-- element group entities --> <!ENTITY % block "p | div | ul | ol | dl"> <!ENTITY % inline "#PCDATA | em | strong | a | img | br"> <!ENTITY % flow "%block; | %inline;"> <!ENTITY % head "meta | link"> <!ENTITY % heading "h1 | h2 | h3"> <!-- attribute group entities --> <!ENTITY % common.attrs " id CDATA #IMPLIED class CDATA #IMPLIED"> <!ENTITY % flow.attrs " %common.attrs;"> <!-- named data types --> <!ENTITY % URI "CDATA"> <!-- element declarations --> <!ELEMENT a "(%inline;)*"> <!ATTLIST a name CDATA #IMPLIED href CDATA #IMPLIED %flow.attrs;> <!ELEMENT body "(%flow;)*"> <!ELEMENT br EMPTY> <!ATTLIST br %flow.attrs;> <!ELEMENT dd "(%flow;)*"> <!ELEMENT div ""li"*"> <!ATTLIST div %flow.attrs;> <!ELEMENT dl "("dt" | "dd")*"> <!ATTLIST dl %flow.attrs;> <!ELEMENT dt "(%inline;)*"> <!ELEMENT em "(%inline;)*"> <!ATTLIST em %flow.attrs;> <!ELEMENT h1 "(%inline;)*"> <!ELEMENT h2 "(%inline;)*"> <!ELEMENT h3 "(%inline;)*"> <!ELEMENT head "((%head;)*, ("title", (%head;)*)?)"> <!ELEMENT html ""head", "body""> <!ELEMENT img EMPTY> <!ATTLIST img alt CDATA #IMPLIED src CDATA #IMPLIED %flow.attrs;> <!ELEMENT li "(%flow;)*"> <!ELEMENT link EMPTY> <!ATTLIST link rel CDATA #IMPLIED rev CDATA #IMPLIED href CDATA #IMPLIED> <!ELEMENT meta EMPTY> <!ATTLIST meta name CDATA #IMPLIED content CDATA #IMPLIED http-equiv CDATA #IMPLIED> <!ELEMENT ol ""li"*"> <!ATTLIST ol %flow.attrs;> <!ELEMENT p "(%inline;)*"> <!ATTLIST p %flow.attrs;> <!ELEMENT strong "(%inline;)*"> <!ATTLIST strong %flow.attrs;> <!ELEMENT title EMPTY> <!ELEMENT ul ""li"*"> <!ATTLIST ul %flow.attrs;>