Copyright ©1999 W3C® (MIT, INRIA, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply.
This specification describes an abstract data set containing the information available from an XML document.
The XML Core Working Group, with this 1999 December 20 Infoset Last Call working draft, invites comment on this specification. The Last Call period begins 20 December 1999 and ends 31 January 2000.
The W3C Membership and other interested parties are invited to review the specification and report implementation experience. Please send comments to www-xml-infoset-comments@w3.org (archive).
For background on this work, please see the XML Activity Statement. While we welcome implementation experience reports, the XML Core Working Group will not allow early implementation to constrain its ability to make changes to this specification prior to final release.
See XML Information Set Requirements for the specific requirements that informed development of this specification.
A list of current W3C Recommendations and other technical documents can be found at http://www.w3.org/TR.
This document specifies an abstract data set called the XML information set (Infoset), a description of the information available in a well-formed XML document [XML].
Although technically well-formed XML 1.0, documents that do not conform to [Namespaces] are not considered to have meaningful information sets. This essentially bars documents that have element or attribute names containing colons that are used in other ways than as prescribed by [Namespaces]. There is no requirement for a XML document to be valid in order to have an information set.
An XML document's information set consists of two or more information items (the information set for any well-formed XML document will contain at least the document information item and one element information item). An information item is an abstract representation of some component of an XML document: each information item has a set of associated properties, some of which are core, and some of which are peripheral.
In earlier drafts, the term "required" was used rather than "core", and the term "optional" rather than "peripheral". The editor has made this change because "required" and "optional" suggest the behavior of an application rather than the status of part of a data structure.
For any given XML document, there are a number of corresponding information sets: a unique minimal information set consisting of the core properties of the core items and nothing else, a unique maximal information set consisting of all the core and all the peripheral items with all the peripheral properties, and one for every combination of present/absent peripheral items and properties in between. The in-between information sets must be fully consistent with the maximal information set.
All information sets are understood to describe the XML document with all entity references already expanded; that is, represented by the information items corresponding to their replacement text. In the case that an entity reference cannot be expanded, because an XML processor has not read its declaration or its value, explicit provision is made for representing such a reference in the information set.
The XML information set does not require or favor a specific interface or class of interfaces. This specification presents the information set as a tree for the sake of clarity and simplicity, but there is no requirement that the XML information set be made available through a tree structure; other types of interfaces, including (but not limited to) event-based and query-based interfaces are also capable of providing information conforming to the information set. As long as the information in the information set is made available to XML applications in one way or another, the requirements of this document are satisfied.
Note: In this document, the words "must", "should", and "may" assume the meanings specified in RFC 2119 [RFC2119], except that the words do not appear in upper case.
Note: To the best of the editors' knowledge and belief, the information set scheme described in this document satisfies the requirements of the XPointer-Information Set Liaison Statement [XPointer-Liaison].
Note: To the best of the editors' knowledge and belief, the interface specified by the Document Object Model, Level 1 Core Recommendation [DOM] conforms to the XML Information Set as currently specified.
The XML information set can contain fifteen different types of information items:
Every information item has properties, some of which are core and some of which are peripheral. Note that peripheral information items can, and do, have core properties. For ease of reference, each property is given a name, indicated [thus].
XML Definition: document (Section 2, Documents)
XML Syntax: [1] Document (Section 2.1, Well-Formed XML Documents)
There is always one document information item in the information set, and all other information items are related to the document information item, either directly or indirectly.
The document information item must have the following properties available in some form:
The document information item may also have the following properties available in some form:
XML Definition: element (Section 3, Logical Structures)
XML Syntax: [39] Element (Section 3, Logical Structures)
There is one element information item for each element appearing in the XML document. Exactly one of the element information items corresponds to the document element (the root of the element tree), and all other element information items are contained within the document element, either directly or indirectly.
An element information item must have the following properties available in some form:
An element information item may also have the following properties available in some form:
xmlns=""
, which does not declare a namespace but
rather undeclares the default namespace.XML Definition: attribute (Section 3.1, Start-Tags, End-Tags, and Empty-Element Tags)
XML Syntax: [41] Attribute (Section 3.1, Start-Tags, End-Tags, and Empty-Element Tags)
There is one attribute information item for each attribute (specified or defaulted) for each element in the document instance. Namespace declarations are represented using namespace declaration information items, not attribute information items.
Attributes declared in the
DTD with a default value of #IMPLIED
and not specified in the
element's start tag are not represented by attribute information items.
An attribute information item must have the following properties available in some form:
In addition, for each attribute information item, the following property may be available in some form:
XML Definition: processing instruction (Section 2.6, Processing Instructions)
XML Syntax: [16] PI (Section 2.6, Processing Instructions)
There is one processing instruction information item for every processing instruction in the document. The XML declaration and text declarations for external parsed entities are not considered processing instructions.
A processing instruction information item must have the following properties available in some form:
A processing instruction information item may also have the following properties available in some form:
XML Definition: Section 4.4.3, Included If Validating
There is one reference to skipped entity information item for each reference to an entity not included by a non-validating XML processor because the XML processor does not include external parsed entities.
A validating XML processor will never generate reference to skipped entity information items for a valid XML document.
A reference to skipped entity information item must have the following information available in some form:
A reference to skipped entity information item may also have the following properties available in some form:
XML Definition: characters (Section 2.2, Characters)
XML Syntax: [2] Char (Section 2.2, Characters)
There is one character information item for each non-markup character that appears within the document element, either literally, as a character reference, or within a CDATA section. There is also one character information item for each character that appears in a normalized attribute value.
Note, however, that a CR (#xD) character that is followed
by a LF (#xA) character is not represented by any information item. Furthermore,
a CR character that is not followed by a LF character is treated
as a LF character. These rules do not apply to CR characters created by character
references such as 
or
.
Each character is a logically-separate information item, but XML applications are free to chunk characters into larger groups as necessary or desirable.
A character information item must have the following properties available in some form:
A character information item may also have the following properties available in some form:
XML Definition: comment (Section 2.5, Comments)
XML Syntax: [15] Comment (Section 2.5, Comments)
The peripheral comment information item corresponds to a single XML comment in the original document.
If a comment information item is included, the following properties must be available:
XML Definition: document type declaration (section 2.8, Prolog and Document Type Declaration)
XML Syntax: [28] doctypedecl (section 2.8, Prolog and Document Type Declaration)
If the XML document has a document type declaration, then the information set may contain a single document type declaration information item. Note that although entities and notations are logically part of the document type declaration, they are provided as properties of the document information item, because XML processors must provide information on them.
A document type declaration information item may have the following properties available in some form:
XML Definition: entity (section 4, Physical Structures)
XML Syntax: [70] EntityDecl (section 4.2, Entity Declarations)
Entity information items are peripheral, except for information items representing unparsed external entities, which are core information items.
There is at most one entity information item for each general entity, internal or external, declared in the DTD: when the same entity is declared more than once, only the first declaration is used. Parameter entities are not represented by entity information items. There is also at most one entity information item for the document entity, and at most one for the DTD external subset (if there is one). It is perfectly all right for an XML processor to report some entities and not others.
The entity information item, if included, must have the following information available in some form:
An entity information item may also have the following information available in some form:
XML Definition: notation (section 4.7, Notation Declarations)
XML Syntax: [82] NotationDecl (section 4.7, Notation Declarations)
There is one notation information item for each notation declared in the DTD.
A notation information item must have the following properties available:
XML Definition: entity reference (section 4.1, Character and Entity References)
XML Syntax: [68] EntityRef (section 4.1, Character and Entity References)
Entity start marker information items are an peripheral part of the information set. They are inserted to mark the place where text included from an general entity (as a consequence of an entity reference) begins. They appear as children of an element or attribute information item.
Entity start marker information items are not used in connection with parameter entity references in the DTD.
An entity start marker information item, if present, must have the following properties available in some form:
XML Definition: entity reference (section 4.1, Character and Entity References)
XML Syntax: [68] EntityRef (section 4.1, Character and Entity References)
Entity end marker information items are an peripheral part of the information set. They are inserted to mark the place where text included from an general entity (as a consequence of an entity reference) concludes. They appear as children of an element or attribute information item.
Entity end marker information items are not used in connection with parameter entity references in the DTD.
An entity end marker information item, if present, must have the following properties available in some form:
XML Definition: CDATA sections (section 2.7, CDATA sections)
XML Syntax: [18] CDSect (section 2.7, CDATA Sections)
CDATA start marker information items are an peripheral part of the information set. They are inserted to mark the place where text embedded in a CDATA section begins. They appear as children of an element information item.
CDATA start marker information items have no properties.
XML Definition: CDATA sections (section 2.7, CDATA sections)
XML Syntax: [18] CDSect (section 2.7, CDATA Sections)
CDATA end marker information items are an peripheral part of the information set. They are inserted to mark the place where text embedded in a CDATA section concludes. They appear as children of an element information item.
CDATA end marker information items have no properties.
XML Definition: attribute (Section 3.1, Start-Tags, End-Tags, and Empty-Element Tags)
XML Syntax: [41] Attribute (Section 3.1, Start-Tags, End-Tags, and Empty-Element Tags)
There is one namespace
declaration information item for each namespace declaration
(specified or defaulted) for each element in the document instance. Namespace
declarations are syntactically like attribute declarations of attributes whose
names begin with the string xmlns
.
Namespace declarations
declared in the DTD with a default value of #IMPLIED
and not
specified in the element's start tag are not represented by information items.
Note that the last two properties present the same underlying information in overlapping ways. XML processors may report either one or both, but must report at least one.
A namespace declaration information item must have the following properties available in some form:
xmlns:
prefix.
If the attribute name is simply xmlns
, this property is a null
string.Consider the following example XML document:
<?xml version="1.0"?> <msg:message dc:date="19990421" xmlns:dc="http://purl.org/metadata/dublin_core#" xmlns:msg="http://www.message.net/" >Phone home!</msg:message>
The Information Set for this XML document will contain at least the following items in some form:
http://www.message.net/
" and the local part "message
".
date
".An XML processor conforms to the XML Information Set if it provides all the core information items and all their core properties corresponding to that part of the document that the processor has actually read. For instance, attributes are core information items; therefore, an XML processor that does not report the existence of attributes, as well as their names and values (which are core properties of attributes), does not conform to the XML Information Set.
Some information items are peripheral, and some core information items have peripheral information associated with them. If an XML processor reports an information item, then it must supply at least the core properties defined by the XML Information Set in order to conform. For instance, if an XML processor chooses to supply entity information items, which are peripheral, then it is also required to supply names for the entities, since the XML Information Set specifies that the name of an entity information item is a core property. However, since entity information items are peripheral, an XML processor which does not supply them at all also conforms to the XML Information Set.
The XML 1.0 Recommendation [XML] explicitly allows non-validating XML processors to omit parsing the external DTD subset and external entities (both parsed general entities and parameter entities). As a result, it is possible that a non-validating XML processor will omit reading attribute and entity declarations or actual markup that will affect the quantity and quality of information included in the information set. Validating XML processors must report all core information; non-validating XML processors may omit core information that appears outside of the top-level document entity (either in the external DTD subset or in an external text entity) if they do not read the other entities.
XML Processors may optionally provide additional information not found in the XML Information Set; for instance, the XML Information Set excludes whitespace that occurs between attributes from the information set, but an XML Processor that provides this information will still conform to the Information Set as long as it provides the information that is required for conformance to the XML Information Set.
The following information is not represented in the current version of the XML Information Set:
<foo/>
and <foo></foo>
.Furthermore, the XML Infoset does not provide any method of assigning a single series of numbers to all child nodes of an element or of the document that is guaranteed to be reliable regardless of the underlying XML processor. Although such a method would be desirable, it is considered unachievable for XML, due to the difficulties produced by references to skipped entities, non-validating processors, and peripheral information items.
In other words, there is no reliable way to specify something like "the second child of this element" without restricting both the type of XML processor and the types of children being counted.
http://www.w3.org/TR/REC-DOM-Level-1/
.
http://www.w3.org/TR/REC-xml-names
.
http://www.isi.edu/in-notes/rfc2119.txt
.
http://www.isi.edu/in-notes/rfc2396.txt
.http://www.w3.org/TR/REC-xml
.
http://www.w3.org/TR/NOTE-xptr-infoset-liaison
.Although the XML 1.0 Recommendation [XML] is primarily concerned with XML syntax, it also includes some specific reporting requirements for XML processors.
The reporting requirements include errors, which are outside the scope of this specification, and document information; all of the XML 1.0 requirements for document information reporting have been integrated into the XML information set specification (numbers in parentheses refer to sections of the Recommendation):
The following RDF Schema provides a formal characterization of the Infoset. In case of disagreement between this schema and the prose in this document, the prose should be taken as normative.
<?xml version='1.0' standalone='yes'?> <rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:rdfs='http://www.w3.org/TR/1999/PR-rdf-schema-19990303#' xmlns='http://www.w3.org/1999/WD-infoset-19991201#'> <!--Enumeration classes and their members--> <rdfs:Class id='AttrType'/> <AttrType id='AttrType.ID'/> <AttrType id='AttrType.IDREF'/> <AttrType id='AttrType.IDREFS'/> <AttrType id='AttrType.ENTITY'/> <AttrType id='AttrType.ENTITIES'/> <AttrType id='AttrType.NMTOKEN'/> <AttrType id='AttrType.NMTOKENS'/> <AttrType id='AttrType.NOTATION'/> <AttrType id='AttrType.CDATA'/> <AttrType id='AttrType.ENUMERATED'/> <rdfs:Class id='Boolean'/> <Boolean id='Boolean.true'/> <Boolean id='Boolean.false'/> <rdfs:Class id='EntityType'/> <EntityType id=EntityType.InternalGeneral'/> <EntityType id=EntityType.ExternalGeneral'/> <EntityType id=EntityType.Unparsed'/> <EntityType id=EntityType.DocumentEntity'/> <EntityType id=EntityType.ExternalDTDSubset'/> <rdfs:Class id='Integer' rdfs:subClassOf='http://www.w3.org/TR/1999/PR-rdf-schema-19990303#Literal'/> <rdfs:Class id='StandaloneType'/> <StandaloneType id='StandaloneType.yes'/> <StandaloneType id='StandaloneType.no'/> <StandaloneType id='StandaloneType.notSpecified'/> <!--Info item classes in document order--> <rdfs:Class id='InfoItem'/> <rdfs:Class id='Document' rdfs:subClassOf='#InfoItem'/> <rdfs:Class id='Element' rdfs:subClassOf='#InfoItem'/> <rdfs:Class id='Attribute' rdfs:subClassOf='#InfoItem'/> <rdfs:Class id='ProcessingInstruction' rdfs:subClassOf='#InfoItem'/> <rdfs:Class id='Character' rdfs:subClassOf='#InfoItem'/> <rdfs:Class id='ReferenceToSkippedEntity' rdfs:subClassOf='#InfoItem'/> <rdfs:Class id='Comment' rdfs:subClassOf='#InfoItem'/> <rdfs:Class id='DocumentTypeDeclaration' rdfs:subClassOf='#InfoItem'/> <rdfs:Class id='Entity' rdfs:subClassOf='#InfoItem'/> <rdfs:Class id='Notation' rdfs:subClassOf='#InfoItem'/> <rdfs:Class id='EntityStartMarker' rdfs:subClassOf='#InfoItem'/> <rdfs:Class id='EntityEndMarker' rdfs:subClassOf='#InfoItem'/> <rdfs:Class id='CDATAStartMarker' rdfs:subClassOf='#InfoItem'/> <rdfs:Class id='CDATAEndMarker' rdfs:subClassOf='#InfoItem'/> <rdfs:Class id='Namespace' rdfs:subClassOf='#InfoItem'/>\ <!--Set containers--> <rdfs:Class id='InfoItemSet' rdfs:subClassOf='http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag'/> <rdfs:Class id='AttributeSet' rdfs:subClassOf='#InfoItemSet'/> <rdfs:Class id='EntitySet' rdfs:subClassOf='#InfoItemSet'/> <rdfs:Class id='NamespaceSet' rdfs:subClassOf='#InfoItemSet'/> <rdfs:Class id='NotationSet' rdfs:subClassOf='#InfoItemSet'/> <!--Sequence container--> <rdfs:Class id='InfoItemSeq' rdfs:subClassOf='http://www.w3.org/1999/02/22-rdf-syntax-ns#Seq'/> <!--Info item properties--> <rdfs:Property id='attributes'> <rdfs:domain resource='#Element'/> <rdfs:range resource='#AttributeSet'/> </rdfs:Property> <rdfs:Property id='attributeType'> <rdfs:domain resource='#Attribute'/> <rdfs:range resource='#AttrType'/> </rdfs:Property> <rdfs:Property id='baseURI'> <rdfs:domain resource='#Document'/> <rdfs:domain resource='#Element'/> <rdfs:domain resource='#ProcessingInstruction'/> <rdfs:domain resource='#Entity'/> <rdfs:domain resource='#Notation'/> <rdfs:range resource='http://www.w3.org/TR/1999/PR-rdf-schema-19990303#Literal'/> </rdfs:Property> <rdfs:Property id='characterCode'> <rdfs:domain resource='#Character'/> <rdfs:range resource='#Integer'/> </rdfs:Property> <rdfs:Property id='charset'> <rdfs:domain resource='#Entity'/> <rdfs:range resource='http://www.w3.org/TR/1999/PR-rdf-schema-19990303#Literal'/> </rdfs:Property> <rdfs:Property id='children'> <rdfs:domain resource='#Document'/> <rdfs:domain resource='#Element'/> <rdfs:domain resource='#Attribute'/> <rdfs:domain resource='#DocumentTypeDeclaration'/> <rdfs:domain resource='#Namespace'/> <rdfs:range resource='#InfoItemSeq'/> </rdfs:Property> <rdfs:Property id='content'> <rdfs:domain resource='#ProcessingInstruction'/> <rdfs:domain resource='#Comment'/> <rdfs:domain resource='#Entity'/> <rdfs:range resource='http://www.w3.org/TR/1999/PR-rdf-schema-19990303#Literal'/> </rdfs:Property> <rdfs:Property id='declaredNamespaces'> <rdfs:domain resource='#Element'/> <rdfs:range resource='#NamespaceSet'/> </rdfs:Property> <rdfs:Property id='default'> <rdfs:domain resource='#Attribute'/> <rdfs:range resource='#Boolean'/> </rdfs:Property> <rdfs:Property id='elementContentWhitespace'> <rdfs:domain resource='#Character'/> <rdfs:range resource='#Boolean'/> </rdfs:Property> <rdfs:Property id='entity'> <rdfs:domain resource='#EntityStartMarker'/> <rdfs:domain resource='#EntityEndMarker'/> <rdfs:range resource='#Entity'/> </rdfs:Property> <rdfs:Property id='entities'> <rdfs:domain resource='#Document'/> <rdfs:range resource='#EntitySet'/> </rdfs:Property> <rdfs:Property id='entityType'> <rdfs:domain resource='#Attribute'/> <rdfs:range resource='#AttrType'/> </rdfs:Property> <rdfs:Property id='externalDTD'> <rdfs:domain resource='#DocumentTypeDeclaration'/> <rdfs:range resource='#Entity'/> </rdfs:Property> <rdfs:Property id='inScopeNamespaces'> <rdfs:domain resource='#Element'/> <rdfs:range resource='#NamespaceSet'/> </rdfs:Property> <rdfs:Property id='localName'> <rdfs:domain resource='#Element'/> <rdfs:domain resource='#Attribute'/> <rdfs:range resource='http://www.w3.org/TR/1999/PR-rdf-schema-19990303#Literal'/> </rdfs:Property> <rdfs:Property id='name'> <rdfs:domain resource='#ReferenceToSkippedEntity'/> <rdfs:domain resource='#Entity'/> <rdfs:domain resource='#Notation'/> <rdfs:range resource='http://www.w3.org/TR/1999/PR-rdf-schema-19990303#Literal'/> </rdfs:Property> <rdfs:Property id='namespaceURI'> <rdfs:domain resource='#Element'/> <rdfs:domain resource='#Attribute'/> <rdfs:domain resource='#Namespace'/> <rdfs:range resource='http://www.w3.org/TR/1999/PR-rdf-schema-19990303#Literal'/> </rdfs:Property> <rdfs:Property id='notation'> <rdfs:domain resource='#Entity'/> <rdfs:range resource='#Notation'/> </rdfs:Property> <rdfs:Property id='notations'> <rdfs:domain resource='#Document'/> <rdfs:range resource='#NotationSet'/> </rdfs:Property> <rdfs:Property id='predefinedEntity'> <rdfs:domain resource='#Character'/> <rdfs:range resource='#Boolean'/> </rdfs:Property> <rdfs:Property id='prefix'> <rdfs:domain resource='#Namespace'/> <rdfs:range resource='http://www.w3.org/TR/1999/PR-rdf-schema-19990303#Literal'/> </rdfs:Property> <rdfs:Property id='publicIdentifier'> <rdfs:domain resource='#Entity'/> <rdfs:domain resource='#Notation'/> <rdfs:range resource='http://www.w3.org/TR/1999/PR-rdf-schema-19990303#Literal'/> </rdfs:Property> <rdfs:Property id='referent'> <rdfs:domain resource='#ReferenceToSkippedEntity'/> <rdfs:range resource='#Entity'/> </rdfs:Property> <rdfs:Property id='specified'> <rdfs:domain resource='#Attribute'/> <rdfs:range resource='#Boolean'/> </rdfs:Property> <rdfs:Property id='standalone'> <rdfs:domain resource='#Entity'/> <rdfs:range resource='#StandaloneType'/> </rdfs:Property> <rdfs:Property id='systemIdentifier'> <rdfs:domain resource='#Entity'/> <rdfs:domain resource='#Notation'/> <rdfs:range resource='http://www.w3.org/TR/1999/PR-rdf-schema-19990303#Literal'/> </rdfs:Property> <rdfs:Property id='target'> <rdfs:domain resource='#ProcessingInstruction'/> <rdfs:range resource='http://www.w3.org/TR/1999/PR-rdf-schema-19990303#Literal'/> </rdfs:Property> </rdf:RDF>