DOM Parsing and Serialization

Abstract

This specification defines various APIs for programmatic access to HTML and generic XML parsers by web applications for use in parsing and serializing DOM nodes

1. Conformance

As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.

The key words must, must not, required, should, should not, recommended, may, and optional in this specification are to be interpreted as described in [RFC2119].

Requirements phrased in the imperative as part of algorithms (such as "strip any leading space characters" or "return false and terminate these steps") are to be interpreted with the meaning of the key word ("must", "should", "may", etc) used in introducing the algorithm.

Conformance requirements phrased as algorithms or specific steps may be implemented in any manner, so long as the end result is equivalent. (In particular, the algorithms defined in this specification are intended to be easy to follow, and not intended to be performant.)

User agents may impose implementation-specific limits on otherwise unconstrained inputs, e.g. to prevent denial of service attacks, to guard against running out of memory, or to work around platform-specific limitations.

When a method or an attribute is said to call another method or attribute, the user agent must invoke its internal API for that attribute or method so that e.g. the author can't change the behavior by overriding attributes or methods with custom properties or functions in ECMAScript.

Unless otherwise stated, string comparisons are done in a case-sensitive manner.

If an algorithm calls into another algorithm, any exception that is thrown by the latter (unless it is explicitly caught), must cause the former to terminate, and the exception to be propagated up to its caller.

1.1 Dependencies

The IDL fragments in this specification must be interpreted as required for conforming IDL fragments, as described in the Web IDL specification. [WEBIDL]

Some of the terms used in this specification are defined in [DOM4], [HTML5], and [XML10].

1.2 Extensibility

Vendor-specific proprietary extensions to this specification are strongly discouraged. Authors must not use such extensions, as doing so reduces interoperability and fragments the user base, allowing only users of specific user agents to access the content in question.

If vendor-specific extensions are needed, the members should be prefixed by vendor-specific strings to prevent clashes with future versions of this specification. Extensions must be defined so that the use of extensions neither contradicts nor causes the non-conformance of functionality defined in the specification.

When vendor-neutral extensions to this specification are needed, either this specification can be updated accordingly, or an extension specification can be written that overrides the requirements in this specification. When someone applying this specification to their activities decides that they will recognise the requirements of such an extension specification, it becomes an applicable specification for the purposes of conformance requirements in this specification.

3. Parsing and serializing Nodes

3.1 Parsing

The following steps form the fragment parsing algorithm, whose arguments are a markup string and a context element.

If the context element's node document is an HTML document: let algorithm be the HTML fragment parsing algorithm.

If the context element's node document is an XML document: let algorithm be the XML fragment parsing algorithm.
Invoke algorithm with markup as the input, and context element as the context element.
Let new children be the nodes returned.
Let fragment be a new DocumentFragment whose node document is context element's node document.
Append each node in new children to fragment (in order).
Note
This ensures the node document for the new nodes is correct.
Return fragment.

3.2 Serializing

To serialize a Node node, the user agent must run the following steps:

Let document be node's node document.
If document is an HTML document, return an HTML serialization of node.
Otherwise, document is an XML document. Return an XML serialization of node.

To produce an HTML serialization of a Node node, the user agent must run the appropriate steps, depending on node's interface:

Element
Document
DocumentFragment: Run the HTML fragment serialization algorithm on node. Return the returned string.
Comment
Text
DocumentType
ProcessingInstruction: Issue 2
Define how these are serialized...

To produce an XML serialization of a Node node, the user agent must run the appropriate steps, depending on node's interface:

Element

Return the concatenation of the following strings:

"<" (U+003C LESS-THAN SIGN);
the value of node's tagName attribute;
Issue 3
escaping / throwing
the XML serialization of node's attributes;
">" (U+003E GREATER-THAN SIGN);
the serialization of node's children, in order;
"</" (U+003C LESS-THAN SIGN, U+002F SOLIDUS);
the value of node's tagName attribute;
">" (U+003E GREATER-THAN SIGN).

Document

Run the XML fragment serialization algorithm on node. Return the string this produced.

Comment

Let markup the concatenation of "".

If markup matches the Comment production, return markup. Otherwise, throw a DOMException with name InvalidStateError.

Text

Let data be node's data.

If node has its serialize as CDATA flag set, run the following steps:

If data doesn't match the CData production, throw a DOMException with name InvalidStateError and terminate the entire algorithm.
Let markup be the concatenation of "<![CDATA[", data, and "]]>".
Return markup.

Otherwise, return data.

DocumentFragment

Let markup the empty string.

For each child of node, in order, produce an XML serialization of the child and concatenate the result to markup.

Return markup.

DocumentType

ProcessingInstruction

Issue 4

TODO

The XML serialization of the attributes of an element element is the result of the following algorithm:

Let result be the empty string.
For each attribute attr in element attributes, in order, append the following strings to result:
1. " " (U+0020 SPACE);
2. attr's name;
  Issue 5
  escaping / throwing
3. "="" (U+003D EQUALS SIGN, U+0022 QUOTATION MARK);
4. attr's value;
  Issue 6
  escaping / throwing
5. """ (U+0022 QUOTATION MARK).
Return result.

4. The `DOMParser` interface

enum SupportedType {
    "text/html",
    "text/xml",
    "application/xml",
    "application/xhtml+xml",
    "image/svg+xml"
};

The DOMParser() constructor must return a new DOMParser object.

[Constructor]
interface DOMParser {
    Document parseFromString (DOMString str, SupportedType type);
};

4.1 Methods

parseFromString

The parseFromString(str, type) method must run these steps, depending on type:

"text/html"

Parse str with an HTML parser, and return the newly created document.

The scripting flag must be set to "disabled".

Note

meta elements are not taken into account for the encoding used, as a Unicode stream is passed into the parser.

Note

script elements get marked unexecutable and the contents of noscript get parsed as markup.

"text/xml"

"application/xml"

"application/xhtml+xml"

"image/svg+xml"

Parse str with a namespace-enabled XML parser.
If the previous step didn't return an error, return the newly created document and terminate these steps.
Let document be a newly-created XMLDocument.
Let root be a new Element, with its local name set to "parsererror" and its namespace set to "http://www.mozilla.org/newlayout/xml/parsererror.xml".
At this point user agents may append nodes to root, for example to describe the nature of the error.
Append root to document.
Return document.

In any case, the returned document's content type must be the type argument.

Issue 7

It is currently unclear what the URL of the returned document should be.

Results for a test case:

	Gecko	Opera	Chrome
document.location	null
document.URL	unsupported	unsupported	""
document.documentURI	Page URL	null	null

Anne van Kesteren suggests using the default, about:blank.

Note

The returned document's encoding is the default, UTF-8.

Parameter	Type	Nullable	Optional	Description
str	`DOMString`	✘	✘
type	`SupportedType`	✘	✘

Return type: Document

Parameter	Type	Nullable	Optional	Description
root	`Node`	✘	✘

6. Extensions to the `Element` interface

enum insertAdjacentHTMLPosition {
    "beforebegin",
    "afterbegin",
    "beforeend",
    "afterend"
};

partial interface Element {
             attribute DOMString innerHTML;
             attribute DOMString outerHTML;
    void insertAdjacentHTML (insertAdjacentHTMLPosition position, DOMString text);
};

6.1 Attributes

innerHTML of type DOMString

The innerHTML IDL attribute represents the markup of the Element's contents.

element . innerHTML [ = value ]

Returns a fragment of HTML or XML that represents the element's contents.

Can be set, to replace the contents of the element with nodes parsed from the given string.

In the case of an XML document, will throw a DOMException with name InvalidStateError if the Element cannot be serialized to XML, and a DOMException with name SyntaxError if the given string is not well-formed.

On getting, if the context object's node document is an HTML document, then the attribute must return the result of running the HTML fragment serialization algorithm on the context object; otherwise, the context object's node document is an XML document, and the attribute must return the result of running the XML fragment serialization algorithm on the context object instead (this might throw an exception instead of returning a string).

On setting, these steps must be run:

Let fragment be the result of invoking the fragment parsing algorithm with the new value as markup, and the context object as the context element.
Replace all with fragment within the context object.

outerHTML of type DOMString

The outerHTML IDL attribute represents the markup of the Element and its contents.

element . outerHTML [ = value ]

Returns a fragment of HTML or XML that represents the element and its contents.

Can be set, to replace the element with nodes parsed from the given string.

In the case of an XML document, will throw a DOMException with name InvalidStateError if the element cannot be serialized to XML, and a DOMException with name SyntaxError if the given string is not well-formed.

Throws a DOMException with name NoModificationAllowedError if the parent of the element is the Document node.

On getting, if the context object's node document is an HTML document, then the attribute must return the result of running the HTML fragment serialization algorithm on a fictional node whose only child is context object; otherwise, the context object's node document is an XML document, and the attribute must return the result of running the XML fragment serialization algorithm on that fictional node instead (this might throw an exception instead of returning a string).

On setting, the following steps must be run:

Let parent be the context object's parent.
If parent is null, terminate these steps. There would be no way to obtain a reference to the nodes created even if the remaining steps were run.
If parent is a Document, throw a DOMException with name NoModificationAllowedError exception and terminate these steps.
If parent is a DocumentFragment, let parent be a new Element with
- body as its local name,
- the HTML namespace as its namespace, and
- the context object's node document as its node document.
Let fragment be the result of invoking the fragment parsing algorithm with the new value as markup, and parent as the context element.
Replace the context object with fragment within the context object's parent.

6.2 Methods

insertAdjacentHTML

element . insertAdjacentHTML(position, text)

Parses the given string text as HTML or XML and inserts the resulting nodes into the tree in the position given by the position argument, as follows:

"beforebegin": Before the element itself.
"afterbegin": Just inside the element, before its first child.
"beforeend": Just inside the element, after its last child.
"afterend": After the element itself.

Throws a TypeError exception if the position argument has an invalid value.

In XML documents, throws a DOMException with name SyntaxError if the given string is not well-formed.

Throws a DOMException with name NoModificationAllowedError if the given position isn't possible (e.g. inserting elements after the root element of a Document).

The insertAdjacentHTML(position, text) method must run these steps:

Use the first matching item from this list:

If position is an ASCII case-insensitive match for the string "beforebegin"
If position is an ASCII case-insensitive match for the string "afterend"

Let context be the context object's parent.
If context is null or a document, throw a DOMException with name NoModificationAllowedError and terminate these steps.
If position is an ASCII case-insensitive match for the string "afterbegin"
If position is an ASCII case-insensitive match for the string "beforeend"
Let context be the context object.
If context is not an Element or the following are all true:
- context's node document is an HTML document,
- context's local name is "html", and
- context's namespace is the HTML namespace;
let context be a new Element with
- body as its local name,
- the HTML namespace as its namespace, and
- the context object's node document as its node document.
Let fragment be the result of invoking the fragment parsing algorithm with text as markup, and parent as the context element.
Use the first matching item from this list:

If position is an ASCII case-insensitive match for the string "beforebegin"
Insert fragment into the context object's parent before the context object.
If position is an ASCII case-insensitive match for the string "afterbegin"
Insert fragment into the context object before its first child.
If position is an ASCII case-insensitive match for the string "beforeend"
Append fragment to the context object.
If position is an ASCII case-insensitive match for the string "afterend"
Insert fragment into the context object's parent before the context object's next sibling.

Parameter	Type	Nullable	Optional	Description
position	`insertAdjacentHTMLPosition`	✘	✘
text	`DOMString`	✘	✘

Return type: void

8. Extensions to the `Range` interface

partial interface Range {
    DocumentFragment createContextualFragment (DOMString fragment);
};

8.1 Methods

createContextualFragment

fragment = range . createContextualFragment(fragment): Returns a DocumentFragment, created from the markup string given.

The createContextualFragment(fragment) method must run these steps:

If the context object's detached flag is set, throw a DOMException with name InvalidStateError and terminate these steps.
Let node the context object's start node.
Let element be as follows, depending on node's interface:

Document
DocumentFragment
null
Element
node
Text
Comment
node's parent element
DocumentType
ProcessingInstruction
[DOM4] prevents this case.
If either element is null or the following are all true:
- element's node document is an HTML document,
- element's local name is "html", and
- element's namespace is the HTML namespace;
let element be a new element with
- "body" as its local name,
- the HTML namespace as its namespace, and
- the context object's node document as its node document.
Let fragment node be the result of invoking the fragment parsing algorithm with fragment as markup, and element as the context element.
Unmark all scripts in fragment node as "already started".
Return fragment node.

Parameter	Type	Nullable	Optional	Description
fragment	`DOMString`	✘	✘

Return type: DocumentFragment

DOM Parsing and Serialization

W3C Working Draft 20 September 2012

Abstract

Status of This Document

Table of Contents

Issues

1. Conformance

1.1 Dependencies

1.2 Extensibility

2. Terminology

3. Parsing and serializing Nodes

3.1 Parsing

3.2 Serializing

4. The `DOMParser` interface

4.1 Methods

5. The `XMLSerializer` interface

5.1 Methods

6. Extensions to the `Element` interface

6.1 Attributes

6.2 Methods

7. Extensions to the `Text` interface

7.1 Attributes

8. Extensions to the `Range` interface

8.1 Methods

A. Acknowledgements

B. References

B.1 Normative references

B.2 Informative references

DOM Parsing and Serialization

W3C Working Draft 20 September 2012

Abstract

Status of This Document

Table of Contents

Issues

1. Conformance

1.1 Dependencies

1.2 Extensibility

2. Terminology

3. Parsing and serializing Nodes

3.1 Parsing

3.2 Serializing

4. The DOMParser interface

4.1 Methods

5. The XMLSerializer interface

5.1 Methods

6. Extensions to the Element interface

6.1 Attributes

6.2 Methods

7. Extensions to the Text interface

7.1 Attributes

8. Extensions to the Range interface

8.1 Methods

A. Acknowledgements

B. References

B.1 Normative references

B.2 Informative references

4. The `DOMParser` interface

5. The `XMLSerializer` interface

6. Extensions to the `Element` interface

7. Extensions to the `Text` interface

8. Extensions to the `Range` interface