This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 11755 - The introduction should be clearer about use cases best addressed by polyglot markup
Summary: The introduction should be clearer about use cases best addressed by polyglot...
Status: RESOLVED FIXED
Alias: None
Product: HTML WG
Classification: Unclassified
Component: LC1 HTML/XHTML Compatibility Authoring Guide (ed: Eliot Graff) (show other bugs)
Version: unspecified
Hardware: All All
: P2 normal
Target Milestone: ---
Assignee: Eliot Graff
QA Contact: HTML WG Bugzilla archive list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-01-14 09:11 UTC by Henri Sivonen
Modified: 2011-08-04 05:07 UTC (History)
5 users (show)

See Also:


Attachments

Description Henri Sivonen 2011-01-14 09:11:29 UTC
The draft says:
"It is often valuable to be able to serve HTML5 documents that are also well formed XML documents. An author may, for example, use XML tools to generate a document, and they and others may process the document using XML tools. These documents are served as text/html."

The quoted part has four problems:

1) It claims "often valuable" in the passive voice without substantiating the claim beyond what is said in the next sentence, but the next sentence isn't on a very strong ground as seen below.

2) If an author uses XML tools to generate the document, using a generic XML serializer is not OK, because a generic serializer might do whatever is OK in application/xhtml+xml but not necessarily in text/html. As a trivial example, a generic XML serializer might likely serialize a script element pointing to an external script as <script src="foo.js"/>, which would be very wrong in text/html. Thus, the author needs a text/html-aware serializer anyway to be able to successfully use the output as text/html: either a polyglot serializer or a text/html-only serializer. Once a text/html-aware serializer is needed instead of a generic XML serializer, it isn't necessary to make the serializer polyglot if the goal is simply to produce text/html content using otherwise XML tools. Monoglot serializers for either text/html or for XML can serialize the text content of the style and script elements with relative ease. However, a strictly polyglot serializer can't support inline scripts and styles in the general case. (The serializer would either have to relax DOM sameness by generating /* <![CDATA[ */ at start of the text content and /* ]]> */ at the end of the text content or to ban the characters <, > and & in the script or style sheet, which would be a drastic restriction.) Using a monoglot serializer avoids this problem, so polyglot isn't a good solution for creating text/html content from an XML tool (such as an XSLT processor).

3) Polyglot isn't a very effective way of allowing others to process the document using XML tools, either. For someone else to be able to consume text/html content using an XML parser, every document (s)he wants to consume has to be polyglot. If the content to be consumed is Web content in general, there's no way to force all of it to be polyglot. From the point of view of the content consumer, it is easier to consume text/html content with an HTML parser that exposes the same APIs to the rest of the app that an XML parser would expose than to make agreements with document authors to get them to write polyglot markup. Once the consumer includes an HTML parser is the app, there's no longer value in any of the consumed docs being polyglot. Thus, from the point of view of a would-be polyglot author, making a document polyglot won't be of value if someone else whose document needs to be consumed by the same consumer makes a monoglot document. The would-be polyglot author might as well be the first one to make a monoglot document that forces the consumer to deal. Thus, getting authors to use polyglot markup isn't as good a solution to consuming text/html content with XML tools as putting an HTML parser at the start of the pipeline is.

4) The last quoted sentence says the documents are served as text/html but doesn't say why. A polyglot document is by definition a document that also works as application/xhtml+xml. The main reason not to serve such documents as application/xhtml+xml only is catering to the userbase of IE version earlier than IE9. It would be a shame to get a situation where authors keep addressing a transient problem even when the problem is gone (when IE6 through IE8 users no longer form a substantial audience). Once the author no longer wishes to address the IE6 through IE8 audience, the author could use a monoglot XML-only serializer for point #2 above.

Please either substantiate "often valuable" better or remove the claim. Please replace the stated use cases with use cases for which using polyglot markup is indeed the best known solution or, alternatively, please at least mention the alternative solutions I outlined in points #2 and #3 above. Please mention that the reason for serving content as text/html when it would work as application/xhtml+xml is a transient reason.
Comment 1 Eliot Graff 2011-02-12 01:04:54 UTC
The 11 February Editor's Draft has the following expanded Introduction:

]]
It is often valuable to be able to serve HTML5 documents that are also well formed XML documents. An author may, for example, use XML tools to generate a document, and they and others may process the document using XML tools. The language used to create documents that can be parsed by both HTML and XML parsers is called polyglot markup. Polyglot markup is the overlap language of documents that are both HTML5 documents and XML documents. It is recommended that these documents be served as either text/html (if the content is transmitted to an HTML-aware user agent) or application/xhtml+xml (if the content is transmitted to an XHTML-aware user agent). Other permissible MIME types are text/xml, application/xml, and any MIME type whose subtype ends with the four characters "+xml". [XML-MT] 

Polyglot markup: 
is valid HTML5. [HTML5]
is well-formed XML. [XML10]
results in identical DOMs (with the exception of the xmlns attribute on the root element) when processed as HTML and when processed as XML.

Polyglot markup is not constrained: 
to be valid XML. [XML10]
by conformance to any XML DTD.

 All web content need not be authored in polyglot markup. Polyglot markup is ideal for publishing when there's a strong desire to serve both HMTL and XML tool chains without simultaneously having to maintain dual copies of the content: one in HTML and a second in XHTML. In addition, a single polyglot markup output requires less infrastructure to produce than to produce both HTML and XHTML output for the same content. Polyglot markup is also be beneficial when lightweight processessuch as quick testing or even hand-authoringare applied to content intended to be published both as HTML and XHTML, especially if that content is not sent through a tool chain. 
[[

I believe this satisfies the request for more concrete use cases and advice in #1. I take it that this is the main thrust of this bug and will resolve it as fixed. If you wish to follow up on the other points, please file separate bugs, so we can more easily track them, OK?

For #2, if you had an xml serializer that was outputting polyglot, tool chains can then become aware of polyglot and change to output polyglot. The benefit is that you don't need an explicit decision ahead of time as to whether you serve to XML or HTML. 

For #3, arbitrary input could be either XML or HTML. For input, polyglot may not be the best case. For input for things you have no control over, you're right, and so I mention in all cases the benefit for serving content. 

For #4, that was an omission. Thanks for the catch. I've changed the introduction to contain the following:

]]
It is recommended that these documents be served as either text/html (if the content is transmitted to an HTML-aware user agent) or application/xhtml+xml (if the content is transmitted to an XHTML-aware user agent). Other permissible MIME types are text/xml, application/xml, and any MIME type whose subtype ends with the four characters "+xml". [XML-MT] 
[[

Thanks, once more, for all the great work.

Eliot
Comment 2 Michael[tm] Smith 2011-08-04 05:06:53 UTC
mass-move component to LC1
Comment 3 Michael[tm] Smith 2011-08-04 05:07:16 UTC
mass-move component to LC1