This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
See the thread ”During HTML parsing, are *all* named character references replaced by their corresponding glyph?”, and in particular this answer from Michael: http://www.w3.org/mid/20130624113437.GB37583@sideshowbarker What Michael said, is easy to forget. Thus, I think this subject needs a little more description in Polyglot Markup. Right now, only <script> and <style> are covered - and also <noscript>. I would propose to a) ad a section that describes the general issue of content that, unlike in XML, is treated as text by the HTML parser Motivation: This a an important and general gotcha and difference, both within pure HTML, but especialy when creating polyglots. b) In practise, this means listing all the elements that themselves - or their children, are treated as text by the HTML parsers. (This includes all elements that begins with the string “<no”, such as <noscript> and <noframe>, as well as <script>, <style>, <xmp>, <iframe> and perhaps some more (?) NB: It may also make sense to mention, in a note that the “sane” elements, such as <object>, <video> etc, are not treated that way. c) The section should give the various usage rules - some elements are forbidden etc, while others have special rules for polyglots under this heading. (Thus, the script/style should go there - or at least be represented with a link to the section where their rules are described.) Btw, note that HTML5 already says that the content of iframe must be empty in XML, so describing iframe should be a nobrainer. See http://www.w3.org/TR/html5/embedded-content-0.html#iframe-content-model And HTML5 has similar things to say about most - if not of these elements, so it is mostly a collection job.
<title> is among these elements.
(In reply to comment #1) > <title> is among these elements. More data: <title> falls under the "generic RCDATA element parsing algorithm" which means that character entities/references are still handled but that tags (other than the endtag of the element itself) are ignored. For contrast, then e.g. <iframe> falls under the "generic raw text element parsing algorithm" which means that both tags (but for the endtag) and character entities/referenes are ignored. see: http://www.w3.org/html/wg/drafts/html/master/syntax.html#generic-rcdata-element-parsing-algorithm
This sounds good, Leif. Can you create proposed text for this?
I have just commited a fix to this bug. However, for polyglot markup, then only script, style, iframe and title are relevant, unless I missed something. Hopefully this can now be closed, but I will look at it once more first.