This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
+++ This bug was initially created as a clone of Bug #23587 +++ Consider following HTML: <!doctype html> <html> <head> <script type="text/javascript"> var user={name:"Jakub <!-- <script>"}</script> <!-- innocent comment --> <script type="text/javascript"> console.log("Hello" + user.name); </script> </head> </html> To put it into perspective imagine, that the server which generated it was executing something "innocent" like this: <?php echo 'var user=' . json_encode($user) ?> The problem is, that given the current automaton description found at http://www.w3.org/TR/html5/syntax.html#script-data-state it leads to matching the "<!--" found in the username with "-->" located after the "innocent comment". Moreover the script body surprisingly extends over to the "next" script. This can be verified in current version of Chrome for example, using the Chrome's console: $$('script')[0].innerHTML " var user={name:"Jakub <!-- <script>"}</script> <!-- innocent comment --> <script type="text/javascript"> console.log("Hello" + user.name); " Observe that there is no warning for the developer at the moment of parsing HTML. However when the JS parser kicks in it gives a (rather) surprising error: Uncaught SyntaxError: Invalid regular expression: missing / The reason for that is that the line var user={name:"Jakub <!-- <script>"}</script> gets parsed as var user=X<Y where Y is "/script>", which resembles a regular expression. Now, what I want to complain about is that the story can end in various different ways depending on such "details" as: 1. do I put the </script> in the same line or not 2. do I put semicolon after definition of user variable 3. do I have an <!-- innocent comment --> after the tag or not 4. do I have a second <script> tag after the innocent comment or not Clearly, this does not help to reach goals which are mentioned in Section "1.10.3 Restrictions on content models and on attribute values", as I wasted 8 hours today debugging this issue in a real life scenario. The reason it was so hard to debug, was that it required aligment of so many planets to reproduce (the username had to contain both <!-- and <script> but not </script>, we needed an html comment afterwards, and another script tag, all of which were independent conditions which happened or not depending on things like adserver targeting etc.). It would help me a lot if the section "4.3.1.2 Restrictions for contents of script elements" at least provided reasons behind this strange set of rules -- I would really like to understand why the "double escape" mode triggered by "<!-- <script>" combo is needed. It would helped even more if some practices were suggested, which could help avoided such problems (for example: "Authors should always escape "<" character as "\x3C" in their strings" or something).
My recommendation is to introduce a parse error here: [[ If the temporary buffer is the string "script", then switch to the script data double escaped state. ]] http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#script-data-double-escape-start-state Rationale: it's only once you enter that state where the parser can swallow a "</script>", which is a problematic thing in conforming parsers that currently goes without any error in validators despite checking the content model restrictions. The "false positive" case of people doing something like <script><!-- document.write('<script></'+'script>') //--></script> is relatively rare and I don't mind the validator flagging an error about it. The current spec catches various errors already (that are less severe now that legacy parsers aren't around), but doesn't properly cover this case.
Jakub: I recommend reading the WHATWG spec, not the W3C spec, and definitely not the spec on the W3C's TR page. The W3C spec is an out-of-date fork of the WHATWG spec with numerous differences (and it's unclear which are intentional — they haven't documented their changes). The WHATWG HTML standard (as of yesterday) has some text that explains this, at the top (in green) and the bottom (in an example) of this section: http://whatwg.org/specs/web-apps/current-work/#restrictions-for-contents-of-script-elements Is that sufficient? Simon: How would you phrase the corresponding authoring conformance criteria? Would it actually really just mean changing the "Restrictions for contents of script elements" in some way? (What way?)
Ian 'Hixie' Hickson : Yes, the explanations are very good. However, I would prefer a simplier advice for developers: "escape < character as \u003C", which is (I think) much easier to implement and remember. Also, maybe it's just me, by do not understand this part of explanation: What is going on here is that for legacy reasons, "<!--" and "<script" strings in script elements in HTML need to be balanced in order for the parser to consider closing the block. If that's really the reason behind this, then why the grammar does not actually require them to be balanced? In a context free gram,ar that should be trivial to express, yet the current grammar does something quite different (I do not even understand what it does, but it seems that <script><!-- <script><script><script><script><script><script></script> --></script> is a good HTML even though it is clearly not balanced).
Yeah I guess we could ban <script (and maybe also </script while at it?) in <!-- --> in the ABNF. Don't ask me how, sorry. I can follow the parser but not the ABNF. :-|
(In reply to Jakub Łopuszański from comment #3) > Ian 'Hixie' Hickson : Yes, the explanations are very good. However, I would > prefer a simplier advice for developers: "escape < character as \u003C", > which is (I think) much easier to implement and remember. That would work too, sure. I'll add that in. > Also, maybe it's just me, by do not understand this part of explanation: > > What is going on here is that for legacy reasons, "<!--" and "<script" > strings in script elements in HTML need to be balanced in order for the > parser to consider closing the block. > > If that's really the reason behind this, then why the grammar does not > actually require them to be balanced? Yeah, "balanced" is the wrong word. I'm not sure what the right word is. > In a context free gramar that should > be trivial to express, yet the current grammar does something quite > different (I do not even understand what it does, but it seems that > <script><!-- <script><script><script><script><script><script></script> > --></script> is a good HTML even though it is clearly not balanced). Yeah... it's more like, it has to be balanced, but only the outer most of each type counts, and only in a particular order, and you can have a trailing <!--, and it's ok to be missing the </script> if you have a -->, and... I tried to write it in prose one time, and couldn't figure out how to do it, which is why it's in ABNF. (In reply to Simon Pieters from comment #4) > Yeah I guess we could ban <script (and maybe also </script while at it?) in > <!-- --> in the ABNF. Don't ask me how, sorry. I can follow the parser but > not the ABNF. :-| script = data1 *( comment-start data2 [ comment-end data1 ] ) data1 = < string that doesn't contain substring that matches not-data1 > not-data1 = comment-start data2 = < string that doesn't contain substring that matches not-data2 > not-data2 = script-start / script-end / comment-end command-start = "<!--" comment-end = "-->" script-start = lt s c r i p t tag-end script-end = lt slash s c r i p t tag-end lt = %x003C ; U+003C LESS-THAN SIGN character (<) slash = %x002F ; U+002F SOLIDUS character (/) s = %x0053 ; U+0053 LATIN CAPITAL LETTER S s =/ %x0073 ; U+0073 LATIN SMALL LETTER S c = %x0043 ; U+0043 LATIN CAPITAL LETTER C c =/ %x0063 ; U+0063 LATIN SMALL LETTER C r = %x0052 ; U+0052 LATIN CAPITAL LETTER R r =/ %x0072 ; U+0072 LATIN SMALL LETTER R i = %x0049 ; U+0049 LATIN CAPITAL LETTER I i =/ %x0069 ; U+0069 LATIN SMALL LETTER I p = %x0050 ; U+0050 LATIN CAPITAL LETTER P p =/ %x0070 ; U+0070 LATIN SMALL LETTER P t = %x0054 ; U+0054 LATIN CAPITAL LETTER T t =/ %x0074 ; U+0074 LATIN SMALL LETTER T tag-end = %x0009 ; U+0009 CHARACTER TABULATION (tab) tag-end =/ %x000A ; U+000A LINE FEED (LF) tag-end =/ %x000C ; U+000C FORM FEED (FF) tag-end =/ %x0020 ; U+0020 SPACE tag-end =/ %x002F ; U+002F SOLIDUS (/) tag-end =/ %x003E ; U+003E GREATER-THAN SIGN (>) ...but IIRC the reason we didn't do this in the first place was that we decided it would match too many pages, and thus cause too many pages to be non-conforming despite working fine.
Checked in as WHATWG revision r8236. Check-in comment: Add a related way to escape scripts. http://html5.org/tools/web-apps-tracker?from=8235&to=8236
(In reply to Ian 'Hixie' Hickson from comment #5) > ...but IIRC the reason we didn't do this in the first place was that we > decided it would match too many pages, and thus cause too many pages to be > non-conforming despite working fine. I think the focus back then was on the behavior, document conformance wasn't discussed much. I think <script><!--...</script> is more common than <script><!--<script></script>...-->...</script>. We give an error for the former but not the latter, even though the former works fine (in non-legacy parsers) but the latter might not match what was intended.
Checked in as WHATWG revision r8237. Check-in comment: Missed a case in previous checkin. http://html5.org/tools/web-apps-tracker?from=8236&to=8237
Actually I'm going to back out the suggestion to just escape "<", because you can't escape it in expressions but it could still be problematic there, as in: var didOk = actualCount<script (In reply to Simon Pieters from comment #7) > > I think the focus back then was on the behavior, document conformance wasn't > discussed much. We might not have discussed it. It was definitely on my mind, or I wouldn't have written this whole big section. :-) > I think <script><!--...</script> is more common than > <script><!--<script></script>...-->...</script>. We give an error for the > former but not the latter, even though the former works fine (in non-legacy > parsers) but the latter might not match what was intended. We shouldn't give an error for the former as far as I can tell from reading the spec. Why do you think it should give an error? (I'm assuming your outer <script> and </script> tags are wrapping the contents of a script element, and are not part of the script element's contents.)
Checked in as WHATWG revision r8238. Check-in comment: Revert the last two checkins. http://html5.org/tools/web-apps-tracker?from=8237&to=8238
(In reply to Ian 'Hixie' Hickson from comment #9) > We shouldn't give an error for the former as far as I can tell from reading > the spec. Why do you think it should give an error? Hmm. I thought it did, but it appears you're right. Validator.nu gives an error though. I think it would make sense to have an error for that case because it resulted in bad parsing in legacy UAs and is clearly a mistake. The <style> element doesn't allow it for that reason. > (I'm assuming your outer <script> and </script> tags are wrapping the > contents of a script element, and are not part of the script element's > contents.) Right.
(In reply to Simon Pieters from comment #11) > (In reply to Ian 'Hixie' Hickson from comment #9) > > > We shouldn't give an error for the former as far as I can tell from reading > > the spec. Why do you think it should give an error? > > Hmm. I thought it did, but it appears you're right. Validator.nu gives an > error though. Maybe because of http://bugzilla.validator.nu/show_bug.cgi?id=697#c0 ? (the "I'd suggest having a warning for the other (r)cdata elements that don't match the ABNF for style." part).
I don't think so. It was asking for a warning, but it appears to not be implemented for e.g. <xmp><!--</xmp>
Simon and others pointed out on IRC that even the advice in the spec now, of escaping <\!-- and <\script> and so on, isn't really something you can always do. Maybe the right solution is for validators to also syntax-check the scripts, so that we have three possible ways of catching the errors — the content restrictions, the HTML parser, and the JS parser — but even then I guess you can't avoid all problems.
I can't figure out a good thing to do here. What's in the spec does vaguely explin the issue, and gives an example, but I can't really see anything we could add that would make it better.
I think the spec should make these non-conforming: <script><!--</script> (to match <style> and to not get weird behavior in legacy browsers and because it's clearly a mistake) <script><!--<script> ...(whatever)... </script> (because getting into this state means confusing things can happen and author intent isn't so clear, and it's easy to avoid it by not using <!-- --> cargo cult and doing escaping)
Ok, done. Is that enough?
Checked in as WHATWG revision r8299. Check-in comment: Remove the parts of the script content model that could lead to unbalanced crazy http://html5.org/tools/web-apps-tracker?from=8298&to=8299
AFAICT it still allows <script><!--<script> ...(whatever)... </script> if the ...(whatever)... is </script>-->
That was my intent. I didn't realise you didn't mean to include that, my bad. Which of the following do you want to allow vs not allow? (Each line is the textContent of a <script> block; actual markup not shown to avoid confusing internal textual contents with identical-looking external markup.) <!-- --> <!-- <script> <!-- <script> --> <!-- <script> </script> <!-- <script> </script> --> <script> <script> </script> </script> --> <!-- --> <!-- --> <!-- --> <script> ...anything else?
(In reply to Ian 'Hixie' Hickson from comment #20) > <!-- --> Allow. > <!-- <script> > <!-- <script> --> > <!-- <script> </script> > <!-- <script> </script> --> Don't allow. > <script> > <script> </script> > </script> Allow. (I think it's pointless to check for </script> here because you can't end up with it from parsing. But I don't have a strong opinion on that either way.) > --> > <!-- --> <!-- --> > <!-- --> <script> Allow. > > ...anything else?
I missed some: <!-- --> <!-- <script> <!-- <script> <!-- <script> <script> <!-- <script> -->
(I assume from previous comments that they're all "don't allow".)
How about: <!-- <!-- -->
I'm assuming that one is ok. Please check my latest attempt!
Checked in as WHATWG revision r8313. Check-in comment: Another attempt at redefining <script> content rules to make zcorpan happy http://html5.org/tools/web-apps-tracker?from=8312&to=8313
(In reply to Ian 'Hixie' Hickson from comment #23) > (I assume from previous comments that they're all "don't allow".) Right. (In reply to Ian 'Hixie' Hickson from comment #25) > I'm assuming that one is ok. Yep. > Please check my latest attempt! LGTM! Thanks!
(In reply to Ian 'Hixie' Hickson from comment #14) > Simon and others pointed out on IRC that even the advice in the spec now, of > escaping <\!-- and <\script> and so on, isn't really something you can > always do. > > Maybe the right solution is for validators to also syntax-check the scripts, > so that we have three possible ways of catching the errors — the content > restrictions, the HTML parser, and the JS parser — but even then I guess you > can't avoid all problems. It's doable for the validator to syntax-check the scripts, so I guess I'll plan to go ahead and add that. (Though we're limited by whatever Rhino currently supports. And I don't know if Rhino is still be maintained and if it's up to date with ES6 now, but if not and there's any new syntax in ES6 that's not allowed in ES5, then the validator will end up flagging it as an error.)