This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
This was was cloned from bug 16768 as part of operation convergence. Originally filed: 2012-04-18 07:54:00 +0000 Original reporter: Anne <annevk@annevk.nl> ================================================================================ #0 Anne 2012-04-18 07:54:33 +0000 -------------------------------------------------------------------------------- The IANA registry is unbounded, does not match implementations when it comes to encodings and their labels, does not detail extensions to encodings that need to be supported, does not detail error handling for encodings; it is inadequate per today's standards. http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html was written to solve this problem and using it in HTML we can simplify the following: * Instead of "preferred MIME name" we can now talk about "name" of the "encoding". * "ASCII-compatible character encoding" is no longer needed as only utf-16 and utf-16be are incompatible per the restricted list. * The "decode a byte string as UTF-8, with error handling" algorithm can be removed in favor of using "utf-8 decode" which has the correct error handling (should be identical). * For encoding (URLs and <form>) a custom "encoder error" needs to be defined, by returning from the decoder algorithm and feeding it the intended replacement characters. (You do not know in advance which code points cannot be encoded.) * In the suggested default encoding list the encoding names can be updated to use the canonical name rather than a label. * Misinterpreted for compatibility is no longer needed and the encoding overrides table can also be removed. ================================================================================ #1 Jirka Kosek 2012-04-18 11:05:14 +0000 -------------------------------------------------------------------------------- Hi Anne, thanks for draft. Where are labels coming from? I'm asking because if the aim of spec is to handle legacy content then additional labels should be added. For example windows-1250 was sometimes referred as cp1250 and you will find plenty of such pages in the wild. Jirka ================================================================================ #2 Anne 2012-04-18 11:31:55 +0000 -------------------------------------------------------------------------------- The current draft is indeed rather conservative when it comes to single-byte labels (IE is the only browser that does not recognize that label as far as I can tell). I filed bug 16773 to change that. ================================================================================ #3 Anne 2012-05-23 07:52:49 +0000 -------------------------------------------------------------------------------- *** Bug 17151 has been marked as a duplicate of this bug. *** ================================================================================
Do you flag people using bytes that aren't compatible between ISO-8859-1 and Win1252 as a conformance error anywhere, or are we just saying ISO-8859-1 is bogus and these are the new tables, end of story? I've left references to "ASCII-compatible character encoding" for now; is it not still plausible that people are using EBCDIC mainframes and implementing HTML parsers for them? The "utf-8 decode" and "decode" algorithms are too clever for HTML's use, so I just directly use the relevant decoder algorithms. "encode" doesn't seem to add anything useful vs "encoder", either. > (You do not know in advance which code points cannot be encoded.) Can you elaborate on this? This patch is kinda long and I'm not at all sure I got it all right, so if you see anything I missed don't hesitate to let me know.
Checked in as WHATWG revision r7647. Check-in comment: Embrace the Encodings specification. http://html5.org/tools/web-apps-tracker?from=7646&to=7647
Basically, HTML now outlaws EBCDIC so I don't think we should account for that possibility. Just like specifications leave non-8-bit byte architectures as an exercise for the reader. > Can you elaborate on this? What I meant is that knowing whether you can encoding a given code point or decode a given byte requires running through an algorithm that effectively attempts that operation. There's no concept of X can encode/decode set Y. As for utf-8 decode. I was hoping we could end up with all specifications to use the same routine and same algorithm in the backend. By having HTML use utf-8 decode (and similar) that would be encouraged and would make it completely obvious that is in fact possible. (And if we then later need to tweak something there's only one place to do it, yadayadayada.)
I don't think having specs ignore real problems is a good policy. I'm not at all convinced that there are no EBCDIC systems out there connected to the Web. If it's true that EBCDIC is dead, then great, but if it's only almost dead like XML, then we should still cater for it (like we do with XML). Leaving open to see if I can move the BOM handling more to the encoding spec.
I've attempted to fix this for HTML, but WebVTT still needs fixing.
Checked in as WHATWG revision r7782. Check-in comment: Strip a leading BOM from scripts in workers, if any. Also, use more of the encoding spec. http://html5.org/tools/web-apps-tracker?from=7781&to=7782
I assume it should go into the new version of the WebVTT spec. So, just checking what needs to be changed. Replace basically "<span>decoded as UTF-8, with error handling</span>" with "decoded using the <span>UTF-8 decoder</span>"?
Hmm also probably: remove <li><p>If the character indicated by <var title="">position</var> is a U+FEFF BYTE ORDER MARK (BOM) character, advance <var title="">position</var> to the next character in <var title="">input</var>.</p></li> ?
Sylvia, yes, but you want to use http://encoding.spec.whatwg.org/#utf-8-decode rather than the utf-8 decoder. And then you can indeed remove the step about the BOM.
(In reply to comment #9) > Sylvia, yes, but you want to use > http://encoding.spec.whatwg.org/#utf-8-decode rather than the utf-8 decoder. > And then you can indeed remove the step about the BOM. Yes, that's what I meant. :-) Thanks!
I am confused by the note in the WHATWG spec: "The UTF-8 decoder is distinct from the UTF-8 decode algorithm. The latter first strips a Byte Order Mark (BOM), if any, and then invokes the former." If I remove the BOM paragraph, I should then reference the "UTF-8 decode algorithm", right?
Yes.
Good. Here we go: https://dvcs.w3.org/hg/text-tracks/rev/27fcd202d32d
Patch was applied as prepared.