This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
This is related to [I18N-ISSUE-371] (http://www.w3.org/International/track/issues/371) The various tests linked from this page: http://www.w3.org/International/tests/repository/encoding/indexes/results-indexes.en.php ... show errors and discrepancies in various legacy single byte encodings, mainly in the handling of unassigned code units. A cursory investigation suggests that the variations in these encodings stem mainly from differences in whether browsers pass unmapped bytes through, generate the U+FFFD replacement character, or generate a PUA code point. Please examine what the majority of browsers are currently doing combined with what makes sense and adapt the mapping tables appropriately.
Per http://lists.w3.org/Archives/Public/www-international/2014JulSep/0189.html no changes are required. Seems there might have been a mistake in earlier editions of the tests.
I'm absolutely okay if somebody points out actual mistakes in earlier editions of my tests, but as long as we have unexplained discrepancies between different versions of tests (see http://lists.w3.org/Archives/Public/www-international/2014JulSep/0198.html), it's premature to close this bug. I have therefore reopened it (I hope this is temporary).
Are those tests publicly available now? It might be better to file a separate bug with links to your tests, as comment 0 does not point to them. (I have similar results to Richard btw in http://dump.testsuite.org/encoding/single-byte-test.html which is why I thought your tests might be buggy.)
My tests so far tested for the characters actually listed in the index files. Where there are gaps in the index files, Martin's tests checked for FFFD, which makes them tests for the decoding algorithm. I'm planning to change my tests and results to also check for FFFD where there is no correspondance listed in the index file.
My tests and results have been updated to check what happens if there is no line for a pointer in the index file. According to the single-byte decoding algorithm, this should produce U+FFFD. See the updated results at http://www.w3.org/International/tests/repository/encoding/indexes/results-aliases I have tried to indicate, where the pass is only partial, how many errors were due to U+FFFD not being served, vs. how many were due to unexpected characters being served that are not those in the tables. I did that in the summary. For details, open the test in the relevant browser (by clicking on the link to the left of the row). See for example http://www.w3.org/International/tests/repository/encoding/indexes/results-aliases#iso-8859-6 The main differences are for windows-1253 and windows-874 and Chrome/Safari/Opera, but also 6 more IE boxes turned orange.
Per http://lists.w3.org/Archives/Public/www-international/2014JulSep/0286.html I believe no changes to be required. However, getting WebKit / Chromium engineers to comment on windows-1253 and windows-874 would be great.
Also, perhaps Travis can give input from Microsoft's side since they deviate the most for single-byte encodings? (Even if this bug is closed once you get to it, please do comment, we can always revisit given better data.)
WebKit just uses ICU, so I would advise contacting the ICU project if there is a desire to have a custom variation of these encodings for the Web. ICU is the place where encodings live :)
I already commented in the mail thread. If we test encoding as well, there'd be a lot more discrepancy between ICU and the encoding spec. For instance, ICU's single-byte tables for windows-12xx and windows-874 map the full-width ASCII block (U+FFxx) to the corresponding position in [0x20 - 0x7E]. Anyway, there's an ICU bug to add tables to match the encoding spec. See http://www.icu-project.org/trac/ticket/10303 http://www.icu-project.org/trac/ticket/11231 deals with windows-874-specific issue of mapping (encode-only) box drawing and a bunch of other characters to [0x80, 0xFF] that are used for Thai characters.
(In reply to Jungshik Shin from comment #9) > If we test encoding as well, there'd be a lot more discrepancy between ICU > and the encoding spec. For instance, ICU's single-byte tables for > windows-12xx and windows-874 map the full-width ASCII block (U+FFxx) to the > corresponding position > in [0x20 - 0x7E]. I think that it would not be a bad thing if the Encoding spec explicitly forbade mapping any codepoint in the ASCII block to any codepoint outside it, in either direction. My gut reaction is that doing that with web content is a security bug waiting to happen.
Jungshik, it's unclear to me whether those ICU encoder extensions are actually interoperable. E.g. I know for a fact Opera before Chromium does not have them. I believe I tested other browsers as well, but I can't find my test right now. Simon, hopefully that falls out of the respective algorithms. No need to make a redundant requirement. Also note that it can never be true for utf-16le/utf-16be, and replacement's decoder.
Anne, I'm not arguing for changing the encoding spec so that U+FF01-U+FF5E is converted to 0x21 - 0x7E in windows-12xx (encoding-only mapping). I just added an observation that there are discrepancies other than found in the test result mentioned in comment 0 because that test suite only tested for decoding. I'm not sure about the security implication of encoding the full-width ASCII to the ASCII range, though. Anyway, perhaps Blink will get rid of that encoding-only mapping for U+FF01 - U+FF5E from windows-12xx. Then, Blink will be aligned with the spec.
Reassigning to Richard so he can make sure the test suite covers encoders as well.
FYI, the chromium bug was filed to get rid of the encoding-only mapping (of U+FF01 - U+FF5E to 0x21 to 0x7E) as well as the discrepancy in windows-874 and windows-1253; http://crbug.com/412053
Richard, any progress on this? https://github.com/w3c/web-platform-tests/pull/1367 demonstrates how you can test an encoder from JavaScript. It shouldn't be that hard to extrapolate something for more extensive testing, although it's a bit cumbersome since the quirks of URL parsing have an impact as well.
I submitted single-byte decoder tests to web-platform-tests: https://github.com/w3c/web-platform-tests/pull/1384 Apart from the document.characterSet API, Chrome and Firefox pass all tests. For the document.characterSet API there are some casing differences. I'm hoping we can make it consistent with TextEncoder.prototype.encoding, but if not I'm happy to get a new bug report. I have not yet written encoder tests. (The interaction with either <form> or URL makes those trickier.) Given that comment 0 seems addressed I'm going to close this.