This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
THE KS X 1001 & UHC INDEX FILE As a result of our discussion on this, I have come to realise that my concern can be addressed simply by listing the characters in a different order in the index file: KS X 1001 from 0 to 94x94 - 1 (following the Japanese indexes) and the additional UHC hangul from 94x94 onwards. Steps 1 and 2 of the ISO-2022-KR algorithm could then be simplified as follows: If iso-2022-kr lead is in the range 0x21 to 0x7E and byte is in the range 0x21 to 0x7E, set code point to the euc-kr code point for 94 × (iso-2022-kr lead − 0x21) + byte − 0x21. The relevant steps of the EUC-KR algorithms might be expressed along the following lines: If lead and byte are both in the range A1..FE: Set index to 94(lead-A1) + (byte-A1) Otherwise, if lead in 0x81..C6: Depending on byte, set x as follows 0x41..5A: byte - 0x41 0x61..7A: 26 + byte - 0x61 0x81..0A: 2x26 + byte - 0x81 If x: If lead < A1: set index to 94x94 + 178(lead-0x81) + x Otherwise, set index to 94x94 + 32x178 + 84(lead-A1) + x Some more technical arguments for this change: The ISO-2022-KR algorithm is simpler. A Johab algorithm would also be able to use this index more easily. The complexity of the EUC-KR algorithm does not change much. In practice, it will probably be slightly more efficient since the entire Otherwise... part of the algorithm above will only be needed occasionally, UHC hangul occurring much less frequently in Korean than the ones included in KS X 1001. (In comparison, the common case in the current algorithm will be the most deeply nested one, viz, 5.1, range 0x81 to 0xFE.) More importantly, ISO-2022-KR encoders (and strict EUC-KR ones if you want to cover that) need to know whether a given Unicode hangul character is part of KS X 1001 or not since UHC hangul will have to be encoded as 8-byte sequences (4 characters: compose character, initial jamo, middle jamo, final jamo) or not at all. For this, something like "if index(u) < 94x94" seems hard to beat.
This would also remove the annoying hole in the index (12,648--12,741) without complicating the algorithms, and reprocessing of (unused) bytes following C6 (cf. <https://www.w3.org/Bugs/Public/show_bug.cgi?id=16690>) would no longer be a special case.
Bug 20599 suggests removing the iso-2022-kr encoder (and possibly decoder). It might still be worthwhile to pursue this however. Is there an easy way to generate this improved index from the index we have now?
The following minimally tested Perl code shows a straightforward (but not particularly elegant) way of rearranging the existing index. while (<>) { if (/^ *([0-9]+)(\t.*)/) { $x[$1] = $2 } else { print } } for $i (1..38) { for $j (1..94) { $n = (32+$i-1)*(84+94) + 84+$j-1; if ($x[$n]) { printf "%5i%s\n", 94*($i-1) + $j-1, $x[$n] } } } for $i (39..94) { for $j (1..94) { $n = (32+38)*(84+94) + ($i-39)*94 + $j-1; if ($x[$n]) { printf "%5i%s\n", 94*($i-1) + $j-1, $x[$n] } } } for $i (1..32) { for $j (1..84+94) { $n = ($i-1)*(84+94) + $j-1; if ($x[$n]) { printf "%5i%s\n", 94*94 + $n, $x[$n] } } } for $i (1..38) { for $j (1..84) { $n = (32+$i-1)*(84+94) + $j-1; if ($x[$n]) { printf "%5i%s\n", 94*94 + (32)*(84+94) + ($i-1)*84 + $j-1, $x[$n] } } }
More concise algorithm with fewer loops: use integer; while (<>) { if (/^ *([0-9]+)(\t.*)/) { if ($1 < 32*(84+94)) { $n = 94*94 + $1 } elsif ($1 < (32+38)*(84+94)) { if ($1%(84+94) < 84) { $n = 94*94 + 32*(84+94) + 84*($1/(84+94)-32) + $1%(84+94) } else { $n = 94*($1/(84+94)-32) + $1%(84+94)-84 } } else { $n = 38*94 + $1-(32+38)*(84+94) } $x{$n} = sprintf "%5i%s\n", $n, $2 } else { print } } for $i (sort {$a <=> $b} keys %x) { print $x{$i} }
More enumeration, less maths: while (<>) { if (/^ *([0-9]+)(\t.*)/) { $x[$1] = $2 } else { print } } $n = 0; $m = 94*94; for $i (0..$#x) { if ($i < 32*(84+94) || $i < (32+38)*(84+94) && $i%(84+94) < 84) { $y[$m++] = $x[$i] } else { $y[$n++] = $x[$i] } } for $i (0..$#y) { printf "%5i%s\n", $i, $y[$i] if $y[$i] }
Sorry, cannot read Perl :-) And I guess that's for the .txt file, not https://github.com/whatwg/encoding/blob/master/indexes.json right? I guess I can take a stab at rewriting this. I hope it's worth it.
The interesting part of a Python implementation that reads indexes.json is not all that different: import json data = json.loads(open("indexes.json", "r").read()) n = 0; m = 94*94; y = [None]*17658 for i, cp in enumerate(data["euc-kr"]): if (i < 32*(84+94) or i < (32+38)*(84+94) and i%(84+94) < 84): if cp != None: y[m] = cp m+=1 else: y[n] = cp n+=1 print ",".join("null" if s == None else str(s) for s in y)
Now we no longer have iso-2022-kr could we have an even more efficient table? I guess that's worth investigating.
Created attachment 1391 [details] Rearranged EUC-KR index
Making consistent error handling less of a special case (cf. bug 16690) seems difficult unless code point C6-0x52 (the last additional UHC Hangul) appears at the end of the table. This can be achieved as proposed previously (at the expense of complicating the EUC-KR algorithm slightly), or by rearranging the index as follows: 1) 126 x 94 entries corresponding to second byte A1 to FE ; 2) remaining entries corresponding to second byte 0x41 to A0. The EUC-KR decoding algorithm can then be simplified slightly: If byte is in the range 0xA1 to 0xFE, set pointer to (lead − 0x81) × 94 + (byte − 0xA1). Otherwise, let temp be 126 × 94 + (lead − 0x81) × 84, and then set pointer to the result of the equation below, depending on byte: 0x41 to 0x5A temp + byte − 0x41 0x61 to 0x7A temp + 26 + byte − 0x61 0x81 to 0xFE temp + 26 + 26 + byte − 0x81 (Incidentally, such an index would simplify the ISO-2022 algorithms as well.)
Would it not be even easier to base on the index on windows-949? Use the range 81-FE for the lead byte and the range 41-FE as trail bytes? And then unconsume if there is no code point and the trail byte is less than 0x80, just like we do for big5.
*** Bug 16690 has been marked as a duplicate of this bug. ***
Jungshik, could you (or anyone reading this) perhaps give input on comment 11? It would be nice to get the remaining Encoding bugs fixed.
To me/Blink, it does not matter much how index file for EUC-KR is arranged because we won't use the index file directly (we use it to generate an icu mapping file). However, I found an important incompatibility between browsers using ICU on the one hand (Chrome, Opera, Safari) and Firefox on the other hand when it comes to handling invalid/unassigned code points in legacy encodings. When coming across '\xF0\x61' in EUC-KR/CP949, ICU emits U+FFFD for the two byte sequence. Firefox emits U+FFFD followed by U+0061. And, that's what the current encoding spec requires of Big5 (I found it the other day while making the ICU mapping table for Big5 per the encoding spec). We need to reconcile this discrepancy. I'll file a separate bug.
Jungshik, the reason that happens is that otherwise there's an XSS risk. You can inject a lead byte to make sure a byte in the 0x00-0x7F range does not get seen and bytes in that range are often important delimiters. See bug 19961 for more details and getting these kind of security considerations into the specification.
https://github.com/whatwg/encoding/commit/4b20cf61260ed00357663755886d9f7617d60b35