16691 – Fix euc-kr

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 16691 - Fix euc-kr

Summary: Fix euc-kr

Status:	RESOLVED FIXED

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	Encoding (show other bugs)
Version:	unspecified
Hardware:	PC Windows 3.1

Importance:	P2 normal
Target Milestone:	Unsorted
Assignee:	Anne
QA Contact:	sideshowbarker+encodingspec

URL:
Whiteboard:
Keywords:

Duplicates (1):	16690 (view as bug list)
Depends on:	20599
Blocks:	16690
	Show dependency tree / graph

Reported:	2012-04-10 20:15 UTC by Anne
Modified:	2014-11-06 14:48 UTC (History)
CC List:	5 users (show)

See Also:

Attachments
Rearranged EUC-KR index (109.93 KB, text/plain) 2013-08-24 19:51 UTC, pub-w3	Details

Description Anne 2012-04-10 20:15:02 UTC

THE KS X 1001 & UHC INDEX FILE
 
As a result of our discussion on this, I have come to realise that my concern can be addressed simply by listing the characters in a different order in the index file: KS X 1001 from 0 to 94x94 - 1 (following the Japanese indexes) and the additional UHC hangul from 94x94 onwards.
 
Steps 1 and 2 of the ISO-2022-KR algorithm could then be simplified as follows:
 
If iso-2022-kr lead is in the range 0x21 to 0x7E and byte is in the range 0x21 to 0x7E, set code point to the euc-kr code point for 94 × (iso-2022-kr lead − 0x21) + byte − 0x21.
 
The relevant steps of the EUC-KR algorithms might be expressed along the following lines:
 
If lead and byte are both in the range A1..FE:
    Set index to 94(lead-A1) + (byte-A1)
Otherwise, if lead in 0x81..C6:
    Depending on byte, set x as follows
        0x41..5A:  byte - 0x41
        0x61..7A:  26 + byte - 0x61
        0x81..0A:  2x26 + byte - 0x81
    If x:
        If lead < A1: set index to 94x94 + 178(lead-0x81) + x
        Otherwise, set index to 94x94 + 32x178 + 84(lead-A1) + x
 
Some more technical arguments for this change:
 
The ISO-2022-KR algorithm is simpler.  A Johab algorithm would also be able to use this index more easily.
 
The complexity of the EUC-KR algorithm does not change much.  In practice, it will probably be slightly more efficient since the entire Otherwise... part of the algorithm above will only be needed occasionally, UHC hangul occurring much less frequently in Korean than the ones included in KS X 1001.  (In comparison, the common case in the current algorithm will be the most deeply nested one, viz, 5.1, range 0x81 to 0xFE.)
 
More importantly, ISO-2022-KR encoders (and strict EUC-KR ones if you want to cover that) need to know whether a given Unicode hangul character is part of KS X 1001 or not since UHC hangul will have to be encoded as 8-byte sequences (4 characters: compose character, initial jamo, middle jamo, final jamo) or not at all.  For this, something like "if index(u) < 94x94" seems hard to beat.

Comment 1 pub-w3 2012-04-25 18:52:27 UTC

This would also remove the annoying hole in the index (12,648--12,741) without complicating the algorithms,  and reprocessing of (unused) bytes following C6 (cf. <https://www.w3.org/Bugs/Public/show_bug.cgi?id=16690>) would no longer be a special case.

Comment 2 Anne 2013-01-14 12:10:44 UTC

Bug 20599 suggests removing the iso-2022-kr encoder (and possibly decoder).

It might still be worthwhile to pursue this however. Is there an easy way to generate this improved index from the index we have now?

Comment 3 pub-w3 2013-01-14 21:44:27 UTC

The following minimally tested Perl code shows a straightforward (but not particularly elegant) way of rearranging the existing index.

while (<>) {
    if (/^ *([0-9]+)(\t.*)/) {
	$x[$1] = $2
    } else {
	print
    }
}

for $i (1..38) {
    for $j (1..94) {
	$n = (32+$i-1)*(84+94) + 84+$j-1;
	if ($x[$n]) {
	    printf "%5i%s\n", 94*($i-1) + $j-1, $x[$n]
	}
    }
}

for $i (39..94) {
    for $j (1..94) {
	$n = (32+38)*(84+94) + ($i-39)*94 + $j-1;
	if ($x[$n]) {
	    printf "%5i%s\n", 94*($i-1) + $j-1, $x[$n]
	}
    }
}

for $i (1..32) {
    for $j (1..84+94) {
	$n = ($i-1)*(84+94) + $j-1;
	if ($x[$n]) {
	    printf "%5i%s\n", 94*94 + $n, $x[$n]
	}
    }
}

for $i (1..38) {
    for $j (1..84) {
	$n = (32+$i-1)*(84+94) + $j-1;
	if ($x[$n]) {
	    printf "%5i%s\n", 94*94 + (32)*(84+94) + ($i-1)*84 + $j-1, $x[$n]
	}
    }
}

Comment 4 pub-w3 2013-01-15 19:18:50 UTC

More concise algorithm with fewer loops:

use integer;
while (<>) {
    if (/^ *([0-9]+)(\t.*)/) {
	if ($1 < 32*(84+94)) {
	    $n = 94*94 + $1
	} elsif ($1 < (32+38)*(84+94)) {
	    if ($1%(84+94) < 84) {
		$n = 94*94 + 32*(84+94) + 84*($1/(84+94)-32) + $1%(84+94)
	    } else {
		$n = 94*($1/(84+94)-32) + $1%(84+94)-84
	    }
        } else {
	    $n = 38*94 + $1-(32+38)*(84+94)
	}
	$x{$n} = sprintf "%5i%s\n", $n, $2
    } else {
	print
    }
}
for $i (sort {$a <=> $b} keys %x) {
    print $x{$i}
}

Comment 5 pub-w3 2013-01-16 09:19:18 UTC

More enumeration, less maths:

while (<>) {
    if (/^ *([0-9]+)(\t.*)/) {
	$x[$1] = $2
    } else {
	print
    }
}

$n = 0; $m = 94*94;
for $i (0..$#x) {
    if ($i < 32*(84+94)  ||  $i < (32+38)*(84+94) && $i%(84+94) < 84) {
	$y[$m++] = $x[$i]
    } else {
	$y[$n++] = $x[$i]
    }
}

for $i (0..$#y) {
    printf "%5i%s\n", $i, $y[$i] if $y[$i]
}

Comment 6 Anne 2013-01-21 15:48:14 UTC

Sorry, cannot read Perl :-) And I guess that's for the .txt file, not https://github.com/whatwg/encoding/blob/master/indexes.json right?

I guess I can take a stab at rewriting this. I hope it's worth it.

Comment 7 pub-w3 2013-01-21 20:21:01 UTC

The interesting part of a Python implementation that reads indexes.json is not all that different:

import json
data = json.loads(open("indexes.json", "r").read())

n = 0; m = 94*94; y = [None]*17658
for i, cp in enumerate(data["euc-kr"]):
    if (i < 32*(84+94)  or  i < (32+38)*(84+94) and i%(84+94) < 84):
        if cp != None: y[m] = cp
        m+=1
    else:
        y[n] = cp
        n+=1

print ",".join("null" if s == None else str(s) for s in y)

Comment 8 Anne 2013-08-23 10:37:55 UTC

Now we no longer have iso-2022-kr could we have an even more efficient table? I guess that's worth investigating.

Comment 9 pub-w3 2013-08-24 19:51:41 UTC

Created attachment 1391 [details]
Rearranged EUC-KR index

Comment 10 pub-w3 2013-08-24 19:55:25 UTC

Making consistent error handling less of a special case (cf. bug 16690) seems difficult unless code point C6-0x52 (the last additional UHC Hangul) appears at the end of the table.  This can be achieved as proposed previously (at the expense of complicating the EUC-KR algorithm slightly), or by rearranging the index as follows:

    1) 126 x 94 entries corresponding to second byte A1 to FE ;
    2) remaining entries corresponding to second byte 0x41 to A0.

The EUC-KR decoding algorithm can then be simplified slightly:

    If byte is in the range 0xA1 to 0xFE,
    set pointer to (lead − 0x81) × 94 + (byte − 0xA1).

    Otherwise, let temp be 126 × 94 + (lead − 0x81) × 84,
    and then set pointer to the result of the equation below,
    depending on byte:

        0x41 to 0x5A
            temp + byte − 0x41

        0x61 to 0x7A
            temp + 26 + byte − 0x61

        0x81 to 0xFE
            temp + 26 + 26 + byte − 0x81

(Incidentally, such an index would simplify the ISO-2022 algorithms as well.)

Comment 11 Anne 2013-12-16 18:16:32 UTC

Would it not be even easier to base on the index on windows-949?

Use the range 81-FE for the lead byte and the range 41-FE as trail bytes?

And then unconsume if there is no code point and the trail byte is less than 0x80, just like we do for big5.

Comment 12 Anne 2013-12-16 18:17:05 UTC

*** Bug 16690 has been marked as a duplicate of this bug. ***

Comment 13 Anne 2014-11-04 15:44:24 UTC

Jungshik, could you (or anyone reading this) perhaps give input on comment 11?

It would be nice to get the remaining Encoding bugs fixed.

Comment 14 Jungshik Shin 2014-11-04 22:06:49 UTC

To me/Blink, it does not matter much how index file for EUC-KR is arranged because we won't use the index file directly (we use it to generate an icu mapping file). 

However, I found an important incompatibility between  
browsers using ICU on the one hand (Chrome, Opera, Safari) and Firefox on the other hand when it comes to handling invalid/unassigned code points in legacy encodings. 

When coming across '\xF0\x61' in EUC-KR/CP949, ICU emits U+FFFD for the two byte sequence. Firefox emits U+FFFD followed by U+0061. And, that's what the current encoding spec requires of Big5 (I found it the other day while making the ICU mapping table for Big5 per the encoding spec). 

We need to reconcile this discrepancy. I'll file a separate bug.

Comment 15 Anne 2014-11-05 09:23:50 UTC

Jungshik, the reason that happens is that otherwise there's an XSS risk. You can inject a lead byte to make sure a byte in the 0x00-0x7F range does not get seen and bytes in that range are often important delimiters. See bug 19961 for more details and getting these kind of security considerations into the specification.

Comment 16 Anne 2014-11-06 14:48:50 UTC

https://github.com/whatwg/encoding/commit/4b20cf61260ed00357663755886d9f7617d60b35