This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 21556 - iso-2022-jp: How to handle cases when no character comes between shift sequences
Summary: iso-2022-jp: How to handle cases when no character comes between shift sequences
Status: RESOLVED DUPLICATE of bug 27256
Alias: None
Product: WHATWG
Classification: Unclassified
Component: Encoding (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal
Target Milestone: Unsorted
Assignee: Anne
QA Contact: sideshowbarker+encodingspec
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-04-02 20:55 UTC by Peter Occil
Modified: 2014-11-06 10:44 UTC (History)
1 user (show)

See Also:


Attachments

Description Peter Occil 2013-04-02 20:55:42 UTC
Section 3.6.2 of Unicode Technical Report 36 says that conversion must use replacements or cause an error or even for "unrecognized or 'empty' state-change sequences".  But this does not happen in the current encoding algorithms.

For example, in the "hz-gb-2312" algorithm:

0x7E 0x7B 0x7E 0x7D 0x20 results in U+0020, rather than a decoder error and 0x20 (since I presume that the empty shift sequence is illegal.)

Similarly, 0x7E 0x7D 0x7E 0x7B causes no decoder error for being an empty shift sequence.

In the "iso-2022-jp" algorithm:

0x1b 0x24 0x40 0x1b 0x28 0x42 0x20 (and other sequences like it) results in U+0020, rather than a decoder error and 0x20
(since I presume that the empty shift sequence is illegal.)

In the "iso-2022-kr" algorithm:

The byte sequence 0x0E 0x0E 0x0E ... results in no characters, rather than one or more decoder errors (at least for reaching the end of the stream with no characters).

The byte sequence 0x0F 0x0F 0x0F ... results in no characters, rather than one or more decoder errors (at least for reaching the end of the stream with no characters).

0x0E 0x0F 0x20 results in U+0020, rather than a decoder error and 0x20.

All the cases above indicate empty shift sequences not currently treated as decoder errors.

Should the encoding algorithms be changed to emit a decoder error if there are no characters in between shift sequences in "iso-2022-jp" and "iso-2022-kr"?  Or are the algorithms like this for compatibility? Another issue is how to deal with unrecognized ISO 2022 escape sequences; I feel that the current encoding algorithms don't deal with that well enough.
Comment 1 Anne 2013-04-03 14:59:00 UTC
* We do not want to change algorithms except where that leads to further convergence.

* Convergence with Unicode Technical Reports is a non-goal. Unicode Technical Reports can be updated if the reality is different.

* All sequences in iso-2022-* are handled as far as I can tell. What's the problem?
Comment 2 Peter Occil 2013-04-03 16:06:25 UTC
I will test these sequences with different browsers and report back.
Comment 3 Peter Occil 2013-04-03 18:49:55 UTC
I've made a test page at this address:

http://upokecenter.com/projects/iso2022.htm

and tested it with Safari 5.1.7, Internet Explorer 10, Opera 12, Google Chrome 26, and Firefox 19.  The results included the following:

- Safari and Chrome showed the same results, one of the consequences of having the same browser engine -- Webkit.
- Firefox and Webkit emit 0xFFFD when it reaches a shift sequence immediately
after another shift sequence, but not IE or Opera.
- No browser showed a decoder error if a shift sequence occurs at the very end of the string, so this case should probably
  be ignored.
- There was different behavior across browsers on how unrecognized escape sequences are handled.  In ASCII mode,
Opera and Webkit emit 0xFFFD to replace the first bytes of the sequences, while IE and Firefox emit the 0x1B escape character
  and the rest of the sequence as ASCII.

I will collect all the test results on another page and report it here.
Comment 5 Anne 2014-11-06 10:44:31 UTC

*** This bug has been marked as a duplicate of bug 27256 ***