26693 – Edit for the 'Encoding false statement' thread

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 26693 - Edit for the 'Encoding false statement' thread

Summary: Edit for the 'Encoding false statement' thread

Status:	RESOLVED FIXED

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	Encoding (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P2 normal
Target Milestone:	Unsorted
Assignee:	Anne
QA Contact:	sideshowbarker+encodingspec

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2014-08-29 09:08 UTC by Richard Ishida
Modified:	2014-09-04 10:36 UTC (History)
CC List:	6 users (show)

See Also:

Attachments

Description Richard Ishida 2014-08-29 09:08:48 UTC

Please change:

"Historically encodings and their specifications (if any) were kept track of by the IANA Character Sets registry. This specification renders that registry obsolete."

to:

"Encodings and their specifications (if any) are also being kept track of by the IANA Character Sets registry. For the purposes of specifications using this specification, that registry is no longer relevant."

The first part of the second sentence (up to the comma) was suggested by Anne during an i18n telecon. The text for the second part is recommended to avoid the controversial word 'obsolete'. The edit to the first sentence was proposed by Martin Dürst.

Comment 1 Anne 2014-09-01 09:51:03 UTC

See http://lists.w3.org/Archives/Public/www-international/2014JulSep/0264.html for a proposal that hopefully addresses further concerns from Asmus Freytag.

Comment 2 Richard Ishida 2014-09-01 16:42:49 UTC

http://lists.w3.org/Archives/Public/www-international/2014JulSep/0271.html provides further fine-tuning suggestions.

Comment 3 Anne 2014-09-01 18:20:12 UTC

https://github.com/whatwg/encoding/commit/d6eaebe4436ada349218ce9b22ccc104e5f8caab

Comment 4 Martin Dürst 2014-09-02 01:17:28 UTC

The first paragraph now reads:

"Unicode is the universal alphabet and utf-8 is its encoding. This specification turns that into a requirement for new protocols and formats, as well as existing formats deployed in new contexts."

The first sentence reads as if it's from a religious document. It would be great if we could avoid this impression. What about something like "The universal alphabet Unicode and its encoding utf-8 are indispensable for interoperability." ?

The "that" in the second sentence is unclear because there are two referents (Unicode and utf-8). What about changing "This specification turns that into a requirement" to "This specification requires utf-8" ?

The last paragraph (of the preface) ends with "and thereby renders the registry irrelevant.". I don't care whether we say "obsolete" or "irrelevant", but we need to qualify the range of that statement. We already had various text proposals for this qualification, but apparently they got dropped. What about "and thereby renders the registry irrelevant for specifications, implementations, and content adhering to it." or some such?

Comment 5 Glenn Adams 2014-09-02 03:07:46 UTC

Referring to Unicode as an "alphabet" is a serious abuse of terminology. Please don't use "alphabet" to characterize Unicode as a whole or even in part.

Comment 6 Martin Dürst 2014-09-02 03:33:42 UTC

(In reply to Glenn Adams from comment #5)
> Referring to Unicode as an "alphabet" is a serious abuse of terminology.
> Please don't use "alphabet" to characterize Unicode as a whole or even in
> part.

Sorry, yes, very good point.

Comment 7 Anne 2014-09-02 08:42:34 UTC

Suggestions welcome Glenn.

Comment 8 Richard Ishida 2014-09-02 08:46:56 UTC

How about "Unicode is the universal character set"

I agree with Martin's last point about losing the scope.  I suggest "and thereby renders the registry irrelevant for specifications using this specification."

(Martin's "adhering to it" appears to refer to the registry, which I don't think is what Martin meant.)

Comment 9 Richard Ishida 2014-09-02 08:48:30 UTC

Darn, hit the button too soon.

Amend my suggestion to "and thereby renders the registry irrelevant for specifications, application and content using this specification."

Comment 10 Anne 2014-09-02 13:41:48 UTC

I think I'd rather replace the first paragraph with "For new protocols and formats, as well as existing formats deployed in new contexts, this specification requires the utf-8 encoding."

Comment 11 Richard Ishida 2014-09-02 13:43:21 UTC

Works for me.

Comment 12 Larry Masinter 2014-09-03 11:49:16 UTC

This specification, by itself, does nothing to the IANA charset registry, because the IANA charset registry remains, and remains relevant to those trying to understand how charset labels appear and are intended, e.g., in web-based email and instant messaging.

So it isn't true that all applications using this specification need not know about or look at the IANA charset registry. The IANA charset registry may remain inaccurate until someone fixes it, but this specification unfortunately does nothing to change its accuracy and very little reduce its relevance. 

Until there is a replacement for the IANA charset registry that is suitable for use by other (non-web) applications, and suitable for web-based implementations of those applications, the IANA charset registry remains an important (if flawed) source of information for designers and implementors of those applications.

By making false statements about how well this draft does something it doesn't really do (obsolete or render the IANA charset registry) the more you negatively affect the incentive to do what is necessary, which is to actually fix the IANA charset registry.   Fixing the IAMA charset registry so that it is useful and an acceptable reference for non-web applications may be out of scope or not addressed here, but just say so.

Comment 13 Anne 2014-09-03 11:56:44 UTC

What applications are you talking about it? This is intended for all applications.

Comment 14 Richard Ishida 2014-09-03 15:16:37 UTC

This is how I understand it (and without speaking on their behalf, I think it's the view of the i18n WG too):

For those specifications, formats, protocols and applications that use and conform to the Encoding spec, the Encoding spec restricts the range of legacy encodings to a smaller set than that covered in the IANA registry. For the set of encodings covered by the Encoding spec, the Encoding spec provides information which is sufficiently detailed and complete that such conforming specifications, formats, protocols and applications don't need to refer to the IANA registry for more information.

Fixing the IANA registry itself by making changes to it is out of scope for and therefore not addressed by the Encoding spec.

Comment 15 Martin Dürst 2014-09-04 08:06:40 UTC

(In reply to Anne from comment #10)
> I think I'd rather replace the first paragraph with "For new protocols and
> formats, as well as existing formats deployed in new contexts, this
> specification requires the utf-8 encoding."

That's a very fine solution for the first paragraph.

The last paragraph still says "renders the registry irrelevant", which is still inappropriate.

Comment 16 Henri Sivonen 2014-09-04 09:40:30 UTC

Trying my luck painting this bikeshed, I suggest:

"
Unicode is the coded character set for the Web and, of the encodings that can represent all of Unicode, UTF-8 is the most appropriate one for interchange. This specification turns the use of UTF-8 as the interchange encoding into a requirement for new protocols and formats, as well as existing formats deployed in new contexts.

There are other (legacy) encodings and while they have been defined to some extent, implementations have not always implemented them in the same way, have not always used the same labels, and often differ in dealing with undefined and former proprietary areas of encodings. This specification attempts to fill those gaps so that new implementations do not have to reverse engineer encoding implementations of the market leaders and existing implementations can converge.

In particular, this specification defines all the encodings, their algorithms to go from bytes to scalar values and back, and their canonical names and identifying labels. This specification also defines an API to expose part of the encoding algorithms to JavaScript.

Implementations have significantly deviated from the labels listed in the IANA Character Sets registry. Combined with the desire to stop legacy encodings from spreading further, this specification is exhaustive about the aforementioned details and thereby renders the registry obsolete when adhering to this specification. In particular, this specification intentionally does not provide a mechanism for extending the set of encodings or the set of labels.
"

("obsolete when adhering to this specification" is pretty much a circular statement, but I suggest using slightly more weasely words if that allows a more productive focus on what the spec actually defines.)

Comment 17 Anne 2014-09-04 09:42:41 UTC

(In reply to Martin Dürst from comment #15)
> The last paragraph still says "renders the registry irrelevant", which is
> still inappropriate.

I disagree, but I changed it nonetheless.

https://github.com/whatwg/encoding/commit/45b854467f5a3ed48927b577735474c975c02f13

Comment 18 Anne 2014-09-04 10:24:08 UTC

Thanks Henri. Integrated some of your suggestions.

https://github.com/whatwg/encoding/commit/29c491bc217a55300b79f64559101533d33e9a37

Comment 19 Martin Dürst 2014-09-04 10:36:09 UTC

(In reply to Anne from comment #17)

> https://github.com/whatwg/encoding/commit/
> 45b854467f5a3ed48927b577735474c975c02f13

Looks good to me!