This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
Please change: "Historically encodings and their specifications (if any) were kept track of by the IANA Character Sets registry. This specification renders that registry obsolete." to: "Encodings and their specifications (if any) are also being kept track of by the IANA Character Sets registry. For the purposes of specifications using this specification, that registry is no longer relevant." The first part of the second sentence (up to the comma) was suggested by Anne during an i18n telecon. The text for the second part is recommended to avoid the controversial word 'obsolete'. The edit to the first sentence was proposed by Martin Dürst.
See http://lists.w3.org/Archives/Public/www-international/2014JulSep/0264.html for a proposal that hopefully addresses further concerns from Asmus Freytag.
http://lists.w3.org/Archives/Public/www-international/2014JulSep/0271.html provides further fine-tuning suggestions.
https://github.com/whatwg/encoding/commit/d6eaebe4436ada349218ce9b22ccc104e5f8caab
The first paragraph now reads: "Unicode is the universal alphabet and utf-8 is its encoding. This specification turns that into a requirement for new protocols and formats, as well as existing formats deployed in new contexts." The first sentence reads as if it's from a religious document. It would be great if we could avoid this impression. What about something like "The universal alphabet Unicode and its encoding utf-8 are indispensable for interoperability." ? The "that" in the second sentence is unclear because there are two referents (Unicode and utf-8). What about changing "This specification turns that into a requirement" to "This specification requires utf-8" ? The last paragraph (of the preface) ends with "and thereby renders the registry irrelevant.". I don't care whether we say "obsolete" or "irrelevant", but we need to qualify the range of that statement. We already had various text proposals for this qualification, but apparently they got dropped. What about "and thereby renders the registry irrelevant for specifications, implementations, and content adhering to it." or some such?
Referring to Unicode as an "alphabet" is a serious abuse of terminology. Please don't use "alphabet" to characterize Unicode as a whole or even in part.
(In reply to Glenn Adams from comment #5) > Referring to Unicode as an "alphabet" is a serious abuse of terminology. > Please don't use "alphabet" to characterize Unicode as a whole or even in > part. Sorry, yes, very good point.
Suggestions welcome Glenn.
How about "Unicode is the universal character set" I agree with Martin's last point about losing the scope. I suggest "and thereby renders the registry irrelevant for specifications using this specification." (Martin's "adhering to it" appears to refer to the registry, which I don't think is what Martin meant.)
Darn, hit the button too soon. Amend my suggestion to "and thereby renders the registry irrelevant for specifications, application and content using this specification."
I think I'd rather replace the first paragraph with "For new protocols and formats, as well as existing formats deployed in new contexts, this specification requires the utf-8 encoding."
Works for me.
This specification, by itself, does nothing to the IANA charset registry, because the IANA charset registry remains, and remains relevant to those trying to understand how charset labels appear and are intended, e.g., in web-based email and instant messaging. So it isn't true that all applications using this specification need not know about or look at the IANA charset registry. The IANA charset registry may remain inaccurate until someone fixes it, but this specification unfortunately does nothing to change its accuracy and very little reduce its relevance. Until there is a replacement for the IANA charset registry that is suitable for use by other (non-web) applications, and suitable for web-based implementations of those applications, the IANA charset registry remains an important (if flawed) source of information for designers and implementors of those applications. By making false statements about how well this draft does something it doesn't really do (obsolete or render the IANA charset registry) the more you negatively affect the incentive to do what is necessary, which is to actually fix the IANA charset registry. Fixing the IAMA charset registry so that it is useful and an acceptable reference for non-web applications may be out of scope or not addressed here, but just say so.
What applications are you talking about it? This is intended for all applications.
This is how I understand it (and without speaking on their behalf, I think it's the view of the i18n WG too): For those specifications, formats, protocols and applications that use and conform to the Encoding spec, the Encoding spec restricts the range of legacy encodings to a smaller set than that covered in the IANA registry. For the set of encodings covered by the Encoding spec, the Encoding spec provides information which is sufficiently detailed and complete that such conforming specifications, formats, protocols and applications don't need to refer to the IANA registry for more information. Fixing the IANA registry itself by making changes to it is out of scope for and therefore not addressed by the Encoding spec.
(In reply to Anne from comment #10) > I think I'd rather replace the first paragraph with "For new protocols and > formats, as well as existing formats deployed in new contexts, this > specification requires the utf-8 encoding." That's a very fine solution for the first paragraph. The last paragraph still says "renders the registry irrelevant", which is still inappropriate.
Trying my luck painting this bikeshed, I suggest: " Unicode is the coded character set for the Web and, of the encodings that can represent all of Unicode, UTF-8 is the most appropriate one for interchange. This specification turns the use of UTF-8 as the interchange encoding into a requirement for new protocols and formats, as well as existing formats deployed in new contexts. There are other (legacy) encodings and while they have been defined to some extent, implementations have not always implemented them in the same way, have not always used the same labels, and often differ in dealing with undefined and former proprietary areas of encodings. This specification attempts to fill those gaps so that new implementations do not have to reverse engineer encoding implementations of the market leaders and existing implementations can converge. In particular, this specification defines all the encodings, their algorithms to go from bytes to scalar values and back, and their canonical names and identifying labels. This specification also defines an API to expose part of the encoding algorithms to JavaScript. Implementations have significantly deviated from the labels listed in the IANA Character Sets registry. Combined with the desire to stop legacy encodings from spreading further, this specification is exhaustive about the aforementioned details and thereby renders the registry obsolete when adhering to this specification. In particular, this specification intentionally does not provide a mechanism for extending the set of encodings or the set of labels. " ("obsolete when adhering to this specification" is pretty much a circular statement, but I suggest using slightly more weasely words if that allows a more productive focus on what the spec actually defines.)
(In reply to Martin Dürst from comment #15) > The last paragraph still says "renders the registry irrelevant", which is > still inappropriate. I disagree, but I changed it nonetheless. https://github.com/whatwg/encoding/commit/45b854467f5a3ed48927b577735474c975c02f13
Thanks Henri. Integrated some of your suggestions. https://github.com/whatwg/encoding/commit/29c491bc217a55300b79f64559101533d33e9a37
(In reply to Anne from comment #17) > https://github.com/whatwg/encoding/commit/ > 45b854467f5a3ed48927b577735474c975c02f13 Looks good to me!