utf-8 Growth On The Web
Part of Tools
On Google's blog, Mark Davis is explaining that Google is moving to Unicode 5.1. The article unfortunately mixes unicode and utf-8 as it has been noticed by David Goodger in Unicode misinformation. But the really interesting bit is the growth of utf-8 on the Web. These data should be interesting for the development of http, html 5 and validators.
© graph from Google.
I wonder how they determined the encoding of pages for purposes of the graph.
@Mark
me neither but that would be interesting to know more. The data have been compiled by Erik van der Poel. Maybe he could chime in and explain a bit more. I will send have sent him a pointer to this thread.
The raise of utf-8, equals the down level of us only(ascii)
And another sad point is Chinese(gb2312) 's change are no obvious.
PS: Glad to see another comment support Markdown Syntax (Extra)!
When you'd add US-ASCII and UTF-8 the picture would be almost stable over the last seven years, a valid US-ASCII page (with NCRs) is also a valid UTF-8 page. Likely folks updated their defaults and dare use more non-ASCII-UTF-8 than in 2001, to be sure you'd have to look into the page. Lies, damned lies, and statistics...
[I drafted this up earlier and no one had responded yet...now I'm a little late to the game]
This is an interesting look at encoding, but many questions remain, namely
their methodology and their URL set.
There are a number of ways to specify the encoding for a document, including
just doing some blind scanning of the raw document. The Google blog post makes
zero mention of how they detected the encoding. Based on the graph and the
wording, it makes me think that they did not look at any of the stated
encodings from the document (the "charset" parameter of the Content-Type HTTP
header, the same value via the META markup element and the "encoding" attribute
of the XML declaration).
I've also been doing some studies of stated encodings recently using mainly the
Open Directory Project (DMoz) as the URL set. [Note: unlike the Google research,
the results I have found are currently a snapshot only and do not represent any
trends over time.] Results from that study will be released soon, but, I've
noticed a few things about character encoding after analyzing about 3.5 million
URLs so far. Here are a few highlights:
A minority of documents (~20%) use the HTTP Header to declare the
encoding[1]. The "utf-8" value IS the dominant value, but only slightly
(318351 for "utf-8" as versus 286967 for the next most-popular value of
"iso-8859-1"). This agrees with Google's research.
The majority of documents (~66%) use the META element to declare the
encoding. In this situation, the result is much different. The #1 and
2 values are "iso-8859-1" and "windows-1252", which combined are
represented in 1754820 cases, which dominates over the third place
"utf-8" at 249084 (a 7:1 ratio!)
Most documents that use XML also specify an encoding in the XML declaration.
Even there, "iso-8859-1" dominates over "utf-8", 54572 instances to 27052
(although "utf-8" is the default encoding for XML documents...)
Stated encodings:
Most browsers will use a stated charset encoding from the HTTP header in
preference to using auto-detection methods (scanning all or parts of the
entire document to look for encoding hints), so if Google used some other
method, they may be ignoring how a browser would actually treat the encoding
of the document, and hence how it is actually (and accurately) displayed.
Please note that the value of "us-ascii" - the closest value to what they are
claiming is on a big decline - was very rarely encountered...less than 1% of
all cases where encodings are specified in any way. So...what does Google mean
when they say that "ASCII" has such a high usage? Do they just mean the first
128 code points shared in common between the ASCII, iso-8859-* and UTF-8
encodings? For UTF-8, did they also use the Byte Order Mark to detect UTF
usage?
Now, certainly the DMoz URL set has its own issues, namely skewing more toward
western web pages, and skewing heavily toward top-level pages of a site (about
3/4). These problems with DMoz are known...but are there any known issues with
Google's URL set? We don't know anything about Google's URL set other than "it
is big" and it "probably represents the universe of the Web-at-large" in
some way.
[1] "utf-8" dominance here may actually be more impressive than it seems. Many
Web servers have a default encoding used by the HTTP header. The default
encoding for Apache 2.2 for example is not "utf-8", but "iso-8859-1".
I used an encoding detector that looks at the entire HTML file (not just the "charset" label). We (Google) have samples of Web documents from 2001 onwards, so I ran the detector on those. The detector reports the "lowest" encoding. I.e. it would not report US-ASCII if there were any non-ASCII characters (bytes with value greater than 127). An NCR analysis might also be interesting, I agree. What would you like to see? Unicode scripts over time? Languages over time?
A while ago I collected very similar data to Brian Wilson, and did some analysis. (I was primarily looking for common errors, and for how many bytes you need to check before finding the meta charset, rather than comparing the frequency of encodings.)
An interesting point is that of the pages declared as UTF-8 (in HTTP headers or in <meta>), 4% are not actually valid UTF-8 and are relying on browsers doing error correction. GB2312 is worse, with 16% of the pages I looked at containing invalid byte sequences - it seems many people label their pages as GB2312 when actually they're using GBK/GB18030.
Hello Brian, thank you for reporting your results. It's always nice to see the results from other samples, since I worry that Google's sample may be somewhat biased toward our mechanisms for choosing a subset of the Web. As you know, we compute a value called Page Rank, and this is used in our systems.
We do use the HTTP and HTML META charsets, but only as initial hints. The rest of the detection is based on the byte stream itself (after the HTTP response headers). Many encoding detectors use the frequencies of occurences of certain byte sequences in a "base" set to build a model during training, and then compute the probability that a document is in a certain encoding based on the byte sequences in that document. Note that the HTTP and HTML charset labels are sometimes wrong or missing. One initial measurement of the fraction of documents that have an "incorrect" label was roughly 5%, I believe, but I will have to go back and confirm that some day. Of course, our own detector may be getting it wrong sometimes too, but when we actually look at the documents, we do find some incorrect charsets, and even some documents that mix UTF-8 with ISO-8859-1. Note also that browsers offer an encoding menu that the user can use to "correct" a garbled display (though novices may never use that feature).
In another study, I found that the HTTP charset was present in 11% of responses in 2001, and 43% in 2007. For the HTML charset, those numbers were 44% and 74%, respectively, while for XML encoding they were 0.39% and 2.7%, respectively.
Some documents come with charset labels declaring iso-8859-1, windows-1252 or even utf-8 when the byte values themselves are never greater than 127. Such documents are pure US-ASCII (if no ISO 2022 escape sequences are encountered). One might even argue that the charset labels are incorrect in such cases. I realize that this is debatable, but I don't think such debate is valuable.
Note that the first 128 values (0-127) are common to many charsets, not just ascii, iso-8859-* and utf-8. Windows-, euc-, shift_jis and big5 all come to mind. Major browsers treat the first 128 values as ascii even if the spec for the charset itself has a few non-ascii characters in that range, such as Yen sign instead of Backslash, and so on.
Yes, we do feed the various BOMs (utf-8, utf-16, etc) into our probability computation.