This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
Specification: https://html.spec.whatwg.org/multipage/introduction.html Multipage: https://html.spec.whatwg.org/multipage/#structure-of-this-specification Complete: https://html.spec.whatwg.org/#structure-of-this-specification Referrer: https://html.spec.whatwg.org/multipage/ Comment: why do these examples of <html> lack the lang attribute? Posted from: 24.22.56.84 User agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0
Why not? Realistically, few people include it. It just means the language is unknown.
(Note that this bug has not been closed, meaning the issue has not been resolved. If you disagree with comment 1, please describe why, to convince me that I'm wrong.)
Based on this tweet it appears that Adrian may be trying to collect data for this bug, so I'm leaving it open: https://twitter.com/aardrian/status/519873578515570688 The most useful thing for this bug would be a clear statement about why having the language explicitly set is important. As far as I'm aware, having the language set really only matters for font selection when distinguishing CJK languages and for speech synthesis selection in legacy products that can't autorecognise the language. The HTML spec doesn't actually have a strong encouragement to add the attribute currently. I definitely don't want to encourage people to add something that's not necessary, so if there isn't a compelling reason to add the attribute (especially in non-CJK cases) then we should probably make that clear in the spec.
https://twitter.com/aardrian/status/535225090028630016
Took me longer to get back to you than I promised. I blame the holiday. I was unable to pull down the latest file from http://webdevdata.org/ without it being corrupt, so I can't state the number of the 78,000 sites it contains (as of 2013-10-30) that use the lang attribute. However, you stated you want information on why setting the lang attribute is important. Here's what I have: - VoiceOver on iOS uses the attribute to auto-switche voices. https://twitter.com/cookiecrook/status/535264071902580736 - VoiceOver can speak a particular language using a different accent when specified. https://twitter.com/pauljadam/status/535264133185556480 - Leaving out the lang attribute may require the user to manually switch to the correct language for proper pronunciation. https://twitter.com/pauljadam/status/535264906216751104 - JAWS uses it to load the correct phonetic engine / phonologic dictionary. Handy for sites with multiple languages. https://twitter.com/notabene/status/535450940070166528 https://twitter.com/notabene/status/535451061163925504 - NVDA (Windows) uses it in the same way as VoiceOver and JAWS. https://twitter.com/MarcoInEnglish/status/535452203314868225 - When used in HTML that is used to form an ePub or Apple iBooks document, it affects how VoiceOver will read the book. https://twitter.com/MarcoInEnglish/status/535452358508306432 - Firefox, IE10, and Safari (as of a year ago) support CSS hyphens: auto only when the lang attribute is set. I did not personally test this because even in this age of evergreen browsers, I still run across year-old versions on a day-to-day basis. http://www.quirksmode.org/blog/archives/2012/11/hyphenation_wor.html I think it's worth noting that I do not consider the current release of VoiceOver in iOS nor NVDA to be a legacy product. I made a Storify of the responses I got on Twitter (all the tweets linked above are included, along with others that re-state the same points): https://storify.com/aardrian/lang-attribute-on-html-for-screen-readers
Interesting stuff, thanks. What language do those screen readers use when there's no language specified?
My understanding is the user's default system setting, barring it being overridden in the SR software.
I was able to download the latest archive from WebDevData.org (2015-01-08 (780 Mb) 87,000 pages). Of the 84,054 pages that I was able to parse, 39,433 use the lang attribute on the <html> element. That's 47% (46.914% if I understand significant digits correctly).
Highest stats for page views in chromestatus: LangAttribute 0.2415% https://www.chromestatus.com/metrics/feature/timeline/popularity/587 LangAttributeDoesNotMatchToUILocale 0.0736% https://www.chromestatus.com/metrics/feature/timeline/popularity/590 LangAttributeOnBody 0.0028% https://www.chromestatus.com/metrics/feature/timeline/popularity/589 LangAttributeOnHtml 0.2184% https://www.chromestatus.com/metrics/feature/timeline/popularity/588 See also comments in https://www.w3.org/Bugs/Public/show_bug.cgi?id=26951 for analysis of earlier webdevdata as well as github. It seems to me that on top sites, lang is relatively common and most often used correctly, while on the long tail, it is used rarely and more often incorrectly.
It seems like the correct resolution here is to canvas the spec for examples and demos that include the `html` element, and add `lang="en"` to them. Note that the spec already encourages lang usage: > Authors are encouraged to specify a lang attribute on the root html element, giving the document's language. This aids speech synthesis tools to determine what pronunciations to use, translation tools to determine what rules to use, and so forth. This seems like a pretty easy bug if someone is willing to submit a pull request.
Did you consider misuse due to copy/paste of examples into non-English pages, as in https://www.w3.org/Bugs/Public/show_bug.cgi?id=26951#c7 ?
I did not. What do you think that means we should do?
I'm not sure... I can see a few possible situations: * Software uses the lang="" if specified, and otherwise system language or user setting (apparently most screen readers per comment 5). Omitting lang is no-harm if the page happens to be in the same language as the system language or user setting, otherwise as harmful as mislabeling. Mislabeling is harmful (requires user override). * Software uses the lang="" if specified, and otherwise uses language analysis of the page (or user override). I don't know if any such software exists. Omitting lang would typically be no-harm, since language analysis works reasonably well I believe. Mislabeling is harmful (requires user override). * Software uses one of the above approaches but ignores lang="en" due to too much mislabeled content. * Software always uses language analysis (or user override) (possibly using lang as a hint). e.g. Google Translate, I think. Omitting or mislabeling would typically be no-harm. So mislabeling is a problem, but not labeling at all can also be a problem. I suppose it is ineffective to try to combat mislabeling by not labeling at all in examples in the spec. It would be more effective to warn in HTML checkers if the specified language doesn't match with language analysis. How about adding lang to about half of the examples, so that it doesn't appear like it's a fixed required preamble (like the doctype)? Maybe also add more non-English examples.
Good breakdown. I'm not a big fan of the half-the-examples idea. I think in effect this bug is trying to communicate that it *is* a required preamble, for good screen reader support.
Yeah, OK. Let's add those 'lang's then, and then experiment with the HTML checker and other tools to combat mislabeling.
https://github.com/whatwg/html/pull/1061
https://github.com/validator/validator/issues/284