This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
http://lists.w3.org/Archives/Public/www-validator/2004May/0089.html http://lists.w3.org/Archives/Public/public-qa-dev/2004May/0024.html
I've added the O_CHARSET flag to &abort_if_error_flagged that gets triggered on byte errors. This has the potential to break in some of the other exception cases that gets handled by this particular instance, so it bears watching for further weirdness (there was a reason why it was disabled IIRC). This enables the Charset popup in result pages for URLs with no charset and bytes that are invalid in UTF-8 (i.e. unlabled Latin 1, Win-1252, etc.).
Both Charset and DOCTYPE were borked due to logic errors. Now fixed, but the code is somewhat hairy so there could still be edge cases and it's likely to break again the next time someone futzes about in this part of the code. Anyone have ideas for a complete revamp of this code?
Still some issues: try for example http://www.hut.fi - HTTP Content-Type: text/html (no charset) - No <meta> element with a charset in markup - Not XML, no XML encoding. --> validator uses UTF-8. Shouldn't this be iso-8859-1 based on the "strong default" of HTTP?
cf. Comment #3 "Shouldn't this be iso-8859-1 based on the "strong default" of HTTP?" No. As we've been over a gazillion and one times on w-v, the HTML Recs have seen fit to override the HTTP RFC making it impossible to conform with both standards at the same time. Given the mess this issues is in, we're erring to the side 1) obeying the W3C Recs since we're a W3C hosted service, and 2) to promote UNICODE over limited legacy charsets. This is not ideal, but the best we can do under the circumstances. If you have specific proposed changes (other than changing the default to ISO-8859-1) please outline them (warn about the fallback perhaps?). Otherwise, please close this bug.
Correct me if I'm wrong: I don't know where the conflict between the specs in the situation outlined in comment 3 is. AFAIK, one of the specs has a "strong default", the other (speaking HTML here, not X(HT)ML) does not have any default. FWIW, I disagree with bluntly acting against the HTTP spec _when not necessary_. Anyway, we already have warnings about not being able to find a character encoding to use in the validator code. Why aren't those shown in this case? In which cases they are shown, then? I think the warnings should be shown no matter what charset we choose if none is explicitly specified. In addition if we choose to use UTF-8 in these cases for which the reasoning is not at all obvious IMO, a blurb/statement about it needs to be included in the documentation.
cf. Comment #5; HTTP specifies that the absence of a charset parameter in the Content-Type field means a default of ISO-8859-1. The HTML 4.01 Recommendation says something along the lines of "This has turned out to be sub-optimal. You should disregard this and default to UTF-8 instead." IOW, when no other charset information is present -- including any defaults implied from an XML Content-Type -- we can pick either ISO-8859-1 or UTF-8 depending on whether we choose to listen to the IETF or the W3C. After many (*many*) discussions on w-v we've ended up listening to the W3C. As for why there is no warning in the case outlined in Comment #3, this is due to the page generating a fatal error. The exception handler is conservative in what it tries to spit out because fatal errors usually occur too early and the datastructures are in a garbage state. I'll look into whether we can fix it in this particular case.
Ok, I've added the accumulated warnings to the output for all cases that the metadata table is also output for. I think the initialization state for &add_table and &add_warning is the same at that stage. Should should be up on qa-dev now; try it on hut.fi and let me know if it does the job.
HTML 4.01 actually says do what you want, but do not default to ISO-8859-1 blindly as that would fail in many situations. The specification mentions UTF-8 only as an example for common encodings and in the section on how to handle illegal URIs. If we do not attempt to be clever, ISO-8859-1 makes the most sense among possible default encodings, as there are most likely more ISO-8859- 1 documents than UTF-8 documents which do not declare any encoding on the web.
As mentioned, this has been discussed, at length, on several occasions on w-v. Bugzilla is not the place to take this discussion up again, and 0.6.6 is not the target for revisiting this issue. Closing this bug as FIXED; lets hash this out on the list and iff necessary make any changes with a target of 0.7.
Now that the warnings are visible, the fallback is a lot less confusing; good enough for 0.6.6 IMO.