This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
Fragment identifier parsing in the HTML spec has this crazy thing: Let decoded fragid be the result of expanding any sequences of percent-encoded octets in fragid that are valid UTF-8 sequences into Unicode characters as defined by UTF-8. If any percent-encoded octets in that string are not valid UTF-8 sequences (e.g. they expand to surrogate code points), then skip this step and the next one. Any chance of getting an algorithm for this somehow? http://www.whatwg.org/specs/web-apps/current-work/#the-indicated-part-of-the-document
So the algorithm you want is: 1. Percent decode /input/ into /bytes/. 2. Run utf-8's decoder on /bytes/. If that emitted an encoder error, return input, otherwise return the result of running utf-8's decoder on /bytes/. Isn't that simple enough to just put in the HTML specification?
Sure, I can do it in the HTML spec if you like.
Checked in as WHATWG revision r7796. Check-in comment: Update integration with URL spec and Encoding spec. http://html5.org/tools/web-apps-tracker?from=7795&to=7796