Copyright © 2014 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. W3C liability, trademark and document use rules apply.
The utf-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the utf-8 encoding.
The other (legacy) encodings have been defined to some extent in the past. However, user agents have not always implemented them in the same way, have not always used the same labels, and often differ in dealing with undefined and former proprietary areas of encodings. This specification addresses those gaps so that new user agents do not have to reverse engineer encoding implementations and existing user agents can converge.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document was published by the Internationalization Working Group as a Candidate Recommendation. This document is intended to become a W3C Recommendation.
W3C publishes a Candidate Recommendation to indicate that the document is believed to be stable and to encourage implementation by the developer community. W3C encourages everybody to implement this specification and return comments to the (archived) public mailing list www-international@w3.org (see instructions) or file a bug (open bugs). All comments are welcome. When sending e-mail, please put the text “Encoding” in the subject, preferably like this: “[Encoding] …summary of comment…” . This Candidate Recommendation is expected to advance to Proposed Recommendation no earlier than 16 March 2015. There is not yet an implementation report.
This is a snapshot of the WHATWG document, as of 4 September 2014, published after discussion with the WHATWG editors. No changes have been made in the body of this document other than to align with W3C house styles. The primary reason that W3C is publishing this document is so that HTML5 and other specifications may normatively refer to a stable W3C Recommendation.
This document takes into account comments during the Last Call; see Disposition of Comments for the Last Call. See a list of changes since the Last Call version was published. There are no features at risk.
Publication as a Candidate Recommendation does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document is governed by the 14 October 2005 W3C Process Document.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
The utf-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the utf-8 encoding.
The other (legacy) encodings have been defined to some extent in the past. However, user agents have not always implemented them in the same way, have not always used the same labels, and often differ in dealing with undefined and former proprietary areas of encodings. This specification addresses those gaps so that new user agents do not have to reverse engineer encoding implementations and existing user agents can converge.
In particular, this specification defines all those encodings, their algorithms to go from bytes to scalar values and back, and their canonical names and identifying labels. This specification also defines an API to expose part of the encoding algorithms to JavaScript.
User agents have also significantly deviated from the labels listed in the IANA Character Sets registry. To stop spreading legacy encodings further, this specification is exhaustive about the aforementioned details and therefore has no need for the registry. In particular, this specification does not provide a mechanism for extending any aspect of encodings.
All diagrams, examples, and notes in this specification are non-normative, as are all sections explicitly marked non-normative. Everything else in this specification is normative.
The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative parts of this document are to be interpreted as described in RFC2119. For readability, these words do not appear in all uppercase letters in this specification. [RFC2119]
Conformance requirements phrased as algorithms or specific steps may be implemented in any manner, so long as the end result is equivalent. (In particular, the algorithms defined in this specification are intended to be easy to follow, and not intended to be performant.)
User agents may impose implementation-specific limits on otherwise unconstrained inputs, e.g. to prevent denial of service attacks, to guard against running out of memory, or to work around platform-specific limitations.
Hexadecimal numbers are prefixed with "0x".
In equations, all numbers are integers, addition is represented by "+", subtraction by "−", multiplication by "×", division by "/", calculating the remainder of a division (also known as modulo) by "%", logical left shifts by "<<", logical right shifts by ">>", bitwise AND by "&", and bitwise OR by "|".
For logical right shifts operands must have at least twenty-one bits precision.
A byte is a sequence of eight bits, represented as a double-digit hexadecimal number in the range 0x00 to 0xFF.
A code point is a Unicode code point and is represented as a four-to-six digit hexadecimal number, typically prefixed with "U+". In equations and indexes code points are prefixed with "0x". [UNICODE]
A scalar value is a code point that is not in the range U+D800 to U+DFFF.
The ASCII whitespace are code points U+0009, U+000A, U+000C, U+000D, and U+0020.
The ASCII digits are code points in the range U+0030 to U+0039.
A string is a sequence of code points.
Comparing two strings in an ASCII case-insensitive manner means comparing them exactly, code point for code point, except that the characters in the range U+0041 to U+005A (i.e. LATIN CAPITAL LETTER A to LATIN CAPITAL LETTER Z) and the corresponding characters in the range U+0061 to U+007A (i.e. LATIN SMALL LETTER A to LATIN SMALL LETTER Z) are considered to also match.
A token is a piece of data, such as a byte or code point.
A stream represents an ordered sequence of tokens. End-of-stream is a special token that signifies no more tokens are in the stream.
When a token is read from a stream, the first token in the stream must be returned and subsequently removed, and end-of-stream must be returned otherwise.
When one or more tokens are prepended to a stream, those tokens must be inserted, in given order, before the first token in the stream.
Inserting the sequence of tokens 💩
in a
stream " hello world
", results in a stream
"💩 hello world
". The next token to be read would be
&
.
When one or more tokens are pushed to a stream, those tokens must be inserted, in given order, after the last token in the stream.
An encoding defines a mapping from a scalar value sequence to a byte sequence (and vice versa). Each encoding has a name, and one or more labels.
Each encoding has an associated decoder and encoder. Each decoder and encoder have a handler algorithm. A handler algorithm takes an input stream and a token, and returns finished, one or more tokens, error optionally with a code point, or continue.
An error mode as used below is either replacement (default) or fatal for a decoder and one of fatal (default) or HTML for an encoder.
An XML processor would set error mode to fatal. [XML]
HTML exists as error mode due to URLs and HTML forms requiring a non-terminating legacy encoder. The HTML error mode causes a sequence to be emitted that cannot be distinguished from legitimate input and can therefore lead to silent data loss. Developers are strongly encouraged to use the utf-8 encoding to prevent this from happening. [URL] [HTML]
To run an encoding's decoder or encoder encoderDecoder with input stream input, output stream output, and error mode mode, run these steps:
If mode is not given, set it to replacement, if encoderDecoder is a decoder, and fatal otherwise.
Let encoderDecoderInstance be a new encoderDecoder.
While true:
Let result be the result of processing the result of reading from input for encoderDecoderInstance, input, output, and mode.
If result is not continue, return result.
Otherwise, do nothing.
To process a token token for an encoding's encoder or decoder instance encoderDecoderInstance, stream input, output stream output, and error mode mode, run these steps:
If mode is not given, set it to replacement, if encoderDecoder is a decoder instance, and fatal otherwise.
Let result be the result of running encoderDecoderInstance's handler on token.
Otherwise, if result is one or more tokens, push result to output.
Otherwise, if result is error, switch on mode and run the associated steps:
The table below lists all encodings and their labels user agents must support. User agents must not support any other encodings or labels.
Authors must use the utf-8 encoding and must use the
ASCII case-insensitive "utf-8
" label to
identify it.
New protocols and formats, as well as existing formats deployed in new contexts, must
use the utf-8 encoding exclusively. If these protocols and
formats need to expose the encoding's name or
label, they must expose it as "utf-8
".
To get an encoding from a string label, run these steps:
Remove any leading and trailing ASCII whitespace from label.
If label is an ASCII case-insensitive match for any of the labels listed in the table below, return the corresponding encoding, and failure otherwise.
This is a much simpler and more restrictive algorithm of mapping labels to encodings than section 1.4 of Unicode Technical Standard #22 prescribes, as that is found to be necessary to be compatible with deployed content.
Name | Labels |
---|---|
The Encoding | |
utf-8 | "unicode-1-1-utf-8 "
|
"utf-8 "
| |
"utf8 "
| |
Legacy single-byte encodings | |
ibm866 | "866 "
|
"cp866 "
| |
"csibm866 "
| |
"ibm866 "
| |
iso-8859-2 | "csisolatin2 "
|
"iso-8859-2 "
| |
"iso-ir-101 "
| |
"iso8859-2 "
| |
"iso88592 "
| |
"iso_8859-2 "
| |
"iso_8859-2:1987 "
| |
"l2 "
| |
"latin2 "
| |
iso-8859-3 | "csisolatin3 "
|
"iso-8859-3 "
| |
"iso-ir-109 "
| |
"iso8859-3 "
| |
"iso88593 "
| |
"iso_8859-3 "
| |
"iso_8859-3:1988 "
| |
"l3 "
| |
"latin3 "
| |
iso-8859-4 | "csisolatin4 "
|
"iso-8859-4 "
| |
"iso-ir-110 "
| |
"iso8859-4 "
| |
"iso88594 "
| |
"iso_8859-4 "
| |
"iso_8859-4:1988 "
| |
"l4 "
| |
"latin4 "
| |
iso-8859-5 | "csisolatincyrillic "
|
"cyrillic "
| |
"iso-8859-5 "
| |
"iso-ir-144 "
| |
"iso8859-5 "
| |
"iso88595 "
| |
"iso_8859-5 "
| |
"iso_8859-5:1988 "
| |
iso-8859-6 | "arabic "
|
"asmo-708 "
| |
"csiso88596e "
| |
"csiso88596i "
| |
"csisolatinarabic "
| |
"ecma-114 "
| |
"iso-8859-6 "
| |
"iso-8859-6-e "
| |
"iso-8859-6-i "
| |
"iso-ir-127 "
| |
"iso8859-6 "
| |
"iso88596 "
| |
"iso_8859-6 "
| |
"iso_8859-6:1987 "
| |
iso-8859-7 | "csisolatingreek "
|
"ecma-118 "
| |
"elot_928 "
| |
"greek "
| |
"greek8 "
| |
"iso-8859-7 "
| |
"iso-ir-126 "
| |
"iso8859-7 "
| |
"iso88597 "
| |
"iso_8859-7 "
| |
"iso_8859-7:1987 "
| |
"sun_eu_greek "
| |
iso-8859-8 | "csiso88598e "
|
"csisolatinhebrew "
| |
"hebrew "
| |
"iso-8859-8 "
| |
"iso-8859-8-e "
| |
"iso-ir-138 "
| |
"iso8859-8 "
| |
"iso88598 "
| |
"iso_8859-8 "
| |
"iso_8859-8:1988 "
| |
"visual "
| |
iso-8859-8-i | "csiso88598i "
|
"iso-8859-8-i "
| |
"logical "
| |
iso-8859-10 | "csisolatin6 "
|
"iso-8859-10 "
| |
"iso-ir-157 "
| |
"iso8859-10 "
| |
"iso885910 "
| |
"l6 "
| |
"latin6 "
| |
iso-8859-13 | "iso-8859-13 "
|
"iso8859-13 "
| |
"iso885913 "
| |
iso-8859-14 | "iso-8859-14 "
|
"iso8859-14 "
| |
"iso885914 "
| |
iso-8859-15 | "csisolatin9 "
|
"iso-8859-15 "
| |
"iso8859-15 "
| |
"iso885915 "
| |
"iso_8859-15 "
| |
"l9 "
| |
iso-8859-16 | "iso-8859-16 "
|
koi8-r | "cskoi8r "
|
"koi "
| |
"koi8 "
| |
"koi8-r "
| |
"koi8_r "
| |
koi8-u | "koi8-u "
|
macintosh | "csmacintosh "
|
"mac "
| |
"macintosh "
| |
"x-mac-roman "
| |
windows-874 | "dos-874 "
|
"iso-8859-11 "
| |
"iso8859-11 "
| |
"iso885911 "
| |
"tis-620 "
| |
"windows-874 "
| |
windows-1250 | "cp1250 "
|
"windows-1250 "
| |
"x-cp1250 "
| |
windows-1251 | "cp1251 "
|
"windows-1251 "
| |
"x-cp1251 "
| |
windows-1252 | "ansi_x3.4-1968 "
|
"ascii "
| |
"cp1252 "
| |
"cp819 "
| |
"csisolatin1 "
| |
"ibm819 "
| |
"iso-8859-1 "
| |
"iso-ir-100 "
| |
"iso8859-1 "
| |
"iso88591 "
| |
"iso_8859-1 "
| |
"iso_8859-1:1987 "
| |
"l1 "
| |
"latin1 "
| |
"us-ascii "
| |
"windows-1252 "
| |
"x-cp1252 "
| |
windows-1253 | "cp1253 "
|
"windows-1253 "
| |
"x-cp1253 "
| |
windows-1254 | "cp1254 "
|
"csisolatin5 "
| |
"iso-8859-9 "
| |
"iso-ir-148 "
| |
"iso8859-9 "
| |
"iso88599 "
| |
"iso_8859-9 "
| |
"iso_8859-9:1989 "
| |
"l5 "
| |
"latin5 "
| |
"windows-1254 "
| |
"x-cp1254 "
| |
windows-1255 | "cp1255 "
|
"windows-1255 "
| |
"x-cp1255 "
| |
windows-1256 | "cp1256 "
|
"windows-1256 "
| |
"x-cp1256 "
| |
windows-1257 | "cp1257 "
|
"windows-1257 "
| |
"x-cp1257 "
| |
windows-1258 | "cp1258 "
|
"windows-1258 "
| |
"x-cp1258 "
| |
x-mac-cyrillic | "x-mac-cyrillic "
|
"x-mac-ukrainian "
| |
Legacy multi-byte Chinese (simplified) encodings | |
gb18030 | "chinese "
|
"csgb2312 "
| |
"csiso58gb231280 "
| |
"gb18030 "
| |
"gb2312 "
| |
"gb_2312 "
| |
"gb_2312-80 "
| |
"gbk "
| |
"iso-ir-58 "
| |
"x-gbk "
| |
hz-gb-2312 | "hz-gb-2312 "
|
Legacy multi-byte Chinese (traditional) encodings | |
big5 | "big5 "
|
"big5-hkscs "
| |
"cn-big5 "
| |
"csbig5 "
| |
"x-x-big5 "
| |
Legacy multi-byte Japanese encodings | |
euc-jp | "cseucpkdfmtjapanese "
|
"euc-jp "
| |
"x-euc-jp "
| |
iso-2022-jp | "csiso2022jp "
|
"iso-2022-jp "
| |
shift_jis | "csshiftjis "
|
"ms_kanji "
| |
"shift-jis "
| |
"shift_jis "
| |
"sjis "
| |
"windows-31j "
| |
"x-sjis "
| |
Legacy multi-byte Korean encodings | |
euc-kr | "cseuckr "
|
"csksc56011987 "
| |
"euc-kr "
| |
"iso-ir-149 "
| |
"korean "
| |
"ks_c_5601-1987 "
| |
"ks_c_5601-1989 "
| |
"ksc5601 "
| |
"ksc_5601 "
| |
"windows-949 "
| |
Legacy miscellaneous encodings | |
replacement | "csiso2022kr "
|
"iso-2022-cn "
| |
"iso-2022-cn-ext "
| |
"iso-2022-kr "
| |
utf-16be | "utf-16be "
|
utf-16le | "utf-16 "
|
"utf-16le "
| |
x-user-defined | "x-user-defined "
|
All encodings and their labels are also available as non-normative encodings.json resource.
Most legacy encodings make use of an index. An index is an ordered list of pointers and corresponding code points. Within an index pointers are unique and code points can be duplicated.
An efficient implementation likely has two indexes per encoding. One optimized for its decoder and one for its encoder.
To find the pointers and their corresponding code points in an index, let lines be the result of splitting the resource's contents on U+000A. Then remove each item in lines that is the empty string or starts with U+0023. Then the pointers and their corresponding code points are found by splitting each item in lines on U+0009. The first subitem is the pointer (as a decimal number) and the second is the corresponding code point (as a hexadecimal number). Other subitems are not relevant.
To signify changes an index includes an Identifier and a Date. If an Identifier has changed, so has the index.
The index code point for pointer in index is the code point corresponding to pointer in index, or null if pointer is not in index.
The index pointer for code point in index is the first pointer corresponding to code point in index, or null if code point is not in index.
These are the indexes defined by this specification, excluding index single-byte, which have their own table:
Index | Notes | |
---|---|---|
index big5 | index-big5.txt | This matches the Big5 standard in combination with the Hong Kong Supplementary Character Set and other common extensions. |
index euc-kr | index-euc-kr.txt | This matches the KS X 1001 standard and the Unified Hangul Code, more commonly known together as Windows Codepage 949. |
index gb18030 | index-gb18030.txt | This matches the GB18030 standard for code points encoded as two bytes, except 0xA3 0xA0 maps to U+3000 to be compatible with deployed content. |
index gb18030 ranges | index-gb18030-ranges.txt | This index works different from all others. Listing all code points would result in over a million items whereas they can be represented neatly in 207 ranges combined with trivial limit checks. It therefore only superficially matches the GB18030 standard for code points encoded as four bytes. See also index gb18030 ranges code point and index gb18030 ranges pointer below. |
index jis0208 | index-jis0208.txt | This is the JIS X 0208 standard including formerly proprietary extensions from IBM and NEC. |
index jis0212 | index-jis0212.txt | This is the JIS X 0212 standard. |
The index gb18030 ranges code point for pointer is the return value of these steps:
If pointer is greater than 39419 and less than 189000, or pointer is greater than 1237575, return null.
Let offset be the last pointer in index gb18030 ranges that is equal to or less than pointer and let code point offset be its corresponding code point.
Return a code point whose value is code point offset + pointer − offset.
The index gb18030 ranges pointer for code point is the return value of these steps:
Let offset be the last code point in index gb18030 ranges that is equal to or less than code point and let pointer offset be its corresponding pointer.
Return a pointer whose value is pointer offset + code point − offset.
The index shift_jis pointer for code point is the return value of these steps:
Let index be index jis0208 excluding all pointers in the range 8272 to 8835.
Return the index pointer for code point in index.
All indexes are also available as non-normative indexes.json resource. (index gb18030 ranges has a slightly different format here, to be able to represent ranges.)
The algorithms decode, utf-8 decode, utf-8 decode without BOM, encode, and utf-8 encode are intended for usage by other specifications. utf-8 decode is to be used by new formats. The get an encoding algorithm can be used first to turn a label into an encoding.
To decode a byte stream stream using fallback encoding encoding, run these steps:
Let buffer be an empty byte sequence.
Let BOM seen flag be unset.
Read bytes from stream into buffer until either buffer contains three bytes or read returns end-of-stream.
For each of the rows in the table below, starting with the first one and going down, if the first bytes of buffer match all the bytes given in the first column, then set encoding to the encoding given in the cell in the second column of that row and set BOM seen flag.
Byte order mark | Encoding |
---|---|
0xEF 0xBB 0xBF | utf-8 |
0xFE 0xFF | utf-16be |
0xFF 0xFE | utf-16le |
For compatibility with deployed content, the byte order mark (also known as BOM) is considered more authoritative than anything else.
If BOM seen flag is unset prepend buffer to stream.
Otherwise, if BOM seen flag is set, encoding is not utf-8, and buffer contains three bytes, prepend the last byte of buffer to stream.
Let output be a code point stream.
Return output.
To utf-8 decode a byte stream stream, run these steps:
Let buffer be an empty byte sequence.
Read three bytes from stream into buffer.
If buffer does not match 0xEF 0xBB 0xBF, prepend buffer to stream.
Let output be a code point stream.
Return output.
To utf-8 decode without BOM a byte stream stream, run these steps:
To encode a code point stream stream using encoding encoding, run these steps:
This is mostly a legacy hook for URLs and HTML forms. Layering utf-8 encode on top is safe as it never triggers errors. [URL] [HTML]
To utf-8 encode a code point stream stream, return the result of encoding stream using encoding utf-8.
If the input to encode or utf-8 encode stems from a DOMString, the convert a DOMString to a sequence of Unicode characters from Web IDL is to be used first.
This section uses terminology from the DOM and Web IDL. Non-browser user agents are not required to support this API. [DOM] [WEBIDL]
The following example uses the TextEncoder
object to encode
an array of strings into an
ArrayBuffer
. The result is a
Uint8Array
containing the number
of strings (as a Uint32Array
),
followed by the length of the first string (as a
Uint32Array
), the
utf-8 encoded string data, the length of the second string (as
a Uint32Array
), the string data,
and so on.
function encodeArrayOfStrings(strings, encoding) {
var encoder, encoded, len, i, bytes, view, offset;
encoder = new TextEncoder(encoding);
encoded = [];
len = Uint32Array.BYTES_PER_ELEMENT;
for (i = 0; i < strings.length; i += 1) {
len += Uint32Array.BYTES_PER_ELEMENT;
encoded[i] = new TextEncoder(encoding).encode(strings[i]);
len += encoded[i].byteLength;
}
bytes = new Uint8Array(len);
view = new DataView(bytes.buffer);
offset = 0;
view.setUint32(offset, strings.length);
offset += Uint32Array.BYTES_PER_ELEMENT;
for (i = 0; i < encoded.length; i += 1) {
len = encoded[i].byteLength;
view.setUint32(offset, len);
offset += Uint32Array.BYTES_PER_ELEMENT;
bytes.set(encoded[i], offset);
offset += len;
}
return bytes.buffer;
}
The following example decodes an
ArrayBuffer
containing data
encoded in the format produced by the previous example back into an array
of strings.
function decodeArrayOfStrings(buffer, encoding) {
var decoder, view, offset, num_strings, strings, i, len;
decoder = new TextDecoder(encoding);
view = new DataView(buffer);
offset = 0;
strings = [];
num_strings = view.getUint32(offset);
offset += Uint32Array.BYTES_PER_ELEMENT;
for (i = 0; i < num_strings; i += 1) {
len = view.getUint32(offset);
offset += Uint32Array.BYTES_PER_ELEMENT;
strings[i] = decoder.decode(
new DataView(view.buffer, offset, len));
offset += len;
}
return strings;
}
ScalarValueString
This is a temporary definition until IDL proper is updated to include it.
typedef
is used for now to keep IDL validity.
typedef DOMString ScalarValueString;
ScalarValueString
is identical to DOMString except that
convert a DOMString to a sequence of Unicode characters
is used subsequently when converting to an IDL value. Ergo, the IDL value is a sequence
of scalar values.
Only use ScalarValueString
if the subsystem deals in
scalar values rather than
code points. When in doubt, use
DOMString
.
TextDecoder
dictionary TextDecoderOptions { boolean fatal = false; boolean ignoreBOM = false; }; dictionary TextDecodeOptions { boolean stream = false; }; [Constructor(optional DOMString label = "utf-8", optional TextDecoderOptions options), Exposed=Window,Worker] interface TextDecoder { readonly attribute DOMString encoding; readonly attribute boolean fatal; readonly attribute boolean ignoreBOM; DOMString decode(optional ArrayBufferView input, optional TextDecodeOptions options); };
A TextDecoder
object has an associated encoding, decoder,
stream, ignore BOM flag (initially unset),
BOM seen flag (initially unset),
error mode (initially replacement), and
streaming flag (initially unset).
A TextDecoder
object also has an associated
serialize stream algorithm, that given a
stream stream, runs these steps:
Let output be the empty string.
While true:
Let token be the result of reading from stream.
If encoding is one of utf-8, utf-16be, and utf-16le, and ignore BOM flag and BOM seen flag are unset, run these subsubsteps:
If token is U+FEFF, set BOM seen flag.
Otherwise, if token is not end-of-stream, set BOM seen flag and append token to output.
Otherwise, return output.
Otherwise, if token is not end-of-stream, append token to output.
Otherwise, return output.
This algorithm is intentionally different with respect to BOM handling from the decode algorithm used by the rest of the platform to give API users more control.
decoder = new TextDecoder([label = "utf-8" [, options]])
Returns a new TextDecoder
object.
If label is either not a label or is a
label for replacement,
throws a
TypeError
.
decoder . encoding
Returns encoding's name.
fatal . fatal
Returns true if error mode is fatal, and false otherwise.
ignoreBOM . ignoreBOM
Returns true if ignore BOM flag is set, and false otherwise.
decoder . decode([input [, options]])
Returns the result of running encoding's
decoder. If options's
stream
is set to true the method can be invoked
multiple times to process a fragmented stream.
If the error mode is fatal set and encoding's
decoder returns error,
throws an
"EncodingError
".
The
TextDecoder(label, options)
constructor must run these steps:
Let encoding be the result of getting an encoding from label.
If encoding is failure or replacement,
throw a TypeError
.
Let dec be a new TextDecoder
object.
Set dec's encoding to encoding.
If options's fatal
member is
true, set dec's error mode to fatal.
If options's ignoreBOM
member is
true, set dec's ignore BOM flag.
Return dec.
The encoding
attribute must
return encoding's name.
The fatal
attribute must
return true if error mode is fatal, and false otherwise.
The ignoreBOM
attribute must
return true if ignore BOM flag is set, and false otherwise.
The
decode(input, options)
method must run these steps:
If the streaming flag is unset, set decoder to a new encoding's decoder, set stream to a new stream, and unset the BOM seen flag.
If options's stream
is true, set the
streaming flag, and unset the streaming flag otherwise.
If input is given, then given
input's buffer
,
byteOffset
, and byteLength
,
push byteLength
bytes from
buffer
, starting at byteOffset
, to stream.
Let output be a new stream.
While true:
Let token be the result of reading from stream.
If token is end-of-stream and the streaming flag is set, return output, serialized.
Otherwise, run these subsubsteps:
Let result be the result of processing token for decoder, stream, output, and error mode.
If result is finished, return output, serialized.
Otherwise, do nothing.
TextEncoder
[Constructor(optional DOMString utfLabel = "utf-8"), Exposed=Window,Worker] interface TextEncoder { readonly attribute DOMString encoding; Uint8Array encode(optional ScalarValueString input = ""); };
A TextEncoder
object has an associated encoding and encoder.
encoder = new TextEncoder([utfLabel = "utf-8"])
Returns a new TextEncoder
object.
If utfLabel is not a label for
utf-8, utf-16be, or utf-16le,
throws a
TypeError
.
encoder . encoding
Returns encoding's name.
encoder . encode([input = ""])
Returns the result of running encoding's
encoder. If options's
stream
is set to true, the method can be invoked
multiple times to process a fragmented stream.
The
TextEncoder(utfLabel)
constructor must run these steps:
Let encoding be the result of getting an encoding from utfLabel.
If encoding is failure, or is none of
utf-8, utf-16be, and utf-16le,
throw a TypeError
.
Let enc be a new TextEncoder
object.
Set enc's encoding to encoding.
Set enc's encoder to a new enc's encoding's encoder.
Return enc.
The encoding
attribute must return encoding's name.
The
encode(input, options)
method must run these steps:
Convert input to a stream.
Let output be a new stream.
While true, run these substeps:
Let token be the result of reading from input.
Let result be the result of processing token for encoder, input, output.
If result is finished, convert output into a
byte sequence, and then return a Uint8Array
object wrapping an
ArrayBuffer
containing output.
utf-8's decoder's has an associated utf-8 code point, utf-8 bytes seen, and utf-8 bytes needed (all initially 0), a utf-8 lower boundary (initially 0x80), and a utf-8 upper boundary (initially 0xBF).
utf-8's decoder's handler, given a stream and byte, runs these steps:
If byte is end-of-stream and utf-8 bytes needed is not 0, set utf-8 bytes needed to 0 and return error.
If byte is end-of-stream, return finished.
If utf-8 bytes needed is 0, based on byte:
Return a code point whose value is byte.
Set utf-8 bytes needed to 1 and utf-8 code point to byte − 0xC0.
If byte is 0xE0, set utf-8 lower boundary to 0xA0.
If byte is 0xED, set utf-8 upper boundary to 0x9F.
Set utf-8 bytes needed to 2 and utf-8 code point to byte − 0xE0.
If byte is 0xF0, set utf-8 lower boundary to 0x90.
If byte is 0xF4, set utf-8 upper boundary to 0x8F.
Set utf-8 bytes needed to 3 and utf-8 code point to byte − 0xF0.
Return error.
Then (byte is in the range 0xC2 to 0xF4) set utf-8 code point to utf-8 code point << (6 × utf-8 bytes needed) and return continue.
If byte is not in the range utf-8 lower boundary to utf-8 upper boundary, run these substeps:
Set utf-8 code point, utf-8 bytes needed, and utf-8 bytes seen to 0, set utf-8 lower boundary to 0x80, and set utf-8 upper boundary to 0xBF.
Prepend byte to stream.
Return error.
Set utf-8 lower boundary to 0x80 and utf-8 upper boundary to 0xBF.
Increase utf-8 bytes seen by one and set utf-8 code point to utf-8 code point + (byte − 0x80) << (6 × (utf-8 bytes needed − utf-8 bytes seen)).
If utf-8 bytes seen is not equal to utf-8 bytes needed, continue.
Let code point be utf-8 code point.
Set utf-8 code point, utf-8 bytes needed, and utf-8 bytes seen to 0.
Emit a code point whose value is code point.
The constraints in the utf-8 decoder above match “Best Practices for Using U+FFFD” from the Unicode standard. No other behavior is permitted per the Encoding Standard (other algorithms that achieve the same result are obviously fine, even encouraged).
utf-8's encoder's handler, given a stream and code point, runs these steps:
If code point is end-of-stream, return finished.
If code point is in the range U+0000 to U+007F, return a byte whose value is code point.
Set count and offset based on the range code point is in:
Let bytes be a byte sequence whose first byte is (code point >> (6 × count)) + offset.
Run these substeps while count is greater than 0:
Set temp to code point >> (6 × (count − 1)).
Append to bytes 0x80 | (temp & 0x3F).
Decrease count by one.
Return bytes bytes, in order.
An encoding where each byte is either a single code point or nothing, is a single-byte encoding. Single-byte encodings share the decoder and encoder. Index single-byte, as referenced by the single-byte decoder and single-byte encoder, is defined by the following table, and depends on the single-byte encoding in use. All but two single-byte encodings have a unique index.
Name | Index |
---|---|
ibm866 | index-ibm866.txt |
iso-8859-2 | index-iso-8859-2.txt |
iso-8859-3 | index-iso-8859-3.txt |
iso-8859-4 | index-iso-8859-4.txt |
iso-8859-5 | index-iso-8859-5.txt |
iso-8859-6 | index-iso-8859-6.txt |
iso-8859-7 | index-iso-8859-7.txt |
iso-8859-8 | index-iso-8859-8.txt |
iso-8859-8-i | |
iso-8859-10 | index-iso-8859-10.txt |
iso-8859-13 | index-iso-8859-13.txt |
iso-8859-14 | index-iso-8859-14.txt |
iso-8859-15 | index-iso-8859-15.txt |
iso-8859-16 | index-iso-8859-16.txt |
koi8-r | index-koi8-r.txt |
koi8-u | index-koi8-u.txt |
macintosh | index-macintosh.txt |
windows-874 | index-windows-874.txt |
windows-1250 | index-windows-1250.txt |
windows-1251 | index-windows-1251.txt |
windows-1252 | index-windows-1252.txt |
windows-1253 | index-windows-1253.txt |
windows-1254 | index-windows-1254.txt |
windows-1255 | index-windows-1255.txt |
windows-1256 | index-windows-1256.txt |
windows-1257 | index-windows-1257.txt |
windows-1258 | index-windows-1258.txt |
x-mac-cyrillic | index-x-mac-cyrillic.txt |
iso-8859-8 and iso-8859-8-i are distinct encoding names, because iso-8859-8 has influence on the layout direction. And although historically this might have been the case for iso-8859-6 and "iso-8859-6-i" as well, that is no longer true.
Single-byte encodings's decoder's handler, given a stream and byte, runs these steps:
If byte is end-of-stream, return finished.
If byte is in the range 0x00 to 0x7F, return a code point whose value is byte.
Let code point be the index code point for byte − 0x80 in index single-byte.
If code point is null, return error.
Return a code point whose value is code point.
Single-byte encodings's encoder's handler, given a stream and code point, runs these steps:
If code point is end-of-stream, return finished.
If code point is in the range U+0000 to U+007F, return a byte whose value is code point.
Let pointer be the index pointer for code point in index single-byte.
If pointer is null, return error with code point.
Return a byte whose value is pointer + 0x80.
gb18030's decoder has an associated gb18030 first, gb18030 second, and gb18030 third (all initially 0x00).
gb18030's decoder's handler, given a stream and byte, runs these steps:
If byte is end-of-stream and gb18030 first, gb18030 second, and gb18030 third are 0x00, return finished.
If byte is end-of-stream, and gb18030 first, gb18030 second, or gb18030 third is not 0x00, set gb18030 first, gb18030 second, and gb18030 third to 0x00, and return error.
If gb18030 third is not 0x00, run these substeps:
Let code point be null.
If byte is in the range 0x30 to 0x39, set code point to the index gb18030 ranges code point for (((gb18030 first − 0x81) × 10 + gb18030 second − 0x30) × 126 + gb18030 third − 0x81) × 10 + byte − 0x30.
Let buffer be a byte sequence consisting of gb18030 second, gb18030 third, and byte, in order.
Set gb18030 first, gb18030 second, and gb18030 third to 0x00.
If code point is null, prepend buffer to stream and return error.
Return a code point whose value is code point.
If gb18030 second is not 0x00, run these substeps:
If byte is in the range 0x81 to 0xFE, set gb18030 third to byte and return continue.
Prepend gb18030 second followed by byte to stream, set gb18030 first and gb18030 second to 0x00, and return error.
If gb18030 first is not 0x00, run these substeps:
If byte is in the range 0x30 to 0x39, set gb18030 second to byte and return continue.
Let lead be gb18030 first, let pointer be null, and set gb18030 first to 0x00.
Let offset be 0x40 if byte is less than 0x7F and 0x41 otherwise.
If byte is in the range 0x40 to 0x7E or 0x80 to 0xFE, set pointer to (lead − 0x81) × 190 + (byte − offset).
Let code point be null if pointer is null and the index code point for pointer in index gb18030 otherwise.
If pointer is null, prepend byte to stream.
If code point is null, return error.
Return a code point whose value is code point.
If byte is in the range 0x00 to 0x7F, return a code point whose value is byte.
If byte is 0x80, return code point U+20AC.
If byte is in the range 0x81 to 0xFE, set gb18030 first to byte and return continue.
Return error.
gb18030's encoder's handler, given a stream and code point, runs these steps:
If code point is end-of-stream, return finished.
If code point is in the range U+0000 to U+007F, return a byte whose value is code point.
Let pointer be the index pointer for code point in index gb18030.
If pointer is not null, run these substeps:
Let lead be pointer / 190 + 0x81.
Let trail be pointer % 190.
Let offset be 0x40 if trail is less than 0x3F and 0x41 otherwise.
Return two bytes whose values are lead and trail + offset.
Set pointer to the index gb18030 ranges pointer for code point.
Let byte1 be pointer / 10 / 126 / 10.
Set pointer to pointer − byte1 × 10 × 126 × 10.
Let byte2 be pointer / 10 / 126.
Set pointer to pointer − byte2 × 10 × 126.
Let byte3 be pointer / 10.
Let byte4 be pointer − byte3 × 10.
Return four bytes whose values are byte1 + 0x81, byte2 + 0x30, byte3 + 0x81, byte4 + 0x30.
This encoding is considered for removal.
hz-gb-2312's decoder has an associated hz-gb-2312 flag (initially unset) and hz-gb-2312 lead (initially 0x00).
hz-gb-2312's decoder's handler, given a stream and byte, runs these steps:
If byte is end-of-stream and hz-gb-2312 lead is not 0x00, set hz-gb-2312 lead to 0x00 and return error.
If byte is end-of-stream and hz-gb-2312 lead is 0x00, return finished.
If hz-gb-2312 lead is 0x7E, set hz-gb-2312 lead to 0x00, and based on byte:
Set the hz-gb-2312 flag and return continue.
Unset the hz-gb-2312 flag and return continue.
Return code point U+007E.
Return continue.
If hz-gb-2312 lead is not 0x00, let lead be hz-gb-2312 lead, set hz-gb-2312 lead to 0x00, and then run these substeps:
If byte is in the range 0x21 to 0x7E, let code point be the index code point for (lead − 1) × 190 + (byte + 0x3F) in index gb18030.
If byte is 0x0A, unset the hz-gb-2312 flag.
If code point is null, return error.
Return a code point whose value is code point.
If byte is 0x7E, set hz-gb-2312 lead to 0x7E and return continue.
If the hz-gb-2312 flag is set:
If byte is in the range 0x20 to 0x7F, set hz-gb-2312 lead to byte and return continue.
If byte is 0x0A, unset the hz-gb-2312 flag.
Return error.
If byte is in the range 0x00 to 0x7F, return a code point whose value is byte.
Return error.
hz-gb-2312's encoder has an associated hz-gb-2312 flag.
hz-gb-2312's encoder's handler, given a stream and code point, runs these steps:
If code point is end-of-stream, return finished.
If code point is in the range U+0000 to U+007F and the hz-gb-2312 flag is set, prepend code point to stream, unset the hz-gb-2312 flag, and return two bytes 0x7E 0x7D.
If code point is 0x007E, return two bytes 0x7E 0x7E.
If code point is in the range U+0000 to U+007F, return a byte whose value is code point.
Let pointer be the index pointer for code point in index gb18030.
If pointer is null, return error with code point.
If the hz-gb-2312 flag is unset, prepend code point to stream, set the hz-gb-2312 flag, and return two bytes 0x7E 0x7B.
Let lead be pointer / 190 + 1.
Let trail be pointer % 190 − 0x3F.
If either lead or trail is less than 0x21, return error with code point.
Return two bytes whose values are lead and trail.
big5's decoder has an associated big5 lead (initially 0x00). big5's decoder's handler, given a stream and byte, runs these steps:
If byte is end-of-stream and big5 lead is not 0x00, set big5 lead to 0x00 and return error.
If byte is end-of-stream and big5 lead is 0x00, return finished.
If big5 lead is not 0x00, let lead be big5 lead, let pointer be null, set big5 lead to 0x00, and then run these substeps:
Let offset be 0x40 if byte is less than 0x7F and 0x62 otherwise.
If byte is in the range 0x40 to 0x7E or 0xA1 to 0xFE, set pointer to (lead − 0x81) × 157 + (byte − offset).
If there is a row in the table below whose first column is pointer, return the two code points listed in its second column (the third column is irrelevant):
Pointer | Code points | Notes |
---|---|---|
1133 | U+00CA U+0304 | Ê̄ (LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND MACRON) |
1135 | U+00CA U+030C | Ê̌ (LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND CARON) |
1164 | U+00EA U+0304 | ê̄ (LATIN SMALL LETTER E WITH CIRCUMFLEX AND MACRON) |
1166 | U+00EA U+030C | ê̌ (LATIN SMALL LETTER E WITH CIRCUMFLEX AND CARON) |
Since indexes are limited to single code points this table is used for these pointers.
Let code point be null if pointer is null and the index code point for pointer in index big5 otherwise.
If pointer is null and byte is in the range 0x00 to 0x7F, prepend byte to stream.
If code point is null, return error.
Return a code point whose value is code point.
If byte is in the range 0x00 to 0x7F, return a code point whose value is byte.
If byte is in the range 0x81 to 0xFE, set big5 lead to byte and return continue.
Return error.
big5's encoder's handler, given a stream and code point, runs these steps:
If code point is end-of-stream, return finished.
If code point is in the range U+0000 to U+007F, return a byte whose value is code point.
Let pointer be the index pointer for code point in index big5.
If pointer is null, return error with code point.
Let lead be pointer / 157 + 0x81.
If lead is less than 0xA1, return error with code point.
Avoid returning Hong Kong Supplementary Character Set extensions literally.
Let trail be pointer % 157.
Let offset be 0x40 if trail is less than 0x3F and 0x62 otherwise.
Return two bytes whose values are lead and trail + offset.
euc-jp's decoder has an associated euc-jp jis0212 flag (initially unset) and euc-jp lead (initially 0x00).
euc-jp's decoder's handler, given a stream and byte, runs these steps:
If byte is end-of-stream and euc-jp lead is not 0x00, set euc-jp lead to 0x00, and return error.
If byte is end-of-stream and euc-jp lead is 0x00, return finished.
If euc-jp lead is 0x8E and byte is in the range 0xA1 to 0xDF, set euc-jp lead to 0x00 and return a code point whose value is 0xFF61 + byte − 0xA1.
If euc-jp lead is 0x8F and byte is in the range 0xA1 to 0xFE, set the euc-jp jis0212 flag, set euc-jp lead to byte, and return continue.
If euc-jp lead is not 0x00, let lead be euc-jp lead, set euc-jp lead to 0x00, and run these substeps:
Let code point be null.
If lead and byte are both in the range 0xA1 to 0xFE, set code point to the index code point for (lead − 0xA1) × 94 + byte − 0xA1 in index jis0208 if the euc-jp jis0212 flag is unset and in index jis0212 otherwise.
Unset the euc-jp jis0212 flag.
If byte is not in the range 0xA1 to 0xFE, prepend byte to stream.
If code point is null, return error.
Return a code point whose value is code point.
If byte is in the range 0x00 to 0x7F, return a code point whose value is byte.
If byte is 0x8E, 0x8F, or in the range 0xA1 to 0xFE, set euc-jp lead to byte and return continue.
Return error.
euc-jp's encoder's handler, given a stream and code point, runs these steps:
If code point is end-of-stream, return finished.
If code point is in the range U+0000 to U+007F, return a byte whose value is code point.
If code point is U+00A5, return byte 0x5C.
If code point is U+203E, return byte 0x7E.
If code point is in the range U+FF61 to U+FF9F, return two bytes whose values are 0x8E and code point − 0xFF61 + 0xA1.
Let pointer be the index pointer for code point in index jis0208.
If pointer is null, return error with code point.
Let lead be pointer / 94 + 0xA1.
Let trail be pointer % 94 + 0xA1.
Return two bytes whose values are lead and trail.
The index jis0212 is not used by the euc-jp encoder due to lack of widespread support.
iso-2022-jp's decoder has an associated iso-2022-jp state (initially ASCII state), iso-2022-jp jis0212 flag (initially unset), and iso-2022-jp lead (initially 0x00).
iso-2022-jp's decoder's handler, given a stream and byte, runs these steps, switching on iso-2022-jp state:
Based on byte:
Set iso-2022-jp state to escape start state and return continue.
Return a code point whose value is byte.
Return finished.
Return error.
If byte is either 0x24 or 0x28, set iso-2022-jp lead to byte, iso-2022-jp state to escape middle state, and return continue.
If byte is not end-of-stream, prepend byte to stream.
Set iso-2022-jp state to ASCII state and return error.
Let lead be iso-2022-jp lead and set iso-2022-jp lead to 0x00.
If lead is 0x24 and byte is either 0x40 or 0x42, unset the iso-2022-jp jis0212 flag, set iso-2022-jp state to lead state, and return continue.
If lead is 0x24 and byte is 0x28, set iso-2022-jp state to escape final state and return continue.
If lead is 0x28 and byte is either 0x42 or 0x4A, set iso-2022-jp state to ASCII state and return continue.
If lead is 0x28 and byte is 0x49, set iso-2022-jp state to Katakana state and return continue.
Let buffer be byte if byte is end-of-stream, and two bytes lead byte otherwise.
Prepend buffer to stream.
Set iso-2022-jp state to ASCII state and return error.
If byte is 0x44, set the iso-2022-jp jis0212 flag, set iso-2022-jp state to lead state, and return continue.
Let buffer be two bytes 0x28 byte, if byte is end-of-stream, and three bytes 0x24 0x28 byte otherwise.
Prepend buffer to stream.
Set iso-2022-jp state to ASCII state and return error.
Based on byte:
Set iso-2022-jp state to ASCII state and return code point U+000A.
Set iso-2022-jp state to escape start state and return continue.
Return finished.
Set iso-2022-jp lead to byte, iso-2022-jp state to trail state, and return continue.
Set the iso-2022-jp state to lead state.
If byte is end-of-stream, return error.
Let code point be null and let pointer be (iso-2022-jp lead − 0x21) × 94 + byte − 0x21.
If iso-2022-jp lead and byte are both in the range 0x21 to 0x7E, set code point to the index code point for pointer in index jis0208, if the iso-2022-jp jis0212 flag is unset, and in index jis0212 otherwise.
If code point is null, return error.
Return a code point whose value is code point.
Based on byte:
Set iso-2022-jp state to escape start state and return continue.
Return a code point whose value is 0xFF61 + byte − 0x21.
Return finished.
Return error.
iso-2022-jp's encoder has an associated iso-2022-jp state.
iso-2022-jp's encoder's handler, given a stream and code point, runs these steps:
If code point is end-of-stream, return finished.
If code point is in the range U+0000 to U+007F, or is U+00A5 or U+203E, and iso-2022-jp state is not ASCII state, prepend code point to stream, set iso-2022-jp state to ASCII state, and return three bytes 0x1B 0x28 0x42.
If code point is in the range U+0000 to U+007F, return a byte whose value is code point.
If code point is U+00A5, return byte 0x5C.
If code point is U+203E, return byte 0x7E.
If code point is in the range U+FF61 to U+FF9F and iso-2022-jp state is not Katakana state, prepend code point to stream, set iso-2022-jp state to Katakana state, and return three bytes 0x1B 0x28 0x49.
If code point is in the range U+FF61 to U+FF9F, return a byte whose value is code point − 0xFF61 + 0x21.
Let pointer be the index pointer for code point in index jis0208.
If pointer is null, return error with code point.
If iso-2022-jp state is not lead state, prepend code point to stream, set iso-2022-jp state to lead state, and return three bytes 0x1B 0x24 0x42.
Let lead be pointer / 94 + 0x21.
Let trail be pointer % 94 + 0x21.
Return two bytes whose values are lead and trail.
The index jis0212 is not used by the iso-2022-jp encoder due to lack of widespread support.
shift_jis's decoder has an associated shift_jis lead (initially 0x00).
shift_jis's decoder's handler, given a stream and byte, runs these steps:
If byte is end-of-stream, shift_jis lead is not 0x00, set shift_jis lead to 0x00 and return error.
If byte is end-of-stream and shift_jis lead is 0x00, return finished.
If shift_jis lead is not 0x00, let lead be shift_jis lead, let pointer be null, set shift_jis lead to 0x00, and then run these substeps:
Let offset be 0x40, if byte is less than 0x7F, and 0x41 otherwise.
Let lead offset be 0x81, if lead is less than 0xA0, and 0xC1 otherwise.
If byte is in the range 0x40 to 0x7E or 0x80 to 0xFC, set pointer to (lead − lead offset) × 188 + byte − offset.
Let code point be null, if pointer is null, and the index code point for pointer in index jis0208 otherwise.
If code point is null and pointer is in the range 8836 to 10528, return a code point whose value is 0xE000 + pointer − 8836.
This is interoperable legacy from Windows known as EUDC.
If pointer is null, prepend byte to stream.
If code point is null, return error.
Return a code point whose value is code point.
If byte is in the range 0x00 to 0x80, return a code point whose value is byte.
If byte is in the range 0xA1 to 0xDF, return a code point whose value is 0xFF61 + byte − 0xA1.
If byte is in the range 0x81 to 0x9F or 0xE0 to 0xFC, set shift_jis lead to byte and return continue.
Return error.
shift_jis's encoder's handler, given a stream and code point, runs these steps:
If code point is end-of-stream, return finished.
If code point is in the range U+0000 to U+0080, return a byte whose value is code point.
If code point is U+00A5, return byte 0x5C.
If code point is U+203E, return byte 0x7E.
If code point is in the range U+FF61 to U+FF9F, return a byte whose value is code point − 0xFF61 + 0xA1.
Let pointer be the index shift_jis pointer for code point.
If pointer is null, return error with code point.
Let lead be pointer / 188.
Let lead offset be 0x81, if lead is less than 0x1F, and 0xC1 otherwise.
Let trail be pointer % 188.
Let offset be 0x40, if trail is less than 0x3F, and 0x41 otherwise.
Return two bytes whose values are lead + lead offset and trail + offset.
euc-kr's decoder has an associated euc-kr lead (initially 0x00).
euc-kr's decoder's handler, given a stream and byte, runs these steps:
If byte is end-of-stream and euc-kr lead is not 0x00, set euc-kr lead to 0x00 and return error.
If byte is end-of-stream and euc-kr lead is 0x00, return finished.
If euc-kr lead is not 0x00, let lead be euc-kr lead, let pointer be null, set euc-kr lead to 0x00, and then run these substeps:
If lead is in the range 0x81 to 0xC6, let temp be (26 + 26 + 126) × (lead − 0x81), and then set pointer to the result of the equation below, depending on byte:
temp + byte − 0x41
temp + 26 + byte − 0x61
temp + 26 + 26 + byte − 0x81
If lead is in the range 0xC7 to 0xFE and byte is in the range 0xA1 to 0xFE, set pointer to (26 + 26 + 126) × (0xC7 − 0x81) + (lead − 0xC7) × 94 + (byte − 0xA1).
Let code point be null, if pointer is null, and the index code point for pointer in index euc-kr otherwise.
If pointer is null, prepend byte to stream.
If code point is null, return error.
Return a code point whose value is code point.
If byte is in the range 0x00 to 0x7F, return a code point whose value is byte.
If byte is in the range 0x81 to 0xFE, set euc-kr lead to byte and return continue.
Return error.
euc-kr's encoder's handler, given a stream and code point, runs these steps:
If code point is end-of-stream, return finished.
If code point is in the range U+0000 to U+007F, return a byte whose value is code point.
Let pointer be the index pointer for code point in index euc-kr.
If pointer is null, return error with code point.
If pointer is less than (26 + 26 + 126) × (0xC7 − 0x81), run these substeps:
Let lead be pointer / (26 + 26 + 126) + 0x81.
Let trail be pointer % (26 + 26 + 126).
Let offset be 0x41 if trail is less than 26, 0x47 if trail is less than 26 + 26, and 0x4D otherwise.
Return two bytes whose values are lead and trail + offset.
Set pointer to pointer − (26 + 26 + 126) × (0xC7 − 0x81).
Let lead be pointer / 94 + 0xC7.
Let trail be pointer % 94 + 0xA1.
Return two bytes whose values are lead and trail.
The replacement encoding exists to prevent certain attacks that abuse a mismatch between encodings supported on the server and the client.
replacement's decoder has an associated replacement error returned flag (initially unset).
replacement's decoder's handler, given a stream and byte, runs these steps:
If replacement error returned flag is unset, set the replacement error returned flag and return error.
Return finished.
replacement's encoder is utf-8's encoder.
To convert a code unit to bytes using a utf-16be flag, run these steps:
Let byte1 be code unit >> 8.
Let byte2 be code unit & 0x00FF.
Then return the bytes in order:
byte1, then byte2.
byte2, then byte1.
A byte order mark has priority over a label as it has been found to be more accurate in deployed content. Therefore it is not part of the shared utf-16 decoder algorithm but rather the decode algorithm.
shared utf-16 decoder has an associated utf-16 lead byte and utf-16 lead surrogate (both initially null), and utf-16be flag (no initial value).
shared utf-16 decoder's handler, given a stream and byte, runs these steps:
If byte is end-of-stream and either utf-16 lead byte or utf-16 lead surrogate is not null, set utf-16 lead byte and utf-16 lead surrogate to null, and return error.
If byte is end-of-stream and utf-16 lead byte and utf-16 lead surrogate are null, return finished.
If utf-16 lead byte is null, set utf-16 lead byte to byte and return continue.
Let code unit be the result of:
(utf-16 lead byte << 8) + byte.
(byte << 8) + utf-16 lead byte.
Then set utf-16 lead byte to null.
If utf-16 lead surrogate is not null, let lead surrogate be utf-16 lead surrogate, set utf-16 lead surrogate to null, and then run these substeps:
If code unit is in the range U+DC00 to U+DFFF, return a code point whose value is 0x10000 + ((lead surrogate − 0xD800) << 10) + (code unit − 0xDC00).
Prepend the sequence resulting of converting code unit to bytes using utf-16be flag to stream and return error.
If code unit is in the range U+D800 to U+DBFF, set utf-16 lead surrogate to code unit and return continue.
If code unit is in the range U+DC00 to U+DFFF, return error.
Return code point code unit.
shared utf-16 encoder has an associated utf-16be flag.
shared utf-16 encoder's handler, given a stream and code point, runs these steps:
If code point is end-of-stream, return finished.
If code point is in the range 0x00 to 0xFFFF, return the sequence resulting of converting code point to bytes using utf-16be flag.
Let lead be ((code point − 0x10000) >> 10) + 0xD800, converted to bytes using utf-16be flag.
Let trail be ((code point − 0x10000) & 0x3FF) + 0xDC00, converted to bytes using utf-16be flag.
Return a byte sequence of lead followed by trail.
utf-16be's decoder is shared utf-16 decoder with its utf-16be flag set.
utf-16be's encoder is shared utf-16 encoder with its utf-16be flag set.
Both "utf-16
" and
"utf-16le
" are labels for
utf-16le to deal with deployed content.
utf-16le's decoder is shared utf-16 decoder with its utf-16be flag unset.
utf-16le's encoder is shared utf-16 encoder with its utf-16be flag unset.
While technically this is a single-byte encoding, it is defined separately as it can be implemented algorithmically.
x-user-defined's decoder's handler, given a stream and byte, runs these steps:
If byte is end-of-stream, return finished.
If byte is in the range 0x00 to 0x7F, return a code point whose value is byte.
Return a code point whose value is 0xF780 + byte − 0x80.
x-user-defined's encoder's handler, given a stream and code point, runs these steps:
If code point is end-of-stream, return finished.
If code point is in the range U+0000 to U+007F, return a byte whose value is code point.
If code point is in the range U+F780 to U+F7FF, return a byte whose value is code point − 0xF780 + 0x80.
Return error with code point.
There have been a lot of people that have helped make encodings more interoperable over the years and thereby furthered the goals of this standard. Likewise many people have helped making this standard what it is today.
Ideally they are all listed here so please contact the editor with any omissions.
With that, many thanks to Alan Chaney, Alexander Shtuchkin, Allen Wirfs-Brock, Asmus Freytag, Ben Noordhuis, Boris Zbarsky, Cameron McCormack, Charles McCathieNeville, David Carlisle, Dominique Hazaël-Massieux, Doug Ewell, Erik van der Poel, 譚永鋒 (Frank Yung-Fong Tang), Sam Sneddon, Glenn Maynard, Gordon P. Hemsley, Henri Sivonen, Ian Hickson, James Graham, John Tamplin, Joshua Bell, 신정식 (Jungshik Shin), 강 성훈 (Kang Seonghoon), 川幡太一 (Kawabata Taichi), Ken Lunde, Ken Whistler, Kenneth Russell, Leif Halvard Silli, Makoto Kato, Mark Callow, Mark Davis, Martin Dürst, Masatoshi Kimura, Ms2ger, Nigel Tao, Norbert Lindenberg, Øistein E. Andersen, Peter Krefting, Philip Jägenstedt, Philip Taylor, Richard Ishida, Robbert Broersma, Robert Mustacchi, Ryan Dahl, Shawn Steele, Simon Montagu, Simon Pieters, Simon Sapin, Vyacheslav Matva, and 成瀬ゆい (Yui Naruse) for being awesome.