This article provides guidelines for the migration of software and data to Unicode. It covers planning the migration, and design and implementation of Unicode-enabled software.
A basic understanding of Unicode and the principles of character encoding is assumed. Some sources for information about these include:
There are a number of reasons for adopting Unicode:
Note that simply changing the character encoding of your pages to Unicode will not eliminate all character encoding problems. In fact, during the migration there is a significantly increased risk of such bugs, because existing data must be converted to Unicode, and the current encoding is not always known. This document provides tips on how to minimize this risk, and how to provide mechanisms to correct character conversion issues.
To scope the migration to Unicode, you need to understand the use of character encodings in your current setup and decide on the internal and external use of character encodings for the Unicode-based design. You also need to know the state of Unicode support in software components you rely on, and where needed, the migration plans for these components. This enables you to plan the upgrade of your software to be based on Unicode, and the conversion of existing data to Unicode encodings.
A project to migrate to Unicode may also be a good time to improve internationalization in general. In particular, you should consider whether you can use the multilingual capabilities of Unicode to break down unnecessary barriers between different audiences, cultures, or languages. Especially for sites or applications that enable communication between users and thus host or transmit user-generated content, it may make sense to have a single worldwide site with shared multilingual content, despite having several localized user interfaces.
As a starting point, you need to thoroughly understand how character encodings are used in your current software. Identify the components of the software and the data containers: front end, back end, storage, APIs, web interfaces, and so on, and clarify their use of encodings:
The last question may be surprising, but is particularly important. Lack of correct information about the character encoding used for text that is coming in from outside the site (such as content feeds or user input) or that's already in your data collections is a common problem, and needs particular attention. (Actually, you need to pay attention to such things even if you're not converting to Unicode.) There are variety of ways this lack of correct information may come about:
To deal with such situations, character encoding detection is commonly used. Encoding detection attempts to determine the encoding used in a byte sequence based on characteristics of the byte sequence itself. In most cases it's a statistical process that needs long input byte sequences to work well, although you may be able to improve its accuracy by using other information available to your application. Because of the high error rate, it's often necessary to provide ways for humans to discover and correct errors. This requires keeping the original byte sequence available for later reconversion. Examples of encoding detection libraries include:
Software often depends on other software for its implementation:
You need to check whether the software you depend on supports Unicode, or at least doesn't put obstacles into your adopting it. It will commonly be necessary to upgrade to newer versions of underlying platforms, and in a few cases it will be necessary to migrate from obsolete platforms to newer ones.
Unicode offers three encoding forms: UTF-8, UTF-16, and UTF-32. For transportation over the network or for storage in files UTF-8 usually works the best because it is ASCII-compatible, while the ASCII-look-alike bytes contained in UTF-16 and UTF-32 text are a problem for some network devices or file processing tools. For in-memory processing, all three encoding forms can be useful, and the best choice often depends on the programming platforms and libraries you use: Java, JavaScript, ICU, and most Windows APIs are based on UTF-16, while Unix systems tend to prefer UTF-8. Storage size is rarely a factor in deciding between UTF-8 and UTF-16 because either one can have a better size profile, depending on the mix of markup and European or Asian languages. UTF-32 is inefficient for storage and therefore rarely used for that purpose, but it is very convenient for processing, and some libraries, such as Java and ICU, provide string accessors and processing API in terms of UTF-32 code points. Conversion between the three encoding forms is fast and safe, so it's quite feasible and common to use different encoding forms in different components of large software systems.
Storage of text whose character encoding is not known with certainty is an exception from the Unicode-only rule. Such text often has to be interpreted using character encoding detection. And character encoding detection is not a reliable process. Thus, you should keep the original bytes around (along with the detected character encoding) so that the text can be reconverted if a human corrects the encoding selection.
For communicating with the world outside your application, UTF-8 should be used wherever possible. However, there are situations where you can't control the encoding, or need to communicate with systems that don't support UTF-8. Here are recommendations for common cases:
Email: You need to accept incoming email in whatever encoding it uses. For outgoing email, you might have to take into consideration that many older email applications don't support UTF-8. For the time being, outgoing email should therefore be converted to a legacy encoding that can represent its entire content; UTF-8 should only be used if no such legacy encoding can be found. The encoding of outgoing email must be specified. Selecting which outbound encodings to use will depend on your application's needs.
URIs and POST data: These are used in several different contexts: form submissions, web services, or URLs entered directly into a web browser. For form submissions, the HTML form should be designed to identify the character encoding even if the user changes it in the browser. (You can use JavaScript or special field values to determine if the user has changed the encoding: perhaps you should gently remind the user not to do that?)
Web services: Web services (especially REST-based Web services) should specify the use of UTF-8 and reject requests that are not valid UTF-8. For third party web services, use UTF-8 if supported by the web service and ensure correct identification of the character encoding used. For URIs entered directly into a web browser (such as marketing URIs), other encodings may need to be supported.
HTML: When serving pages to desktop browsers, use UTF-8; support for it is now
almost universal. Mobile browsers don't always support UTF-8, so you may
have to use a legacy encoding depending on the target device. The
encoding used should be declared using the HTTP Content-Type
header, the HTML meta
tag, or (preferably) both. If you cannot control
the configuration of your server on a file or request level or can't set
it to send UTF-8 as the encoding, then least be sure that it does not
send an encoding at all: user agents pay attention to this encoding
before any declared in a meta tag. The correct HTML meta
tag is
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
How to set the HTTP Content-Type
header depends on your runtime
environment. For example, in PHP, you'd use
<?php header("Content-type: text/html; charset=UTF-8"); ?>
.
See more information on setting the content-type.
If you take content from outside your site (such as when you spider for a search engine or include external HTML), you'll need to deal with whatever encoding that material is in. In some cases this means accepting and processing the encoding used. If you incorporate that material into your own pages, you'll need to make the encoding consistent with your page (or put it into an iframe where it can use its original encoding).
XML:Outgoing XML should always be encoded in UTF-8; the XML specification requires every XML parser to understand it. For XML data sources, specify the use of UTF-8. In other cases, XML files in other encodings need to be accepted as long as they are valid. Note that the HTTP Content-Type text/xml defaults to US-ASCII (for this reason, application/xml+* is usually preferred) and you'll still need to specify the charset if you use the text/xml Content-Type.
JSON: Outgoing JSON data should always be encoded in UTF-8 and preferably in ASCII with \u escapes for all non-ASCII characters. For incoming data, support for UTF-16 and UTF-32 might also be considered. The JSON specification does not allow any other encodings.
Serialized PHP: This data format should be avoided because it does not allow the specification of the encoding used, and is therefore likely to lead to data corruption for non-ASCII text. JSON is a good alternative. If you absolutely can't avoid serialized PHP, specify and use UTF-8.
Other data feeds: Where you can influence the character encoding of incoming feeds, it should be UTF-8, or at least well-specified. There may be cases though where you want to use feeds that you can't control, in which case you have to use whatever you get.
Generally, incoming text should be converted to a Unicode encoding as soon as possible, and outgoing text, if it has to be sent in a non-Unicode encoding, converted from Unicode to that other encoding as late as possible. However, if the encoding of incoming text cannot be determined with certainty, then the original text must be stored along with information about the likely encoding. This enables corrective action if it turns out that the encoding was wrong.
For very simple sites or applications it may be possible to change the entire software to be based on Unicode, convert all data to a Unicode encoding, and switch over from the pre-Unicode version to the Unicode version in one instant. But many sites or applications offer external interfaces, have large bodies of code, and have accumulated huge data sets, so their conversion is a big project with multiple dependencies that needs to be carefully planned. Here's a breakdown into likely sub-projects:
Convert your databases and/or data stores to Unicode. If the databases were accessed directly by other products, this may have to wait until those products have migrated to the APIs introduced in the first step. The section Migrating Data has more information on this step.
Some of these sub-projects can be executed in parallel or in a different order, depending on the specific situation of your product. For example, migration of the implementation of your product may be held up by dependencies on other software components that haven't sufficiently progressed in their migration yet. On the other hand, SQL databases can be migrated to Unicode much earlier because the client component of the database software insulates clients from the encoding used in the database and performs character encoding conversion when necessary. Migrating databases early on has benefits: It simplifies testing, because the database can be tested independently from the software using it, while testing higher-level software typically requires a database, and it may allow you to merge multiple separate databases using legacy encodings into a single multilingual database.
Byte sequences can only be correctly interpreted as text if the character encoding is known. Many applications are written such that they just move around byte sequences, without naming the character encoding. As discussed above, this has always caused problems. But it happened to work in many cases in which users all speak the same language or are willing to adapt to some content being incorrectly rendered on the page. During the transition to Unicode, however, each language will be handled in at least two encodings, the legacy encoding for that language and UTF-8, so specifying the encoding for each byte sequence will be critical in order to avoid an avalanche of data corruption bugs.
Character encodings can be specified in a variety of ways:
In the format specification: The specification for a data format may specify either the character encoding directly or a simple deterministic mechanism for detecting the character encoding by looking at the beginning of a byte sequence. Examples are the specification of the Java String class, which specifies UTF-16, and the JSON specification, which prescribes the use of Unicode encodings and how to distinguish between them.
As part of the byte sequence: The
specification for a data format may provide for a mechanism to specify
the character encoding as part of the byte sequence. The XML
specification does so in an elegant way using the encoding
declaration, the HTML specification in a less elegant way using the meta
tag. For data formats that allow such specification, the data can contain
the encoding specification unless the byte sequence is in UTF-8 and the
specification ensures correct detection of UTF-8 (as for XML). For HTML
files, the meta
tag specifying the content type
and character encoding should be the first sub-element of the head
element, and it must not be preceded by non-ASCII
characters.
In data external to the byte sequence:
In many cases, a container of the byte sequence provides the encoding
specification. Some examples would include HTTP or MIME, where the Content-Type
header field can specify the character encoding, and databases, where
the encoding is specified as part of the schema or the database
configuration. Again, where such capabilities exist, textual data should
make use of it. In some cases, such as sending HTML over HTTP, the
external encoding specification may duplicate one that's part of the byte
sequence - that's a good thing, because the HTTP header is preferred by
browsers, while the meta
tag is the only
specification that survives if the file is saved to disk.
In interface specifications: Specifications of interfaces that accept or return byte sequences without any of the character encoding specifications above can (and should) specify the character encoding used. The specification may be absolute or relative to an environment setting. For example, some libraries provide functions that accept or return strings in UTF-8, while others accept or return strings encoded in some legacy encoding.
By context: The character encoding may be derived from the context in which the byte sequence occurs. For example, browsers generally send form data in the character encoding of the web page that contains the form. This is a very weak form of specification because the byte sequence often is transferred out of the context, such as into log files, where the encoding can no longer be reconstructed. Also, users sometimes change the browser encoding so that the encoding of the returned form data no longer matches the encoding of the page that the web application generated. Using UTF-8 is one way to minimize the damage resulting from this weak form of encoding specification because it eliminates the need for users to change the browser encoding and because character encoding detection works better for UTF-8 than for most other encodings.
There is a standard for naming character encodings on the internet, RFC 2978, and an associated IANA charset registry. However, actual use is often different. Many encodings come in different variants or have siblings supporting extended character sets, and different software often uses different names for the same encoding or the same name for different encodings. For example, the name ISO-8859-1 is often used to describe data that actually uses the encoding windows-1252. This latter encoding (Microsoft Windows code page 1252) is very similar to ISO 8859-1 but assigns graphic characters to the range of bytes between 0x80 and 0x9F. Many Web applications (such as browsers, search engines, etc.) treat content bearing the ISO 8859-1 label as using the windows-1252 encoding instead, since, for all practical purposes, windows-1252 is a "superset" of ISO 8859-1. Other applications, such as encoding converters (like iconv or ICU) are pretty literal, and you must specify the right encoding name in order to get the proper results.
Whenever a byte sequence is interpreted as text and processed, its character encoding must be known. In many cases determining the character encoding is so trivial that it's not even thought about - for example, when processing a string in a programming language that specifies that strings are encoded in UTF-16. However, in other cases, no clear specification of the character encoding is available, or the text comes from a source that may not be fully trusted to provide a correct specification. In such cases, a more complicated process is necessary to determine the character encoding and to enable later correction of mistakes made:
When sending text, the appropriate character encoding needs to be selected based on the data format and the recipient. The section Deciding on Character Encoding Use for External Interfaces discusses encoding use based on data formats. In most cases, a Unicode encoding is recommended. However, there are two major exceptions:
Email: Many older email applications don't support UTF-8. For the time being, email should therefore be converted to a legacy encoding that can represent its entire content; UTF-8 should only be used if no such legacy encoding can be found.
Mobile browsers: Mobile systems don't always support UTF-8. It may therefore be necessary to select other encodings based on the specific device.
No matter which encoding is used, the character encoding really must be unambiguously specified using one of the mechanisms described in the Character Encoding Specifications section.
Whenever text is expected to be in one character encoding in one place and in a different character encoding in the next place, encoding conversion is necessary. Some commonly used libraries for character encoding conversion are ICU and iconv, however, some platforms, such as Java and Perl, bring along their own conversion libraries.
When using the libraries, it is important to use the right encoding names for the specific library. See the section Character Encoding Names above for more information.
There are some specific conversion issues that may affect particular products:
Private use characters: Several character encodings, including Unicode and most East Asian encodings, have code point ranges that are reserved for private use or just undefined. These are often used for company specific or personal use characters — the emoji defined by Japanese mobile operators are an example. Standard character converters don't know how to map such characters. For applications where support for private use characters is critical, you either have to use custom character converters or use workarounds such as numeric character references to ensure correct mapping.
Versions of character encodings and mappings: Many character encodings evolve over time, and so do the mappings between them. An example is the mapping from HKSCS to Unicode: Early versions of HKSCS had to map numerous characters into the Unicode private use area because Unicode didn't support the characters, but later these characters were added to the Unicode character repertoire, and the mappings from HKSCS were changed to map to the newly added characters. In general, you should make sure you're using the latest versions of the character converters.
Some characters have more than one way of being represented in Unicode. Unicode defines several ways of eliminating these differences when they do not matter to text processing. For more information on Normalization, see CharMod-Norm.
Unicode does not prescribe when to use a specific Unicode normalization form. However, a number of processes work better if text is normalized, in particular processes involving text comparison such as collation, search, and regular expression processing. Some libraries performing these processes offer normalization as part of the process; otherwise, you should ensure that text is normalized before using these processes. Generally, normalization form C (NFC) is recommended for web applications. However, some processes, such as internationalized domain names, use other normalization forms.
Some languages will require normalization before processing since different input methods may generate different sequences of Unicode codepoints. Vietnamese is a prime example where the Vietnamese keyboard layout in Windows 2000 onwards produces a different character sequence than most other third party Vietnamese input software does. Similar issues arise in a number of African languages, Yoruba being the first that comes to mind.
Storing text as Unicode often takes more space than storing it in legacy encodings. The exact amount of expansion depends on the language and particular text involved. Expansions for some common encodings might be as much as:
Source Encoding |
Languages |
UTF-8 |
UTF-16 |
ASCII | English, Malay, ... | 0% | 100% |
ISO-8859-1 | Western European | 10% | 100% |
ISO-8859-7, plain text | Greek | 90% | 100% |
ISO-8859-7, 50% markup | Greek | 45% | 100% |
TIS-620, plain text | Thai | 190% | 100% |
TIS-620, 50% markup | Thai | 95% | 100% |
EUC-KR, plain text | Korean | 50% | 5% |
EUC-KR, 50% markup | Korean | 25% | 55% |
At a macro level, this doesn't really matter much. Network bandwidth and storage nowadays are dominated by videos, images, and sound files, while text only consumes a fraction. There may be an impact on storage systems that store only text. If text size is really a concern, it can be reduced using compression.
At the micro level, however, the increased storage size has a number of implications:
To work with Unicode, it is often advantageous to use software libraries that are dedicated to Unicode support. Older libraries may support Unicode less well, or not at all.
While Unicode enables multilingual applications and documents, there are many processes that require knowledge about the actual language in use. Such processes range from simple case-folding to searching and spelling checking.
Unicode-based APIs should therefore enable the specification of the language(s) used wherever such knowledge may be needed, and the language of user-generated content should be recorded where possible. Where the language cannot be captured at the source, a language detection library may be useful.
To help other applications, the language of web content, where known,
should be declared using the HTTP Content-Language
header or the HTML/XML lang
attributes.
Web sites using Unicode need to be more careful about specifying fonts than web sites using legacy encodings. Many languages have unique or specific writing traditions, even though they share a script with other languages. In other cases, fonts support can be a barrier because the fonts necessary to display specific scripts are not installed on most systems.
For example, the Chinese and Japanese writing systems share a large number of characters, but have different typographic traditions, so that Chinese fonts are generally not acceptable for Japanese text and vice versa. For example, here is the same character being displayed using a Chinese and a Japanese font (along with the HTML code used to generate the screen capture):
<span style="font-size:3em;font-family:sans-serif;">
<span lang="zh-Hans-CN" style="font-family: simsun, hei,
sans-serif;">直</span>
<span lang="ja" style="font-family: 'ms gothic',
osaka;">直</span>
</span>
When legacy encodings are used, browsers often guess the language from the encoding and pick an appropriate font.
Since Unicode supports both Chinese and Japanese, this trick doesn't work for Unicode-encoded pages, and the result can be an inappropriate font or even an ugly mix of fonts being used to render the content.
One solution is to keep track of the language in use, and
communicate both the language and the preferred fonts for the language to the
browser. For monolingual pages, using a language specific style sheet is a
simple and effective approach. For multilingual pages, you should use the lang
attribute on HTML tags to identify the language; a few browsers
use this information as guidance in selecting the right font. For precise
control over the font you can also use classes to identify the language and
class selectors in the style sheet to set the font. The CSS 2.1 language
pseudo class selectors, which would select directly based on language
attribute(s), isn't supported by Internet Explorer and so is of limited
usefulness. See Test results: Language-dependent styling.
Converting the data associated with a product will in many cases be the biggest challenge in migrating the product to Unicode. For example, some applications own or access a number of databases, some of which are managed by database engines such as Oracle or MySQL. Others use custom file formats and access mechanisms. These databases, regardless of type, need to be migrated to support Unicode.
Migration of the data to Unicode is also a good time to consider consolidating databases that were previously separate because of different character encodings. Using a single database worldwide or just a few for the main regions may simplify deployment and maintenance, and may enable content sharing between different markets, and Unicode is ideal for this because it can represent text in all languages. Consolidation will however have to keep in mind that other restrictions on content sharing may remain - such as language availability, licensing conditions, and legal or cultural restrictions on publishing material related to politics, religion, sex, and violence.
Strategies for conversion of the data will vary based on a number of factors:
Because of variations in these factors, there's no simple recipe that can be followed in converting the databases of a product. The following is a discussion of common considerations; however, it will generally be necessary to create a tailored conversion plan for each product. Such a plan will likely have several phases for analysis, conversion with checking of the conversion results, and recovery if something goes wrong.
As mentioned in Text Size Issues (above), converting text to Unicode generally results in expanded storage requirements, and you need to carefully consider whether to measure text lengths in bytes, characters, or UTF-16 code units.To reduce the impact of increased field sizes, it may make sense to switch CHAR fields in SQL databases to VARCHAR, thus allowing the database to allocate just as much space as needed.
On text measurement, some databases don't give you a choice. For example, MySQL always measures in terms of Unicode BMP characters, resulting in 3 bytes per character. Others, such as Oracle, let you choose between character or byte semantics. Other storage systems that impose size limits are likely to measure in bytes.
During the migration, which involves encoding conversion, be careful to avoid truncation. In some cases, unfortunately, you may not be able to do so because of external constraints, such as Oracle's limit of 30 bytes for schema object names in data dictionaries (use of ASCII characters for schema names helps avoid this issue). In such cases, at least make sure to truncate at a character boundary.
Also: note that there can be text expansion due to translation. See Text size in translation.
It is worthwhile identifying data sets (files, database tables, database columns) that are entirely in ASCII. If the desired Unicode encoding is UTF-8, no conversion is necessary for such data sets because ASCII byte sequences are identical to the corresponding UTF-8 byte sequences. Also, indices over ASCII text fields are also valid for the corresponding UTF-8 or UTF-16 text fields, unless they are based on language sensitive sort orders. However, you have to be strict in identifying ASCII data sets. The term "ASCII" is often mistakenly used for things that aren't ASCII, such as plain text (in any encoding) or for text in the ISO 8859-1 or Windows-1252 encodings. In addition, a number of character encodings have been designed to fit into 7-bit byte sequences while representing completely different character sets from ASCII.
To verify that the data set is indeed in ASCII, check for the following:
The byte values 0x0E, 0x0F, and 0x1B are not used. The presence of these byte values is likely to indicate that the data is really in some ISO-2022 encoding.
If the characters "+" and "-" occur, the intervening bytes are not valid BASE-64. If they are, the text is possible that they are in the UTF-7 encoding. Note that UTF-7 should be avoided wherever possible because of the potential for XSS attacks.
The character sequences "~{" and "~}" don't occur. If they do, the text is likely to be in HZ encoding.
As mentioned earlier, it sometimes occurs that databases contain text whose encoding isn't known. Character encoding detection can be used to get an idea of the encoding, but this process is not reliable. To deal with the uncertainty, a number of additional steps may be necessary:
For simplicity, the following sections assume that the encoding can be determined with certainty and conversion therefore is a one-time event. Where this is not the case, strategies need to be adjusted.
Databases holding user-generated content often contain numeric character references (NCRs) for non-ASCII characters that users have entered, such as "Œ" (Œ) or "€" (€). Many browsers generate the NCRs when users enter text into form fields that cannot be expressed in the form's character encoding. NCRs work fine if the text is subsequently redisplayed in HTML. They do not work, however, for other processes because they don't match the text they represent in searching, they get sorted in the wrong place, or they're ignored by case conversion. Migration to Unicode is therefore also a good time to convert NCRs to the corresponding Unicode characters. You'll need to be careful however to avoid conversions that change the meaning of the text (as might a conversion of "&" to "&") or conversion to text that would have been filtered out for security reasons.
During the migration from legacy encodings to Unicode, it's common to use legacy encodings and Unicode in parallel, and you need to be able to distinguish between them. In the general case, this requires character encoding specifications. However, if you need to distinguish between just one specific legacy encoding (such as the site's old default encoding) and UTF-8, you might use the Unicode byte order mark (BOM) as a prefix to identify UTF-8 strings. This is particularly handy if there is no provision for a character encoding specification, for example in plain text files or in cookies. The BOM in UTF-8 is the byte sequence 0xEF 0xBB 0xBF, which is very unlikely to be meaningful in any legacy encoding.
A reader for data that identifies its encoding in this way reads the first three bytes to determine the encoding. If the bytes match the BOM, the three bytes are stripped off and the remaining content returned as UTF-8. If they don't, the entire content is converted from the legacy encoding to UTF-8. This stripping, however, isn't automatic and it interferes with some platforms or languages. For example, a PHP file that starts with a BOM won't be interpreted properly by the PHP processor. So this trick is best confined to well-known parts of your site or code.
Plain text files that use a single character encoding are easy to convert. For example, the iconv tool is available on most Unix/Linux systems. On systems that don't have it, a convenient approach is to install a Java Development Kit and use its native2ascii tool:
native2ascii -encoding _''sourceencoding'' ''sourcefile'' |
native2ascii -reverse -encoding ''targetencoding'' >
''targetfile''
For small numbers of files, editors can also be used: TextPad on Windows, TextEdit on Mac, or jEdit on any platform are just a few editors that can convert files. Note that some editors, such as Notepad, like to prefix Unicode files with a Unicode byte-order mark (BOM), which in the case of UTF-8 files is unnecessary and may cause problems with software reading the files.
Structured files in this context means any files, other than SQL databases, that have components that might have different encodings or that have length limitations. Examples are: log files, where different entries may use different encodings; email messages, where different headers and MIME body components may use different encodings, and the headers have length limitations; and cookies, which are often treated as having multiple fields. For such files, every component has to be converted separately, and length limitations have to be dealt with separately for each component.
An SQL database really consists of two components: A server component, which actually manages the data, and a client component, which interfaces with other software (such as a PHP or Java runtime) and communicates with the server components. The character encoding that the client uses to communicate with the server can be set separately from the character encodings used by the server; the server will convert if necessary.
Depending on the size of a database and its uptime requirements, various strategies for conversion are possible:
The SQL language and documentation have the unfortunate habit of using the term "character set" for character encodings, ignoring the fact that UTF-8 and UTF-16 (and even GB18030) are different encodings of the same character set.
Oracle has general Unicode support starting with version
8, but support for supplementary characters is only available starting with
version 9r2, and support for Unicode 3.2 only starting with version 10. Also,
the use of the NCHAR
and NVARCHAR
data types before version 9 is somewhat
difficult. Oracle provides comprehensive Globalization Support Guides for
versions 9r1, 9r2, 10r1,
and 10r2.
The chapters on Character Set Migration and Character Set Scanner are
particularly relevant.
The character encoding selected for an Oracle database is set for the
entire database, including data, schema, and queries, with one exception: The NCHAR
and NVARCHAR
types
always use Unicode. Different Unicode encodings are offered for the database
as a whole and the NCHAR
and NVARCHAR
data types. For the database, there are correct
UTF-8 under the name AL32UTF8
, and a variant UTF8
that encodes supplementary characters as two 3-byte
sequences. For databases migrating to Unicode, you should use AL32UTF8
(databases that already use UTF8
can in most cases continue to do so - the difference
between these encodings may affect collation and indexing within the
database, but in general it doesn't matter much since the client interface
converts UTF8
to correct UTF-8). For the NCHAR
and NVARCHAR
data types,
UTF-16 is available under the name AL32UTF8
, along
with the variant UTF8
encoding. The semantics of
length specifications for the CHAR
, VARCHAR2
, and LONG
data types can
be set using the NLS_LENGTH_SEMANTICS
, with byte
semantics as the default, while the NCHAR
and NVARCHAR
data types always use character semantics.
For correct conversion between the encoding(s) used within the database to
the client's encoding it is essential to define the NLS_LANG
environment variable on the client side. This
variable describes the language, territory, and encoding used by the client
OS. Oracle has numerous other settings to specify locale-sensitive behavior
in the database; these can generally be set separately from the encoding as
long as the encoding can represent the characters of the selected locale.
Unicode supports all locales.
Oracle provides built-in support for several conversion strategies. The
Character Set Scanner tool helps in identifying possible conversion and
truncation problems in the pre-conversion analysis. The Export and Import
utilities help in implementing a dump and reload strategy. Adding Unicode
columns is easy because the NCHAR
and NVARCHAR
data types support Unicode independent of the
database encoding. Converting in place with an encoding tag is possible if
the database itself doesn't interpret the text - the ALTER
DATABASE CHARSET
statement can be used to inform the database of the
actual encoding once conversion has completed.
There are reports that the NCHAR
data types are not
supported in the PHP Oracle Call Interface.
To get Unicode support in MySQL databases, you'll need to use MySQL 4.1 or higher. For information on upgrading to this version and on possible compatibility issues, see Upgrading Character Sets from MySQL 4.0. For detailed information on character encoding support in MySQL, see the Character Set Support chapter of the MySQL documentation. The character encoding for the database content can be set separately at the server, database, table, or column level. Where the encoding isn't set explicitly, it's inherited from the next higher level.
MySQL's default encoding is latin1
, that is, ISO-8859-1. The supported
Unicode encodings are called utf8
and ucs2
. Usually the recommended character
encoding for MySQL should be utf8
. Both utf8
and ucs2
are limited to the
characters of the Unicode Basic Multilingual Plane (BMP), so there's no
support for supplementary characters in MySQL. As a result, utf8
isn't a
wholly compliant implementation of UTF-8 (although for most purposes it is
fine). The NCHAR
and NVARCHAR
data types always use utf8
.
Length specifications for character data types are
interpreted to be in Unicode BMP characters, so a specification CHAR(5) CHARACTER SET utf8
will reserve 15 bytes. Metadata,
such as user names, is always stored in UTF-8, so non-Latin names can be
used. The character encoding for the client
connection can be set separately for client, connection, and results, but
to avoid confusion, it's best to set them all together using SET NAMES 'utf8'
. The ucs2
encoding is not
supported for the client connection, so there's no good reason to use this
encoding for the database content either.
Collations are related to character encodings, so they
should always be set at the same time as the encoding. If utf8
is
used without specifying a collation, the default collation utf8_general_ci
is used. This is a legacy collation
algorithm that's not good for any particular language. The collation utf8_unicode_ci
is a better default, since it implements
the Unicode Collation Algorithm (UCA) and works for many languages that are
not specifically supported by a named collation. You can also select one of
the language-named UTF-8 collations to get language-specific collation
"tailoring" based on UCA. See the list of collations for Unicode
Character Sets. MySQL supports the CONVERT
function, which allows the results of a query to be converted from one
encoding to another. MySQL also supports in-place conversion from one
encoding to another using the ALTER
statement: ALTER TABLE table CONVERT TO CHARACTER SET
utf8 COLLATE collation;
.
In some cases, the encoding of a column may be incorrectly
declared in the schema - for example, UTF-8 data may have been stored in a
MySQL database under the encoding name latin1
before MySQL really supported
UTF-8, or Japanese data may have been labeled sjis
when it was actually
using the Windows version of Shift-JIS, which MySQL calls cp932
(see The cp932
Character Set for more information on this case). In such cases, a column
can be relabeled without conversion by changing its type to the binary
equivalent (BINARY
, VARBINARY
, BLOB
), then back to
characters (CHAR
, VARCHAR
, TEXT
) with the correct encoding name, e.g., for a TEXT
column: ALTER TABLE table CHANGE
column column BLOB; ALTER TABLE table CHANGE column column TEXT
CHARACTER SET utf8 COLLATION collation;
. You can and should change
all columns of one table together in order to minimize the overhead of
rebuilding the table.
The PHP client for MySQL by default specifies latin1 as the connection encoding for each new connection, so it is necessary to
insert a statement SET NAMES 'utf8'
for each new connection.
Several server operating systems (for example, !FreeBSD, Red Hat) store file names as simple byte sequences whose interpretation is up to higher level processes. Server processes may interpret the byte sequences according to the character encoding of the locale they run in, or just pass them on to client processes. The actual encoding must therefore be determined by evaluating how the name was created, which might be through a web page in the default encoding for the particular site or user. If that's not conclusive, character encoding detection may also be used.
If the encoding of a file name can be determined with certainty, it can be converted to UTF-8, and a Byte Order Mark can be used to mark it as converted. If the encoding is uncertain, it may be necessary to create a database parallel to the file system to record the detected encoding and possibly the UTF-8 version, so that the original file name can be kept around for later correction of the encoding.
Testing Unicode support with ASCII text is useless. Make sure that you test handling of user data with text in a variety of the languages you will support:
Testing with these languages will require your computer to be configured to support them. Learn To Type Japanese and Other Languages has information on how to do this for all common operating systems.
To show that text in the user interface of your application is handled correctly, pseudo-localization is a useful testing strategy. Pseudo-translation tools can automatically replace ASCII characters in user interface messages with equivalent full width Latin characters from the Unicode range U+FF01 to U+FF5E (English -> English) or with variant letters with diacritic marks from the complete Latin range (English -> Ëñgłíšh).
Issues that need particular attention when testing Unicode support include:
Some guidelines for testing Unicode support and internationalization are available at: International Testing Basics
Character Set Support is for MySQL.
Learn To Type Japanese and Other Languages has information on configuring and using all common operating systems for multilingual testing.