W3C Internationalization Workshop

Position Statement

SUBMITTED BY

Suzanne Topping

BizWonk Inc.

INTRODUCTION

The following position statement was developed as a result of recent discussions on various email distribution lists such as www-international@w3.org, I18N-prog@yahoogroups.com, unicode@unicode.org. Due to the high volume of locale-related discussions, a new list (locales@yahoogroups.com) was created by David Possin in November of 2001. The broad-based recommendations in this paper have been compiled from the discussions on these lists.

POSITION SUMMARY

The participants in the Locales group at yahoogroups.com feel that the topic of locales should be one of the areas of focus for the re-chartering of the W3Có Internationalization Working Group.

As one Locales group member stated:

Ïne of the strengths of Unicode over the years has been the early recognition that the scheme decided upon would need to be extensible without breaking. I think at least as much thought needs to be given up front to the locale issue if we expect people to adopt it as widely as Unicode.€

The Locales group recognizes this reality, and hopes that a W3C initiative could help drive the development of appropriate standards.

Experience: I am representing a group of internationalization professionals from a wide range of backgrounds and experiences. Contributors to this summary include (but are not limited to):

Carl Brown

Barry Caplan

John Clews

Peter Constable

Mark Davis

Michael Kaplan

David Possin

Thierry Sourbier

Tex Texin

Suzanne Topping

Needs: The needs of Locales group members vary depending on their specific working situations. There is no single specific area of need, other than a general interest in developing more standardized handling for customizing data based on user preferences.

Expected Outcomes: The Locales group would like to see attention given to the locale issue as part of the new internationalization efforts being put in place by the W3C. Ultimately a set of guidelines would be developed for dealing with locale issues. A working group may be a good way to proceed with developing these guidelines.

Potential Contributions: The Locales group could consolidate and summarize suggestions to act as a starting point for WG activities. In addition, some Locale group members could take part in a WG, if one is formed.

SUMMARY OF CURRENT STATE

Portions of discussions from the Locales group and other lists are included below, to provide an overview of the range of viewpoints presented. Comments were condensed and paraphrased in some cases. Initial comments are general in nature, and then move on to more specific issues. Individual comments are separated by asterisks (***).

***

Locale definitions should be separated from language definitions.

***

The name of a country is not the same as a language, which is not the same as a currency, which is not the same as an address format, which is not the same as a name format and so on.  I cannot think of a single good reason why any of these unique characteristics should be inextricably linked to any other in any way or standard whatsoever, nor why they should be made unchangeable, nor can I think of any good programming reason (except time and money) as to why each characteristic cannot be approached separately (understanding that a set of defaults can be assigned on the basis of "locale").

***

The challenge with managing locales is that:

1) Every user has a specific idea of what they want.

2) Individual users do not all agree about what they want, and so they end up with what they -donô- want.

3) Unhappy users do not understand the UI well enough to fix the problem. And even if they know the UI well, it may not be flexible enough to allow the degree of customization they desire.

***

What some people appear to want is some structured way to indicate and/or communicate a raft of information about a useró preferences. That would presumably include the traditional features of a locale, such as how dates are formatted, but may also include currency, timezone, preferred character set, smoker/non-smoker, vegetarian or not, music preference, religion, party affiliation, favorite charity, etc.

***

Major US software developers deferred to ISO 639 which couples language and country.  We need to break the tie between language and locale, and to create real-world minimal defaults for the locale definition. While IBM, Sun, and Microsoft use predefined language_country locales for their operating systems, that does not mean that they are right.

***

Standardization of the "locale components" is an open ended process. ISO has provided us with the basic language identifications, but we all understand that it is not enough. We have to translate between various applications of the ISO standard, for example Java locale is "es_ES", but try putting this into xml:lang.

***

When we say "language as used in country X", I think what we are really identifying is a particular set of orthographic conventions. In the model I'd propose, a language can have multiple writing systems, and a writing
system can have multiple orthographies. Orthography is typically a function of language standardization policies, and an orthography object determines spellings, hyphenation conventions, and default sort order. When we talk about a language having certain properties in the context of a given country, it is generally going to be orthography; it is not some kind of linguistic sub-variety or dialect. An orthography distinction may require reference to a script, or to a date, a defining document/policy/legislation or a promoting agency.

***

Some general questions regarding orthographic and user preference option handling were raised:

--Should every orthographic and user preference option be viewed as a modifiable selection?

--Should every option have at least one value, and allow a null value?

--Should some sort of fallback mechanism be required?

--Should a minimal set of default options be required?

***

The concept of Areas should be explored as part of the locale guideline development. Countries within a particular area of the world often have elements of a particular locale in common, with relatively small variations.

Examples are larger areas of the Caribbean, North America, Central America, South America, Western Europe, Central Asia, Western Asia, South Asia, the Pacific, North Africa, etc.

Currently there is no framework that system designers (or users) can refer to, for getting information on areas of the world that share similar characteristics.

***

I think it will be hard to have people agree on a *default* Area->country->region relationship even if they know they can override it. We will probably need several different hierarchies to reflect the different usage (political, geographical, economical, trade based, military based, etc...).

***

I don't think it is possible to divorce the discussion of locale from the discussion of internal data format.  I don't think anyone can say that their idea for locale is complete unless they also have a complete understanding of the îormalized" format of the underlying data.

***

HTML currently only uses the lang attribute, there is no attribute for region/territory/country, and that is one of the sources for problems. The added country code should only be used to enhance the language definition, but in reality it is also used to set the locale settings, which is a wrong assumption.

***

It would be nice if we could have standard attribute names, maintained by W3C, like lang. I know there are many ideas for XML out there, but what about HTML?

***

We not only need a new way of specifying locales, but also a new attribute, perhaps something like xml:locale, on the quixotic possibility that we might be able to convince the W3C to declare a new standard attribute. I want to be able to use this in an attribute position in xml.

***

I am afraid that a 'standard' without an industry-strength implementation is a waste of time. So I pray that one day, a decent company will come up with a notion of WEB "preference server", and along with that it will define both the identifier "format" as well as the "preferences content model".

EXAMPLES OF GUIDELINE ITEMS TO BE CREATED

Some suggested items to include in guidelines are listed below. Also useful to include in the guidelines would be checklists, best practice recommendations, and examples of what NOT to do.

1. Locale Definition

Which of the many variables for internationalization belong in a locale and which belong in some other structure? Which variables are best associated with the locale, which with the data, and which with the application?

2. Locale Identification

Locale-specific options should have a standard identifier and a default. How many parameters are needed for a default minimal locale description? Country codes should be extended by regions, as described in the W3C international list, and user-defined regions should be possible. It would be desirable to have machine-readable lists provided by the W3C.

3. Language Identification

Language codes need to be improved to really reflect the language spoken including their dialects and local rules. How can we identify languages that are not included in the ISO 639 language group standard? (Current locale identifiers use the 2-letter code, not the 3-letter code).

(TC 37/SC 2 has set things in motion to start a new work item on ISO 639, and there is a reasonable probability that it will lead to a Part 3, perhaps consisting of 4-letter codes, and almost certainly
consisting of a lot more codes than are currently in 639-2.)

4. Orthographic and User Preference Attribute Definitions

A list of orthographic and user preference options should be defined. Some examples of issues impacted by locale, and which should therefore be included as attributes are listed below.

Language

Dialect

Script

Personal name format

Address format

Date/time format

Date/time selection

Data casing (upper/lower case)

Telephone number format

Currency format

Measurement format

Numeric format

Collation/sorting

Other issues

5. Time zone Handling

Global time and date displays need to be dealt with. A locale may include more than one zone (e.g. US goes from EST, CST PST) and daylight savings time may vary within a locale.

Language, country, and time zone are not sufficient to determine which calendar is being used. Perhaps timezone should be replaced with something representing calendar+date+time formats and timezone?

6. Currency and Euro Handling

Locales typically have only one currency associated with them, and European locales still all have their national currencies implied. Even when the euro becomes standard for a country, older transactions will still have to be working with old currencies and/or triangulation. We can't just convert them.

USE CASES/STRAWMEN

Below are some suggestions for approaches.

Single Locale Specification

One approach is to stick with a single locale specification rather than dealing with multiple parameters, but one which supplies more information.

For example:

"es-mx_US.iso-8859-1#America/Los_Angeles"

Since many applications would require a lot of change to track multiple parameters rather than a single locale specification, some people feel that this is the best approach.

Multiple Parameters

Another approach is to have each set of parameters treated independently to avoid problems like the ones resulting from the language_country coupling. The number of parameters should stay fully optional when defining an international operating/run environment.

For example:

"lang_es-mx#loc_US#tz_America/Los_Angeles#char_iso-8859-1"

or

"lang[es-mx]#loc[US]#tz[America/Los_Angeles]#char[iso-8859-1]"

If using multiple parameters, it would be necessary to determine which parameters (listed under €Orthographic and User Preference Attribute Definitions in this document) are subcategories of others, and which ones stand alone.  For example, there may not be any relationship between time zone and language but dialect is definitely a subcategory of language and may not merit a separate category. (But this issue is also up for debate.)

Also required would be development of defaults for use when parameters are missing/undefined, or if the addressed resources are missing.

Make Parameters Full Text Based

Language and country identifiers are English based and are 2 to 3 characters long. Future extensions might be 4 or more characters long, but that doesnô improve understandability much. It was therefore suggested that parameters be full text based.

For example:

*lang#German#Bavarian# would be equivalent to *lang#Deutsch#Bayrisch#

*lang#English#American# would be equivalent to *lang#Englisch#Amerikanisch#

*region#Germany#Bavaria# would be equivalent to *region#Deutschland#Bayern#

*tz#Middle_European# would be equivalent to *tz#mitteleuropäisch#

Note: *tz#US_Arizona# would be enough to identify daylight savings time issues as well as the time zone.

Translations of all parameters already exist in operating systems, programming languages, databases, etc, they just need to be standardized. Using full text, a future level of the ISO standards would be more descriptive and precise than any 4-letter acronym could be.  Memory is not that much of an issue anymore, and legacy systems would still have to use the older standards anyway.

CONCLUSION

The Locales group feels that given the degree of variation in current handling across operating systems, programming languages, etc., the issue of locales is worthy of attention during the rechartering of the W3Có Internationalization Working Group.