[ contents ]

W3C

Internationalization and Localization Markup Requirements

W3C Working Draft 5 August 2005

This version:
http://www.w3.org/TR/2005/WD-itsreq-20050805
Latest version:
http://www.w3.org/TR/itsreq
Editor:
Yves Savourel, ENLASO

Abstract

When creating schemas (XML Schema, DTD, etc.), it is important to include constructs that meet the needs of content authors dealing with international audiences, and address the needs of the localization community. This document provides a list of key requirements to achieve such a goal. It will be used to provide a framework and direction for a detailed solution proposal (or set of proposals) to be developed later.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document is a First Public Working Draft published by the Internationalization Tag Set Working Group, part of the W3C Internationalization Activity. The Working Group expects to advance this Working Draft to Working Group Note (see W3C document maturity levels).

This document defines requirements for a set of solutions that would address the main challenges and issues of internationalizing and localizing XML documents. The solutions are expected to include several aspects: a specialized vocabulary that XML users can include in their own documents, a set of guidelines to apply when using existing XML technologies, and a range of possible mechanisms for applying those.

Feedback about the content of this document is encouraged. Send your comments to www-i18n-comments@w3.org. Use "Comment on ITS requirements WD" in the subject line of your email. The archives for this list are publicly available.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

Table of Contents

Appendices

A References
B Acknowledgements (Non-Normative)

Go to the table of contents.1 Introduction

Go to the table of contents.1.1 Background

Content or software that is authored in one language (i.e. source language) is often made available in additional languages. This is done through a process called localization, where the original material is translated and adapted to the target audience.

From the viewpoints of feasibility, cost, and efficiency, it is important that the original material should be suitable for localization. This is achieved by proper design and development, and the corresponding process is referred to as internationalization.

The increasing usage of XML as a medium for documentation-related content (e.g. DocBook, being a format for writing structured documentation, well suited to computer hardware and software manuals) and software-related content (e.g. the eXtensible User Interface Language (XUL)) provides growing challenges and opportunities in the domain of XML internationalization and localization.

Go to the table of contents.1.2 Who Should Read This

The target audience of this document includes the following categories:

  • Designers of content-related formats

  • Developers of schemas in various formats

  • Developers of XML authoring tools

  • Authors of XML content

  • Developers of localization tools

  • Localizers involved with XML

  • Developers of Internet specifications at the World Wide Web Consortium and related bodies

Go to the table of contents.1.3 Overview

This document describes requirements for a list of guidelines and a set of recommended approaches to developing schemas which address issues related to international use of document formats and localization of XML content.

Regardless of the final form and syntax such approaches ultimately take, it is possible to envision their usage at different levels:

  1. In a document instance, grouped in a single location, to associate information with multiple parts of the document using some kind of linking or addressing mechanism. Such usage would be similar to the <style> element in an HTML document.

  2. In a document instance, within the content, at the location where the information applies. This usage would be similar to the style attribute in an HTML document.

  3. In schemas, along with the definition of an element or an attribute, to provide data categories for internationalization and localization.

Such approaches are not meant to describe the configuration settings of localization tools for XML content. However, it is expected that the tools will be able to infer such properties from the information provided by the ITS implementations. For example, the tools should be able to build a list of all nodes that are to be translated in a given document using the ITS information in the document itself and in its corresponding schema(s) or DTD.

Go to the table of contents.1.4 Key Definitions

When used in this document, the following terms have the meaning described here:

Internationalization

[Definition: Internationalization is the process of generalizing a product so that it can handle multiple languages and cultural conventions without the need for redesign. Internationalization takes place at the level of program design and document development.] (Definition based on LISA's FAQ [LISA FAQ])

Localization

[Definition: Localization is the process of taking a product and making it linguistically and culturally appropriate to a given target locale (country/region and language) where it will be used.] (Definition based on LISA's FAQ [LISA FAQ])

Schema

[Definition: The term schema(s) refers to any schema languages (e.g. DTD, XML Schema, etc). The term "XML Schema" is used when referring to XML Schema.]

Go to the table of contents.2 Usage Scenarios

Go to the table of contents.2.1 Content Authoring

Go to the table of contents.2.1.1 Description

As an author develops content that is meant to be localized, he or she may need to label specific parts of the text for various purposes, such as:

  • terms that either should not be translated or translated using a pre-existing terminology list

  • sections of the document that should remain in the source language

  • acronyms or specific terminology that requires an explanation note for the translator

  • identification of re-usable text

In other cases, the original text itself may need to be labeled for specific information required for correct rendering, such as ruby text in Japanese [Ruby], or bidirectional overrides in Arabic [Bidi].

The use of a standardized set of tags allows authoring systems to provide a common solution for these special markers across all XML documents. This, in turn, increases the feasibility of a simple interface for performing the labeling task.

For example, the author selects a portion of text not to translate and clicks a button to mark it up as "not translatable" with a tag identical across all markup vocabularies. The availability of such interface allows the author to provide to the translators a better context of work, with minimal effort.

Go to the table of contents.2.1.2 Stakeholders

This scenario is relevant to:

  • The technical writers developing localizable content

  • The developers of authoring systems

  • The localizers and the translators

Go to the table of contents.2.2 Terminology Creation and Translation

Go to the table of contents.2.2.1 Description

During the development of documentation material, it is common practice to scan the content of the documents to create a list of frequently used terms.

This list is used to provide a consistent terminology across the different parts of the documentation. It is also used as the base for translation glossaries.

During the terminology creation phase the insertion of special markers to delimit terms within the source material helps the user to identify the proposed entries and view them within their context.

The same markup can be used at later stages in the translation process, to help the translators match the source terms with their agreed-upon translations.

The use of a common set of markers allows for better re-usability of the information across the different steps of the localization process and across the various tools used to facilitate it.

Go to the table of contents.2.2.2 Stakeholders

This scenario is relevant to:

  • The authors or the terminologists that create the glossaries

  • The people working on quality management/assurance

  • The translators

Go to the table of contents.2.3 Software Development

Go to the table of contents.2.3.1 Description

Software-related material is now often stored in XML repositories. Examples of this, would be UI resources and message files, comments in the source code to generate documentation, or even temporary XML storage generated from proprietary formats for the time of the localization.

A software developer often needs to provide localization-related information along with the resources that will be translated. For instance, he or she may need to indicate that a string has a maximum length because the program processes it using a fixed-length buffer.

Using a common set of tags in the XML documents to carry such information across the different tools used during the localization process offers better control to the developer. He or she can affect how the resources will be modified, and ultimately prevent some bug or incorrect translation to be introduced.

Localizers also often need to add their own information in the resource material. They do this to complete what has been already set by the developer, or to add their own instructions.

In all these cases, a common set of tags allows the localization providers to develop re-usable verification tools to ensure that the translated material follows the requirements requested by the developers. It also helps the communication, in context, of some information between the different parties.

Go to the table of contents.2.3.2 Stakeholders

This scenario is relevant to:

  • The software developers that create the resources

  • The localization engineers that prepare the resources for translation

  • The translators modifying the data

Go to the table of contents.3 Requirements

Note: Several of the following requirements are illustrated with XML code samples using yet-to-be-defined ITS elements and attributes. Their names are completely arbitrary and are not intended to represent the appearance of the actual solution. The solution also may or may not be implemented as a namespace. These elements and attributes are represented with a strong emphasis in the examples.

Go to the table of contents.3.1 Indicator of Constraints

[R001] It should be possible to associate one or more constraints to specific content.

Go to the table of contents.3.1.1 Challenges

Translatable data may come with various constraints in the way they can be modified. For example, the content of the following <string> element must accommodate the length restriction imposed by the small display panel where it is used:

Example 1: Length restriction
<!-- LED display has only 16 characters -->
<string id="s123">Printing...</string>

In this case a standard method should be used for indicating the dimensions of the container so that localization tools can automatically recognize them and, when possible, enforce the constraint during translation.

Examples of constraints are:

  • Container size (e.g. maximum length, etc.)

  • Text allowed in a limited set of characters (e.g. translatable paths or filenames)

These constraints may need to be defined at the schema level or they may need to be defined for specific instances of an element.

In some cases, the constraint may be applicable only for a given context or a given tool.

Go to the table of contents.3.1.2 Notes

XSD (XML Schema Part 2: Datatypes Second Edition) provides a mechanism to define "Constraining Facets" ([XSD], section 4.3) that may provide some solution for this requirement at the schema level. At the instance level, Schematron [Schematron] could be used for the same purpose.

Sometimes the constraint may need to be expressed using units different from the unit used in the document. For example, the maximum length of a string may need to be expressed in byte or pixels, or display cells instead of characters. This may lead to the need for quite a few parameters with the constraint (e.g. the encoding to use, or the font and point-size information, etc.)

Go to the table of contents.3.2 Span-Like Element

[R002] span-like element is required to allow authors to mark sections text that may have special properties, from a localization and internationalization point of view.

Go to the table of contents.3.2.1 Challenges

Given a section of XML text, there's often insufficient information in the original markup in order to determine how exactly the contents should be dealt with from a localization and internationalization point of view. Adding various span-like elements to the markup at the authoring stage, would allow this information to be passed on to localization processes (either human or machine assisted processes).

For example, span-like elements could be used to mark sections of text that need to be translated by a domain-expert (as with source code fragments) or mark those that need special terminology in order to be properly translated. In particular, a span-like element can be useful to help translation tools determine where to apply sentence-breaks and also to assist metrics-calculating algorithms.

A span-like element is also extremely useful for marking language information in source files that translation tools can use to determine which translation process to use for each given section of text (e.g. a Latin quotation in a section of English text is often intended to be left in Latin for the translated version of the English text.) Other uses are foreseen, within the scope of the ITS.

One example would be the following sentence, which contains some source code that we would like to treat specially during translation:

Example 2: Text with portion of source code

The Java statement System.out.println("Hello World!"); prints the text "Hello World!" to standard output.

Here, we would like to put a span-like element around the source code fragment to indicate that it is not standard text for translation and should be translated by a someone familiar with the Java programming language. Also, translation tools should treat the exclamation points in this sample text carefully with respect to sentence-segmentation if they perform that function.

While the <code> tag in XHTML could be used to markup this text (in an XHTML document), it is often not specific enough for translators: it does not tell the translator what sort of source code is contained inside the tag, nor does it mark which portions of the code contents are translatable.

A suggestion of the sort of usage we could foresee for a span-like element could be the following:

Example 3: Text with marked-up source code

The Java statement <code> <span trans="no"> System.out.println(" </span> Hello World <span trans="no"> "); </span> </code> prints the text "Hello World!" to standard output.

An alternative to this sort of construction, would be to put the translatable text in a separate document, and then refer to that using using some form of linking mechanism, for example:

Example 4: Source code with entity reference

<code>System.out.println("&java.code.example.text;");</code>

Another example is shown below, where we have a piece of text that contains a file name which should also not be translated:

Example 5: Text with non-translatable file name

The file /etc/passwd is a local source of information about users' accounts.

In this case, the filename /etc/passwd should not be translated, and we would like to add markup to indicate this.

In these examples, we show that we are aiming to shift some of the responsibility of identifying translatable versus non-translatable content off the translation tools author, on to the content author, or at the very least, make recommendations to content authors to separate out the translatable versus non-translatable portions of text more clearly.

Go to the table of contents.3.2.2 Notes

This requirement is related to some other requirements, namely:

For the Section 3.8: Purpose Specification/Mapping, we need to ensure that any related semantics in the target schema are also sufficient for translation: that is for example, saying that a <programlisting> element in DocBook is related to a <code> element in XHTML is interesting, but neither will help the translator determine which contents of <code> or <programlisting> are actually translatable.

A span-like element could be used in cases like these where specific text properties are identified.

Go to the table of contents.3.3 CDATA Section

[Ed. note: Text for this requirement is still pending discussion, and is linked from the ITS home page.]

[R003] TBD

Go to the table of contents.3.3.1 Challenges

TBD

Go to the table of contents.3.3.2 Notes

TBD

Go to the table of contents.3.4 Unique Identifier

[Ed. note: This requirement is still at its Initial Draft stage.]

[R004] It should be possible to attach a unique identifier to any localizable item. This identifier should be unique within a document set, but should be identical across all translations of the same item.

Go to the table of contents.3.4.1 Challenges

In order to most effectively re-use translated text where content is re-used (either across update versions or across deliverables) it is necessary to have a unique and persistent identifier associated with the element.

This identifier allows the translation tools to correctly track an item from one version or location to the next. After one is sure that this is the same item, the content can be examined for changes, and if no change has taken place the potential for re-use of the previous translation is very high.

Change analysis constitutes an extremely powerful productivity tool for translation when compared to the typical source matching (a.k.a. translation memory) techniques, which simply look for similar source text in the database without, most of the time, being able to tell whether the context of its use is the same.

This change analysis technique has been possible with user-interface messages in the past, but the introduction of structured XML (and SGML) documents will allow for its use in documents also.

Go to the table of contents.3.4.2 Notes

The xml:id attribute [XML ID] may be a means to carry the unique identifier. Note however, that xml:id is unique within a document, not necessarily within a set of documents.

Go to the table of contents.3.5 Handling of Entities

[R005] XML applications which combine contents from various modules/entities need to adhere to certain guidelines in order to ensure that the XML application itself and the contents can be localized easily.

Go to the table of contents.3.5.1 Challenges

XML applications (i.e. a combination of DTD/XSD, style-sheets, XML instances) often make use of so-called general entities ([XML 1.0], section 4). Various types of entities exist, for example:

  1. Character entity. The entity defines a single Unicode character. Example: <!ENTITY aacute "á">

  2. A short element-free text. The entity defines a short text that contains only text (no element or other XML constructs). This is for instance an entity for a product name. Example: <!ENTITY productName "pictoMagic for Windows">

  3. A longer text with one or more elements. The entity defines a piece of boiler-plate text such as a copyright paragraph. Example: <!ENTITY copyrightInfo "<a href='copyright.htm'>Copyright</a> 2005 W3C.">

Two aspects of entities are of particular importance with regard to internationalization and localization: entities are defined, and entities are used. For example, the snippet:

Example 6: Entity declaration

<!ENTITY productName "pictoMagic for Windows">

defines an entity called "productName", and the snippet

Example 7: Entity reference

The latest version of &productName; features many enhancements.

references/uses the entity.

If internationalization and localization are not addressed for entity-related work several issues may arise:

  1. Entity reference cannot be resolved. Example: the definition is not available to the XML processor.

  2. Entity definition does not fit with the surrounding context language-wise. Example: The context in "Das Produkt &productName; ist mit vielen Erweiterungen ausgestattet worden" is German whereas the definition of the entity may be in English.

  3. Entity definition does not fit with the surrounding context grammar-wise. Example: The syntax in "The &objectName; has been disabled." will work, in English, only if the value for &objectName; is singular. If it is plural, "has" must be changed. In other languages "The" and "disabled" may also have to be adjusted.

  4. In addition, even if the entity itself is translated there may be significant grammatical problems for inflected languages for nouns. The translation will inevitably follow the case of the original. For example, if the original is genitive, the translation is genitive as well (of course this requires that the original language and the translation language have a concept for "genitive").

Since entities affect the content of the document, and XSLT processors and other kinds of XML processors act on the content, various processing-related issues may arise. An XSLT style sheet for example, which is sensitive to content contributed by an entity, may fail to work as expected (e.g. may not be able to generate the alt attribute for HTML pages).

Go to the table of contents.3.5.2 Notes

Ideally, the solution which the WG will produce will be applicable not only with regard to entities but also in the realm of XInclude [XInclude] or even fragments ([XFI], appendix B).

Note that character entity references (e.g. &aacute;) and numeric character references (NCRs, e.g. &#x00E1;) are different things. This requirement addresses character entity references, as well as all user defined entities.

Go to the table of contents.3.6 Identifying Language/Locale

[R006] Any document at its beginning should declare a language/locale that is applied to both main content and external content stored separately. While the language/locale may be declared for the whole document, when an element or a text span is in a different language/locale from the document-level language, it should be labeled appropriately. Therefore, DTD/Schema should allow any elements to have a language/locale specifying attribute. The language/locale declaration should use industry standard approaches.

Go to the table of contents.3.6.1 Challenges

Identifying languages (such as French and Spanish) and locales (such as Canadian French and Ecuadorian Spanish) is very important in rendering and processing document text and content properly since they provide specifications of language-dependent properties, such as hyphenation, text wrapping rules, color usage, fonts, spell checking quotation marks and other punctuation, etc.

In order to simplify the parsing process by documentation and localization tools, there should be a declaration of a language/locale that is applied to the whole document as well as externalized content. This should be done as a document-level property. Meanwhile, as a document may contain content with multiple languages/locales, subsets of the document needs a language/locale attribute. Such a local language/locale specification should be declared against an element or a span of text.

Go to the table of contents.3.6.2 Notes

Currently there are several different standards for language/locale specifications, such as RFC 1766 [RFC 1766] and RFC3066 [RFC 3066]. XML 1.0 prescribes a language identification attribute xml:lang ([XML 1.0], section 2.12, and [XML 1.0 Errata], E01). There is also a technical standard from Unicode regarding the locale data markup language [LDML]. ITS should carefully review these existing industry standards and clearly define what is a language/locale and its purpose in order to successfully meet this requirement.

Go to the table of contents.3.7 Identifying Terms

[Ed. note: This requirement is still at its Initial Draft stage.]

[R007] It should be possible to identify terms inside an element or a span and to provide data for terminology management and index generation. Terms should be either associated with attributes for related term information or linked to external terminology data.

Go to the table of contents.3.7.1 Challenges

The capability of specifying terms within the source content is important for terminology management that is beneficial to translation/localization quality. Terms to be identified include any domain-specific words and abbreviations for which translators need additional information in order to find appropriate concepts in their target languages. Term identification also facilitates the creation of glossaries and allows validation of terminology usage in the source and target documents.

Meanwhile, identified terms could be used for indexing that may require some language specific information. For example, Japanese words are sorted not by script characters, but by phonetic characters. Therefore when a Japanese index item is created, it should be accompanied with a phonetic string, called Yomigana.

As a result, terms may require various attributes, such as part of speech, gender, number, term types, definitions, notes on usage, etc. To avoid such a large attribute data is repeated within a document, it should be possible for identified terms to link to externalized attribute data, such as glossary documents and terminology database.

Go to the table of contents.3.7.2 Notes

For more details, please see discussions on term links at OASIS/XLIFF.

The OSCAR/TBX working group is currently working on drafting the TBX-Link specification [TBX-Link].

Go to the table of contents.3.8 Purpose Specification/Mapping

[R008] Currently, it does not appear to be realistic that all XML vocabularies tag localization-relevant information identical (e.g. all use the "term" tag for terms). One way to take care of diverse localization-relevant markup in localization environments is a mapping mechanism which maps localization-relevant markup onto a canonical representation (such as the Internationalization Tag Set).

Go to the table of contents.3.8.1 Challenges

From a localization point of view, many XML vocabularies include markup which requires special attention, since the markup is associated with a specific type of content. Examples:

  • elements which are associated with embedded/binary graphics

  • elements which are associated with specific text styles (e.g. underline and bold)

  • elements which are associated with linking (e.g. <a> in HTML)

  • elements which are associated with lists

  • elements which are associated with tables

  • elements which are associated with with generated content (e.g. an element that fires a query to a database in order to pull in the data for a product catalogue)

Here are some reasons why this type of markup may require special attention:

  • the localization tool may be able to render specific text styles in a standard way (e.g. increased font weight for bold)

  • embedded binary images may have to follow a specific workflow

  • content generation queries may have to be adapted

Since it is hardly imaginable that all content developers will be able to work with the same elements and attributes for this specific type of content, the ITS should include markup which allows people to specify the purpose of specific elements.

Challenges arise for example from the fact that the 'source/original' vocabularies may vary widely with regards to the representation they choose for a specific data category (e.g. their markup related to graphics; see the longer discussion of this).

Go to the table of contents.3.8.2 Notes

This requirement is related to the "Section 3.14: Limited Impact" requirement.

For the specific case of linking something to look at already exists: HLink [HLink].

The approach may be used to support term identification. Suppose that an original document has the following:

Example 8: Markup to map

You can define multiple computation IDs for one company in the <index sortstr="currency restatement">Currency Restatement</index> program.

When you wish that the <index> element serves as an ITS "term", you could use the following mapping:

Example 9: Mapping
<purposeSpec>
 <servesPurpose origVoc="index" its="term"/>
</purposeSpec>

One question to answer is: How can existing attributes (e.g. sortstr in the sample above) be carried over, or how can new attributes (like partOfSpeech, termType) be introduced?

Go to the table of contents.3.9 Cultural Aspects of the Content

[R009] It must be possible to specify finer or coarser granularity of cultural aspects of content than a language, locale or country. Such aspects may include script usages, regions, geographical areas, dialects or content context. The declaration of such an attribute should be done at the beginning of a document. Any content within a document which varies from the primary declaration should be labeled appropriately.

Go to the table of contents.3.9.1 Challenges

In order to successfully and efficiently parse document content, there should be more information than a language or a locale. Here are some examples of these types of issues:

  • A language/locale cannot perfectly represent orthography: e.g. "zh" does not stipulate if it is simplified or traditional Chinese. Locale for Azerbaijan does not provide guidance as to whether the language should be written in Latin or Cyrillic scripts.

  • Multiple cultural preferences within one locale: e.g. In Japanese ("ja-JP"), there are two official date formats – Japanese emperor date (和暦 [Wareki]) and Gregorian date format (西暦 [Seireki]).

  • Finer language variations: e.g. how does one indicate that a voice track is in the language spoken in German-speaking Switzerland rather than the language written there, since one is Schwyzertuutsch (Swiss German) and the other is very close to but not the same as "High German"?

  • Different writing styles and tones in one language: e.g. Japanese uses a polite style (です・ます調 [Desu/masu] tone) for user guides and a formal style (だ・である調 [Da/dearu] tone) for academic and legal content. Italian uses an informal style for software help content and a formal style for user guides. Identifying these variations is very important especially for content reusability. When the content is reused both in source and target languages, context information (such as whether the content is for a user guide or a user help) must be provided in order to reuse content with an appropriate writing style.

Go to the table of contents.3.9.2 Notes

RFC 3066bis called "Tags for Identifying Languages" [RFC 3066bis], defines the vast details of the structure and usage of language tags extended from RFC 3066. This proposes ways to define extended language sub-tags, such as variant sub-tags, region sub-tags and private use sub-tags, which could be solutions for the issues described above. See also "Supplementary information for RFC 3066bis" [RFC 3066bis Info].

Go to the table of contents.3.10 Link to Internal/External Text

[Ed. note: Text for this requirement is still pending discussion, and is linked from the ITS home page.]

[R010] TBD

Go to the table of contents.3.10.1 Challenges

TBD

Go to the table of contents.3.10.2 Notes

TBD

Go to the table of contents.3.11 Bidirectional Text Support

[Ed. note: Text for this requirement is still pending discussion, and is linked from the ITS home page.]

[R011] TBD

Go to the table of contents.3.11.1 Challenges

TBD

Go to the table of contents.3.11.2 Notes

TBD

Go to the table of contents.3.12 Indicator of Translatability

[Ed. note: This requirement is still at its Initial Draft stage.]

[R012] Methods must exist to allow to specify the parts of a document that are to be translated or not.

Go to the table of contents.3.12.1 Challenges

The content of XML documents can usually be seen as either generally translatable (e.g. an XHTML file), or generally not translatable (e.g. an SVG file). A mechanism should exist to identify the parts of the document that are exceptions to the rule.

The mechanism should also allow for the specification of exceptions within exceptions. For example, within the elements of an SVG document, which are generally not translatable, it should allow one to specify that <text> is to be translated, but also that some occurrences of the <text> element (e.g. with an attribute translate="no") are not to be translated.

The mechanism should be able to map existing elements that already carry implicitly or explicitly the translatability information. Here are some examples of this:

  • The <trademark> element in DocBook may be an indicator of non-translatable content.

  • The <text> element in SVG indicates translatable content.

  • The translate attribute in DITA is used to flag translatability.

The mechanism should provide a way to delimit a portion of the content if such a mechanism does not exist in the original vocabulary (so parts of he content could be marked as translatable or not).

The methods used to identify the translatable parts of a document should be useable by localization tools for both:

  • Processing the document directly.

  • Generating localization properties settings files that can be used on all documents of the same document type.

Go to the table of contents.3.12.2 Notes

Part of this requirement is related to the "Section 3.2: Span-Like Element" requirement.

Another part is related to the "Section 3.8: Purpose Specification/Mapping" requirement.

There is a relationship between indicating the parts of a content that are to be translated and the parts of a content that are to be included in "Section 3.13: Metrics Count".

Indicators of translatability may be used for helping translation tools in the creation of localization properties files (i.e. tools settings describing how to handle a given type of document from the viewpoint of localization). They can also be used to complement the localization properties by adding information in document instances.

The information about the parts of a document that are translatable is not limited to localization. Such information can be used in other contexts. For instance when implementing Accessibility features, it can be used to identify content that need to be process differently from the rest of the document.

Go to the table of contents.3.13 Metrics Count

[Ed. note: Text for this requirement is still pending discussion, and is linked from the ITS home page.]

[R013] TBD

Go to the table of contents.3.13.1 Challenges

TBD

Go to the table of contents.3.13.2 Notes

TBD

Go to the table of contents.3.14 Limited Impact

[Ed. note: This requirement is still at its Initial Draft stage.]

[R014] All solutions proposed should be designed to have as less impact as possible on the tree structure of the original document and on the content models in the original schema.

Go to the table of contents.3.14.1 Challenges

Inserting elements or attributes of a different namespace in an XML document can have side effects on various processing aspects. For example, the inserted nodes may:

  • Break the XPath expressions already in use to access part of the document.

  • Interfere with <xsl:value-of/> for extracting information.

  • Interfere with numbering and other aspects of styling the original document.

Solutions for any of the ITS requirements must take in account these potential drawbacks and offer implementations that have limited impact on the original document and on the content models in the original schema.

For instance:

  • Use attributes whenever possible (they have a lesser impact than elements). For example:

    Example 10: Using an extra attribute
    <table translate="no">
     <tr>...
    </table>

    is better than:

    Example 11: Using an extra element
    <notrans>
     <table>
      <tr>...
     </table>
    </notrans>
  • Use data categories that already exist in the original markup by either mapping them to ITS concepts (see "Section 3.8: Purpose Specification/Mapping") or by using them to carry ITS attributes. For example:

    Example 12: Mapping concepts
    <info>
     <mapping target='quote' its='notrans'/>
    <info>
    ...
    <para>The motto of Québec is:
     <quote>"je me souviens"</quote>.</para>
  • Group general ITS information in branches that are placed in locations where they have a minimal impact:

    Example 13: Information placement
    <doc>
     <info>
     ...
     </info>
     <header>...
     <body>...

Go to the table of contents.3.14.2 Notes

One possible solution which has to be discussed is whether ITS should encompass not only a tag set, but also a specification of processing steps for documents. One step then could be the separation of the document in namespace specific sections. This would limit the side effects mentioned above.

The Namespace Routing Language [NRL] could be used for this purpose. The "Part 4: Namespace-based Validation Dispatching Language — NVDL" [NVDL] of the ISO/IEC 19757 proposal "Document Schema Definition Languages (DSDL)" [DSDL] relies mainly on NRL. The following example NRL document can be applied to XML documents with markup from the xhtml namespace and a fictive ITS namespace. With the NRL document, the XML document are validated only against the XHTML scheme "xhtml.rng":

Example 14: Using NRL with XHTML and ITS
<rules startMode="root"
 xmlns="http://www.thaiopensource.com/validate/nrl">
 <mode name="root">
  <namespace ns="http://www.w3.org/1999/xhtml">
   <validate schema="xhtml.rng" useMode="xhtml"/>
  </namespace>
 </mode>
 <mode name="xhtml">
  <namespace ns="http://www.example.org/its">
   <unwrap/>
  </namespace>
  <namespace ns="http://www.w3.org/1999/xhtml">
   <attach/>
  </namespace>
 </mode>
</rules>

Go to the table of contents.3.15 Attributes and Translatable Text

[Ed. note: Text for this requirement is still pending discussion, and is linked from the ITS home page.]

[R015] TBD

Go to the table of contents.3.15.1 Challenges

TBD

Go to the table of contents.3.15.2 Notes

TBD

Go to the table of contents.3.16 Naming Scheme

[Ed. note: Text for this requirement is still pending discussion, and is linked from the ITS home page.]

[R016] TBD

Go to the table of contents.3.16.1 Challenges

TBD

Go to the table of contents.3.16.2 Notes

TBD

Go to the table of contents.3.17 Localization Notes

[Ed. note: Text for this requirement is still pending discussion, and is linked from the ITS home page.]

[R017] TBD

Go to the table of contents.3.17.1 Challenges

TBD

Go to the table of contents.3.17.2 Notes

TBD

Go to the table of contents.3.18 Handling of White-Spaces

[Ed. note: Text for this requirement is still pending discussion, and is linked from the ITS home page.]

[R018] TBD

Go to the table of contents.3.18.1 Challenges

TBD

Go to the table of contents.3.18.2 Notes

TBD

Go to the table of contents.3.19 Multilingual Documents

[Ed. note: Text for this requirement is still pending discussion, and is linked from the ITS home page.]

[R019] TBD

Go to the table of contents.3.19.1 Challenges

TBD

Go to the table of contents.3.19.2 Notes

TBD

Go to the table of contents.3.20 Annotation Markup

[Ed. note: Text for this requirement is still pending discussion, and is linked from the ITS home page.]

[R020] TBD

Go to the table of contents.3.20.1 Challenges

TBD

Go to the table of contents.3.20.2 Notes

TBD

Go to the table of contents.3.21 Identifying Date and Time

[Ed. note: Text for this requirement is still pending discussion, and is linked from the ITS home page.]

[R021] TBD

Go to the table of contents.3.21.1 Challenges

TBD

Go to the table of contents.3.21.2 Notes

TBD

Go to the table of contents.A References

Go to the table of contents.A.2 Other References

Bidi
Richard Ishida. What you need to know about the bidi algorithm and inline markup, W3C Internationalization FAQ. Available at http://www.w3.org/International/articles/inline-bidi-markup/.
DSDL
ISO/IEC. ISO/IEC 19757 - DSDL, Document Schema Definition Languages. Available at http://dsdl.org/.
HLink
Steven Pemberton, Masayasu Ishikawa, editors. Link recognition for the XHTML Family, W3C Working Draft 13 September 2002. Available at http://www.w3.org/TR/2002/WD-hlink-20020913/. The latest version of HLink is available at http://www.w3.org/TR/hlink/.
LISA FAQ
Localisation Industry Standard Association, Frequently Asked Questions, Available at http://www.lisa.org/info/faqs.html.
NRL
James Clark, Namespace Routing Language (NRL), Thai Open Source Software Center Ltd 2003-06-13. Available at http://www.thaiopensource.com/relaxng/nrl.html.
NVDL
ISO/IEC JTC 1/SC 34. Document Schema Definition Languages (DSDL) — Part 4: Namespace-based Validation Dispatching Language — NVDL, 2004-05-31. Available at http://dsdl.org/0525.pdf.
XFI
Paul Grosso, Daniel Veillard, editors. XML Fragment Interchange, W3C Candidate Recommendation 12 February 2001. Available at http://www.w3.org/TR/2001/CR-xml-fragment-20010212. The latest version of XFI is available at http://www.w3.org/TR/xml-fragment.
RFC 3066bis
Addison Phillips, Mark Davis, editors. Tags for Identifying Languages, draft-ietf-ltru-registry-09. Available at http://www.ietf.org/internet-drafts/draft-ietf-ltru-registry-09.txt.
RFC 3066bis Info
Doug Ewell. Supplementary codes for RFC 3066bis. Available at http://users.adelphia.net/~dewell/rfc3066bis-codes.html.
Ruby
Richard Ishida. What is Ruby?, W3C Internationalization FAQ. Available at http://www.w3.org/International/questions/qa-ruby.
Schematron
Schematron Committee, Schematron Home Page, Available at http://www.schematron.com/.
TBX-Link
Alan K. Melby, Andrzej Zydroń, editors. TermBase eXchange Link (TBX Link) 1.0 Specification, Initial Draft 0.1. Available at http://www.lisa.org/oscar/tbxlink/TBX-Link.html.
XML ID
Jonathan Marsh, Daniel Veillard, Norman Walsh, editors. xml:id Version 1.0, W3C Proposed Recommendation 12 July 2005. Available at http://www.w3.org/TR/2005/PR-xml-id-20050712/. The latest version of XML ID is available at http://www.w3.org/TR/xml-id/.

Go to the table of contents.B Acknowledgements (Non-Normative)

The initial requirements in this document have been developed and edited on a wiki system driven by several members of the ITS Working Group: Tim Foster (Sun Microsystems), Richard Ishida (W3C), Masaki Itagaki (Invited Expert), Christian Lieske (SAP), Naoyuki Nomura (Ricoh), Yves Savourel (ENLASO), and Andrzej Zydroń (Invited Expert).

The other members of the ITS Working Group have also contributed their valuable time and comments to the creation of these requirements: Karunesh Arora (CDAC), Martin Dürst (Invited Expert), François Richard (HP), Felix Sasaki (W3C), Dianne Stoick (Boeing), and Najib Tounsi (Ecole Mohammadia d’Ingénieurs ).