The I18N WG herewith formally objects to the post-lastcall removal of external language information from XML Literals. This document gives the reasons for this objection, and some background on our motivation and the history of this discussion.
This document is perhaps not as polished as one would like. Some sections are rather well worked out, others are not. A lot of links could be added.
Main mail messages: RDF decision (point 12: Language tags in typed literals); notice of this decision to I18N WG
Main specs/proposals: RDF M&S, lastcall WDs: Primer, Concepts, Semantics, Syntax, Tests, Schema, post-lastcall internal WDs: Primer, Concepts, Semantics, Syntax, Tests, Schema.
The I18N WG requests that the following XML/RDF document produce two triples (as it did at lastcall), rather than one (as at post-lastcall):
<rdf:RDF> <rdf:Description rdf:about="http://example.org/node"> <eg:property xml:lang="fr" rdf:parseType="Literal">chat</eg:property> <eg:property xml:lang="en" rdf:parseType="Literal">chat</eg:property> </rdf:Description> </rdf:RDF>
[It would have been possible to express this in terms of test cases in the last call, but the tests have been changed in the meantime, and depended on syntactic details that are irrelevant.]
These are our main requirements for language information in RDF:
The reasons for our objection are listed below grouped as follows:
The post-lastcall proposal is in direct violation of the provisions for xml:lang in XML 1.0. This will lead to problems for both tools and humans, and sets a bad precedent for other specifications:
The post-lastcall approach relies on the use of <dummy>
elements to carry language information inside XML Literals (for all forms of
RDF, not only for RDF/XML). This raises the following problems:
For internationalization purposes, text sometimes needs micro-markup. In many cases, this need is not evident to data designers and application designers. It is therefore important to provide for a transition from plain literals to XML Literals that is as smooth as possible. This in particular applies to XML literals without any markup.
<dummy>
elements are inserted to carry language
information, it is impossible for a general application or a general
technology such as a future RDF Query mechanism to know whether an
element was inserted as a dummy element or carries actual meaning.The change from lastcall to post-lastcall interpretation of xml:lang in RDF/XML documents has several problems:
xml:lang=""
for each XML Literal, thus
effectively making the post-lastcall change irrelevant.]Many alternative solutions are available. Any of them would be acceptable for us, because they avoid the problems listed above.
The original design in RDF M&S is best shown by the following example:
<rdf:Description xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/metadata/dublin_core#" xmlns="http://www.w3.org/TR/REC-mathml" rdf:about="http://mycorp.com/papers/NobelPaper1"> <dc:Title rdf:parseType="Literal"> Ramifications of <apply> <power/> <apply> <plus/> <ci>a</ci> <ci>b</ci> </apply> <cn>2</cn> </apply> to World Peace </dc:Title> <dc:Creator>David Hume</dc:Creator> </rdf:Description>
This example shows the following salient design points in RDF M&S:
Micro-markup here refers to markup at the phrasal level. This is important for the following reasons, the first four of which are related to Internationalization:
He said <span
xml:lang='fr'>Oui</span> because he spoke French
fluently.
This is clearly documented, among else in: I18N last call comments on M&S
Consistent and easy to use way of identifying the language of text pieces
It is very important to have a consistent way to identify the language of a piece of text in any technology so that generic operations needing this information can use it easily. Such operations include rendering-related operations such as (CJK) glyph disambiguation, font selection, hyphenation, text-to-speech conversion (important for accessibility), proofing operations such as spell-checking, as well as operations related to the semantics of the text.
Language identification should not be different for each application, but should be the same independent of the application, i.e. it should depend only on the underlying technology. The best example for this is xml:lang. XML applications are not required to use xml:lang if they do not need it, but they can use it off-the-shelf whenever needed.
Consistency also applies across base technologies. All W3C technologies, and all IETF technologies we know, use the same RFC 3066 language tags for language identification.
Language information is in many cases obvious to human readers. Also, humans often deal with information that is mostly in a single language. Therefore, it is easy for humans, from data providers to application programmers, to ignore the importance of language information. If given a choice between preserving language information and preserving other aspects of information, language information easily looses.
This example uses RDF/XML notation because this notation is more stable; the example is about the model rather than the notation. Consider the following six statements:
<rdf:Description rdf:about='resource'> <prop >foo</prop> <!-- (A) --> <prop xml:lang='en'>foo</prop> <!-- (B) --> <prop xml:lang='fr'>foo</prop> <!-- (C) --> <prop rdf:parseType='Literal' >foo</prop> <!-- (D) --> <prop rdf:parseType='Literal' xml:lang='en'>foo</prop> <!-- (E) --> <prop rdf:parseType='Literal' xml:lang='fr'>foo</prop> <!-- (F) --> </rdf:Description>>
In a widely shared understanding of M&S, there are two possible interpretations:
At last call, there was the following interpretation: None of the above entails any other one.
After last call, this was changed to the following interpretation: (D), (E), and (F) mutually entail each other, but (A), (B), and (C) are mutually different and are all different from the D-F group. To get the distinction implied by the different xml:lang attribute values in D-F, RDF Core is proposing to add 'dummy' elements, as follows:
<prop rdf:parseType='Literal' >foo</prop> <!-- (D) --> <prop rdf:parseType='Literal' xml:lang='en'><dummy xml:lang='en'>foo</dummy></prop> <!-- (E')--> <prop rdf:parseType='Literal' xml:lang='fr'><dummy xml:lang='fr'>foo</dummy></prop> <!-- (F')-->
Table of observable artefacts and their handling by RDF:
M&S | Last Call | Post Last Call | |||||
plain | XML | xsd:string | plain | XML | xsd:string | ||
Text | X | X | |||||
Text with language info | X | ||||||
Text with markup | X | ||||||
Text with language info and markup | X | ||||||
XML data | (X) |
In discussion, two contrasting uses of XML Literals in RDF and RDF/XML have become apparent, and can roughly be characterized as follows:
The post-lastcall proposal makes it unduely difficult for usages according to the second view. On the other hand, the lastcall proposal does not needlessly complicate usages according to the first view. Adding xml:lang="" is much easier than adding arbitrary dummy elements.
There is also a serious concern that users will simply ignore the potential of micro-markup if it is too difficult to use.
RDF data created according to RDF M&S or to lastcall.
Message calling for "unacceptably adversely affected" cases.
The following things are important for Internationalization:
parseType="Literal"
syntax and the handling of
xml:lang
. See also I18N
last call comments. On xml:lang, RDF M&S says:
The xml:lang attribute may be used as defined by [XML] to associate a language with the property value. There is no specific data model representation for xml:lang (i.e., it adds no triples to the data model); the language of a literal is considered by RDF to be a part of the literal. An application may ignore language tagging of a string. All RDF applications must specify whether or not language tagging in literals is significant; that is, whether or not language is considered when performing string matching or other processing.
@@@ add link to Jeremy's mail
Massimo Marchiori (look for *** Section 3.2.2): This is the only comment asking for explicit removal of XML Literals as a special case.
Joseph Reagle one mail, other mail: Joseph wanted to make sure there is no confusion between Canonical XML and exclusive canonicalization, but did not say anything one way or another on xml:lang.
Peter Patel-Schneider:
Tim Berners-Lee: A good interpretation of Tim's comments is provided by Patrick Stickler. The comments are not related to xml:lang.
Eric Prud'homeau:
First round of proposals by Jeremy (nuking language information on XML Literals is option 4)
Notable reply by Patrick (comming to the same interpretation of Tim's last call comments and the relation to M&S and charter as we do)
Confirmation from Pat that any of the solutions would be "Not very difficult."... "I am ready for almost any decision we make,"
Solution to "API issues" with wrapper proposal (Jeremy)
Ugly parade (Jeremy)
Jeremy's summary of arguments by RDF Core
This section lists some of the arguments that have been made for the post-lastcall solution that we think are unsubstantiated:
Against the wrapper solution: Unclear where wrapper comes from (Patrick): The differentiation is very easy, if there is a wrapper in the RDF/XML, there will be two wrappers in the wrapped literal. (@@@ add link to Martin's answer to Brian)
Exclusive Canonicalization says so: Exclusive Canonicalization is a tool with some limitations. The tool should not be used without taking into account its limitations. (@@@ add links)
Use XML Fragments: XML Fragments (CR) is not designed to include independent document pieces in another document. They are not directly applicable.
rdf:parseType="Literal" as an enveloping mechanism for XML content
@@@ Jeremy's mail to Jena-Devel