I18N Discussion of RDF dir literal issue -- 29 May 2019

<scribe> scribenick: addison

<chaals> scribe: chaals

how to handle text direction in Verifiable Claims …

Manu: Don't think we will solve this completely by the time VC moves to PR, but would like to have at least a sound direction that we should be aiming at.

<scribe> chair: Addison

UNKNOWN_SPEAKER: Is it OK if we focus on what that should be, or is there something else we need to do here.

AP: Publishing Internet Draft attracted attention, which was not terrible. This issue has been longstanding, and we would like to get it sorted

Manu: I Apologise for my methods… but at least we are now here working to a solution

IH: We had a similar discussion with Addison/Richard about WebPublication that has the same situtation.

<ivan> https://w3c.github.io/wpub/#manifest-specific-language-and-dir

IH: the agreement there is documented, so the group acknowledges there is no standard solution. Which is apparently acceptable for now...
... so it should be acceptable for VC too, no?
... But we still need an actual solution…

<r12a> https://w3c.github.io/rdf-dir-literal/

<Zakim> manu, you wanted to ask if referencing 2.6.3.4.2 Item-specific Language would be "good enough" for VC spec?

AP: Agree with Ivan that VC is yet another spec that is stuck in this situation. We are not thrilled with the way this is, but it's not awful for now. There are lots of specs that need to deal with this, so the WebPub approach lets specs move forward. But we are not satisfied with this as a recommendation...

MS: Is the fallback to reference the specific language thing or string-meta, as a way to get specs to Rec, and in parallel look for a proper solution? I don't think anyone is happy with the current situation.
... If we refer to some specific language can we move the VC spec forward and resolve the review issue?

<ivan> ack

IH: Process-wise no, webpub is behind you in progress and will stay there so you won't be a viable reference?
... string-meta is a WG note - OK for an informative reference but not exactly solid.
... You could copy it, or we can ask for string-meta to be updated.

<addison> https://w3c.github.io/string-meta/#best-practices-recommendations-and-gaps

AP: string-meta is meant to serve as a host for best practices as well as explain the problem.

IH: So would be useful to update string-metas with what we are putting in specs.

<Ralph> [when would the JSON-LD WG be confident that the relevant language in its drafts would not change?]

<dlongley> i note that the item-specific language link does not use a scoped context for aliases for `@value` and `@language` -- so we'd want to do something slightly different anyway

AP: Yes. Would like to avoid lots of specs linking in circles. Specs should have a common core reference. That's what string-meta is for but the challenge is to get everything right there.

Manu: So if we refer to string-meta and it says the right thing we are good to go for Rec.

AP: Yes, modulo maybe an update to string-meta.

Manu: OK, so I can make a PR to look at.

<r12a> https://w3c.github.io/string-meta/

Manu: So let's talk about the various proposals.
... Let's introduce the proposals first, without debating them

IH: I had a discussion with various people before I put this together.

<Ralph> RDF Literals and Base Directions

IH: There are 3 viable options, and I noted some pros and cons. This would need some cleaning...
... Option 1: Fix the RDF original sin of not dealing with the problem yet and defining literals with a language tag but no direction info.
... Looking at RDF it is relatively easy in principle to fix that by adding direction. But this means setting up a WG to revise RDF, and then there is a lot of deployment that has to be reworked.
... Personally, I think this would be good, but it might be too hard in practice

RI: Please explain what needs to get changed if we do this

AP: If a language tag is expected at the end as "foo"@en and then you have "bar"@en^rtl then the language tag will get lost in the string.

IH: Problem in parsing is that there is an internal structure that would have to be extended, e.g. in RDFLib. Unlike HTML which has only 3 engines in use, there are a dozen or so in RDF.

RI: Understand the problem you describe. If we use this extension syntax for the language, would that be a possible option?

AP: It's not at the end

<addison> "bar"@en-d-rtl-u-nu-latn

<dlongley> *that* would work^

<dlongley> but that's not this option

RI: Where "this extension" means what AP just put in.

IH: RDF environments take the language tag and pass it on.
... in extending the model we need to carry two annotation strings.
... instead of the current situation is just one.

<gkellogg> “bar”@en^ltr

IH: I'm describing option 1, and you are pasting examples of option X

Mark: Not sure why we need to have a hacky language tag option. The base can *normally* be derived from teh language

Manu: Why the serialisation in this option matters. In VC we sign the credentials. The language direction would be something we want to have serialised properly so we can sign it. That's why it matters how it gets serialised.
... some options require us to fix RDF properly, some are a "hack" that we can serialise today but may have other issues.

<Ralph> [is there consensus in the I18N community on the equality of strings that differ only base direction?]

IH: Option 2 uses RDF datatypes. We already have strings, booleans, …
... This means you define a lexical space in the serialisation, the value space, and you need a URL to identify the datatype.
... So we would define a new datatype that is a string which handles language *and* direction.
... This is on top of the core of RDF instead of changing the core.

<gkellogg> “foo@en^ltr”^^rdf:LocalizableString

IH: Serialisation can be used, but the handling of such literal can be done with existing structures because there is a way to handle new datatypes.
... This means RDF canonicalisation needs to be able to handle new datatypes, so it is relatively easy and doesn't need rebuilding existing tooling.
... That option is not so nice because we duplicate the core language thing which is a problem. RDF already has a story of doing this so we can imagine this could be handled.

<Zakim> chaals, you wanted to note the problem of definig a parallel datatype is adoption.

<dlongley> scribe: dlongley

chaals: So the problem with that is that when you define a parallel datatype for something we already have, getting people to use it is really really hard. Everyone is setup in their tooling to use the existing one and there's very little incentive to switch. In my experience this has failed in the past. Don't know if you have a different experience.

<scribe> scribe: chaals

IH: Acknowledge teh problem. We don't really have the experience. It does depend how you serialise it… if the datatype is defined, JSON-LD and Turtle would have a transition that makes this as smooth as possible. JSON-LD is being worked on now. Acknowledge this is not an ideal solution...
... but maybe easier
... Option 3 is to extend the language idenitfier with a direction annotation, e.g. en-d-rtl (for direction Right to Left)
... From RDF perspective, this would be simple because the language tag isn't parsed by RDF processors - at most they are compared as strings
... For now, "foo"@en != "foo"@en-au
... So this is simple from implementation.
... Then there is the discussion that occurred in Github. I would have liked feedback on the first two options.
... But this one generated a sizeable discussion.

<Zakim> manu, you wanted to note we have 4 options... :) and to note that this is also not just about RDF

<manu> option #1 -- Fix RDF by adding language direction to the language expression syntax -- "foo"@ar^rtl

<manu> option #2 -- Fix RDF by adding language direction to a new RDF string type "foo@ar^rtl"^^rdf:LocalizableString

<manu> option #3 -- Extend langtags in a private use way "foo"@ar-x-dir-rtl

<manu> option #4 -- Extend langtags as a BCP47 extension "foo"@ar-d-rtl

Manu: We would like option 1 to happen, but it won't be fast.
... Option 3 has two subtypes. We can make a private-use extension, or we can actually standardise the extension formally in BCP47

<Zakim> r12a, you wanted to note that the 3rd option is not in string-meta

RI: There is another possible option…
... the options described are not in the string-meta doc, which describes inferring the direction from language tags instead of stating it.
... in talking to Web of things, that was rejected by them as requiring a lot of processing, and misses some use cases

Mark: In Japanese that is a wildly edge case, where you could require the language tag to have an explicit script, so you then have a lookup of 9 different strings.
... You cannot always get the correct direction without finer grained information about the text.

<manu> scribe: manu

mark: You'd need something wrt. string internals.

ivan: I have two questions

<ivan> { "@value":"sfsd", "@language":"ar", "@direction":"rtl"}

ivan: Going back to the options listed above, what has to be emphasized, in the case of JSON and JSON-LD, there is a way to make this more palatable. There is a syntax that works in pure JSON and JSON-LD... it can be used soon.
... In JSON-LD, we can define that as "this maps onto localizable string datatype, so this is a solution that works w/ JSON and CBOR, we can make that work, even if we take option 1 or 2.

<dlongley> i note that we care about N-Quads too for digital signatures

ivan: The other question, what I don't understand, is what Richard just proposed... do you mean to say that we can get away w/o base direction altogether... what I learned in the past year, is that we really need base direction, now you're saying we don't, so I'm lost.

mark: Wait a second, you need base direction if you want to avoid certain artifacts... that doesn't mean you need base direction to be separate from the language tag.
... If I see ar-Arab, I know it's right-to-left.

<chaals> scribe: chaals

<manu> mark: While the parsing is complicated, parsing of script tag is trivial.

mark: so parsing lang tags is complicated, but parsing script becomes easy, there are 9 you need to match.

<manu> ivan: I don't know if we're discussing this whole thing, then...

RI: In your situation it is easier to deal with. There is more involved, especially if we start messing with BCP47.
... If it gets used in HTML, life gets complicated.

<manu> richard: Your thing, it's easier, that's why we have it in string-meta... there is more involved in all of this ... especially if you start putting stuff in BCP 47, and so on and so forth, if we start using this in HTML, there is a whole raft of things.

<manu> mark: I think the question on the table is, why don't we just use the language tag and put script in where it's important.

RI: that would be fine because script parsing is trivial.

<manu> YES! EXACTLY THE RIGHT QUESTION, IVAN! :)

IH: So why does HTML need "dir" attribute

mark: so you can set these things formally, instead of falling on the defaults

<Ralph> String-Meta 4.5 Script subtags

<manu> but, why!? what's the use case?

mark: we would prefer to have the metadata, but for *most* cases, the script is enough.

AP: If you have the explicit metadata we can map the data around more clearly.

<Zakim> manu, you wanted to note that this isn't just about RDF

Manu: This is not just RDF. In JSON-LD we have been trying to make a linked-data format that web developers who don't do RDF can and will actually use, so it has to be simple enough for that.
... This needs to be something that will get adopted in syntaxes that don't have the information now. It's a tall order but think there are achievable ways to do it.

<r12a> 5 options

Manu: so are we meeting all our requirements? Ability to express information, be canoncalisable across different langauge/syntax to allow signing., can get adoption in JSON/CBOR/RDF/…

mark: Think there would be resistance to the proposed extension. Me for example.

Manu: SO we need to understand why and see if the mitigations are acceptable.

Mark: Looks like you're trying to graft a piece of information that suits a small use case into a larger space. Occupying an extension point for this seems messy.

<dlongley> will type to save time: as a non expert in this space, i find mark's argument (and others have said the same) that you can derive the direction information from the language tag is a strong indicator that either there is no problem -- or that there are *SOME* language tags that are missing that information and they need to be amended *SOMEHOW* (which is why I said, hey, let's add this -d- extension to solve that problem)

Manu: Do you oppose the -x-dir- approach?

Mark: "no" [hard to hear]

Manu: In deployment it would work like -d

<Ralph> Mark: x- is available to anyone

Manu: would like to hear the pros and cons of the proposal

Mark: You can derive the base direction already, why complicate language tags for that?

AP: Timecheck.

<manu> I think we'll need another call :)

AP: look at language tag extension as a way to fix the serialisation problem.
... We need to transport direction metadata across formats.
... Private use approach to me was a hack to smuggle in a separator. I expected Indocrtinated RDF processors would always snap of the -x- piece that is always at the end.

<manu> chaals: My concern with the extension of the language tag is, if we start putting that out and people start using it, then it'll start to get copied around.

<mark> <mark> the "no base language tag" can be handled with, eg, und-Arab

<manu> chaals: Because you can already use language tags in HTML and various formats, if we think it's going to stay in RDF, I think we're mistaken.

<manu> manu: Yes, exactly why we're proposing -d- :)

<manu> chaals: Getting HTML systems to parse that out would be a pain.

<manu> chaals: Getting other systems where we haven't envisaged this would be a pain... HTML would be first up, annoyed because they have to do the work to solve our problem... that's a harder sell than it seems, not sure how big the scope is.

<manu> chaals: There are going to be a bunch of parsing things - we need to retrofit the new microsyntax in, while we save RDF the pain of fixing itself, we introduce a problem into different spaces. The assumption of base direction, if you add the script, you can normally get the data that is useful given that we're talking about a small set of cases, right now, there is nothing doing at all... it would be a net improvement. At present, we have zero ability to do this,

<manu> it seems like a reasonable first step that doesn't break anything else.

<dlongley> HTML4 said that you may not use "lang" to figure out the direction, not sure about HTML5, so it would be ignored until anyone decided they wanted to change it

<manu> chaals: People typically don't put the script in, getting people to adopt that, in the specific RDF case, the fact that language direction data suddenly means that strings aren't equal, feels problematic.

<manu> chaals: I'm not sure how painful that is and where it fits.

<scribe> scribe: chaals

IH: Trying to understand RI's proposal.
... In HTML dir is necessary in a complex text with mixed directions in it. You can nest elements.
... but that is not the case, we are talking about a single string.
... where we don't have a specification of the internal string.
... and we want to define a base direction. The way I understand the comments, this extra attribute is not necessary, it is a convenience.

<pchampin> we can still document the good practice of specifying the script explicitly

IH: so we can drop it.

<r12a> https://w3c.github.io/string-meta/#script_subtag

IH: As I understand it If you put in e.g. lang="ar" you know the base direction.

AP: Follow-up call?

[Yes please]

IH: Yes please.

AP: same bat-time?

[works for me]

<Ralph> [I heard Mark assert that language+script is always sufficient]

<manu> great, thanks addison ! :)

AP: Will set up another meeting, send invite to everyone here, can be forwarded, and link it in the github issue
... Thanks all....

[audio closed]

<dlongley> i wanted to respond to mark but he was on the phone only ...

<dlongley> i was going to say i was the one who originally proposed -d- ....

<ivan> but we do not have internal structure for our cases

<dlongley> precisely because i had heard that directional information could be determined from MOST language tags but NOT all

<dlongley> so, therefore, for those language tags that didn't incldue that information (for whatever reason) ... could be amended in some way to fix the problem

ivan, HTML uses it for that but also for the example of a string without nested elements, where the direction of the initial script doesn't match the overall direction of the content. I.e. the case you are describing.

<dlongley> and ... there's a spec that says "here's how you define an extension" ... so that's where we went with that idea.

<dlongley> define an extension to provide that information that is missing for certain language tags.

<scribe> scribe: nobody

<r12a-too> uk

<r12a> https://w3c.github.io/string-meta/#script_subtag

[Summary of my thoughts on the options: Fixing it properly is easy to describe in terms of a pathway to achieve, but some pieces like updating RDF core, and deployed RDF infrastructure are hard and will take significant time. Adding a datatype is relatively easy but getting adoption seems unlikely in general, and even more so for e.g. plain JSON, because there isn't *enough* pressure to do this since the overall need is pretty low. Extendi

ng language tags will lead to the extended tags being used in other formats, especially ones that do rendering, meaning they will need to be retrofitted which will be at least as hard as fixing RDF, it complicates things that already have some problems, and it is hard to be sure what the scope is…

scribe: Relying on language+script to detect direction will work for a lot of use cases (more than we currently deal with since we have no solution), but doesn't appear intuitive in the case where there are mixed scripts so may lead to some problems in adoption, and it *seems to me* there are cases it doesn't cover (although I am not certain of that yet)…
... Conclusion: My current thinking is that we should actually suggest to use lang+script as a quick'n'dirty interim approach that won't close upgrade pathways, and start the work of fixing RDF itself (which IMHO should happen one day, and probably won't get finished sooner if we start later).]

<addison> language tags don't help in cases where the language is indeterminate. there are many strings on the Web where the best we can say is that the language is undetermined. In some cases, we have a base direction or base direction estimate

<addison> I think I will write this someone non-transieent

<r12a> conclusion of continuation discussion:

<r12a> look more closely at the possibility of following the approach at https://w3c.github.io/string-meta/#script_subtag, which may require no change to RDF

<r12a> to evaluate its potential

<r12a> note, however, that there is an effect on producers and consumers

<r12a> involve Addison in those discussions

<r12a> potential issues:

<r12a> 1. producers need to remember to label things like MAC addresses, ISBN numbers, etc, even though they apparently don't have a language

<r12a> 2. some scripts have alternate directions, eg. egyptian - however, this is probably NOT a real issue, since that kind of thing is likely to be style-related rather than native to the string

<r12a> 3. for this approach to work, we'd need to NOT have a default dir setting for a resource, because otherwise you wouldn't be able to override it for specific strings

<r12a> 4. producers need to label strings adequately to indicate the expected direction for cases such as azeri, which can be written using cyrl, latn, or arab

<r12a> 5. consumers need to know that they are expected to use detection algorithms based on language tags

<r12a> also to clarify, what is the cost to consumers of detecting the direction from language (vague memory that thsi was an issue for WoT)

<r12a> Ivan, Manishearth, others on the call: is this ok for a summary ?

<ivan> r12a, in a JSON-LD environment, MAC addresses, ISBN-s, etc, are not necessarily relevant because they may not be set to have a language altogether, they are just pure, clean strings

<ivan> So 2-5 is relevant for strings that are really considered to be natural language strings, ie, which may have a language set

<ivan> but otherwise it reads o.k. to me

<r12a> exactly, but they may need to have direction ! that's my point - you could use zxx-Latn or some such though

<ivan> as for chaals' conclusion above: I do not see lang+script is a hack, it is a bona-fide usage of BCP47 tags

<ivan> ie, RDF does not have to do anything, neither now nor later...

<ivan> r12a: an ISBN does not have a direction, it is just a number, as far as RDF/JSON-LD is concerned. Just as in a programming language a number and a string are different, the same holds here

<ivan> but I guess we are getting to details

<r12a> incorrect. See the examples in string-meta

<r12a> https://w3c.github.io/string-meta/#neutral-ltr-text

<r12a> that's a telephone number - these are notorious for arabic/hebrew users, and they do all sorts of nasty things to cope with them if the html isn't set properly

<ivan> Ok. MAC/ISBN may be a borderline. In turtle parlance, I can say [] <isbn> "12334"^^integer

<ivan> in which case the direction and language is irrelevant

<r12a> MAC and ISBN sequences are very dangerous, and they contain alpha characters, and it may be completely transparent to a user that they are displayed incorrectly

<r12a> they are also hyphen-separated, which introduces many of the problems

<ivan> I can also say [] <isbn> "1234-345" which, for RDF, means it is 'just' a string, nothing else, no language information stored whatever

<r12a> yes, ivan, but the consumer needs to ensure that this is treated as an embedded ltr string in an overall rtl paragraph, otherwise you'll see

<ivan> If I say [] <isbn> "1234-345"@en . then I have a 'langString' in RDF parlance which is different

<r12a> 345-1234

<ivan> Well, if the consumer _knows_ that it is an ISBN, because that is what the semantics tells you, then a proper consumer will not make that mistake!

<r12a> and as long as it knows all the types of special string there are, or will be...

<ivan> well, in RDF, that information is usually part of the model...

<r12a> but there are other strings, such as 10-12 (a range) which may not be ltr strings in Arabic, but are in Hebrew/Persian

<r12a> it's complicated !

<r12a> it may be good to just think through this a little - it may involve language tagging or it may involve use of ALM (https://r12a.github.io/scripts/arabic/block#char061C) to ensure good results

<r12a> s/mark: that would be fine because script parsing is trivial./mark: that would be fine because script parsing is trivial./

- DRAFT -

I18N Discussion of RDF dir literal issue

29 May 2019

Attendees

Contents

how to handle text direction in Verifiable Claims …

Summary of Action Items

Summary of Resolutions

Scribe.perl diagnostic output