RCH WG Teleconference – 16 May 2023

Meeting minutes

<msporny__> markus_sabadello: any general updates?

<msporny__> No general updates.

<msporny__> markus_sabadello: We have 3 topics today, summary from special topic call, issue 89 -- supporting input data set modifications.

<msporny__> markus_sabadello: subset of people from the group met last week. Could we have a summary of what happened? Is the issue resolved?

<dlongley> A link to Gregg's summary in the issue: w3c/rdf-canon#89 (comment)

<gkellogg> w3c/rdf-canon#100

gkellogg: We talked about ability to correlate blank node IDs on input and mapping them to output dataset. Practically speaking, most implemenations retain such identifiers, and we provided a way to support N-Quads document as input, and used as basis of input dataset and from that takes initial identifier map to associate with that.

gkellogg: This describes a new structure, input map, initialized from NQuads, for any blank nodes that don't have identifiers, arbitrarily identify/assign with those blank nodes. That way, second step of algorithm creates a blank nodes to quads map, specifically needs identifiers for those blank nodes. We are running over other IDs created, this now allows you to do that by adding language on initializing dataset and assigning necessary blank node

identifiers.

gkellogg: specifically using that as a separate ouput, it could be used, but at last step we may need language that describes that algorithm can be run to get canonical NQuads document, or combination of normalized data set and input blank node identifier map. normalized dataset in itself is abstract data set and map to canonical identifiers, we may want to fold language in and use issuer in place of other map. That's why it's still in draft form.

<dlongley> +1 to ivan to emphasize in the spec that practical implementations already have identifiers in their parsed datasets that are passed to the algorithm

ivan: I realized I wanted to put in a comment and didn't. It's one thing we get NQuads as input and use IDs in NQuads and problem solved, it would be worth emphasizing (in the spec), in practice, almost all environments that people use for implementing, have a blank node identifier... for all those implementations, they can just use the identifiers in their particular implementations.

<dlongley> +1 to make it clear we can get the canonical issuer identifier map of input blank node IDs => canonical blank node IDs as an additional output

ivan: Making this whole thing we're talking about, it's for "theoretical purity", in practice, I don't know of any implementations of RDF that don't use a blank node identifier in representing a blank node.

gkellogg: There is a note that says that "some implementations retain blank node identifiers".

ivan: It is mild for my taste, most implemenations, not a matter of parsed, they get dataset... the way runtimes work, they use blank node identifiers regardless of parsing.

dlongley: Just wanted to agree with Ivan, we might say "implementations or environments that have blank node identifiers should be reused and not overwritten by the algorithm" -- that might be the most helpful to say.

dlongley: DB's implementation, we made it so we could have an additional output, we pass an abstract map... we already pass a dataset, and then we pass an additional map as an additional output. Some input I can offer -- use blank nodes if they exist, and populate/create this extra data structure.

<dlongley> +1 to lightweight touch

<ivan> +1 to gregg

gkellogg: One of my objectives was not to change algorithm numbers, trying to be a lightweight touch in doing this. I don't think it's necessary for us to update algortihm heavily... you're talking about canonical identifier map?

gkellogg: That might be a component of your implementation -- strengthen the language, re-use blank nodes from input dataset when they're available and then not say too much about adding additional identifiers that might be necessary.

gkellogg: In terms of outputs, around step 6 or 7 of canonicalization algrithm, desribes normalized dataset and input dataset blank node map.

dlongley: Agree to the lightweight touch, point of clarification, it's the dataset that's passed in, additional map is a reference to an empty map that is used in canonical issuer identifier that populates it w/ input IDs to output IDs... not suggesting we be presriptive, just looking for language to note that it's acceptable way to do implementation since it's easy.

gkellogg: That map is effectively part of the normalized dataset, the problem w/ canonical issuer is that it maps identifiers from input dataset to canonical identifiers in order to get from blank nodes to initial identifiers, that's what I was noticing coming through here, that is now handled but it rapidly gets confusing, maybe normalied dataset is map from blank nodes to canonical identifiers and initial map from initial blank nodes to identifiers,

and if you want to pass in such a map, that would be a component of the normalized dataset.

<Zakim> dlongley, you wanted to say the map we pass in is from IDs to IDs ...

dlongley: That map that we pass in, IS a mapping of input identifiers to output identifiers -- not a mapping of abstract blank nodes to IDs, we're just passing in a reference to it, built and ... slightly different from what you described, want to make sure it's an acceptable implemenation choice, simplest thing for our implementation.

gkellogg: formally, we can't talk about identifiers in the input dataset, we can talk about something that associated, it's not needed as separate data structure, I see it a an implemenation detail. I do think we need to be careful to distinguish identifiers w/ things that might be associated with them. Formally speaking, they aren't.

dlongley: That makes sense, we might need more fuzzy notes around implementation details ... "Implementations might concretely deal w/ blank node identifiers more directly... algorithm implementations shouldn't mess with those, but should faciliate their usage" -- broad enough to cover transalation from theoretical to practical.

gkellogg: maybe this is a way to express it... please push something into PR or use suggestion.

<dlongley> +1 to Gregg

ivan: The input mechanism might be less important, we might want to be explicit, the output returns the map and the canonical issuer, whether it's by reference or copy is implementation detail; that's the important point.

gkellogg: Yes, we might need to bulk up text around step 7... serialized form or...

ivan: returning normalized data set and application may not necessarily want to see ... yes, important to put it in step 7.

<markus_sabadello> Draft PR 100 to address issue 89: w3c/rdf-canon#100

ivan: I've added a comment to that effect there.

markus_sabadello: I think I hear agreement and +1s in the chat, PR100 is a draft PR that's going to address this with a few more comments/tweaks.

markus_sabadello: Anything else on this issue?

Nothing else.

markus_sabadello: We can move on to the other PR.

w3c/rdf-canon#99

markus_sabadello: This is about privacy considerations, issue 84... what's the current state of this? Some comments and suggestions that have been applied.

markus_sabadello: Has anyone else looked at this? Thought this was mostly good, any comments/thoughts?

yamdan: I didn't have time to see the whole comment, but still agree that we explicitly need additional statement/sentences to say that all of our ad-hoc proposals which have been described by myself... HMAC, salted hash, these kinds of things, are not officially proposed by this WG or this specification.

yamdan: We must make this explicit additional statement, which is and is not a part of this WG's work.

ivan: My worry goes a bit beyond that... the way I could read the text that's in the PR, and I'm sure some privacy reviews will read it, is that there are potential privacy problems and these privacy problems can't be solved w/ the algorithm as defined in this WG... that's how I read it.

ivan: It describes the problem, and notes that if we do different things, if we use a different hash function, or use a nonce, or something else, then we can solve the privacy problem. What I'm worried about, what privacy reviewers will say is that we have a problem (which we dont'), need to avoid that danger.

dlongley: I wonder if talking about parameterizing the hash function covers us there -- you're thinking it doesn't unless we go into details about how to parametrize it, there are additional problems w/ taking that approach... default behaviour, creates new privacy problems and protecting nonce and externalizing mapping things... using nonce, HMAC, nonce is secret, use cases w/ Verifier that run algorithm, they won't have nonce, can't follow steps.

dlongley: There are also differences between running it inside algorithm and outside algorith, two different things, don't know how we address that and run afoul of our charter... different ways to do selective disclosure

TallTed: I don't think Ivan's concerns are unwarranted, but might be more than is justified/needed, Privacy Considerations are just that... we can't guarantee Privacy, many other specs have gone through W3C that expose far more than we end up exposing w/ this spec.

TallTed: Saying things like "use a nonce to lower visibility" should be sufficient to what we're doing.

<Zakim> gkellogg, you wanted to suggest that these issues only come in for selective disclosure.

gkellogg: Privacy considerations are just to be considered, and for base use case of creating a canonical form of dataset, I don't think that introduces the considerations that are highlighted here, those considerations are there when altering the result and expecting that you're not revealing inforamation. That should be considered as the motivation, we should enable specs to be based on this and use thigns like HMACs and alternative hash ordering in

order to address that... privacy considerations are the same for RDF Datasets in general, can be used to store PII, we're only providing canonical form for that.

gkellogg: That's something we need to enable and leave for other specifications to handle.

ivan: Just to make it clear, I'm typically pessimistic :) -- Just raising some red flags and we have to be careful about how we formulate things.

ivan: We've had privacy review put in normative statements on privacy-related issues and going beyond suggestions... we survived that in epub, just saying push might come back... we'll see what Dan will come up with (but we have to be critical).

ivan: I'd welcome TallTed's input to help us revise the privacy considerations section.

markus_sabadello: To me it seems like the question is to what extent are certain privacy considerations in scope for us. To what extent are privacy considerations that are inherent to canonicalization and hashing, and what privacy considerations exist once you start using that for VCs and SD... that's what it sounds like to me... the more considerations we cover, being aware of warnings from Ivan, the more we can cover this and anticipate certain

things that could happen when using this work.

markus_sabadello: The more we cover it the better, seems like maybe there are not a lot of privacy considerations to just canonicalization, but once you use it in certain ways, it become simportant.

dlongley: main thing for us to highlight is main thing -- two sub parts -- produce blank node IDs and normalized dataset -- and algorithm that serializes that in particular order. Both identifiers generated and order are entirely dependent on input data -- we are creating a canonical representation of input data, identifiers generated are dependent on that data. If you want to decouple those things, then you should be able to do that and decouple

them... something you could do.

<TallTed> +1

<gkellogg> +1

dlongley: Even if we offer places to parameterize, that could help.

<ivan_> +1 to manu

<markus_sabadello> manu: Agree with dlongley .. I wonder if there is a way for us to put content intoPrivacy Considerations that would go into the Data Integrity spec. I.e. deal with privacy considerations there where it's an actual problem. This spec here may be too low level for certain topics such as HMACs and parameterization for privacy consideration.s.

<markus_sabadello> manu: Going up the spec stack might help us here

<markus_sabadello> manu: We could say this spec doesn't deal with that, but that other spec in the other WG deals with it.

ivan: That's probably the best thing... that we want to use BBS... it's a problem of BBS and selective disclosure, not of the canonicalization specification.

ivan: Maybe we shouldn't do what we're doing here.

<Zakim> dlongley, you wanted to say we still need to walk a line where it's clear a spec like BBS can use the spec we produce without issue

<ivan_> qq

dlongley: Maybe offer a good reason, the reason we want to talk about this is to make it abundantly clear that specs like BBS can build on top of this spec. There is no expectation that running canonicaliation has to be the last step in a transformation, and then you can do other things afterwards. We provide mappinggs, and parameterization of hash function -- make sure that's clear and not cause trouble.

<dlongley> +1 to an appendix

ivan: Simply, we should not put reference and sd in privacy section, simple as that, can have appendix to use for x and y purposes, or general introduction. It's should not be a part of the privacy section.

<dlongley> +1 to address the goals of "this spec can be used for other things" in an appendix.

<markus_sabadello> manu: I was going to argue that we can mention something lightweight in the privacy section. I don't feel strongly about putting it into a privacy section or appendix, but I would like to point outwards to other specs that deal more with it.

<markus_sabadello> manu: We will talk more about hashing functions, parameterization, etc. in the higher-level Data Integrity spec. We should point to it from our spec here.

<markus_sabadello> manu: There should be an easy link to click on to read more about it.

<ivan_> +1 to what manu said

<dlongley> +1 to keep Dan's analysis and find "the best" place to put it

<ivan_> dlongley, the best place might be to take his text and put it into the DI spec...

gkellogg: I do think it belongs as considerations, but caveat is that it's for other uses of downstream results of canonicalization; this is a foundational spec in that regard, reluctant to have pointers forward to other things... some we know now, but maybe many others in future that we haven't considered yet... that sort of information, might be more difficult -- might want to be more guarded in suggesting ways in which other specs might use features

(HMAC, parameterization, remapping) -- that is the province of other specifications.

yamdan: Some of these concerns are far away from the c14n spec... they're more to do with BBS and selective disclosure than they are about canonicalization... Data Integrity spec might be most appropirate place. What do we need to have essential things in specification as a kind of minimum extension point? What is an extension point for downstream use in selective disclosure schemes? We just need to prepare minimume xtension point and all other

extensions should describe what we're talking about here.

<dlongley> having our privacy section say "the outputs of the algorithm depend on the inputs, so you might want to make them independent thereafter for privacy purposes ... and we provide some tools (input ID => output ID mappings, hash parameterization, and separation of serialization steps and so on) to enable that.

<markus_sabadello> manu: Agree with yamdan . I don't think this is the "end" of what yamdan created, but it might land in a different place. It could land in Data Integrity for now.

<dlongley> +1 to land in DI so many cryptosuites can use it generally.

<markus_sabadello> manu: If we feel that it needs to be more specific e.g. to BBS, then it could later be moved there.

<markus_sabadello> manu: I'd be in favor of shifting this PR into Data Integrity and see if it fits.

<markus_sabadello> manu: If it doesn't fit there, we can push it further to BBS

markus_sabadello: Ok, what's the concrete next step for the spec here? Maybe we have a reduced section on privacy considerations that point to Data Intgrity? We still need a little bit here.

ivan: We have to show we've discussed these things.

ivan: We can point to the higher level specs.

<markus_sabadello> manu: Suggestion is that yamdan re-opens the current PR in the Data Integrity section

<markus_sabadello> manu: And then a new PR needs to be opened in RCH spec that mostly talks about privacy effects that come from using this spec. Something like, see the Data Integrity spec for more information.

ivan: Additionally to what Manu said, we know that sorting of quads when we do c14n, creates a way to connect to privacy related issues, we know that c14n may lead to some inherent danger due to application area... any algorithm that has privacy concern has to take some measures, then we can point to stuff in the Data Integrity spec... some of this is relevant for this specification as general statements.

dlongley: I think we can more directly address the general problem, outputs from algorithsm, depend on inputs, if you don't want those to be coupled, then you will want to make them as independent as you can and we've provided tooling (parameterization, HMAC, remapping, separate serialization)...

ivan: Do we parameterize the hashing function?

dlongley: We say that you can change things... BUT we are specific about the hash for URDNA2015.

ivan: This is a little fuzzy... we need to be careful here.

gkellogg: We are going to have to specify how spec can be derived... for example, derivations could use a different hash.

<dlongley> +1 to Markus

markus_sabadello: Ok, conclusion is move current PR to Data Integrity, work on new PR that is more limited for rdf-canon, but could point out general considerations if canonicalization work is used, but for details, mostly points to other specifications.

<gkellogg> +1

<yamdan> +1

<ivan_> +1

markus_sabadello: I'll try to summarize that in the issue and we'll go ahead.

manu: +1

markus_sabadello: That's it for today, we'll triage later.

– DRAFT –
(RCH WG Teleconference

16 May 2023

Attendees

Meeting minutes

w3c/rdf-canon#99

Diagnostics