RCH bi-weekly meeting – 27 September 2023

Meeting minutes

Hash parameterization

<markus_sabadello> https://www.w3.org/2023/09/11-rch-minutes.html#r01

markus_sabadello: During TPAC we resolved that implementations must support a parameter to define which hash function is used
… everyone seemed happy
… but Seabass raised an issue (via email) pointing out that some aspects may not be sufficiently covered
… concerns around interop and security as choice of hash function can be controlled by a param. Comments from PA, Dave Longley and Ivan
… Questions: what do we do with that now?

gkellogg: When we're discussing hashing, we're expecting the same function is used internally as well as for the result. But we don't say that.
… Not sure why we expect that they would be the same
… What is the real purpose of needing to be able to change the hash algorithm within the algorithm since nothing is exposed and has features to avoid collisions
… What is need to parameterize the *internal* hash function?

gkellogg: It's outside the text to print out the internal hashes used

dlongley: I agree with what Gregg just mentioned. Specifying how to express the hash info we've decided is outside the scope of our spec.
… There are a number of external meta methods for expressing hash methods. Those are responsible for talking about which internal steps may be needed. I don't think it's our responsibility to create a new metadata field
… Multihash exists, for example
… There's a ?? spec that does something similar
… There's an RFC for naming thigs with hashes
… There are IANA registries for this sort of thing.
… Good to say non-normatively in our spec: we've said there's a default hash for the internal piece. Could say Don't change this unless you have a god reason to and maybe document it.
… To answer Gregg's question - the fact that we call out and use a hash in the algo. Someone may say "you need to use hash function X as a regulation" so I don't think we need to change what we've done, but we need to be able to say how it can be done.

<Zakim> manu, you wanted to provide some proposals -- can we keep this "implementation defined" but c14n "has to return hash identifier"

manu: What we're saying ... why are we even considering this. We have had conversations with some individuals who would object if we didn't allow this kind of flexibility.
… whether we agree or not, putting some text in the spec mitigates that risk
… The concrete thing that we could do is to say in the algo, when you canonicalize, right now we output the quads, we could also output the internal hashing algo that was used and we can define maybe 2 function names used (referring to SRI spec).

manu: I think what Ivan wants to do goes a little too far.
… Problem is that there are 2 things we're trying to express. I don't think that Ivan is proposing expresses the internal hash function used and that's what I think seabass is concerned about
… SO maybe we can define that as one of the out put pieces

<dlongley> -1 to returning a value, it's unnecessary, it's an input

<dlongley> +1 to you can encode it however you want, that's not our spec's job

manu: You can encode that however you want

<gkellogg> +1 to what dlongley said, it makes it more complicated and is an invariant from the callers context.

manu: Concrete proposal is to allow the hash function to be changed by providing input and you get that same value as part of the output. Implementation specific how that's done

gkellogg: I think Dave ad I have similar thinking. If the caller is providing the hash function to use, I don't then need it to tell me what has function I used

<manu> That's true, gkellogg -- I retract my concrete proposal to provide the internal hash function as an output.

gkellogg: There might be regulatory reason for disallowing use of specific algos. We could use MD5 internally, it really doesn't matter, but if you think it does then, OK.

<dlongley> +1 to gregg's comments generally

gkellogg: We already have two things you can get. The blank node map or the C14N representation. We're talking about adding a third thing

<manu> I'd be fine w/ explanatory text... saying that how to serialize the hash is implementation specific.

markus_sabadello: Since the param is in theinput it doesn't seem necessary to have it as an output as well. I think seabass is concerned with not knowing what to do if you just have the hash. I don't think it's our job to define a new metadata mechanism.
… Some extra text could say that the hash function used in the input is going to be important for uses of the output so it should e preserved or clear from the context or whatever.
… Some sort of guidance seems worth adding.

<Zakim> dlongley, you wanted to say just having a hash is always insufficient

dlongley: You're never going to be able to regenerate a hash if you don't know all the inputs

dlongley: That's true in any system of course.

<TallTed> maybe "The hash function that was used SHOULD be available as an output, e.g., with a +debug flag."?

dlongley: I don't think there's anything normative we need to add. But some informative text could highlight the need for any function to have all its inputs

dlehn: It seems like a communication issue for how you name what you're doing.

<manu> TallTed, no, we don't need that, because you know which hash function is used when you called the function... and this notion that you have only a hash is misguided, that is always insufficient.

<dlongley> -1 to invent new names for every possible hash function in our spec

<manu> ^ yes, to that.

dlehn: I made a comment in the original PR, when you're naming... it seems like there's a discussion here about the has you use n the output on the canonicalized quads and seems beyond the output of the spec.

dlehn: Not sure I'm really understanding the problem.

markus_sabadello: I think you are.

<dlongley> -1 to invent just N-many names in our spec that include hashes in the names as that is too restrictive, but +1 to have non-normative text that says meta data will need to identify input parameters to enable reproduction

dlehn: It;s about how to communicate that between systems.

markus_sabadello: Yes and seabass seems to thing the way to do that is to only allow one algorithm to be used. But the TAG review said that for future proofing, we need a way to parameterize it,

gkellogg: If the principal output is an n-quads doc that is in canonical form with blank node IDs in canonical order
… If they are taken out of the context where the original function was called, there's no way to add comments to an n-quads doc, for example. I don't think we want a structure to include commenst etc.

<dlehn> my earlier comment was here: w3c/rdf-canon#161 (comment) was wondering if the alg naming needed to include the hash name.

gkellogg: A dataset using RDC and using a non-default hashing function must not allow that results to be used in a way that is separated from the original function.

<Zakim> yamdan, you wanted to mention IANA registry for hash alg identifier https://www.iana.org/assignments/named-information/named-information.xhtml

<dlongley> i don't think there is any normative language we can put here that's reasonably testable, it's just strongly worded advice we can do.

<gkellogg> +1 to what dlongley said (again)

yamdan: I originally thought this is just a naming problem. I thinkwe don't need to invent a new ID for each hash algorithm - we already have a registry
… we can just pickup an ID from this registry
… and can just combine it with our name. Like RDFC1.0-SHA256 etc.

yamdan: We always mention H-mac SHA 256 etc.

yamdan: But I may have missed seabass's original intent.

<Zakim> manu, you wanted to follow on w/ what gkellogg was saying wrt. spec guidance.

markus_sabadello: Everyone agrees we don't want to invent new hash IDs. Concatenating RDFC1.0 with the hash name is an interesting idea.

manu: -1 to that. This feels like a slippery slope
… In the data integrity specs, we have tried very hard to stay away from parameterization. We keep it simple. In the algo we say you must call RDFC with *this* hash function. I don't think this is big deal
… When you get the result back, you know the hash function used because you provided it. What you do is important but it's outside the spec.
… This is not a problem in the data integrity specs.
… So I agree with what Gregg was saying. It feels external to the spec.
… So providing some guidance that it's right to convey the internal hash used.

manu: Rattles off list of hash functions

manu: It's up to implementations to convey what they've done

markus_sabadello: That seems in line with the idea that the param is important as is preserved in other payers of the application.

<dlongley> IMO, a summary:

<dlongley> 1. It would be simpler to not parameterize the hash algorithm.

<dlongley> 2. However, we can't do that without creating problems for people who

<dlongley> need to comply with regulations and for future proofing.

<dlongley> 3. It's not our job to define meta data expressions.

<dlongley> 4. There's no testable normative text we can create here, but

<dlongley> more informative advice could be given to address concerns and we have lots of time to bikeshed that.

<gkellogg> +1

markus_sabadello: We should summarize this in the GH issue and ask for his help in adding some language

markus_sabadello: I don't think we can make further progress without him presnet

<dlongley> i think we could indicate that CR is ready to go

<dlongley> +1 to Phil's comments

phila: I think we could help today by taking a resolution that we could agree that informative text is needed, but no norative text needs to be changed. Therefore, our previous resolution stands.

markus_sabadello: Summarizes discussion so far for seabass

markus_sabadello: Presses seabass for an answer whether he's happy with the expected outcome

seabass: First impressions: seems dlongley and I spoke last week and went through some of the emails on the list Havig this extra metada at east solves a couple of the issues.
… Avoids having to force-try every possible algo.
… SO it seems like an improvement.
… If we're not going to limit it to one algorithm, how can we best ensure that people use SHA256 rather than using some other one?

<dlongley> +1 to non-normative text encouraging use of the default if possible and using as few hash algorithms as possible for interoperability purposes.

phila: We've got a default, if you don't say what to use, it will use SHA256. We can add informative text to say that you've got to hang onto the parameter if you do provide one and include that with whomever you communicate / share with. We can provide informative guidance that you need to make that information discoverable or available. We can work on informative text over the next few weeks without the pressure of timing and you can be a part

of that and we can send the normative text to CR as it stands now.

seabass: ... Looking through the text...

<gkellogg> https://www.w3.org/TR/rdf-canon/#dfn-hash-algorithm

gkellogg: This hasn't changed since before TPAC.

seabass: I missed that first TPAC meeting

seabass: Can we make the default a recommendation, not a requirement.

manu: I think it harms things if we remove the default as the default. They'll use it unless there's a reason nt to

manu: We said earlier today that the expression of a hash on its own is not enough. We want to add non-normative text to say you can/'t do that.
… You should never look at a hash output nad not know which hash function was used. We want to give that advice ti prevent what you have highlighted
… If you see a has and only a hash, you shouldn't presume you know which hash function was used

manu: What you suggested was to remove the default. I would be a strong -1 on that. Our test suite is built on that.
… Also we discussed what the output should be. There was agreement that since you call the fucntion with the hash parameter there's no need to have it in the output.
… We want to provide strong guidance, albeit non-normative, if there is upstream software, they need to convey which hash function was used.

seabass: When I suggest removing the default, I mean make it mandatory that you say you used SHA256
… Does that mean implementations should say they use SHA256, or that they use something else?

gkellogg: I'm not quite following you, sorry.
… This is a normative requirement of people implementing the algorithm. They implement with the default, but provide a mechanism for using an alternative. But it's up to the implementation to make it clear what they used.

<dlongley> we can't say *how* external metadata will be expressed, but we can have our informative text say that it should always be clear what hash algorithm was used internally

<dlongley> (and a "default" here in our spec is orthogonal to that)

markus_sabadello: If an implementation 56, do they have to use this parameter? No, because that's the default. So if you invoke without the param, SHA 256 will be used.

<dlongley> i think we're talking about informative text changes at this point

<gkellogg> Should be RDFC-1.0 to identify the algorithm, though.

markus_sabadello: What if we summarise this in the GH issue.

<dlongley> and we could run Phil's proposal to move onto that

draft proposal: While we continue to discuss Issue 176, there is consensus that there will not be a need for a change to the normative text discussed that the WG resolved to seek transitoion to CR revcently

<gkellogg> +1

<manu> I'd +1 that above ^

<seabass> +1

Proposal: While we continue to discuss Issue 176, there is consensus that there will not be a need for a change to the normative text discussed that the WG resolved to seek transition to CR recently

<manu> +1

<dlongley> +1

<yamdan> +1

<seabass> +1

<dlehn> +1

RESOLUTION: While we continue to discuss Issue 176, there is consensus that there will not be a need for a change to the normative text discussed that the WG resolved to seek transition to CR recently

– DRAFT –
RCH bi-weekly meeting

27 September 2023

Attendees

Meeting minutes

Hash parameterization

Summary of resolutions

Diagnostics