This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
FT semantics defines distance by ordering tokens by their starting positions. However, if tokens overlap and have different ending positions but the same starting positions, the results will be indeterminate.
Proposal: (1) Change the semantics functions to order by startPos, endPos This still leads to some non-determinism when the startPos and endPos are identical. However, for the purposes of distance calculations, this is irrelevant. (2) In addition, and this relates to #4715 more than this bug, we could in section 2 say something like: "Tokens are ordered by their starting positions and, if necessary, their ending positions." I will go ahead with (1) as we agree this is the right thing to do. Comments on #2?
Thanks, Mary. Your change to (1) is OK with me, and I agree that it is basically what we agreed on the call. I'm OK with (2) as well. On (1), you said "This still leads to some non-determinism when the startPos and endPos are identical". True, that's irrelevant for distance calculations. Can you identify a place in the language or document where it is relevant?
It can come into play when we talk about phrases, I think -- a phrase being an ordered sequence of tokens. However, I believe that non-determinism doesn't matter in practice, because phrase matching is implementation-dependent anyhow, and we want to allow implementations to come to their own conclusions about matching overlapping tokens in such cases (cf the Dampfschiffmumble example)
Bug was fixed as part of other work.