This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
A.2 Lexical structure [See a later comment for suggested alternate wording.] Thanks for excising the state machine! "and [XML Names]are" Insert space before "are". "When patterns are simple string matches, the strings are embedded directly into the EBNF. In other cases, named terminals are used." Delete. It doesn't say anything that isn't already said better in the EBNF notation section. Plus it isn't connected to anything else in the section. "that together may help disambiguate the individual symbols." Ditto my comments re this sentence in A.1. "When tokenizing, the longest possible match that is valid in the current context is preferred ." Delete space before period. What constitutes "the current context"? What constitutes "valid"? Longest match of what? Given that tokenization is up to the implementor, it seems that the effect of this sentence would vary between implementations, which is probably not what you want. Luckily, I think this rule can be deleted. The rules about required whitespace (to prevent two adjacent terminals from being mis-recognized as one) should (if fixed) handle anything that the "longest possible match" would have.
(In reply to comment #0) > A.2 Lexical structure > > [See a later comment for suggested alternate wording.] > > Thanks for excising the state machine! You're wellcome. You were right that it needed to be excised. > > "and [XML Names]are" > Insert space before "are". Done. > > "When patterns are simple string matches, the strings are embedded directly into > the EBNF. In other cases, named terminals are used." > Delete. It doesn't say anything that isn't already said better in the EBNF > notation section. Plus it isn't connected to anything else in the section. Deleted. > > "that together may help disambiguate the individual symbols." > Ditto my comments re this sentence in A.1. Fixed. > > "When tokenizing, the longest possible match that is valid in the current > context is preferred ." > Delete space before period. Fixed. > > What constitutes "the current context"? What constitutes "valid"? Longest > match of what? Given that tokenization is up to the implementor, it seems > that the effect of this sentence would vary between implementations, which > is probably not what you want. > > Luckily, I think this rule can be deleted. The rules about required > whitespace (to prevent two adjacent terminals from being mis-recognized as > one) should (if fixed) handle anything that the "longest possible match" > would have. I'm not inclined to delete this, at least at this time. I think the rule is clear enough, and longstanding. This should be discussed at the Seattle F2F, especially against the light of other changes or non-changes.
(In reply to comment #1) > > Luckily, I think this rule can be deleted. The rules about required > > whitespace (to prevent two adjacent terminals from being mis-recognized as > > one) should (if fixed) handle anything that the "longest possible match" > > would have. > > I'm not inclined to delete this, at least at this time. I think the rule is > clear enough, and longstanding. > > This should be discussed at the Seattle F2F, especially against the light of > other changes or non-changes. One simple example of where I think the longest token rule is still needed is that of ">" and ">>".
> One simple example of where I think the longest token rule is still needed is > that of ">" and ">>". Is there a context for which '>>' and '>' '>' are both valid continuations?
(In reply to comment #3) > Is there a context for which '>>' and '>' '>' are both valid continuations? Not if you don't consider non-legal sentences. Another case is "descendant-or-self" vs. "descendant" which can occur in the same context. In my parser oriented mind, you need to decide if descendant-or-self::foo has "descendant" followed by some other characters, vs. "descendant-or-self", thus you keep searching for the longest token that matches. On the other hand, you're saying, in terms of the spec, if keyword delimitation is clear, which I think it is, there's only one choice: "descendant-or-self", which is either legal or not. If "descendant-or-self::foo" could be interpreted as "descendant - or-self::foo" (i.e. a subtraction operation), then we would need a longest token rule perhaps. In summary, after thinking about it more, I can't justify having the longest token rule, especially when the spec requires no specific tokenization spec. So, I'm leaning on the side that this rule should be deleted. I'm interested if any other WG members can justify it.
Couldn't descendant - or - self::foo be interpreted as two substractions? Best regards Michael
(In reply to comment #5) > Couldn't descendant - or - self::foo be interpreted as two substractions? Sure, but that's the same as the 'a-b' vs 'a - b' example, unless I'm missing something.
It is related but in one case it is an axis name in the other a user-defined element name. I still prefer the longest token rule to deal with these cases.
While I sympathize in general with the idea of deleting rules like the longest-token rule from grammars when they are redundant, in this particular case I am inclined to keep this particular rule. There are several reasons: 1 I am not absolutely sure whether it's actually redundant in this case; I haven't proven that it's not, but I haven't seen anything that looks like a proof that it is. 2 Considering cases like "<<" vs. "<" + "<" (or similarly the two-character tokens vs. the two single-character tokens for "(:", ":)", "/>", ">>", "{{", "}}", "..", "::", ":=", ">=", "?>", "//"), if I have a choice of getting the right answer by knowing exactly where I am in the grammar or by following a longest-token rule, it seems clear to me that the longest-token rule is a lot simpler to understand and a lot simpler to use in practice. If we could get rid of the qualification about being valid in the current context, I'd be even happier, but I don't see how to eliminate that without more complications.
A joint meeting of the Query and XSLT working groups considered this comment on July 20, 2005. The WGs agreed to resolve this issue as per my previous note, and C. M. Sperberg-McQueen's comment in regards to the longest token rule. If you do not agree with this resolution, please add a comment explaining why. If you wish to appeal the WG's decision to the Director, then change the Status of the record to Reopened. If we do not hear from you in the next two weeks, we will assume you agree with the WG decision.
Closing bug because commenter has not objected to the resolution posted and more than two weeks have passed.