This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
In Section 3. Full-Text, the text "This sample tokenization uses white space, punctuation and XML tags as word-breakers and <p> for paragraph boundaries. The results may be different for other tokenizations." fails to state what rule has been used to identify sentence boundaries. The guidelines for running the test suite give the rule as: "sentences are separated by a period (a/k/a "full stop") followed immediately by white space," The example in 3 Full-Text Selections uses the following XML. <books> <book number="1"> <title shortTitle="Improving Web Site Usability">Improving the Usability of a Web Site Through Expert Reviews and Usability Testing</title> <author>Millicent Marigold</author> <author>Montana Marigold</author> <editor>VĂ©ra Tudor-Medina</editor> <content> <p>The usability of a Web site is how well the site supports the users in achieving specified goals. A Web site should facilitate learning, and enable efficient and effective task completion, while propagating few errors. </p> <note>This book has been approved by the Web Site Users Association. </note> </content> </book> </books> Following the rule for sentence breaking from the test stuie guidelines, test examples-364-2 derived from section 3.6.4 of the specification appears to be incorrect. The specification says: The following expression returns true, because the tokens "usability" and "Marigold" are contained within different sentences: //book contains text "usability" ftand "Marigold" different sentence However, the first sentence break appears after the word "goals", so the two words only ever appear in the same sentence. There is no suggestion in the text that the beginning (end) of a paragraph necessarily start (ends) a sentence. It is also unclear how paragraph boundaries are identified. Consider the following input: <root> A <p>B</p> C </root> I can see three possibilities: 1. There are three paragraphs: one containing A, one containing B and one containing C). 2. There are two paragraphs: one containing A, one containing B C. 3. There are two paragraphs: one containing A B, one containing C. It is not clear from the specification which interpretation is correct.
(personal response:) (In reply to comment #0) > In Section 3. Full-Text, the text > > "This sample tokenization uses white space, punctuation and XML tags as > word-breakers and <p> for paragraph boundaries. The results may be different > for other tokenizations." > > fails to state what rule has been used to identify sentence boundaries. Hm, right. Since we have some examples involving sentences (in section 3.6.4), we should probably copy the text you quoted from test suite's guidelines. > There is no suggestion in the text that the beginning (end) of a paragraph > necessarily start (ends) a sentence. We should probably add that to the description of the sample tokenization. > It is also unclear how paragraph boundaries are identified. Consider the > following input: > > <root> > A <p>B</p> C > </root> > > I can see three possibilities: > > 1. There are three paragraphs: one containing A, one containing B and one > containing C). > 2. There are two paragraphs: one containing A, one containing B C. > 3. There are two paragraphs: one containing A B, one containing C. > > It is not clear from the specification which interpretation is correct. I think they're all conformant (and there are perhaps other possibilities). It's up to each implementation to indicate how it identifies paragraph boundaries (if it supports paragraphs). In the sample tokenization, I'd say it's clear that A and B are not in the same paragraph (because <p> is a "paragraph boundary"), so #3 is out. I'm not sure it's necessary to describe the sample tokenization precisely enough to distinguish between #1 and #2 -- do you know of any examples or tests where it makes a difference to the result?
Note, in some fields of discourse a sentence can indeed span multiple paragraph-like objects - e.g. poetry, or Biblical verses. I don't think it necessary to forbid this.
> I don't think it necessary to forbid this. The spec doesn't forbid implementations from supporting such definitions of paragraph and sentence, and the comments above are not suggesting it should. This issue is about the sample tokenization, which does not affect what the spec allows or disallows of implementations.
> Do you know of any examples or tests where it makes a difference to the result? I'm afraid I can't name one, but I'm positive there is one as I had to fiddle around with our tokenizer to match the behaviour expected by the test suite.
Our spec also says that words, sentences, and paragraphs form a hierarchy, so a paragraph break also implies a sentence break. Since the instructions say that <p> is a paragraph break. Therefore, there is a sentence boundary at the start of the first <p> element.
(In reply to comment #5) > Our spec also says that words, sentences, and paragraphs form a hierarchy, My mistake. The relevant text is in 4.1 Tokenization, point 6 ("The tokenizer MUST preserve the containment hierarchy"). So the spec *does* forbid what Liam suggests in comment #2.
WG agrees to resolve this bug by adding clarifying text to section 3 of the specification. We believe the instructions for the testsuite already say this. Please indicate your satisfaction with this resolution by marking the bug as CLOSED.