This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
>Issue: paragraphs and sentences (Test, mostly) >Sentence boundary detection is highly language-dependent and >relies on specific language and perhaps even vocabulary knowledge. >Paragraph boundaries ditto likewise, although in practice folks >put paragraph structure into their markup, so then the issue is >which markup counts as breaking paragraphs and which doesn't? > >Issue: flow-through/flow-around markup (Test, mostly) >Similarly: which markup indicates word breaks and which doesn't? >Which markup is flowed-around (e.g. footnotes) for phrase and >proximity matching? > >I call these two spec issues also only because it is weird that >we have query options for ignoring some nodes, but not for >specifying any of these other important facts. For the record, >I think it is correct not to have them in the query, but I also >think putting ignored nodes into the query is a big mistake as >well. I also think we need to acknowledge them in some way in >testing and the spec. > > Now, here we do have a testing issue as well as spec problem and we should discuss this in the taskforce right away. I would categorize these two issues under the same umbrella: when to flow-through/flow-around markup. In other words, there are some nodes that should be considered/ignored for tokenization and querying and that might alter the semantics of some of the operators defined in the spec. You have a valid point about FTIgnoreOption. For example, Can a bold markup, which is not a word breaker and therefor ignored by the tokenizer, be considered as part of the search context (i.e. allowing the search to be restricted to bolded nodes only)? I propose to have the capabilities to * Ignore tags in a particular namespace (e.g. XHTML namespace) * Declare tags as delimiters for word, sentence and paragraphs.
We agreed at the F2F to leave this completely implementation defined