This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
Dear authors of XQuery Full-Text Specification, Please clarify the following issues: 1. I am a bit confused with the definition of TokenInfo and StringInclude. [Definition: A TokenInfo represents a contiguous collection of tokens from an XML document. ] [Definition: A StringInclude is a StringMatch that describes a TokenInfo that must be contained in the document.] the UML Static Class diagram of AllMatches shows one-to one correspondece between StringMatch and TokenInfo. But from the XML Schema definition : <xs:element name="stringInclude" type="fts:stringMatch" /> <xs:complexType name="stringMatch"> <xs:sequence> <xs:element ref="fts:tokenInfo"/> </xs:sequence> <xs:attribute name="queryPos" type="xs:integer" use="required"/> <xs:attribute name="isContiguous" type="xs:boolean" use="required"/> </xs:complexType> <xs:complexType name="tokenInfo"> <xs:attribute name="startPos" type="xs:integer" use="required"/> <xs:attribute name="endPos" type="xs:integer" use="required"/> <xs:attribute name="startSent" type="xs:integer" use="required"/> <xs:attribute name="endSent" type="xs:integer" use="required"/> <xs:attribute name="startPara" type="xs:integer" use="required"/> <xs:attribute name="endPara" type="xs:integer" use="required"/> </xs:complexType> <xs:element name="tokenInfo" type="fts:tokenInfo"/> follows that StringMatch can contain a SEQUENCE of tokenInfo. So, we have one-to many relationship. Please, clarify the right relationship between StringMatch and tokenInfo. 2. In section 4.2.7.9 FTDistance you have an example: ("Ford Mustang" ftand "excellent") distance at most 3 words And you say at the end : "The result for the FTDistance selection consists of only the first Match (with positions 1, 2, and 5) and the fifth Match (with positions 25, 27, and 28), because only for these Matches the word distance between consecutive TokenInfos is always less than or equal to 3. It is 1 for the first pair and 3 for the second in the first case, and 2 and 1 in the second." Here for the first match you have 2 StringIncludes (shown on the diagram): 1) first StringInclude with startPos = 1 and endPos=2 2) second StringInclude with startPos = 5 (endPos = 5) But what is the consecutive pairs ? It looks like with have 2 StringIncludes and have only ONE pair and distance = 5 - 2 -1 = 2, but you say " It is 1 for the first pair and 3 for the second in the first case" what defines something different. Please, clarify how do you define the consecutive pairs ? Thank you in advance, Peter Pleshachkov
[personal response:] > 1. I am a bit confused with the definition of TokenInfo and StringInclude. > ... > <xs:complexType name="stringMatch"> > <xs:sequence> > <xs:element ref="fts:tokenInfo"/> > </xs:sequence> > ... > </xs:complexType> > ... > follows that StringMatch can contain a SEQUENCE of tokenInfo. So, we > have one-to many relationship. I think you are misreading the Schema definition. The construct <xs:sequence> <xs:element ref="fts:tokenInfo"/> </xs:sequence> doesn't mean "a sequence of any number of tokenInfo elements", it means "a sequence of exactly one tokenInfo element". To specify other than "exactly one", we would use the 'minOccurs' and/or 'maxOccurs' attributes (of <sequence> or <element>). (By the way, if you have two independent comments, it's better to submit them as separate Bugzilla issues.)
[personal response:] Re your point #2: Yes, I think that's a mistake in the specification. Where we say: It is 1 for the first pair and 3 for the second in the first case, and 2 and 1 in the second. We should instead say something like: For the first Match, the word distance between the two TokenInfos is 3 (startPos 5 - endPos 2), and for the fifth Match, it's 2 (startPos 27 - endPos 25).
(In reply to comment #2) But according to the spec: "the distance between the two is M2's starting position minus M1's ending position, minus 1.". So, for the first match we should get the distance = 5 - 2 - 1 = 2. Is it right ? By the way, section 3.6.3 contains example: "/books/book ftcontains "web" ftand "site" ftand "usability" distance at most 2 words" with the following explanation: "The following expression returns false: The search context does contain the phrase "The usability of a Web site", in which the tokens "usability" and "Web" have a distance of 2 words, and the tokens "Web" and "site" have a distance of 0 words, both of which satisfy the constraint distance at most 2 words. However, the problem is that "usability" and "site" have a distance of 3 words, which does not satisfy the constraint, and so the distance selection yields no matches, and the expression as a whole yields false. (The phrase "Improving Web Site Usability" would satisfy the given full-text selection, but it occurs in an attribute value, and so is not subject to tokenization.)" But the spec says that we have to check the distance between "successive pair of matches" So, we have to check the distance constraint for pairs: ("usability", "web") and ("Web", "site"), but not for the pair ("usability", "site") This is followed from the formal function as well: declare function fts:ApplyFTWordDistanceAtMost ( $allMatches as element(fts:allMatches), $n as xs:integer ) as element(fts:allMatches) { <fts:allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/fts:match let $sorted := for $si in $match/fts:stringInclude order by $si/fts:tokenInfo/@startPos ascending, $si/fts:tokenInfo/@endPos ascending return $si where if (fn:count($sorted) le 1) then fn:true() else every $index in (1 to fn:count($sorted) - 1) satisfies fts:wordDistance( $sorted[$index]/fts:tokenInfo, $sorted[$index+1]/fts:tokenInfo ) <= $n return <fts:match> { fts:joinIncludes($match/fts:stringInclude), for $stringExcl in $match/fts:stringExclude where some $stringIncl in $match/fts:stringInclude satisfies fts:wordDistance( $stringIncl/fts:tokenInfo, $stringExcl/fts:tokenInfo ) <= $n return $stringExcl } </fts:match> } </fts:allMatches> }; So, is the example correct ? > [personal response:] > > Re your point #2: Yes, I think that's a mistake in the specification. > Where we say: > It is 1 for the first pair and 3 for the second in the first case, > and 2 and 1 in the second. > We should instead say something like: > For the first Match, the word distance between > the two TokenInfos is 3 (startPos 5 - endPos 2), > and for the fifth Match, it's 2 (startPos 27 - endPos 25). >
(In reply to comment #3) > (In reply to comment #2) > But according to the spec: "the distance between the two is M2's starting > position minus M1's ending position, minus 1.". > So, for the first match we should get the distance = 5 - 2 - 1 = 2. Is it right > ? Whoops, right, I forgot the minus one. So: For the first Match, the word distance between the two TokenInfos is 2 (startPos 5 - endPos 2 - 1), and for the fifth Match, it's 1 (startPos 27 - endPos 25 - 1).
With respect to the prose around that example in section 3.6.3, this problem was raised in Bug 5886 and has been fixed. The revised wording will appear the next time the specification is published.
At its meeting on 2008-12-22, the Task Force accepted my responses in comments #1, #4, and #5. For comment #1, there is no change to the document. For comment #4, I have modified the editors' copy of the Full Text document as proposed. For comment #5, there has already been a change to the document. Consequently, I'm marking this issue RESOLVED-fixed. If you accept this resolution, please mark the issue CLOSED.