6303 – [FT] TokenInfo and StringInclude definition

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 6303 - [FT] TokenInfo and StringInclude definition

Summary: [FT] TokenInfo and StringInclude definition

Status:	CLOSED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Full Text 1.0 (show other bugs)
Version:	Candidate Recommendation
Hardware:	PC Linux

Importance:	P2 normal
Target Milestone:	---
Assignee:	Jim Melton
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:	http://www.w3.org/TR/xpath-full-text-10/
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2008-12-11 15:30 UTC by Petr Pleshachkov
Modified:	2011-01-06 15:43 UTC (History)
CC List:	2 users (show)

See Also:

Attachments

Description Petr Pleshachkov 2008-12-11 15:30:44 UTC

Dear authors of XQuery Full-Text Specification,

Please clarify the following issues:

1. I am a bit confused with the definition of TokenInfo and StringInclude.

[Definition: A TokenInfo represents a contiguous collection of tokens
from an XML document. ]

[Definition: A StringInclude is a StringMatch that describes a
TokenInfo that must be contained in the document.]

 the UML Static Class diagram of AllMatches shows one-to one
correspondece between StringMatch and TokenInfo.

But from the XML Schema definition :

 <xs:element name="stringInclude"
             type="fts:stringMatch" />


 <xs:complexType name="stringMatch">
   <xs:sequence>
     <xs:element ref="fts:tokenInfo"/>
   </xs:sequence>
   <xs:attribute name="queryPos"
                 type="xs:integer"
                 use="required"/>
   <xs:attribute name="isContiguous"
                 type="xs:boolean"
                 use="required"/>
 </xs:complexType>

 <xs:complexType name="tokenInfo">
   <xs:attribute name="startPos"
                 type="xs:integer"
                 use="required"/>
   <xs:attribute name="endPos"
                 type="xs:integer"
                 use="required"/>
   <xs:attribute name="startSent"
                 type="xs:integer"
                 use="required"/>
   <xs:attribute name="endSent"
                 type="xs:integer"
                 use="required"/>
   <xs:attribute name="startPara"
                 type="xs:integer"
                 use="required"/>
   <xs:attribute name="endPara"
                 type="xs:integer"
                 use="required"/>
 </xs:complexType>

 <xs:element name="tokenInfo" type="fts:tokenInfo"/>

follows that StringMatch can contain a SEQUENCE of tokenInfo. So, we
have one-to many relationship.

Please, clarify the right relationship between StringMatch and tokenInfo.


2. In section  4.2.7.9 FTDistance you have an example: ("Ford Mustang"
ftand "excellent") distance at most 3 words

And you say at the end : "The result for the FTDistance selection
consists of only the first Match (with positions 1, 2, and 5) and the
fifth Match (with positions 25, 27, and 28), because only for these
Matches the word distance between consecutive TokenInfos is always
less than or equal to 3. It is 1 for the first pair and 3 for the
second in the first case, and 2 and 1 in the second."

Here for the first match you have 2 StringIncludes (shown on the diagram):
1) first StringInclude with startPos = 1 and endPos=2
2) second StringInclude with startPos = 5 (endPos = 5)

But what is the consecutive pairs ? It looks like with have 2
StringIncludes and have only ONE pair and distance = 5 - 2 -1 = 2, but
you say " It is 1 for the first pair and 3 for the second in the first
case" what defines something different.

Please, clarify how do you define the consecutive pairs ?


Thank you in advance,
Peter Pleshachkov

Comment 1 Michael Dyck 2008-12-11 18:37:47 UTC

[personal response:]

> 1. I am a bit confused with the definition of TokenInfo and StringInclude.
> ...
>  <xs:complexType name="stringMatch">
>    <xs:sequence>
>      <xs:element ref="fts:tokenInfo"/>
>    </xs:sequence>
>    ...
>  </xs:complexType>
> ...
> follows that StringMatch can contain a SEQUENCE of tokenInfo. So, we
> have one-to many relationship.

I think you are misreading the Schema definition. The construct
    <xs:sequence>
      <xs:element ref="fts:tokenInfo"/>
    </xs:sequence>
doesn't mean "a sequence of any number of tokenInfo elements", it means "a sequence of exactly one tokenInfo element". To specify other than "exactly one", we would use the 'minOccurs' and/or 'maxOccurs' attributes (of <sequence> or <element>).

(By the way, if you have two independent comments, it's better to submit them as separate Bugzilla issues.)

Comment 2 Michael Dyck 2008-12-11 19:11:21 UTC

[personal response:]

Re your point #2: Yes, I think that's a mistake in the specification.
Where we say:
    It is 1 for the first pair and 3 for the second in the first case,
    and 2 and 1 in the second.
We should instead say something like:
    For the first Match, the word distance between 
    the two TokenInfos is 3 (startPos 5 - endPos 2),
    and for the fifth Match, it's 2 (startPos 27 - endPos 25).

Comment 3 Petr Pleshachkov 2008-12-11 21:07:49 UTC

(In reply to comment #2)
But according to the spec: "the distance between the two is M2's starting position minus M1's ending position, minus 1.". 
So, for the first match we should get the distance = 5 - 2 - 1 = 2. Is it right ? 

By the way, section 3.6.3 contains example: 

"/books/book ftcontains "web" ftand "site" ftand
"usability" distance at most 2 words"

with the following explanation:

"The following expression returns false:

The search context does contain the phrase "The usability of a Web site", in which the tokens "usability" and "Web" have a distance of 2 words, and the tokens "Web" and "site" have a distance of 0 words, both of which satisfy the constraint distance at most 2 words. However, the problem is that "usability" and "site" have a distance of 3 words, which does not satisfy the constraint, and so the distance selection yields no matches, and the expression as a whole yields false. (The phrase "Improving Web Site Usability" would satisfy the given full-text selection, but it occurs in an attribute value, and so is not subject to tokenization.)"

But the spec says that we have to check the distance between "successive pair of matches"

So, we have to check the distance constraint for pairs: ("usability", "web") and ("Web", "site"), but not for the pair ("usability", "site")

This is followed from the formal function as well:

declare function fts:ApplyFTWordDistanceAtMost (
      $allMatches as element(fts:allMatches),
      $n as xs:integer ) 
   as element(fts:allMatches) 
{
   <fts:allMatches stokenNum="{$allMatches/@stokenNum}">
   {
      for $match in $allMatches/fts:match
      let $sorted := for $si in $match/fts:stringInclude          
                     order by $si/fts:tokenInfo/@startPos ascending,
                              $si/fts:tokenInfo/@endPos ascending
                     return $si
      where
         if (fn:count($sorted) le 1) then fn:true() else
            every $index in (1 to fn:count($sorted) - 1)
            satisfies fts:wordDistance(
                          $sorted[$index]/fts:tokenInfo,
                          $sorted[$index+1]/fts:tokenInfo
                      ) <= $n 
      return 
        <fts:match>
        {
           fts:joinIncludes($match/fts:stringInclude),
           for $stringExcl in $match/fts:stringExclude
           where some $stringIncl in $match/fts:stringInclude
                 satisfies fts:wordDistance(
                               $stringIncl/fts:tokenInfo,
                               $stringExcl/fts:tokenInfo
                           ) <= $n
           return $stringExcl
        }
        </fts:match>
   }
   </fts:allMatches>
};

So, is the example correct ? 

> [personal response:]
> 
> Re your point #2: Yes, I think that's a mistake in the specification.
> Where we say:
>     It is 1 for the first pair and 3 for the second in the first case,
>     and 2 and 1 in the second.
> We should instead say something like:
>     For the first Match, the word distance between 
>     the two TokenInfos is 3 (startPos 5 - endPos 2),
>     and for the fifth Match, it's 2 (startPos 27 - endPos 25).
>

Comment 4 Michael Dyck 2008-12-11 22:08:56 UTC

(In reply to comment #3)
> (In reply to comment #2)
> But according to the spec: "the distance between the two is M2's starting
> position minus M1's ending position, minus 1.". 
> So, for the first match we should get the distance = 5 - 2 - 1 = 2. Is it right
> ? 

Whoops, right, I forgot the minus one. So:
    For the first Match, the word distance between 
    the two TokenInfos is 2 (startPos 5 - endPos 2 - 1),
    and for the fifth Match, it's 1 (startPos 27 - endPos 25 - 1).

Comment 5 Michael Dyck 2008-12-11 22:20:07 UTC

With respect to the prose around that example in section 3.6.3,
this problem was raised in Bug 5886 and has been fixed.
The revised wording will appear the next time the specification is published.

Comment 6 Michael Dyck 2009-01-08 02:33:02 UTC

At its meeting on 2008-12-22, the Task Force accepted my responses in
comments #1, #4, and #5.

For comment #1, there is no change to the document.

For comment #4, I have modified the editors' copy of the Full Text
document as proposed.

For comment #5, there has already been a change to the document.

Consequently, I'm marking this issue RESOLVED-fixed.
If you accept this resolution, please mark the issue CLOSED.