Copyright © 2004 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply.
This document defines the syntax and formal semantics of XQuery 1.0 and XPath 2.0 Full-Text which is a language that extends XQuery 1.0 and XPath 2.0 with full-text search capabilities.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This is the first public working draft of the XQuery 1.0 and XPath 2.0 Full-Text specification. This WD attempts to meet the requirements in [XQuery and XPath Full-Text Requirements]. The syntax and semantics in this specification are used in [XQuery 1.0 and XPath 2.0 Full-Text Use Cases]. The grammar in this document is aligned with the XQuery 1.0 Last Call Working Draft grammar in [XQuery 1.0: A Query Language for XML].
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document contains many open issues, and should not be considered to be fully stable. Vendors who wish to create preview implementations based on this document do so at their own risk. While this document reflects the general consensus of the working groups, there are still controversial areas that may be subject to change.
Public comments on this document and its open issues are welcome. Comments should be sent to the W3C mailing list, public-qt-comments@w3.org (http://lists.w3.org/Archives/Public/public-qt-comments/) with "[FT]" at the beginning of the subject field.
XQuery 1.0 and XPath 2.0 Full-Text has been defined jointly by the XML Query Working Group and the XSL Working Group (both part of the XML Activity ).
The patent policy for this document is expected to become the 5 February 2004 W3C Patent Policy, pending the Advisory Committee review of the renewal of the XML Query Working Group. Patent disclosures relevant to this specification may be found on the XML Query Working Group's patent disclosure page and the XSL Working Group's patent disclosure page. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) with respect to this specification should disclose the information in accordance with section 6 of the W3C Patent Policy.
1 Introduction
1.1 Full-Text Search and XML
1.2 Organization of this document
2 Full-Text Extensions to XQuery and XPath
2.1 Expression FTContainsExpr
2.1.1 FTContainsExpr Description
2.1.2 FTContainsExpr Examples
2.1.3 Extending the Grammars of XQuery and XPath
2.2 Function ft:score()
2.2.1 Function ft:score() Description
2.2.2 ft:score() Examples
3 FTSelection and FTMatchOptions
3.1 FTSelection
3.1.1 FTSelection Example
3.1.2 FTWords
3.1.3 FTOr
3.1.4 FTAnd
3.1.5 FTUnaryNot
3.1.6 FTMildNegation
3.1.7 FTOrder
3.1.8 FTScope
3.1.9 FTDistance
3.1.10 FTWindow
3.1.11 FTTimes
3.2 FTMatchOptions
3.2.1 FTCaseOption
3.2.2 FTDiacriticsOption
3.2.3 FTSpecialCharOption
3.2.4 FTStemOption
3.2.5 FTThesaurusOption
3.2.6 FTStopwordOption
3.2.7 FTLanguageOption
3.2.8 FTIgnoreOption
3.2.9 FTRegexOption
4 Semantics
4.1 Introduction
4.2 Nested XQuery and XPath Expressions
4.2.1 Left-hand Side of a FTContainsExpr
4.2.2 FTWords
4.2.3 FTRangeSpec
4.2.4 FTStopWordOption
4.2.5 FTThesaurusOption
4.2.6 FTLanguageOption
4.2.7 FTIgnoreOption
4.2.8 Tokenization
4.3 Evaluation of FTSelections
4.3.1 AllMatches
4.3.1.1 Formal Model
4.3.1.2 Examples
4.3.1.3 XML representation
4.3.2 FTSelections
4.3.2.1 XML Representation
4.3.2.2 The evaluate function
4.3.2.3 Formal semantics functions
4.3.2.4 FTWords
4.3.2.5 FTOr
4.3.2.6 FTAnd
4.3.2.7 FTUnaryNot
4.3.2.8 FTMildNot
4.3.2.9 FTOrder
4.3.2.10 FTScope
4.3.2.11 FTDistance
4.3.2.12 FTWindow
4.3.2.13 FTTimes
4.3.3 Match Options Semantics
4.3.3.1 Types
4.3.3.2 High-Level Semantics
4.3.3.3 Formal Semantics Functions
4.3.3.4 FTCaseOption
4.3.3.5 FTDiacriticsOption
4.3.3.6 FTSpecialCharOption
4.3.3.7 FTStemOption
4.3.3.8 FTThesaurusOption
4.3.3.9 FTStopWordOption
4.3.3.10 FTLanguageOption
4.3.3.11 FTRegexOption
4.3.3.12 FTIgnoreOption
4.4 XQuery 1.0 and XPath 2.0 Full-Text and Scoring Expressions
4.4.1 FTContainsExpr
4.4.1.1 Semantics of FTContainsExpr
4.4.1.2 Example
4.4.2 Scoring
A EBNF for XQuery 1.0 Grammar with Full-Text extensions
A.1 Grammar Notes
A.2 White Space Rules
B EBNF for XPath 2.0 Grammar with Full-Text extensions
C References
C.1 Normative References
D Issues List
E Acknowledgements
F Glossary
This document defines the language and the formal semantics of XQuery 1.0 and XPath 2.0 Full-Text. This language is designed to meet the requirements identified in http://www.w3.org/TR/xmlquery-full-text-requirements/ and the use cases in http://www.w3.org/TR/xmlquery-full-text-use-cases/.
XQuery 1.0 and XPath 2.0 Full-Text extends the syntax and semantics of XQuery 1.0 and XPath 2.0.
XML documents may contain highly-structured data (numbers, dates), unstructured data (untagged free-flowing text), and semi-structured data (text with embedded tags). Where a document contains unstructured or semi-structured data, it is important to be able to search that data using Information Retrieval techniques such as full-text search. Full-text search is different from substring search in many ways:
A full-text search searches for phrases (a sequence of words) rather than substrings. A substring search for news items that contain the string "lease" will return a news item that contains "Foobar Corporation releases the 20.9 version ...". A full-text search for the phrase "lease" will not.
There is an expectation that a full-text search will support language- and token-based searches which substring search cannot. An example of a language-based search is "find me all the news items that contain a word with the same linguistic stem as "mouse" (finds "mouse" and "mice"). An example of a token-based search is "find me all the news items that contain the word "XML" within 3 words (tokens) of "Query".
Full-text search is subject to the vageries and nuances of language. The results it returns are often of varying usefulness. When you search a web site for all cameras that cost less than $100, this is an exact search. There is a set of cameras that match this search, and a set that do not. Similarly, when you do a string search across news items for "mouse", there is only 1 expected result set. When you do a full-text search for, say, all the news items that contain the word "mouse", you probably expect to find news items with the word "mice", and possibly "rodents" (or possibly "computers"!). But not all results are equal : some results are more "mousey" than others. Because full-text search can be inexact, we have the notion of score or relevance : we generally expect to see the most relevant results at the top of the results list. Of course, relevance is in the eye of the beholder. Note: as XQuery/XPath evolves, it may apply the notion of score to querying structured search. For example, when making travel plans or shopping for cameras, it is sometimes more useful to get an ordered list of near-matches. If XQuery/XPath defines a generalized inexact match, we assume that XQuery/XPath can utilize the scoring framework provided by the full-text language.
As XML becomes mainstream, users expect to be able to store and search all their documents in XML. This requires a standard way to do full-text search, as well as structured searches, against XML documents. A similar requirement for full-text search led ISO to define the SQL/MM-FT standard. SQL/MM-FT defines extensions to SQL to express full-text queries providing similar functionality as this full-text language extension to XQuery 1.0/XPath 2.0 does.
Full-text queries are performed on text which has been tokenized, i.e., broken into a sequence of words, units of punctuation, and spaces.
A word is defined as any character, n-gram, or sequence of characters returned by a tokenizer as a basic unit to be queried. Each instance of a word consists of one or more consecutive characters. Beyond that, words are implementation defined. Note that consecutive words need not be separated by either punctuation or space, and words may overlap. A phrase is a sequence of ordered words which can contain any number of words.
Tokenization enables functions and operators which work with the relative positions of words (e.g., proximity operators). It also uniquely identifies sentences and paragraphs in which words appear. Tokenization also enables functions and operators which operate on a part or the root of the word (e.g., wildcards, stemming).
We use the namespace "ft" (for full-text) that corresponds to the URL http://www.w3.org/2004/07/xquery-full-text and defines the namespace of full-text search. We also use "fts" for definitional purposes in semantics Section.
This document is organized as follows. We first present a high level syntax for the XQuery 1.0 and XPath 2.0 Full-Text language along with some examples. Then, we present the syntax and examples of the basic primitives in the XQuery 1.0 and XPath 2.0 Full-Text language. This is followed by the semantics of the XQuery 1.0 and XPath 2.0 Full-Text language. The appendix contains a section on how to plug full-text to the XPath grammar and a section on how to plug full-text to the XQuery grammar, a list of issues, acknowledgements and a glossary
To extend the languages of XQuery and XPath for full-text search we introduce a new kind of expression, called FTContainsExpr, as well as the new function ft:score.
The XQuery and XPath Languages are extended by adding the expression FTContainsExpr. Syntactically an FTContainsExpr is just an additional comparison expression, similar to Section 3.5.2 General ComparisonsXQ.
From the XQuery Language spec:
"Comparison expressions allow two values to be compared. XQuery provides three kinds of comparison expressions, called value comparisons, general comparisons, and node comparisons."
For the moment, let us assume that the following production is added to the grammars of XQuery and XPath.
[] | ComparisonExpr |
::= | FTContainsExpr |
Let us briefly describe what an FTContainsExpr does and how it is specified, before we reconsider the way it is integrated into the grammars of XQuery and XPath.
[] | FTContainsExpr |
::= | RangeExpr "ftcontains" FTSelection FTIgnoreOption ? |
An expression of form FTContainsExpr returns a boolean value. It returns true if there is some node in RangeExpr that matches FTSelection. For the purpose of determining a match some nodes in RangeExpr may be ignored, as specified in FTIgnoreOption. The precise semantics of matching is described in Section 4 Semantics.
Expressions of the form FTSelection are composed of the following ingredients.
Words or combinations of words, that are the search strings to be found as matches
Match options, such as case sensitivity or indication to use stop words
Boolean operators, that allow to compose an FTSelection from simpler FTSelections
Positional constraints, such as indication of match distance or window
The following example returns the author of each book whose title contains a word with the same root as dog
and the word cat
.
for $b in /books/book where $b/title ftcontains ("dog" with stemming) && "cat" return $b/author
The same example in XPath 2.0:
/books/book[title ftcontains ("dog" with stemming) && "cat"]/author
The concrete and normative grammars in the specifications of XQuery and XPath are written such that they can be used directly for LL(k) parsing. Our introduction of a new production for ComparisonExpr above, however, would violate the LL(k) property of the grammar, since now there are multiple productions for ComparisonExpr that can start with the same sequence of tokens. Hence, the way we actually extend the grammar looks a little more complicated, but has the same effect.
[68] | ComparisonExpr |
::= | RangeExpr (((ValueComp |
Here we added FTContains to the possible continuations of RangeExpr, that can give rise to a ComparisonExpr, where FTContains expands to the ftcontains
keyword and the right-hand side of the FTContainsExpr. Note that we have folded the production of FTContainsExpr into ComparisonExpr, factoring out the left-hand sides of the operations, which in each case are of form RangeExpr. Hence, this effectively adds the new kind of
ComparisonExpr to XQuery or XPath, as described above.
For the full grammar, see A EBNF for XQuery 1.0 Grammar with Full-Text extensions.
The XQuery Language is extended by adding the function ft:score().
float ft:score(Expr)
The argument Expr of ft:score() is restricted to be a Boolean combination of FTContainsExpr's, more precisely a Boolean combination involving only "and" and "or".
ft:score() returns a value of type xs:float in the range [0, 1]. The value returned by ft:score() reflects the relevance of the match criteria in the FTSelections to the nodes in the respective RangeExprs. The way relevance is calculated is left implementation-dependent, but ft:score must follow these rules:
ft:score() must return values of type xs:float in the range [0, 1]
If evaluation of the Expr argument would yield false, then ft:score() must return 0
For score values greater than 0, a higher score must imply a higher degree of relevance [see issue scoring-properties]
The following example returns the relevance of $b/content ftcontains "web site" && "usability" and $b//chapter/title ftcontains "testing"
to each book.
for $b in /books/book return ft:score($b/content ftcontains "web site" && "usability" and $b//chapter/title ftcontains "testing")
This section describes FTSelection that gives the full-text selection expressions used in the FTContainsExpr, and the match options in FTMatchOptions that are used to adjust the matching semantics of the full-text selection expressions.
The FTSelection production specifies all permitted kinds of full-text search conditions.
[163] | FTSelection |
::= | FTOr (FTMatchOption | FTProximity)* |
In the following we will define the syntax and semantics of the individual fulltext selection operators and provide some examples based on an example document presented in section 3.1.1.
We will use the following XML document as an example throughout this section.
<book number="1"> <title shortTitle="Improving Web Site Usability">Improving the Usability of a Web Site Through Expert Reviews and Usability Testing</title> </book> <author>Millicent Marigold</author> <author>Montana Marigold</author> <editor>Véra Tudor-Medina</editor> <content> <p>The usability of a Web site is how well the site supports the users in achieving specified goals. A Web site should facilitate learning, and enable efficient and effective task completion, while propagating few errors. </p> <note>This book has been approved by the Web Site Users Association. </note> </content> </book>
FTWords specifies the words and phrases that are being searched for in the searched text that is provided as the left-hand side argument of FTContainsExpr.
[169] | FTWords |
::= | PrimaryExpr FTAnyallOption? |
The right hand side Expr of the above production must evaluate to a sequence of string values or nodes of type "xs:string". The result of the Expr is then atomized into a sequence of strings which then is being tokenized into a sequence of phrases (see section 2.x.x for details). If the atomized sequence is not a subtype of xs:string*, a type error [err:XP0006] is raised.
If the "any" option is specified then a match occurs if and only if at least one phrase in the sequence has a match in the searched text.
If the "all" option is specified then a match occurs if and only if all of the phrases in the sequence of phrases are matched in the searched text.
If the "phrase" option is specified then the sequence of phrases is used to create a single phrase by concatenating the phrases and interleaving whitespace. A match occurs if and only if the resulting phrase is matched in the searched text.
If the "any word" option is specified then a match occurs if and only if at least one word in the sequence of phrases is matched in the searched text.
If the "all word" option is specified then a match occurs if and only if all words in the sequence of phrases are matched in the searched text.
If no option is specified then "any" is implied as default.
Note that if Expr results in a single string, the default and "any", "all" and "phrase" are equivalent.
If Expr results in the empty sequence or the tokenization results in a zero-length phrase, this is discussed in the issue zero-length-phrase.
Note: The results assume a case-insensitive match in the following expressions.
/book[@number="1" and ./title ftcontains "Expert"]
returns true because the phrase "Expert" is contained in the title
element of the book
element.
/book[@number="1" and ./title ftcontains "Expert Reviews"]
returns true because the phrase "Expert Reviews" is contained in the title
element of the book
element.
/book[@number="1" and ./title ftcontains ("Expert", "Reviews") all]
returns true because the two phrases "Expert" and "Reviews" are both contained in the title
element.
/book[@number="1"]//p ftcontains "Web Site Usability"
returns false because the p
element in the book
element doesn't contain the phrase "Web Site Usability" though it contains all of the words in the phrase.
for $book in /book[.//author ftcontains "Marigold"] let $score := ft:score($book/title ftcontains "Web Site Usability") where $score > 0.8 order by $score descending return $book/@number
returns the most relevant book
elements by Marigold with a title about "Web Site Usability" in sorted by score order.
[164] | FTOr |
::= | FTAnd ( "||" FTAnd )* |
FTOr finds matches that satisfy at least one of the input selection criteria.
Any match should satisfy at least one of the FTSelection criteria.
/book[.//author ftcontains "Millicent" || "Voltaire"]
returns book
elements written by "Millicent" or "Voltaire". The book
element is returned because it it written by "Millicent".
[165] | FTAnd |
::= | FTUnaryNot ( "&&" FTUnaryNot )* |
FTAnd finds matches that satisfy simultaneously two selection criteria.
Any match must satisfy all of the FTSelection criteria which are specified by one or more FTUnaryNot expressions.
/book[@number="1"]/title ftcontains ("usability" && "testing") case insensitive
returns true because it contains "usability" and "testing" if we ignore the letter case (see FTCaseOption for more details on case sensitivity).
/book[@number="1"]/author ftcontains "Millicent" && "Montana"
returns false because "Millicent" and "Montana" are not contained by the same author
element of the book
element.
[166] | FTUnaryNot |
::= | ("!")? FTMildnot |
FTUnaryNot finds matches that do not satisfy words and phrases that are being searched for in the searched text that is provided as the left-hand side argument of FTContainsExpr.
This is unary negation. Only one operand is required.
/book[. ftcontains "information" && "retrieval" && ! "information retrieval"]
returns book
elements containing "information" and "retrieval" but not "information retrieval".
/book[. ftcontains "web site usability" && !"usability testing"]
returns book
elements about "web site usability" but not "usability testing".
[167] | FTMildnot |
::= | FTWordsSelection ( "mild" "not" FTWordsSelection )* |
FTMildNegation is a milder form of "&& !". 'a mild not b' matches an expression that contains a on its own, and not just as part of b. For example, if I want to find articles that mention Mexico, I might search for ' "Mexico" mild not "New Mexico" '. '"Mexico" mild not "New Mexico"' matches any Expr that contains Mexico on its own. An Expr that contains "New Mexico" is not "excluded" from the result - it may mention "Mexico" as well. An Expr that contains "Mexico" only as part of the phrase "New Mexico" will "not" match ' "Mexico" mild not "New Mexico".
A match to FTMildNegation must contain at least one word occurrence that satisfies the first condition and does not satisfy the second condition. If it contains a word occurrence that satisfies both the first and the second condition, the occurrence is not considered as a result.
/book[@number="1" and . ftcontains "usability" mild not "usability testing"]
returns the book
since "usability" appears in the title
and the p
elements of the book
. However, the occurrence of "Usability Testing" in the title
element is not be considered.
[NaN] | FTOrderedIndicator |
::= | "ordered" |
FTOrder enforces that the order of word occurrences in the match is the same as their order in the query.
By default, there are no restrictions on the order in which the query words are matched in the document.
FTOrder imposes such an order. A match must satisfy the nested selection condition and the match must contain the words in the order specified in the query.
/book[. ftcontains ("web site" && "usability") ordered]/title
returns titles of book
elements that contain "web site" and "usability" in the order in which they appear in the query, i.e., "web site" must precede "usability".
/book[@number="1"]/title ftcontains ("Montana" && "Millicent") ordered
returns false because although "Montana" and "Millicent" appear in the title
element, they do not appear in the order specified in the query.
[185] | FTScope |
::= | ("same" | "different") FTBigUnit |
FTScope specifies a condition on the scope of the occurrences of the matched words.
FTScope specifies whether any matched word in FTSelection should be directly contained in the same ('same') or different ('different') scope.
Possible scopes are sentence (e.g., delimited by ".", "!", or "?"), and paragraph (e.g., delimited by blank lines and EOLN/CR characters). Sentences and paragraphs are defined in the introduction.
By default, there are no restriction on the scope of the occurrences, i.e. they may occur in a sentence or a paragraph. FTScope is used to restrict this scope.
If two words appear in the same sentence and in different sentences then both 'same sentence' and 'different sentence' return true. The same thing applies to the 'paragraph' scope.
/book[@number="1" and . ftcontains "usability" && "Marigold" same sentence]
will not return the book
element because the words "usability" and "Marigold" are not contained by the same sentence.
/book[@number="1" and . ftcontains "usability" && "Marigold" different sentence]
will return the book
element because the words "usability" and "Marigold" are contained by different sentences.
/book[. ftcontains "usability" && "testing" same paragraph]
returns book
elements mentioning "usability" and "testing" in the same paragraph.
/book[. ftcontains "site" && "errors" same sentence]
returns the book
element because "site" and "errors" appear in the same sentence. Note that the book is returned even though there is another occurrence of "site", namely the one in the title
element, which does not appear in the same sentence as the occurrence of "errors".
Some subtle relationships between FTScope and FTDistance will be discussed in the semantics section.
[182] | FTDistance |
::= | "with"? "distance" FTRange FTUnit |
[181] | FTRange |
::= | ("exactly" UnionExpr) |
FTDistance limits the distance in number of words, sentences, or paragraphs between consecutive occurrences of the words in FTSelection. These correspond to "word distance", "sentence distance", and "paragraph distance" forms of FTDistance.
FTRange specifies a range of integers.
The XQuery expression(s) Expr must evaluate (with atomization) to a singleton sequence with an integer atom. Otherwise, the expression containing the clause must return error.
Let the first XQuery expression Expr evaluates to M and the second XQuery expression in the last type of FTRange evaluates to N.
FTDistance may cross element boundaries when computing distance:
Zero words means adjacent.
Zero sentences means the same sentence.
Zero paragraphs means the same paragraph.
The format with "exactly" limits the range to a single integer: [M, M]. "at least" specifies the range [M, ). The "at most" variant specifies the range [0, M]. The last variant specifies a range of allowable values: the closed interval [M, N].
'exactly 0' specifies the range [0, 0].
'at least 1' specifies the range [1, ].
'at most 1' specifies the range [0, 1].
'from 5 to 10' specifies the range [5, 10].
Stop words are counted against the word distance.
/book[. ftcontains ("information" && "retrieval") mild not ("information" && "retrieval" with distance at least 11 words)]
returns book
elements containing "information" and "retrieval" and discards those occurrences of the words that are more than 10 words apart.
/book[. ftcontains "web" && "site" && "usability" with distance at most 2 words]/title
returns the titles of book
elements mentioning "web", "site", and "usability" with at most 2 intervening words between consecutive occurrences of the words.
/book[@number="1" and . ftcontains "web site" && "usability" with distance at most 1 words]/title
returns the title
element; the p
element will not be returned when stop words are not ignored because its occurrences of "web site" and "usability" are within word distance of 2.
/book[@number="1" and . ftcontains ("web site" && "completion" && ! "learning") with distance exactly 15 words]/title
returns the title
element because the word "learning" not appears within 15 words of the words "web site" and "completion".
/book[@number="1" and . ftcontains "web site" && "completion" with distance exactly 15 words same paragraph]/title
returns the title
element if the words "web site" and "completion" appear within 15 words of each other and in the same paragraph.
[183] | FTWindow |
::= | "within"? "window" FTRange |
FTWindow allows control over the distance between the leftmost word occurrence (the one with the smallest position) and the rightmost one.
FTWindow may cross element boundaries when computing distances.
FTRange specifies a range of integers.
The number of words for the occurrences of the nested selection condition between the smallest word position and the largest position (inclusive on both sides) in words should be within the specified range. Similar to the FTDistance, stop words are counted.
Zero words means adjacent.
Zero sentences means the same sentence.
/book[./title ftcontains "web" && "site" && "usability" window at most 5]/@number
returns the numbers of book
elements containing "web", "site", and "usability" in their title within a window of 5.
/book[. ftcontains ("web" && "site" ordered) && ("usability" || "testing") window at most 10]
returns book
elements that contain "web" and "site" in this order plus either "usability" or "testing" and all the matched words occur within a window of at most 10.
/book[@number="1" and . ftcontains "web site" && "usability" window at most 3]
returns the title
element because it contains "Web Site Usability"; the p
element will not be returned because its occurrences of "web site" and "usability" are not within a window of 3.
[184] | FTTimes |
::= | "occurs" FTRange |
FTTimes controls the number of times a specified FTSelection must be matched.
FTTimes limits the number of different occurrences of FTSelection, which must be within the specified range.
An occurrence of the criterion is a distinct set of word occurrences that satisfies it.
The FTSelection '("very big")' has one occurrence in the text fragment "very very big": it consists of the second "very" and "big".
The FTSelection '"very" && "big"' has two occurrences in the text fragment "very very big": one consisting of the first "very" and "big", and the other containing the second "very" and "big".
The FTSelection '"very" || "big"' has 3 occurrences in "very very big" any non-empty set of words.
/book[. ftcontains "usability" occurs at least 2]/@number
returns the numbers of the book
elements that contain 2 or more occurrences of "usability".
/book[@number="1" and title ftcontains "usability" || "testing" occurs at most 3]
returns false because "usability" 3 occurrences and "testing" 1 occurrences; therefore, there are 4 occurrences of "usability" || "testing".
/book[@number="1" and . ftcontains "usability" occurs at least 2]
returns the book
element because its title
element contains 3 occurrences of "usability" although its p
element contains only one occurrence.
FTMatchOptions modify the operational semantics of the FTSelection they are applied on.
FTMatchOptions productions set an environment for the matching options of FTSelection.
[171] | FTMatchOption |
::= | FTCaseOption |
FTMatchOption operates with the following defaults:
FTCaseOption is "case insensitive".
FTDiacriticsOption is "diacritics insensitive".
FTSpecialCharOption is "without special characters".
FTStemOption is "without stemming".
FTThesaurusOption is "without thesaurus".
FTStopWordOption is "without stopwords".
FTLanguageOption is no language is selected.
FTIgnoreOption is that no element content and tags are ignored.
FTRegexOption is "without regex".
As a result, the query:
/book/title ftcontains "usability"
is equivalent to the query
/book/title ftcontains "usability" case insensitive diacritics insensitive without special characters without stemming without thesaurus without regex
FTMatchOptions are applied in the order in which they are given in the query. More information on their semantics is given in 4.3.3 Match Options Semantics.
We illustrate each match option in more detail in the following sections.
[172] | FTCaseOption |
::= | "lowercase" |
FTCaseOption controls the way words are matched with regards to the letter case.
Influences the way FTWords is applied.
"lowercase" ("uppercase") specify that only words in lower-case (upper-case) letters can be matched exactly; "case insensitive" specifies that matching word occurrences can have both small and capital letters; their case is ignored; "case sensitive" specifies that the case of the letters in the result must match the case of the letters in the word from the query.
The default is "case insensitive".
/book[@number="1"]/title ftcontains "Usability" lowercase
returns false because the title
element doesn't contain "usability" (in lower case).
/book[@number="1"]/title ftcontains "usability" case insensitive
returns true because the case of the letters is not considered.
[173] | FTDiacriticsOption |
::= | "with" "diacritics" |
FTDiacriticsOption controls the way words are matched with regards to the use of diacritic symbols.
FTDiacriticsOption influences the way FTWords is applied.
"with" ("without") "diacritics" specifies that only words that contain (do not contain) diacritics can be matched exactly; "diacritics insensitive" specifies that there are no restrictions on the matching word occurrences with regards to diacritic symbols: letters containing diacritics can be matched with their non-diacritics counterparts and vice versa; "diacritics sensitive" specifies that the diacritic symbols must match the symbols in the word from the query.
The default is "diacritics insensitive".
/book[@number="1"]//editor ftcontains "Vera" with diacritics
returns the editor
element.
/book[@number="1"]/editors ftcontains "Véra" without diacritics
returns false.
[174] | FTSpecialcharOption |
::= | "with" "special" "characters" | "without" "special" "characters" |
FTSpecialCharOption specifies whether special characters such as punctuation should or should not be ignored.
Influences the way FTWords is applied.
The option "with special characters" specifies that special characters such as punctuation must also be matched. The option "without special characters" specifies that special characters such as punctuation need not be matched.
The default is "without special characters".
/book[@number="1"]//editor ftcontains "Tudor Medina" with special characters
returns true.
/book[@number="1"]/editors ftcontains "Tudor-Medina" without special characters
returns false.
[175] | FTStemOption |
::= | "with" "stemming" | "without" "stemming" |
FTStemOption controls the use of stemming during string matching.
FTStemOption influences the way FTWords is applied. It produces a disjunction of the query words by expanding the words into the list of words that share the same stem. By definition, the query words are included in that disjunction.
When the "with stemming" option is present, string matches may also contain words that have the same stem as the query string. It is implementation-defined what a stem of a word is.
The clause "without stemming" turns off the use of stemming when words are matched.
It is implementation-defined whether the stemming will based on an algorithm, dictionary, or mixed approach.
The default is "without stemming".
/book[@number="1"]/title ftcontains "improve" with stemming
returns true because it contains "improving" that has the same stem as "improve".
[176] | FTThesaurusOption |
::= | ("with" "thesaurus" UnionExpr) | "without" "thesaurus" |
FTThesaurusOption controls the use of thesauri during string matching.
Each of the FTThesaurusOption is converted as though it was an argument of a function with the expected parameter type xs:string. A type error [err:XP0006] is raised if any operand cannot be converted to a string.
Note: The above rule implies atomization of the Expr values followed by an implementation-defined tokenization.
Influences the way FTWords is applied.
The Expr must result in a sequence of strings (string atoms or nodes of type "xs:string") that represent valid thesauri names. Otherwise, an error is returned. What is a valid thesaurus name is implementation-dependent and can be either a name of system-provided or user-specified thesaurus. If Expr evaluates to an empty sequence, the construct is equivalent to "without thesaurus".
When the "with thesaurus" match option is specified, string matches also include words that can be found in the specified thesauri and that correspond to the query string.
The statement "without thesaurus" instructs the query engine not to use thesauri when matching words.
The default is "without thesaurus".
It is implementation defined how a thesaurus is represented. This includes files in a predefined format, or modules using a common interface.
[177] | FTStopwordOption |
::= | ("with" "stop" "words" | "without" "stop" "words") UnionExpr |
FTStopWordOption controls the use of stop words (frequent functional words such as "a", "an", "the" that are ignored) during string matching.
Influences the way FTWords is applied.
Expr must evaluate to a sequence of string atoms or nodes of type "xs:string". No tokenization is performed on the strings: they are used as they occur in the sequence.
When the "with stop words" option is used, if a word is within a collection of stop words, it should be ignored. If Expr is not specified, an implementation-defined system collection of stop words is used. If Expr is present and "additional" is not specified, the strings in its result sequence are used as the new stop-word collection. If "additional" is specified, the strings from the result sequence are appended to the current stop-word collection. It is a syntax error to use "additional" without specifying an Expr.
"without stop words" turns off the use of the words in the expression result as stop words or clears the whole stop-word collection if no expression is specified.
The default is "without stop words".
/book[@number="1"]//p ftcontains "usability web site" with stop words ("a", "the", "of")
returns true.
/book[@number="1"]//title ftcontains "usability web site" without stop words
returns false.
[178] | FTLanguageOption |
::= | "language" UnionExpr |
FTLanguageOption controls the language of the matched words.
Influences the way FTWords is applied.
Each of the FTLanguageOption is converted as though it was an argument of a function with the expected parameter type xs:string. A type error [err:XP0006] is raised if any operand cannot be converted to a string.
Note: The above rule implies atomization of the Expr values followed by an implementation-defined tokenization.
Language can have implications in various aspect of string matching. This includes how the tokenization into words is performed, how are symbols transformed into lower/upper-case, what are the valid diacritic symbols, what are the possible special characters, how stemming is performed, or which words can considered to be stop words. In particular, the language option may imply what are the default thesaurus/stop-word sets.
Expr is an XQuery expression that must evaluate to a string atom, a node with typed value of type "xs:string", or an empty sequence.
If Expr evaluates to "none", "", or an empty sequence, this means that there is no language selected; otherwise, it should be valid identifier of a language.
By default, there is no language selected.
/book[@number="1"]//editor ftcontains "tudor" with diacritics language "Romanian"
returns true.
[188] | FTIgnoreOption |
::= | "without" "content" UnionExpr |
FTIgnoreOption specifies a set of element nodes whose content should be ignored. The set of nodes is identified by the XQuery expression Expr that should evaluate to a sequence of element nodes.
If "without content" is specified, all the words directly contained by the elements are ignored. For example, "Web <b>Site</b> Usability" can be matched by "Web Usability" if the option is "without content .//b". If the XQuery sub-expression evaluates to an empty sequence no words from element content are ignored.
By default element content is not ignored.
/book[@number="1"] ftcontains "Testing" without content .//title
returns false because "Testing" does not occur without the title
element whose content is ignored.
[179] | FTRegexOption |
::= | "with" "regex" | "without" "regex" |
FTRegexOption controls the use of regular expressions in words.
Influences the way words in FTWords is interpreted.
When the 'with regex' option is present, the words are interpreted as grep-style regular expressions.
The clause "without regex" turns off the use of regular expressions. Any special characters used in regular expressions are uninterpreted and matched directly or ignored (depending on FTSpecialCharOption).
The default is "without regex".
/book[@number="1"]/title ftcontains "improv*" with regex
returns true because it contains "improving".
This section describes the formal semantics of XQuery 1.0 and XPath 2.0 Full-Text. The figure below shows how XQuery 1.0 and XPath 2.0 Full-Text integrates with XQuery and XPath.
The arrow (1) represents the composability of the XQuery and XPath expressions. It is described in the XQuery language specification. Regular XQuery expressions can be nested inside FTSelections (arrow (2)) by evaluating them to a sequence of items and then converting them to a tokenized text; depending on the role they are used in a XQuery 1.0 and XPath 2.0 Full-Text expression. The process is described in Nested XQuery and XPath Expressions. Similarly to arrow (1), there is a full composability of FTSelections (arrow (3)). The composability is achived by evaluating FTSelections to AllMatches. Each FTSelection operates on zero or more AllMatches and returns AllMatches. The process is described in the Evaluation of FTSelections section. Finally, the result of the evaluation of XQuery 1.0 and XPath 2.0 Full-Text and scoring expressions needs to be integrated in the XPath and XQuery model (arrow (4)). The section XQuery 1.0 and XPath 2.0 Full-Text and scoring expressions describes how this is achieved.
The following section discusses the nesting of XQuery and XPath expressions inside FTContainsExpr.
The general rule is that the nested XQuery and XPath expressions are evaluated to a sequence of items before the evaluation of FTContainsExpr. The sequence of items must satisfy certain constraints depending on the context in which it is used. These constraints are described below.
Full-text queries are performed on text which has been tokenized, i.e., broken into a sequence of words, units of punctuation, and spaces. The tokenization is applied on the string value of the evaluation of the left-hand side of the FTContainsExpr expression.
The XQuery expression nested inside an FTWords must evaluate to a sequence of string values after applying atomization (otherwise the entire FTSelection returns an error). Then, FTWords performs an tokenization on the string values from the sequence.
The XQuery expression (or expressions, in the case of a "from-to" range) must evaluate to a singleton sequence of integers after applying atomization (otherwise the entire FTSelection returns an error). The resulting integer values are treated as boundaries for the corresponding range.
The XQuery sub-expression must evaluate to either an empty sequence or a singleton sequence of a string value or an empty sequence after applying atomization (otherwise the entire FTSelection returns an error). The resulting string value is treated as a language identifier specifying the language of the matched document/documents.
[Definition: Tokenization] is the process of converting a string to a sequence of TokenInfos.
A [Definition: TokenInfo] is the identity of a word occurrence inside an XML document. Each TokenInfo is associated with:
the word it identifies: word
a unique identifier that captures the relative position of the word in the document order: pos
the relative position of the sentence containing the word: sentence
the relative position of the paragraph containing the word: para
The tokenization is performed by the formal semantics functions:
function fts:getTokenInfo( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchToken as fts:TokenInfo) as fts:Tokeninfo*
The above function returns all the TokenInfos in nodes in $searchContext
that match the search string in $searchToken
when using the match options in $matchOptions
. The match options that occur at the beginning of the list should be applied before match options that occur later in the list.
function fts:getSearchTokenInfo( $searchString as xs:string, $matchOptions as fts:FTMatchOptions) as fts:Tokeninfo*
The above function tokenizes the search string $searchString
and returns a sequence of TokenInfo that describe the sequence of tokens in the search string.
A compliant implementation should provide implementations of the above functions.
As an illustration, consider the following XML fragment:
<offers> <offer id="1000" price="10000"> Ford Mustang 2000, 65K, excellent condition, runs great, AC, CC, power all </offer> <offer id="1001" price="8000"> Honda Accord 1999, 78K, A/C, cruise control, runs and looks great, excellent condition </offer> <offer id="1005" price="5500"> Ford Mustang, 1995, 150K highway mileage, no rust, excellent condition </offer> </offers>
If we assume that words are delimited by punctuation and whitespace symbols (as in English), the first word "Ford" from the first element content will be assigned a TokenInfo with relative position of 1, the word "Mustang" will be assigned a TokenInfo with relative position of 2, the word "2000" will be assigned a TokenInfo with a relative position of 3, and so on. The relative positions of the TokenInfos are shown below in parenthesis.
<offers> <offer id="1000" price="10000"> Ford(1) Mustang(2) 2000(3), 65K(4), excellent(5) condition(6), runs(7) great(8), AC(9), CC(10), power(11) all(12) </offer> <offer id="1001" price="8000"> Honda(13) Accord(14) 1999(15), 78K(16), A(17)/C(18), cruise(19) control(20), runs(21) and(22) looks(23) great(24), excellent(25) condition(26) </offer> <offer id="1005" price="5500"> Ford(27) Mustang(28), 1995(29), 150K(30) highway(31) mileage(32), little(33) rust(34), excellent(35) condition(36) </offer> </offers>
The relative positions of paragraphs are determined similarly. Assuming that the paragraph delimiters are start tag, end tag, and end of line characters, the words in the first element's content will be assigned a paragraph relative number 1, the words from the following element content will be assigned a relative number 2, and so on.
The relative positions of sentences are also determined similarly using sentence delimiters such as ".", "!", and "?".
The XQuery/XPath data model of a "sequence of nodes" is inadequate for fully composable FTSelections. The main reason is that full-text operations (such as FTSelections) operate on linguistic units, such as positions of words, and such information is not captured in the XQuery/XPath data model. We thus define AllMatches that allows for fully compositional FTSelections.
An [Definition: AllMatches] object describes all the posible results an FTSelection. The UML Static Class diagram of AllMatches is shown on the diagram.
The AllMatches object contains zero or more Matches. Each Match describes one result to the FTSelection. The result is described in terms of zero or more StringIncludes and zero or more StringExcludes, which describe the TokenInfos that must be contained and respectively, those that must not be contained. Both StringInclude and StringExclude are of type StringMatch, which describes a possible match of a query search token with a document word. The queryString attribute of StringMatch contains the query search token that has been matched. The queryPos attribute specifies the position of this search token in the query (this attribute is needed for FTOrders). The TokenInfo associated with the StringMatch describes the word in the document that matches the query search token.
Intuitively, AllMatches specifies the TokenInfos that a node should contain, and the TokenInfos that a node should not contain, in order to satisfy an FTSelection
The AllMatches structure resembles the Disjunctive Normal Form (DNF) in propositional and first-order logic. The AllMatches is a disjunction of Matches. Each Match is a conjunction of positive "atoms", the StringIncludes, and negative "atoms", the StringExcludes.
Consider the FTWords "Mustang"
evaluated over the sample document fragment in the previous section. The AllMatches corresponding to this FTWords is shown in figure below.
As shown, the AllMatches consists of two Matches. Each Match represents one possible result of the FTWords "Mustang"
. The result represented by the first Match contains (represented as StringInclude) the word "Mustang" at position 2. The result described by the second Match contains the word "Mustang" at position 28.
Let us now consider a more complex example. Consider the FTWords "Ford Mustang"
evaluated over the XML fragment used above. The AllMatches for this FTWords is shown on the figure below.
There are two possible results of this FTWords, and these are represented by the two Matches. Each of the Matches requires two words to be matched. The result corresponding to the first Match is obtained by matching "Ford" at position 1 and matching "Mustang" at position 2. Similarly, the result described by the second Match is obtained by matching "Ford" at position 27 and "Mustang" at position 28.
Let us now consider a more sophisticated example of a AllMatches. Consider the FTSelection "Mustang" && ! "rust"
that searches for nodes that contain "Mustang" but not "rust". The AllMatches for this FTSelection is shown in the figure below.
Observe the use of StringExclude. This is the component that corresponds to negation. It specifies that the result desribed by the corresponding Match should not match the word at the specified position. For instance, the first Match specifies the solution that "Mustang" should be matched at position 2, and "rust" should not be matched at position 34.
AllMatches has a well-defined hierarchical structure. Therefore, the AllMatches can be easily modeled in XML. In subsequent sections, we will use this XML representation to formally describe the semantics of FTSelections. In particular, we will use the XML representation of AllMatches to formally specify how an FTSelection operates on zero or more AllMatches to produce a resulting AllMatches. We will also use the XML representation to specify the formal semantics of the FTContainsExpr and FTScoreExpr.
The XML schema for representing AllMatches is given below:
<xs:schema targetNamespace="http://www.w3.org/2004/07/xquery-full-text" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fts="http://www.w3.org/2004/07/xquery-full-text" elementFormDefault="qualified" attributeFormDefault="unqualified"> <xs:complexType name="AllMatches"> <xs:sequence> <xs:element name="match" type="fts:Match" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> <xs:complexType name="Match"> <xs:sequence> <xs:element name="stringInclude" type="fts:StringMatch" minOccurs="0" maxOccurs="unbounded"/> <xs:element name="stringExclude" type="fts:StringMatch" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> <xs:complexType name="StringMatch"> <xs:sequence> <xs:element name="tokenInfo" type="fts:TokenInfo"/> </xs:sequence> <xs:attribute name="queryString" type="xs:string" use="required"/> <xs:attribute name="queryPos" type="xs:integer" use="required"/> </xs:complexType> <xs:complexType name="TokenInfo"> <xs:attribute name="word" type="xs:string" use="required"/> <xs:attribute name="pos" type="xs:integer" use="required"/> <xs:attribute name="para" type="xs:integer" use="required"/> <xs:attribute name="sentence" type="xs:integer" use="required"/> </xs:complexType> </xs:schema>
In this section, we define the semantics of FTSelections. FTSelections are fully composable, and can be arbitrarily nested under other FTSelections. Also, each FTSelection can be associated with match options (such as stemming, stop words, etc.) and score weights. Since score weights are solely interpreted by the formal semantics scoring function, score weights do not influence the semantics of FTSelections in any way. We will thus not consider score weights when defining the formal semantics.
Here, we define the XML representation of the FTSelections as used in the fts:evaluate
function. The representation for FTSelection and FTSelectionWithScoreWeights is the same; the former does not use the weight
element.
<xs:schema targetNamespace="http://www.w3.org/2004/07/xquery-full-text" xmlns:fts="http://www.w3.org/2004/07/xquery-full-text" xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified"> <xs:include schemaLocation="AllMatches.xsd" /> <xs:include schemaLocation="MatchOptions.xsd" /> <xs:complexType name="FTSelection"> <xs:sequence> <xs:choice> <xs:element name="FTWords" type="fts:FTWords"/> <xs:element name="FTAnd" type="fts:FTAnd"/> <xs:element name="FTOr" type="fts:FTOr"/> <xs:element name="FTUnaryNot" type="fts:FTUnaryNot"/> <xs:element name="FTMildNot" type="fts:FTMildNot"/> <xs:element name="FTOrder" type="fts:FTOrder"/> <xs:element name="FTScope" type="fts:FTScope"/> <xs:element name="FTDistance" type="fts:FTDistance"/> <xs:element name="FTWindow" type="fts:FTWindow"/> <xs:element name="FTTimes" type="fts:FTTimes"/> </xs:choice> <xs:element name="matchOption" type="fts:FTMatchOption" minOccurs="0"/> <xs:element name="weight" type="xs:float" minOccurs="0"/> </xs:sequence> </xs:complexType> <xs:complexType name="FTWords"> <xs:sequence> <xs:element name="searchToken" type="fts:TokenInfo" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="type" type="fts:FTWordsType" use="required"/> </xs:complexType> <xs:complexType name="FTAnd"> <xs:sequence> <xs:element name="left" type="fts:FTSelection"/> <xs:element name="right" type="fts:FTSelection"/> </xs:sequence> </xs:complexType> <xs:complexType name="FTOr"> <xs:sequence> <xs:element name="left" type="fts:FTSelection"/> <xs:element name="right" type="fts:FTSelection"/> </xs:sequence> </xs:complexType> <xs:complexType name="FTUnaryNot"> <xs:sequence> <xs:element name="selection" type="fts:FTSelection"/> </xs:sequence> </xs:complexType> <xs:complexType name="FTMildNot"> <xs:sequence> <xs:element name="selection" type="fts:FTSelection"/> </xs:sequence> </xs:complexType> <xs:complexType name="FTOrder"> <xs:sequence> <xs:element name="selection" type="fts:FTSelection"/> </xs:sequence> </xs:complexType> <xs:complexType name="FTScope"> <xs:sequence> <xs:element name="selection" type="fts:FTSelection"/> </xs:sequence> <xs:attribute name="type" type="fts:ScopeType" use="required"/> <xs:attribute name="scope" type="fts:ScopeSelector" use="required"/> </xs:complexType> <xs:complexType name="FTDistance"> <xs:sequence> <xs:element name="range" type="fts:FTRangeSpec"/> <xs:element name="selection" type="fts:FTSelection"/> </xs:sequence> <xs:attribute name="type" type="fts:DistanceType" use="required"/> </xs:complexType> <xs:complexType name="FTWindow"> <xs:sequence> <xs:element name="range" type="fts:FTRangeSpec"/> <xs:element name="selection" type="fts:FTSelection"/> </xs:sequence> </xs:complexType> <xs:complexType name="FTTimes"> <xs:sequence> <xs:element name="range" type="fts:FTRangeSpec"/> <xs:element name="selection" type="fts:FTSelection"/> </xs:sequence> </xs:complexType> <xs:complexType name="FTCaseOption"> <xs:attribute name="value" use="required"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="lowercase"/> <xs:enumeration value="uppercase"/> <xs:enumeration value="case insensitive"/> <xs:enumeration value="case sensitive"/> </xs:restriction> </xs:simpleType> </xs:attribute> </xs:complexType> <xs:complexType name="FTRangeSpec"> <xs:attribute name="type" type="fts:RangeSpecType" use="required"/> <xs:attribute name="m" type="xs:integer"/> <xs:attribute name="n" type="xs:integer" use="required"/> </xs:complexType> <xs:simpleType name="FTWordsType"> <xs:restriction base="xs:string"> <xs:enumeration value="any"/> <xs:enumeration value="all"/> <xs:enumeration value="phrase"/> <xs:enumeration value="any word"/> <xs:enumeration value="all word"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="ScopeType"> <xs:restriction base="xs:string"> <xs:enumeration value="same"/> <xs:enumeration value="different"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="ScopeSelector"> <xs:restriction base="xs:string"> <xs:enumeration value="paragraph"/> <xs:enumeration value="sentence"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="RangeSpecType"> <xs:restriction base="xs:string"> <xs:enumeration value="exactly"/> <xs:enumeration value="at least"/> <xs:enumeration value="at most"/> <xs:enumeration value="from to"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="DistanceType"> <xs:restriction base="xs:string"> <xs:enumeration value="paragraph"/> <xs:enumeration value="sentence"/> <xs:enumeration value="word"/> </xs:restriction> </xs:simpleType> </xs:schema>
The XML representation of the match options is discussed in the match options section
evaluate
functionWe present denotational semantics for the evaluation of FTSelections. Specifically, we define a function fts:evaluate
that takes in three parameters: (1) an FTSelection, (2) a search context node, and (3) the default set of match options that apply to the evaluation of the FTSelection. The fts:evaluate
function returns the AllMatches that is the result of evaluating the FTSelection. When fts:evaluate
is applied to some
FTSelection X, it calls the function fts:applyX
to build the resulting AllMatches. If X is applied on nested FTSelections, the fts:evaluate
function is recursively called on these nested FTSelections and the returned AllMatches are used in the evaluation of fts:applyX
.
See the section Match Options Semantics for the semantics of the full-text match options.
We first present a high-level description of the fts:evaluate
function, and then describe the details.
The fts:evaluate
function is given below.
function evaluate($ftSelect as element(*, fts:FTSelection), $searchContext as node(), $matchOptions as FTMatchOptions, $searchTokenNum as xs:integer) as AllMatches { if (fn:count($ftSelect/FTMatchOption) > 0) then (: First we deal with all match options that the :) (: FTSelection might bear: we add the match options :) (: in front of the current match options sequence :) (: and pass the new sequence to the recursive call :) let $newFTSelection := $ftSelect/*[!(. instance of element(FTMatchOption))] return fts:evaluate($newFTSelection, $searchContext, ($ftSelect/matchOption, $matchOptions), $searchTokenNum) else if (fn:count($ftSelect/weight) > 0) then (: Weight has no bearing on semantics – just :) (: call "evaluate" on nested FTSelection :) let $newFTSelection := $ftSelect/*[! (. instance of element(weight)] return fts:evaluate($newFTSelection, $searchContext, $matchOptions, $searchTokenNum) else typeswitch ($ftSelect) case ($nftSelection as element(FTWords)) (: Apply the FTWords in the search context :) return applyFTWords($searchContext, $matchOptions, $nftSelection/searchToken, $searchTokenNum + 1); case ($nftSelection as element(FTAnd)) return let $left = fts:evaluate($nftSelection/left, $searchContext, $matchOptions, $searchTokenNum) let $newSearchTokenNum = fn:max($left//@queryPos) + 1 let $right = fts:evaluate($nftSelection/right, $searchContext, $matchOptions, $searchTokenNum) return applyFTAnd($left, $right) case ($nftSelection as element(FTOr)) return let $left = fts:evaluate($nftSelection/left, $searchContext, $matchOptions, $searchTokenNum) let $newSearchTokenNum = fn:max($left//@queryPos) + 1 let $right = fts:evaluate($nftSelection/right, $searchContext, $matchOptions, $searchTokenNum) return applyFTOr($left, $right) case ($nftSelection as element(FTUnaryNot)) return applyFTUnaryNot($nftSelection/selection) case ($ftSelection as element(FTMildNot)) return let $left = fts:evaluate($nftSelection/left, $searchContext, $matchOptions, $searchTokenNum) let $newSearchTokenNum = fn:max($left//@queryPos) + 1 let $right = fts:evaluate($nftSelection/right, $searchContext, $matchOptions, $searchTokenNum) return applyFTMildNot($left, $right) case ($nftSelection as element(FTOrder)) let $nested := fts:evaluate($nftSelection/selection, $searchContext, $matchOptions, $searchTokenNum) return applyFTOrder($nested) case ($nftSelection as element(FTScope)) let $nested := fts:evaluate($nftSelection/selection, $searchContext, $matchOptions, $searchTokenNum) return applyFTScope($nftSelection/@type, $nftSelection/@scope, $nested) case ($nftSelection as element(FTDistance)) let $nested := fts:evaluate($nftSelection/selection, $searchContext, $matchOptions, $searchTokenNum) return applyFTDistance($matchOptions, $nftSelection/@type, $nftSelection/range, $nested) case ($nftSelection as element(FTWindow)) let $nested := fts:evaluate($nftSelection/selection, $searchContext, $matchOptions, $searchTokenNum) return applyFTWindow($matchOptions, $nftSelection/range, $nested) case ($nftSelection as element(FTTimes)) let $nested := fts:evaluate($nftSelection/selection, $searchContext, $matchOptions, $searchTokenNum) return applyFTTimes($nftSelection/range, $nested) }
Let us now walk through the above pseudo-code to understand the semantics of the function. For concreteness, let us assume that the FTSelection was invoked inside an ftcontains
expression such as searchContext ftcontains ftselection
. In order to determine the AllMatches result of ftselection
, the fts:evaluate
function is invoked as follows: fts:evaluate($ftselection, $searchContext, $matchOptions, 0)
, where $ftselection
is the XML representation of the ftselection
and $searchContext
is bound to the result of the evaluation of the XQuery expression searchContext
.
Initially, the $searchTokensNum
is 0, i.e. currently 0 search tokens have been processed.
The $matchOptions
above is the default (implementation-defined) list of match options that apply to the evaluation of ftselection
(such as stemming but not thesaurus) and is implementation-defined. Match options embedded in ftselection
can change the match options collection as evaluation proceeds. In order to express the order in which match options are applied to an FTSelection, the match options are organized in a stack. The top match option in the stack
is to be applied first, the next match option is to be applied second, and so on. The ordering among match options is necessary because match options are not always commutative. For example, synonym(stem(word)) is not always the same as stem(synonym(word)). Of course, match optionss can be reordered when they commute, but this is an optimization issue and is beyond the scope of this semantics document.
Given the invocation of: fts:evaluate($ftselection, $searchContext, $matchOptions)
, evaluation proceeds as follows. First, $ftselection
is checked to see whether it is a match option applied on a nested FTSelection (case 1), a weight specification (case 2), a FTWords (case 3), or some other FTSelection (case 4). Let us consider these four cases in turn.
Case 1: If $ftselection
contains a match option, then it modifies the context for the nested FTSelection. Consequently, a new match option element is created and pushed onto the top of the stack of match options. The createOptionElement
function used to create a stack element corresponding to the match option simply creates a data structure that stores the type of match option (such as stemming, thesaurus, synonyms, ignore, etc.) and the details relating to the match
option (such as the name of the thesaurus, the words to ignore, etc.). The context match option created is added to the top of the stack because, in the FTSelection, it was applied before the other match options in the current match options stack. The evaluate
function is then invoked on the nested FTSelection with the new match options stack. When the function returns, the match option is popped from the stack, and the result of the nested evaluate
function is
returned. The match option is popped because the match options should not apply to FTSelections outside its scope.
Case 2: If $ftselection
contains a weight specification, then the specification is simply ignored (because it does not alter semantics). The evaluate
function is recursively called on the nested FTSelection and the resulting AllMatches is directly returned.
Case 3: If $ftselection
is a FTWords, then it does not have any nested FTSelections. Consequently, this is the base of the recursive call, and the AllMatches result of the FTWords is computed and returned. The AllMatches is computed by invoking the applyFTWords
function with the current search context and other necessary information. The semantics of how exactly the corresponding applyFTWords
creates AllMatches for
FTWords will be specified in the next section.
Case 4: If $ftselection
contains neither a match option nor a weight specification and is not a FTWords, the FTSelection performs some form of full-text operation such as &&
, ||
, window
, etc. Note that these operations are fully-compositional, and can be invoked on nested FTSelections. Consequently, evaluation proceeds as follows. First, the evaluate
function is recursively invoked on each nested
FTSelection. The result of evaluating each nested FTSelection is AllMatches. These AllMatches are transformed into a result AllMatches by applying the full- text operation corresponding to FTSelection1
(generically named applyX
for some type of FTSelection X in the pseudo-code). As an example, let FTSelection1
be FTSelection2 && FTSelection3
. Here FTSelection2
and FTSelection3
can themselves be arbitrarily nested FTSelections. Thus, evaluate
is invoked on FTSelection2
and FTSelection3
, and the resulting AllMatches are transformed to the output AllMatches using the applyFTAnd
function corresponding to &&
.
Note that specifying the semantics of the applyFTSelection
function for each FTSelection is key to specifying the semantics of the FTSelection itself. In the subsequent sections, we define the semantics of the applyX
function for each FTSelection kind X.
The formal semantics of the ApplyX
functions for each FTSelection kind X is specified in terms of four functions. How these four functions are computed is implementation-defined, but the functions have to satisfy some well-defined properties. We first present the properties of the formal semantics functions, and then present the semantics of the family of functions applyX
in terms of these functions.
The first function, getTokenInfo
has been described in tokenization section.
The wordDistance
returns the number of words that occur between the positions of the TokenInfos $tokenInfo1
and $tokenInfo2
. For example, two consecutive words have a distance of 0.
function fts:wordDistance( $tokenInfo1 as fts:TokenInfo, $tokenInfo2 as fts:TokenInfo, $matchOptions as fts:FTMatchOptions) as xs:integer
Similarly, the function getParaDistance
returns the number of paragraphs that occur between the TokenInfos $tokenInfo1
and $tokenInfo2
.
function fts:paraDistance( $tokenInfo1 as fts:TokenInfo, $tokenInfo2 as fts:TokenInfo, $matchOptions as fts:FTMatchOptions) as xs:integer
The function sentenceDistance
returns the number of sentences that occur between the TokenInfos $tokenInfo1
and $tokenInfo2
.
function fts:sentenceDistance( $tokenInfo1 as fts:TokenInfo, $tokenInfo2 as fts:TokenInfo, $matchOptions as fts:FTMatchOptions) as xs:integer
We first consider the case where FTWords consists of a single search string. The parameters of the applySingleSearchToken
function are the search context, the list of match options, the search TokenInfo, and the position where the latter occurs in the query.
declare function fts:applySingleSearchToken( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchToken as fts:TokenInfo, $queryPos as xs:integer) as element(allMatches, fts:AllMatches) { <allMatches> { let $token_pos := fts:getTokenInfo($searchContext, $matchOptions, $searchToken) for $pos in $token_pos return <match> <stringInclude queryPos="{$queryPos}" queryString="{$searchToken/@word}" > {$pos} </stringInclude> </match> } </allMatches> }
Intuitively, the AllMatches corresponding to an FTWords corresponds to a set of Matches, each of which is associated with a position where the corresponding search token was found. For example, the AllMatches result for the FTWords "Mustang" evaluated in the context of the sample document will be (in graphical terms):
The other cases can be rewritten as complex FTSelections that operate on single string FTWordss.
In the case of a FTWords with any word
specified, the semantics is given below. Since FTWords does not have nested FTSelections, the ApplyFTWords
function does not take in any AllMatches parameters corresponding to nested FTSelection results.
declare function fts:MakeDisjunction($curRes as element(allMatches, fts:AllMatches), $rest as element(allMatches, fts:AllMatches)*) as element(allMatches, fts:AllMatches) { if (fn:count($rest) = 0) then $curRes else let $firstAllMatches := fn:item-at($rest, 1) let $restAllMatches := fn-subsequence($rest, 2) let $newCurRes := fts:ApplyFTOr($curRes, $firstAllMatches) return fts:MakeDisjunction($newCurRes, $restAllMatches) } declare function fts:ApplyFTWordsAnyWord( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchStrings as xs:string*, $queryPos as xs:integer) as element(allMatches, fts:AllMatches) { if (fn:count($searchStrings) = 0) then <allMatches /> else let $searchTokens := for $searchString in $searchStrings return fts:getSearchTokenInfo($searchString, $matchOptions) let $allAllMatches = for $searchToken at $pos in $searchTokens return fts:applySingleSearchToken( $searchContext, $matchOptions, $searchToken, $queryPos + $pos -1) let $firstAllMatches := fn:item-at($allAllMatches, 1) let $restAllMatches := fn:subsequence($allAllMatches, 2) return fts:MakeDisjunction($firstAllMatches, $restAllMatches) }
Intuitively, all search strings are tokenized and a single sequence that consists of all TokenInfos is constructed. For each of these, the result of FTWords is computed using ApplySingleSearchSelection
. Finally, the conjunction of all resulting AllMatches is computed.
Similarly, in the case of a FTWords with all word
specified, the semantics is given below.
declare function fts:MakeConjunction($curRes as element(allMatches, fts:AllMatches), $rest as element(allMatches, fts:AllMatches)*) as element(allMatches, fts:AllMatches) { if (fn:count($rest) = 0) then $curRes else let $firstAllMatches := fn:item-at($rest, 1) let $restAllMatches := fn-subsequence($rest, 2) let $newCurRes := fts:ApplyFTAnd($curRes, $firstAllMatches) return fts:MakeConjunction($newCurRes, $restAllMatches) } declare function fts:ApplyFTWordsAllWord( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchStrings as xs:string*, $queryPos as xs:integer) as element(allMatches, fts:AllMatches) { if (fn:count($searchStrings) = 0) then <allMatches /> else let $searchTokens := for $searchString in $searchStrings return fts:getSearchTokenInfo($searchString, $matchOptions) let $allAllMatches = for $searchToken at $pos in $searchTokens return fts:applySingleSearchToken( $searchContext, $matchOptions, $searchToken, $queryPos + $pos - 1) let $firstAllMatches := fn:item-at($allAllMatches, 1) let $restAllMatches := fn:subsequence($allAllMatches, 2) return fts:MakeConjunction($firstAllMatches, $restAllMatches) }
In the case of a FTWords with phrase
specified, the semantics is given below.
declare function fts:ApplyFTWordsPhrase( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchStrings as xs:string*, $queryPos as xs:integer) as element(allMatches, fts:AllMatches) { let $conj := fts:ApplyFTWordsAllWord($searchContext, $matchOptions, $searchStrings, $queryPos) let $ordered := fts:ApplyFTOrder($conj) let $distance1 := fts:ApplyFTDistance($matchOptions, $ordered, <fts:range type="exactly" n="0">) return $distance1 }
The above function is similar to the one in the case of all word
. The only difference is that the additional FTSelections ordered
and with word distance 1
are applied.
The semantics for the case of FTWords with any
specified is given below.
declare function fts:ApplyFTWordsAny( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchStrings as xs:string*, $queryPos as xs:integer) as element(allMatches, fts:AllMatches) { if (fn:count($searchTokens) = 0) then <allMatches /> else let $firstSearchString := fn:item-at($searchStrings, 1) let $restSearchString := fn:subsequence($searchStrings, 2) let $firstAllMatches := fts:ApplyFTWordsPhrase($searchContext, $matchOptions, $firstSearchString, $queryPos) let $newQueryPos := fn:max($firstAllMatches//@queyrPos) + 1 let $restAllMatches := fts:ApplyFTWordsAny($searchContext, $matchOptions, $searchStrings, $newQueryPos) return fts:ApplyFTOr($firstAllMatches, $resAllMatches) }
Intuitively, the FTWords with any
specified forms the disjunction of the AllMatches that are the result of the matching of each seperate search string as a phrase.
Analogously, the semantics for the case of a FTWords with all
specified is:
declare function fts:ApplyFTWordsAll( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchStrings as xs:string*, $queryPos as xs:integer) as element(allMatches, fts:AllMatches) { if (fn:count($searchTokens) = 0) then <allMatches /> else let $firstSearchString := fn:item-at($searchStrings, 1) let $restSearchString := fn:subsequence($searchStrings, 2) let $firstAllMatches := fts:ApplyFTWordsPhrase($searchContext, $matchOptions, $firstSearchString, $queryPos) let $newQueryPos := fn:max($firstAllMatches//@quetyPos) + 1 let $restAllMatches := fts:ApplyFTWordsAll($searchContext, $matchOptions, $searchStrings, $newQueryPos) return fts:ApplyFTAnd($firstAllMatches, $resAllMatches) }
As before, the difference from the case of any
is the use of conjunction instead of disjunction.
Finally, we define the function that combines all of the above cases.
declare function fts:ApplyFTWords($searchContext as Node*, $matchOptions as fts:FTMatchOptions, $type as element(type, fts:FTWordsType), $searchTokens as xs:string*, $queryPos as xs:integer) as element(allMatches, fts:AllMatches) { if ($type eq "any word") then fts:ApplyFTWordsAnyWord($searchContext, $matchOptions, $searchTokens, $queryPos) else if ($type eq "all word") then fts:ApplyFTWordsAllWord($searchContext, $matchOptions, $searchTokens, $queryPos) else if ($type eq "phrase") then fts:ApplyFTWordsPhrase($searchContext, $matchOptions, $searchTokens, $queryPos) else if ($type eq "any") then fts:ApplyFTWordsAny($searchContext, $matchOptions, $searchTokens, $queryPos) else fts:ApplyFTWordsAll($searchContext, $matchOptions, $searchTokens, $queryPos) }
The parameters of the ApplyFTOr
function are the two AllMatches parameters corresponding to the results of the two nested FTSelections. The search context and the match options stack are not used in this case. The function definition is given below.
declare function fts:ApplyFTOr($allMatches1 as element(allMatches, fts:AllMatches), $allMatches2 as element(allMatches, fts:AllMatches)) as element(allMatches, fts:AllMatches) { <allMatches> ($allMatches1/match $allMatches2/match) </allMatches> }
The function creates a new AllMatches whose Matches are simply the union of those found in the input AllMatches. The rationale for this semantics is that each Match represents one possible "solution" to the corresponding FTSelection. Thus, if we "or" two AllMatches, a Match from either of the AllMatches should also be a solution.
As an example, consider the FTSelection "Mustang" || "Honda"
in the context of the sample document. The AllMatches corresponding to "Mustang" and "Honda" are:
The AllMatches produced by ApplyFTOr
is:
The parameters of the ApplyFTAnd
function are the two AllMatches parameters corresponding to the results of the two nested FTSelections. The search context and the match options are not used in this case. The function definition is given below.
declare function fts:ApplyFTAnd ($allMatches1 as element(allMatches, fts:AllMatches), $allMatches2 as element(allMatches, fts:AllMatches)) as element(allMatches, fts:AllMatches) { <allMatches> {for $sm1 in $allMatches1/match for $sm2 in $allMatches2/match return <match> {$sm1/* $sm2/*} </match> } </allMatches> }
Intuitively, the result of a conjunction is a new AllMatches that contains the "Cartesian product" of the simple matches of the participating FTSelections. Every resulting Match is formed the combination of the stringInclude components and stringExclude components from each of the AllMatches of the nested FTSelection conditions. Thus every simple match will contain the positions to satisfy a Match from both original FTSelections and will exclude the positions that will violate the same Matches.
As an example let us consider the FTSelection "Mustang" && "rust"
in the context of the sample document. The source AllMatches are:
The AllMatches produced by ApplyFTAnd
is:
The parameters of the ApplyFTUnaryNot
function are the search context, the list of match optionss, and one AllMatches parameter corresponding to the result of the nested FTSelection to be negated. The search context and the match options are not used in this case. The function definition is given below.
declare function fts:InvertStringMatch($strm) { if ($strm instanceof element(stringExclude)) then <stringInclude queryPos="{$strm/@queryPos}" queryString="{$strm/@queryString}"> {$strm/docPos} </stringInclude> else <stringExclude queryPos="{$strm/@queryPos}" queryString="{$strm/@queryString}"> {$strm/docPos} </stringInclude> } declare function fts:UnaryNotHelper($sms) { <allMatches> { for $sm in $sms/match[1]/child::element() for $rest in fts:UnaryNotHelper( fn:subsequence($sms/match, 2)/match return <match> (fts:InvertStringMatch($sm) $rest/*) </match> } </allMatches> } declare function fts:ApplyFTUnaryNot($allMatches as element(allMatches, fts:AllMatches)) as element(allMatches, fts:AllMatches) { if ($allMatches/match) then {fts:UnaryNotHelper($allMatches)} else <allMatches> <match /> </allMatches> }
The process of the generation of the resulting AllMatches of an FTUnaryNot resembles the transformation of a negation of prepositional formula in DNF back to DNF. The intuition is that negation of AllMatches requires the inversion of all the conditions on the nodes encoded by the AllMatches .
In the implementation above, this inversion is implemented as follows. The function fts:invertStringMatch
inverts a stringInclude into a stringExclude and vice versa. The function fts:neg_helper
transforms the source Matches into the resulting Matches by combining a the inversions of a stringInclude or stringExclude component from every source Match into a new Match.
As an example, let us consider the FTSelection ! ("Mustang" || "Honda")
in the context of the sample document. The source AllMatches is:
The FTUnaryNot will transform it to:
The parameters of the ApplyFTMildNot function are the two AllMatches parameters corresponding to the results of the two nested FTSelections. The search context and the match options stack are not used in this case. The function definition is given below.
declare function fts:ApplyFTMildNot($allMatches1 as element(allMatches, fts:AllMatches), $allMatches2 as element(allMatches, fts:AllMatches)) as element(allMatches, fts:AllMatches){ <allMatches> {let $posSet2 = $allMatches2/match/stringInclude/pos return $allMatch1/match[every $pos1 in ./stringInclude/pos, $pos2 in $posSet2 satisfies $pos1 ne $pos2] } </allMatches> }
The resulting AllMatches consists of those Matches of the first operand that do not mention in their stringInclude components positions mentioned in a stringInclude component in the AllMatches of the second operand.
As an example, consider the FTSelection ("Ford" mildnot "Ford Mustang")
in the context of the sample document. The source AllMatches are:
The FTMildNot will transform these to empty AllMatches because both position 1 and position 27 from the first AllMatches contain only TokenInfos from stringInclude components of the second AllMatches.
The parameters of the ApplyFTOrder
function are the search context, the list of match options, and one AllMatches parameter corresponding to the result of the nested FTSelections. The evaluation context and the match options are not used in this case. The function definition is given below.
declare function fts:ApplyFTOrder($allMatches as element(allMatches, fts:AllMatches)) as element(allMatches, fts:AllMatches) { <allMatches> { for $match in $allMatches/match where every $stringInclude1 in $match, $stringInclude2 in $match satisfies (($stringInclude1/tokenInfo/@pos <= $stringInclude2/tokenInfo/@pos) and ($stringInclude1/@queryPos <= $stringInclude2/@queryPos)) or (($stringInclude1/tokenInfo/@pos>= $stringInclude2/tokenInfo/@pos) and ($stringInclude1/@queryPos >= $stringInclude2/@queryPos)) return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where every $stringIncl in $match/stringInclude satisfies (($stringExcl/tokenInfo/@pos <= $stringIncl/tokenInfo/@pos) and ($stringExcl/@queryPos <= $stringIncl/@queryPos)) or (($stringExcl/tokenInfo/@pos >= $stringIncl/tokenInfo/@pos) and ($stringExcl/@queryPos >= $stringIncl/@queryPos)) } </match> } </allMatches> }
The resulting AllMatches contains all Match of the parameter whose positions in the stringInclude elements are in the order of the query positions of their query strings. Only those stringExcludes are retained that preserve the order.
As an example, consider the FTSelection ("great" && "condition") ordered
in the context of the sample document. The source AllMatches is:
The FTOrder will return:
The parameters of the ApplyFTScope
function are the search context, the list of match options, the type of the scope (same or different), the linguistic unit (sentence or paragraph) and one AllMatches parameter corresponding to the result of the nested FTSelections. The search context and the match options are not used in this case. The functions definitions depending on the type of the scope (paragraph, sentence) and the scope predicate (same, different) are given below.
In case of same sentence
, the semantics is given by:
declare function fts:ApplyFTScopeSameSentence( $allMatches as element(allMatches, fts:AllMatches) ) as element(allMatches, fts:AllMatches) { <allMatches> { for $match in $allMatches/match where every $stringInclude1 in $match, $stringInclude2 in $match satisfies $stringInclude1/tokenInfo/@sentence = $stringInclude2/tokenInfo/@sentence return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where every $stringIncl in $match/stringInclude satisfies $stringIncl/tokenInfo/@sentence = $stringExcl/tokenInfo/@sentence } </match> } </allMatches> }
Similarly, the semantics for different sentence
is given by:
declare function fts:ApplyFTScopeDifferentSentence( $allMatches as element(allMatches, fts:AllMatches) ) as element(allMatches, fts:AllMatches) { <allMatches> { for $match in $allMatches/match where every $stringInclude1 in $match, $stringInclude2 in $match satisfies $stringInclude1 = $stringInclude2 or $stringInclude1/tokenInfo/@sentence != $stringInclude2/tokenInfo/@sentence return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where every $stringIncl in $match/stringInclude satisfies $stringIncl/tokenInfo/@sentence != $stringExcl/tokenInfo/@sentence } </match> } </allMatches> }
In case of same paragraph
, the semantics is given by:
declare function fts:ApplyFTScopeSameParagraph( $allMatches as element(allMatches, fts:AllMatches) ) as element(allMatches, fts:AllMatches) { <allMatches> { for $match in $allMatches/match where every $stringInclude1 in $match, $stringInclude2 in $match satisfies $stringInclude1/tokenInfo/@para = $stringInclude2/tokenInfo/@para return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where every $stringIncl in $match/stringInclude satisfies $stringIncl/tokenInfo/@para = $stringExcl/tokenInfo/@para } </match> } </allMatches> }
Finally, the semantics for different paragraph
is given by:
declare function fts:ApplyFTScopeDifferentParagraph( $type $allMatches as element(allMatches, fts:AllMatches) ) as element(allMatches, fts:AllMatches) { <allMatches> { for $match in $allMatches/match where every $stringInclude1 in $match, $stringInclude2 in $match satisfies $stringInclude1 = $stringInclude2 or $stringInclude1/tokenInfo/@para != $stringInclude2/tokenInfo/@para return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where every $stringIncl in $match/stringInclude satisfies $stringIncl/tokenInfo/@para != $stringExcl/tokenInfo/@para } </match> } </allMatches> }
If for instance the type of the scope is "sentence", the semantics is straightforward. For every Match from the AllMatches of the operand, it filters those that contain string matches from stringInclude only in the same (different) element sentence. From the stringExcludes of the AllMatches, only those that refer to the same node are retained. The case for scope type paragraph is analogous.
The semantics for the general case is given by:
declare function fts:ApplyFTScope( $type as fts:ScopeType, $selector fts:ScopeSelector, $allMatches as element(allMatches, fts:AllMatches) ) as element(allMatches, fts:AllMatches) { if ($type eq "same" and $selector eq "sentence") then fts:ApplyFTScopeSameSentence($allMatches) else if ($type eq "different" and $selector eq "sentence") then fts:ApplyFTScopeDifferentSentence($allMatches) else if ($type eq "same" and $selector eq "paragraph") then fts:ApplyFTScopeSameParagraph($allMatches) else fts:ApplyFTScopeDifferentParagraph($allMatches) }
As an example, consider the FTSelection ("Mustang" && "Honda") same paragraph
in the context of the sample document. The source AllMatches is:
The FTScope will convert this to an empty AllMatches because neither Matches contain TokenInfos from a single element.
The parameters of the ApplyFTDistance
function are the search context, the list of match options, one AllMatches parameter corresponding to the result of the nested FTSelections, the unit of the distance (words, sentences, paragraphs) and the range specification used. The search context is not used in this case. The semantics for the different cases depending on the distance units and the range specification are given below.
The function for the case word distance exactly N
is presented below:
declare function fts:ApplyFTWordDistanceExactly( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer) ) as element(allMatches, fts:AllMatches) { <allMatches> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $idx in (1 to fn:count($sorted) - 1) satisfies fts:wordDistance( $sorted[$idx]/tokenInfo, $sorted[$idx+1]/tokenInfo, $matchOptions) = $n return <match> {$match/stringInclude} { let $sorted = for $si in $match order by $si/*/tokenInfo/@pos ascending return $si for $stringExcl in $sorted/stringExclude where every $stringIncl in $sorted/stringInclude satisfies fts:wordDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) = $n } </match> } </allMatches> }
Similarly, the semantics for the case of word distance at least N
is presented below:
declare function fts:ApplyFWordDistanceAtLeast( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:wordDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) >= $n return <match> {$match/stringInclude} { let $sorted = for $si in $match order by $si/*/tokenInfo/@pos ascending return $si for $stringExcl in $sorted/stringExclude where every $stringIncl in $sorted/stringInclude satisfies fts:wordDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) >= $n } </match> } </allMatches> }
The semantics for the case of word distance at most N
is given by:
declare function fts:ApplyFWordDistanceAtMost( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@Identifier ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:wordDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) <= $n return <match> {$match/stringInclude} {let $sorted = for $si in $match order by $si/*/tokenInfo/@pos ascending return $si for $stringExcl in $sorted/stringExclude where every $stringIncl in $sorted/stringInclude satisfies fts:wordDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) <= $n } </match> } </allMatches> }
The semantics for the final case of word distance from M to N
is given by:
declare function fts:ApplyFWordDistanceFromTo( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $m as xs:integer, $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:wordDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) >= $m and fts:wordDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) <= $n return <match> {$match/stringInclude} { let $sorted = for $si in $match order by $si/*/tokenInfo/@Identifier ascending return $si for $stringExcl in $sorted/stringExclude where every $stringIncl in $sorted/stringInclude satisfies fts:wordDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) >= $m and fts:wordDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) <= $n } </match> } </allMatches> }
The function for the case sentence distance exactly N
is presented below:
declare function fts:ApplyFSentenceDistanceExactly( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:sentenceDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) = $n return <match> {$match/stringInclude} { let $sorted = for $si in $match order by $si/*/tokenInfo/@pos ascending return $si for $stringExcl in $sorted/stringExclude where every $stringIncl in $sorted/stringInclude satisfies fts:sentenceDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) = $n } </match> } </allMatches> }
Similarly, the semantics for the case of sentence distance at least N
is presented below:
declare function fts:ApplyFSentenceDistanceAtLeast( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:sentenceDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) >= $n return <match> {$match/stringInclude} { let $sorted = for $si in $match order by $si/*/tokenInfo/@pos ascending return $si for $stringExcl in $sorted/stringExclude where every $stringIncl in $sorted/stringInclude satisfies fts:sentenceDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) >= $n } </match> } </allMatches> }
The semantics for the case of sentence distance at most N
is given by:
declare function fts:ApplyFSentenceDistanceAtMost( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:sentenceDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) <= $n return <match> {$match/stringInclude} { let $sorted = for $si in $match order by $si/*/tokenInfo/@pos ascending return $si for $stringExcl in $sorted/stringExclude where every $stringIncl in $sorted/stringInclude satisfies fts:sentenceDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) <= $n } </match> } </allMatches> }
The semantics for the final case of sentence distance from M to N
is given by:
declare function fts:ApplyFSentenceDistanceFromTo( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $m as xs:integer, $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:sentenceDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) >= $m and fts:sentenceDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) <= $n return <match> {$match/stringInclude} { let $sorted = for $si in $match order by $si/*/tokenInfo/@pos ascending return $si for $stringExcl in $sorted/stringExclude where every $stringIncl in $sorted/stringInclude satisfies fts:sentenceDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) >= $m and fts:sentenceDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) <= $n } </match> } </allMatches> }
The function for the case paragraph distance exactly N
is presented below:
declare function fts:ApplyFTParagraphDistanceExactly( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:paraDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) = $n return <match> {$match/stringInclude} { let $sorted = for $si in $match order by $si/*/tokenInfo/@pos ascending return $si for $stringExcl in $sorted/stringExclude where every $stringIncl in $sorted/stringInclude satisfies fts:paraDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) = $n } </match> } </allMatches> }
Similarly, the semantics for the case of paragraph distance at least N
is presented below:
declare function fts:ApplyFTParagraphDistanceAtLeast( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:paraDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) >= $n return <match> {$match/stringInclude} { let $sorted = for $si in $match order by $si/*/tokenInfo/@pos ascending return $si for $stringExcl in $sorted/stringExclude where every $stringIncl in $sorted/stringInclude satisfies fts:paraDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) >= $n } </match> } </allMatches> }
The semantics for the case of paragraph distance at most N
is given by:
declare function fts:ApplyFTParagraphDistanceAtMost( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:paraDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) <= $n return <match> {$match/stringInclude} { let $sorted = for $si in $match order by $si/*/tokenInfo/@pos ascending return $si for $stringExcl in $sorted/stringExclude where every $stringIncl in $sorted/stringInclude satisfies fts:paraDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) <= $n } </match> } </allMatches> }
The semantics for the final case of paragraph distance from M to N
is given by:
declare function fts:ApplyFTParagraphDistanceFromTo( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $m as xs:integer, $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $sitokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:paraDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) >= $m and fts:paraDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) <= $n return <match> {$match/stringInclude} {let $sorted = for $si in $match order by $si/*/tokenInfo/@pos ascending return $si for $stringExcl in $sorted/stringExclude where every $stringIncl in $sorted/stringInclude satisfies fts:paraDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) >= $m and fts:paraDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) <= $n } </match> } </allMatches> }
Intuitively, the resulting AllMatches contains those Matches of the operand that satisfy the condition that the distance (measured in words, sentences, or paragraphs) for every couple of consecutive valid positions in stringInclude elements is in the specified interval. Here by consecutive, we mean with no other valid positions from the same stringInclude element between them.
In the general case, the semantics is given by:
declare function fts:ApplyFTDistance( $matchOptions as element(matchOptions, fts:FTMatchOptions), $type as fts:DistanceType, $range as element(range, fts:FTRangeSpec), $allMatches as element(allMatches, fts:AllMatches) ) as element(allMatches, fts:AllMatches) { if ($type eq "word") then if ($range/@type eq "exactly") then fts:ApplyFTWordDistanceExactly($matchOptions, $allMatches, $range/@n) else if ($range/@type eq "at least") then fts:ApplyFTWordDistanceAtLeast($matchOptions, $allMatches, $range/@n) else if ($range/@type eq "at most") then fts:ApplyFTWordDistanceAtMost($matchOptions, $allMatches, $ range/@n) else fts:ApplyFTWordDistanceFromTo($matchOptions, $allMatches, $range/@m, $range/@n) else if ($type eq "sentence") then if ($range/@type eq "exactly") then fts:ApplyFTSentenceDistanceExactly($matchOptions, $allMatches, $range/@n) else if ($range/@type eq "at least") then fts:ApplyFTSentenceDistanceAtLeast($matchOptions, $allMatches, $range/@n) else if ($range/@type eq "at most") then fts:ApplyFTSentenceDistanceAtMost($matchOptions, $allMatches, $range/@n) else fts:ApplyFTSentenceDistanceFromTo($matchOptions, $allMatches, $range/@m, $range/@n) else if ($range/@type eq "exactly") then fts:ApplyFTParagraphDistanceExactly($matchOptions, $allMatches, $range/@n) else if ($range/@type eq "at least") then fts:ApplyFTParagraphDistanceAtLeast($matchOptions, $allMatches, $range/@n) else if ($range/@type eq "at most") then fts:ApplyFTParagraphDistanceAtMost($matchOptions, $allMatches, $range/@n) else fts:ApplyFTParagraphDistanceFromTo($matchOptions, $allMatches, $range/@m, $range/@n) }
As an example, consider the FTDistance selection ("Ford Mustang" && "excellent") word distance at most 3
over the sample document. The six Matches of the source AllMatches for ("Ford Mustang" && "excellent")
are given below:
The result for the above FTDistance selection will consist of only the first Match because only its the distance between consecuive TokenInfos (distance 1 and distance 3 in this case) is less or equal to 3.
The parameters of the ApplyFTWindow
function are the search context, the list of match options, a range specification, and one AllMatches parameter corresponding to the result of the nested FTSelections. The search context is not used in this case. The semantics for the different cases depending on the range specification FTRange used follow.
The function for the case word window exactly N
is presented below:
declare function fts:ApplyFTWordWindowExactly( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches> { for $match in $allMatches/match let $minpos := fn:min($match/stringInclude/tokenInfo/@pos), $maxpos := fn:max($match/stringInclude/tokenInfo/@pos), $tokenInfo1 := $match/stringInclude/tokenInfo[@pos = $minpos][1], $tokenInfo2 := $match/stringInclude/tokenInfo[@pos = $maxpos][1], $windowSize := fts:wordDistance1($tokenInfo1, $tokenInfo2, $matchOptions) + 1 where $windowSize = $n return <match> {$match/stringInclude} { for $stringExclude in $match/stringExclude where fts:wordDistance1($tokenInfo1, $stringExclude/tokenInfo, $matchOptions) <= $windowSize and fts:wordDistance1($stringExclude/tokenInfo, $tokenInfo2, $matchOptions) <= $windowSize } </match> } </allMatches> }
The function for the case window at least N
is presented below:
declare function fts:ApplyFTWordWindowAtLeast( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches> { for $match in $allMatches/match let $minpos := fn:min($match/stringInclude/tokenInfo/@pos), $maxpos := fn:max($match/stringInclude/tokenInfo/@pos), $tokenInfo1 := $match/stringInclude/tokenInfo[@pos = $minpos][1], $tokenInfo2 := $match/stringInclude/tokenInfo[@pos = $maxpos][1], $windowSize := fts:wordDistance1($tokenInfo1, $tokenInfo2, $matchOptions) + 1 where $windowSize >= $n return <match> {$match/stringInclude} { for $stringExclude in $match/stringExclude where fts:wordDistance1($tokenInfo1, $stringExclude/tokenInfo, $matchOptions) <= $windowSize and fts:wordDistance1($stringExclude/tokenInfo, $tokenInfo2, $matchOptions) <= $windowSize } </match> } </allMatches> }
The function for the case word window at most N
is presented below:
declare function fts:ApplyFTWordWindowAtMost( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches> { for $match in $allMatches/match let $minpos := fn:min($match/stringInclude/tokenInfo/@pos), $maxpos := fn:max($match/stringInclude/tokenInfo/@pos), $tokenInfo1 := $match/stringInclude/tokenInfo[@pos = $minpos][1], $tokenInfo2 := $match/stringInclude/tokenInfo[@pos = $maxpos][1], $windowSize := fts:wordDistance1($tokenInfo1, $tokenInfo2, $matchOptions) + 1 where $windowSize <= $n return <match> {$match/stringInclude} { for $stringExclude in $match/stringExclude where fts:wordDistance1($tokenInfo1, $stringExclude/tokenInfo, $matchOptions) <= $windowSize and fts:wordDistance1($stringExclude/tokenInfo, $tokenInfo2, $matchOptions) <= $windowSize } </match> } </allMatches> }
The function for the case word window from M to N
is presented below:
declare function fts:ApplyFTWordWindowFromTo( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $m as xs:integer, $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches> { for $match in $allMatches/match let $minpos := fn:min($match/stringInclude/tokenInfo/@pos), $maxpos := fn:max($match/stringInclude/tokenInfo/@pos), $tokenInfo1 := $match/stringInclude/tokenInfo[@pos = $minpos][1], $tokenInfo2 := $match/stringInclude/tokenInfo[@pos = $maxpos][1], $windowSize := fts:wordDistance1($tokenInfo1, $tokenInfo2, $matchOptions) + 1 where ($windowSize >= $m) and ($windowSize <= $n) return <match> {$match/stringInclude} { for $stringExclude in $match/stringExclude where fts:wordDistance1($tokenInfo1, $stringExclude/tokenInfo, $matchOptions) <= $windowSize and fts:wordDistance1($stringExclude/tokenInfo, $tokenInfo2, $matchOptions) <= $windowSize } </match> } </allMatches> }
Intuitively, the resulting AllMatches contains those Matches of the operand that satisfy the condition that the distance between the maximum position and the minimum position plus two (because the include both positions) is within the specified interval. Only those StringExcludes are retained that fall within the specified window range.
In the general case, the semantics is given by:
declare function fts:ApplyFTWindow( $matchOptions as element(matchOptions, fts:FTMatchOptions), $range as element(range, fts:FTRangeSpec), $allMatches as element(allMatches, fts:AllMatches), ) as element(allMatches, fts:AllMatches) { if ($range/@type eq "exactly") then fts:ApplyFTWordWindowExactly($matchOptions, $allMatches, $range/@n) else if ($range/@type eq "at least") then fts:ApplyFTWordWindowAtLeast($matchOptions, $allMatches, $range/@n) else if ($range/@type eq "at most") then fts:ApplyFTWordWindowAtMost($matchOptions, $allMatches, $range/@n) else fts:ApplyFTWordWindowFromTo($matchOptions, $allMatches, $range/@m, $range/@n) }
As an example, consider the FTWindow selection ("Ford Mustang" && "excellent") window at most 10
over the sample document. The six Matches of the source AllMatches for ("Ford Mustang" && "excellent")
are given below:
The result for the above FTWindow selection will consist of only the first, the fifth, and the sixth Matches because their respective window sizes are 5, 4, and 9.
The parameters of the ApplyFTTimes
function are the search context, the list of match options, one AllMatches a range specification, and parameter corresponding to the result of the nested FTSelection. The search context and the match options stack are not used in this case.
The function definitions, depending the range specification FTRange limiting the number of occurrences, follow.
declare function fts:FormCombinations($sms, $times) { if (fn:count($sms) lt $times) then () else if (fn:count($sms) eq $times) then <match> {$sms/*} </match> else { fts:FormCombination(fn:subsequence($sms, 2), $times) <match> {$sms[1]/*} {fts:FormCombinations(fn:subsequence($sms, 2), $times-1)/*} </match> } } declare function fts::FormRange($sms, $l, $u) { let $lower_match := <allMatches> {fts:FormCombinations($sms, $l) } </allMatches> return if ($l > $u) then () else fts:ApplyFTAnd(<allMatches> {fts:FormCombinations($sms, $l)} </allMatches>, fts::ApplyFTUnaryNot( <allMatches> {fts:FormCombinations($sms, $u+1)} </allMatches>) ) }
We now define the semantics for the case exactly N occurrences
:
declare function fts:ApplyFTTimesExactly( $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { fts:FormRange($allMatches/match, $n, $n) }
We next define the semantics for the case at least N occurrences
:
declare function fts:ApplyFTTimesAtLeast( $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches> {fts:formCombinations($allMatches/match, $n)} </allMatches> }
We next define the semantics for the case at most N occurrences
:
declare function fts:ApplyFTTimesAtMost( $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { fts:formRange($allMatches/match, 0, $n) }
Finally, we define the semantics for the case from M to N occurrences
:
declare function fts:ApplyFTTimesFromTo( $allMatches as element(allMatches, fts:AllMatches), $m as xs:integer, $n as xs:integer ) as element(allMatches, fts:AllMatches) { fts:formRange($allMatches/match, $m, $n) }
The intuition is as follows. The way to ensure that there are at least N different matches of an FTSelection is to ensure that at least N of its Matches occur simultaneously. This is similar to forming their conjunction: combine N distinct Matches into one simple match. Therefore, the full match for the selection condition involving the range specifier at least N
is to form all possible combinations of N simple matches of the operand and form
one simple match for each combination negating the rest of the simple matches. This operations is performed in the function fts:FormCombinations
.
In the case of the range [l, u], it is treated as the condition at least l and not at least u + 1
. This transformation is performed in the function fts:FormRange
.
The semantics in the general case is given by:
declare function fts:ApplyFTTimes( $range as element(range, fts:FTRangeSpec), $allMatches as element(allMatches, fts:AllMatches), ) as element(allMatches, fts:AllMatches) { if ($range/@type eq "exactly") then fts:ApplyFTTimesExactly($allMatches, $range/@n) else if ($range/@type eq "at least") then fts:ApplyFTTimesAtLeast($allMatches, $range/@n) else if ($range/@type eq "at most") then fts:ApplyFTTimesAtMost($allMatches, $range/@n) else fts:ApplyFTTimesFromTo($allMatches, $range/@m, $range/@n) }
As an example, consider the FTTimes selection "Mustang" at least 2 occurrences
over the sample document. The source AllMatches of the FTWords selection "Mustang"
is:
The result will consist of all couples of Matches from above:
We define the following type: An ordered sequence of match options.
<xs:schema targetNamespace="http://www.w3.org/2004/07/xquery-full-text" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fts="http://www.w3.org/2004/07/xquery-full-text" elementFormDefault="qualified" attributeFormDefault="unqualified"> <xs:complexType name="FTMatchOptions"> <xs:sequence> <xs:element name="matchOption" type="fts:FTMatchOption"/> </xs:sequence> </xs:complexType> <xs:complexType name="FTMatchOption"> <xs:choice> <xs:element name="case" type="fts:FTCaseOption" /> <xs:element name="diacritics" type="fts:FTDiacriticsOption" /> <xs:element name="specialChar" type="fts:FTSpecialcharOption" /> <xs:element name="thesaurus" type="fts:FTThesaurusOption" /> <xs:element name="stem" type="fts:FTStemOption" /> <xs:element name="regex" type="fts:FTRegexOption" /> <xs:element name="language" type="fts:FTLanguageOption" /> <xs:element name="stopWord" type="fts:FTStopwordOption" /> <xs:element name="ignore" type="fts:FTIgnoreMatchOption" /> </xs:choice> </xs:complexType> <xs:complexType name="FTCaseOption"> <xs:attribute name="caseIndicator"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="insensitive"/> <xs:enumeration value="sensitive"/> <xs:enumeration value="lowercase"/> <xs:enumeration value="uppercase"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="caseLanguage" type="xs:string"/> </xs:complexType> <xs:complexType name="FTDiacriticsOption"> <xs:attribute name="diacriticsIndicator"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="insensitive"/> <xs:enumeration value="sensitive"/> <xs:enumeration value="with"/> <xs:enumeration value="without"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="language" type="xs:string"/> </xs:complexType> <xs:complexType name="FTSpecialcharOption"> <xs:attribute name="specialCharIndicator"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="with"/> <xs:enumeration value="without"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="language" type="xs:string"/> </xs:complexType> <xs:complexType name="FTThesaurusOption"> <xs:sequence> <xs:element name="thesaurusName" type="xs:string" minOccurs="1" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="thesaurusIndicator"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="with"/> <xs:enumeration value="without"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="language" type="xs:string"/> </xs:complexType> <xs:complexType name="FTStemOption"> <xs:attribute name="stemIndicator"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="with"/> <xs:enumeration value="without"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="language" type="xs:string"/> </xs:complexType> <xs:complexType name="FTRegexOption"> <xs:attribute name="regexIndicator"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="with"/> <xs:enumeration value="without"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="language" type="xs:string"/> </xs:complexType> <xs:complexType name="FTLanguageOption"> <xs:attribute name="languageName" type="xs:string"/> </xs:complexType> <xs:complexType name="FTStopwordOption"> <xs:sequence> <xs:element name="additionalStopWords" type="xs:string" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="stopWordIndicator"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="with"/> <xs:enumeration value="without"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="language" type="xs:string"/> </xs:complexType> <xs:complexType name="FTIgnoreMatchOption"> <xs:sequence> <xs:element name="tagName" type="xs:anyType" minOccurs="0" maxOccurs="unbounded"/> <xs:element name="contentName" type="xs:anyType" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:schema>
Modification to the current semantics.
declare function fts:applySingleSearchToken( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchToken as fts:TokenInfo, $queryPos as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches> {let $searchTokens := if ($matchOptions//regex) then fts:applyRegexOption($searchContext, $matchOptions, $searchToken/@word) else $searchToken let $searchTokens2 := if ($matchOptions//stopWord) then fts::applyStopWordOption($searchContext, $matchOptions, $searchTokens) else $searchTokens let $effectiveOptions := $matchOptions except $matchOptions[self::regex] let $token_pos := fts:matchStr($searchContext, $effectiveOptions, $searchTokens2) for $pos in $token_pos return <match> <stringInclude queryPos="{$queryPos}" queryString="{$searchToken/@word}" > <tokenInfo>{$pos}</tokenInfo> </stringInclude> </match>} </allMatches> } declare function fts:matchStr( $searchContext as node(), $matchOptionss as fts:FTMatchOptions, $searchToken as xs:string ) as element(tokenInfo, fts:TokenInfo)* { let $nonexpOptions := $matchOptions[self::language or self::ignore] let $expOptions := $matchOptions except $nonexpOptions let $searchTokens := $searchToken for $matchOption in $expOptions let $searchTokens := applyMatchOption($matchOption, $searchTokens), $searchTokens return getTokenInfo($searchContext, $nonexpOptions, $searchTokens) } declare function fts:applyMatchOption( $matchOption as fts:FTMatchOption, $searchTokens as xs:string* ) as element(tokenInfo, fts:TokenInfo)* { if node-name($matchOption) = "case" return applyCaseOption($matchOption,$searchTokens) else if node-name($matchOption) = "diacritics" return applyDiacriticsOption($matchOption, $searchTokens) else if node-name($matchOption) = "specialChar" return applySpecialCharOption($matchOption, $searchTokens) else if node-name($matchOption) = "thesaurus" return applyThesaurusOption($matchOption, $searchTokens) else if node-name($matchOption) = "stem" return applyStemOption($matchOption, $searchTokens) }
function fts:lowerCase($word as xs:string, $caseLanguage as xs:string) as xs:string* function fts:upperCase($word as xs:string, $caseLanguage as xs:string) as xs:string* function fts:insensitiveCase($word as xs:string, $caseLanguage as xs:string) as xs:string* function fts:removeDiacritics( $word as xs:string, $diacriticsLanguage as xs:string) as xs:string* function fts:insensitiveDiacritics( $word as xs:string, $diacriticsLanguage as xs:string) as xs:string* function fts:removeSpecialCharOption( $word as xs:string, $specialCharLanguage as xs:string) as xs:string* function fts:getThesaurus($word as xs:string, $thesaurusName as xs:string, $thesaurusLanguage as xs:string) as xs:string* function fts:stemmedForm($word as xs:string, $stemLanguage as xs:string) as xs:string* function fts:regexForm($word as xs:string, $regexLanguage as xs:string) as xs:string*
declare function fts:applyCaseOption($matchOption as fts:FTMatchOption, $searchTokens as xs:string*) as string* { let $searchToken := first($searchTokens) let $nextTokens := except($searchTokens,$searchToken) if ($matchOption/@caseIndicator = "lowercase") let $returnedTokens := lowerCase($searchToken, $matchOption/@language), applyCaseOption($matchOption, $nextTokens) else if ($matchOption/@caseIndicator = "uppercase") { let $returnedTokens := upperCase($searchToken, $matchOption/@language), applyCaseOption($matchOption, $nextTokens) } else if ($matchOption/@caseIndicator = "insensitive") let $returnedTokens := insensitiveCase($searchToken, $matchOption/@language), applyCaseOption($matchOption, $nextTokens) else //case where $matchOption/@caseIndicator="sensitive" //operates on data?? return $returnedTokens }
declare function fts:applyDiacriticsOption( $matchOption as fts:FTMatchOption, $searchTokens as xs:string*) as xs:string* { let $searchToken := first($searchTokens) let $nextTokens := except($searchTokens,$searchToken) if ($matchOption/@diacriticsIndicator = "with") let $returnedTokens := $searchTokens else if ($matchOption/@diacriticsIndicator = "without") { let $returnedTokens := removeDiacritics($searchToken, $matchOption/@language), applyDiacriticsOption($matchOption, $nextTokens) } else if ($matchOption/@diacriticsIndicator = "insensitive") let $returnedTokens := insensitiveDiacritics( $searchToken, $matchOption/@language), applyDiacriticsOption($matchOption, $nextTokens) else //case where $matchOption/@diacriticsIndicator= //"sensitive" operates on data?? return $returnedTokens }
declare function fts:applySpecialCharOption( $matchOption as fts:FTMatchOption, $searchTokens as xs:string*) as xs:string* { if ($matchOption/@specialCharIndicator = "with") let $returnedTokens := $searchTokens else if ($matchOption/@specialCharIndicator = "without") { let $searchToken := first($searchTokens) let $nextTokens := except($searchTokens,$searchToken) let $returnedTokens := removeSpecialCharOption( $searchToken, $matchOption/@language), applySpecialCharOption($matchOption, $nextTokens) } return $returnedTokens }
declare function fts:applyStemOption($matchOption as fts:FTMatchOption, $searchTokens as xs:string*) as xs:string* { if ($matchOption/@stemIndicator = "with") { let $searchToken := first($searchTokens) let $nextTokens := except($searchTokens,$searchToken) let $returnedTokens := stemmedForm($searchToken, $matchOption/@language), applyStemOption($matchOption, $nextTokens) } else if ($matchOption/@stemIndicator = "without") let $returnedTokens := $searchTokens return $returnedTokens }
declare function fts:applyThesaurusOption( $matchOption as fts:FTMatchOption, $searchTokens as xs:string*) as xs:string* { if ($matchOption/@thesaurusIndicator = "with") { let $searchToken := first($searchTokens) let $nextTokens := except($searchTokens,$searchToken) let $returnedTokens := applyThesaurus( $matchOption/@thesaurusName, $matchOption/@language, $searchToken), applyThesaurusOption($matchOption, $nextTokens) } else if ($matchOption/@thesaurusIndicator = "without") let $returnedTokens := $searchTokens return $returnedTokens } declare function fts:applyThesaurus($thesaurusName as xs:string*, $thesaurusLanguage as xs:string, $searchToken as xs:string) as xs:string* { let $returnedTokens as xs:string* := $searchToken for $thesaurus in $thesaurusName let $returnedTokens := getThesaurus( $searchToken,$thesaurus,$thesaurusLanguage), $returnedTokens return $returnedTokens }
declare function fts:applyRegexOption($matchOption as fts:FTMatchOption, $searchTokens as xs:string*) as xs:string* { if ($matchOption/@regexIndicator = "with") { let $searchToken := first($searchTokens) let $nextTokens := except($searchTokens,$searchToken) let $returnedTokens := regexForm($searchToken, $matchOption/@language), applyRegexOption($matchOption, $nextTokens) } else if ($matchOption/@regexIndicator = "without") let $returnedTokens := $searchTokens return $returnedTokens }
We now present the formal semantics of the FTContainsExpr expression. It takes in (1) an search context consisting of a sequence of nodes (which is the result of a regular XQuery/XPath expression), and (2) AllMatches corresponding to an FTSelection, and returns a sequence of nodes. Since FTContainsExpr returns results in the XQuery data model (a sequence of nodes), it can be treated like regular XQuery expressions and can be fully composed with other XQuery expressions. In addition, since FTContainsExpr and FTScoreExpr map AllMatches to a sequence of nodes, they provide the "glue" and well-defined semantics for mapping from AllMatches to the XQuery data model.
The formal semantics of FTContainsExpr is specified in terms of an formal semantics functions. The functions have to comply with the prototype defined below. The semantics of FTContainsExpr will be presented based on this function.
Consider an FTContainsExpr expression of the form EvaluationContext ftcontains FTSelection
, where EvaluationContext
is an XQuery expression that returns a sequence of nodes, and FTSelection
is an FTSelection that returns AllMatches. Intuitively, the FTContainsExpr returns true if and only if some node in the result of EvaluationContext
satisfies the AllMatches returned by FTSelection
.
We now formally define the semantics of FTContainsExpr. The semantics is defined in terms of a regular XQuery function (without any XQuery 1.0 and XPath 2.0 Full-Text extensions). The XQuery function takes in three parameters: the first parameter is the sequence of nodes returned by EvalationContext
; the second parameter is the XML node representation of FTSelection
; the third parameter is the XML representation of the default set of FTMatchOptions. The XQuery
function (by definition) returns true if and only if the corresponding FTContainsExpr returns true, and thus specifies the semantics of FTContainsExpr. Note that by using regular XQuery to specify the formal semantics, we avoid the need to introduce new formalism. We simply reuse the formal semantics of XQuery.
declare function FTContainsExpr( $searchContext as node()*, $ftSelection as fts:FTSelection, $defOptions as fts:FTMatchOptions) as xs:Boolean { return some $node in $searchContext satisfies let $allMatches := fts:evaluate($ftSelection, $node, $defOptions, 0) return some $match in $allMatches/match satisfies fn:count($match/stringExclude) eq 0 }
Intuitively, the above function returns true if and only if the AllMatches that is the result of the application of the FTSelection for some node in the search context contains a Match with no StringExcludes. This means that there is a set of TokenInfos in that node which satisfy the condition of the FTSelection
We will now show the evaluation of a more elaborate example of FTContainsExpr. We use the same sample document. For convenience, we present it again here.
<offers> <offer id="1000" price="10000"> Ford(1) Mustang(2) 2000(3), 65K(4), excellent(5) condition(6), runs(7) great(8), AC(9), CC(10), power(11) all(12) </offer> <offer id="1001" price="8000"> Honda(13) Accord(14) 1999(15), 78K(16), A(17)/C(18), cruise(19) control(20), runs(21) and(22) looks(23) great(24), excellent(25) condition(26) </offer> <offer id="1005" price="5500"> Ford(27) Mustang(28), 1995(29), 150K(30) highway(31) mileage(32), little(33) rust(34), excellent(35) condition(36) </offer> </offers>
Let the above document be assigned to $doc
. We will walk through the evaluation of the following FTContainsExpr
$doc ftcontains ( ( "mustang" && (("great" || "excellent") at least 2 occurrences) ) window at most 30 && ! "rust" ) same node
We first evaluate the FTSelection to AllMatches
( ( "mustang" && (("great" || "excellent") at least 2 occurrences) ) window at most 30 && ! "rust" ) same node
Step 1: Evaluate the FTWords "Mustang"
Step 2: Evaluate the FTWords "great"
Step 3: Evaluate the FTWords "excellent"
Step 4 - Apply the FTOr ("great" || "excellent")
: form the union of the Matches
Step 5 - Apply the FTTimes ("great" || "excellent") at least 2 occurrences
: form 2-tuples (couples) of Matches
Step 6 - Apply the FTAnd "Mustang" && (("great" || "excellent") at least 2 occurrences)
: form the "Cartesian product" of Matches
Step 7 - Apply the FTWindow ("Mustang" && (("great" || "excellent") at least 2 occurrences)) window at most 30
: filter out Matches for which the window is not less than or equal to 30
Step 8 - Match FTWords "rust"
Step 9 - Apply the FTUnaryNot ! "rust"
: transform the stringInclude
into stringExclude
Step 10 - Apply the FTAnd (("Mustang" && (("great" || "excellent") at least 2 occurrences)) window at most 30) && ! "rust"
: form the "Cartesian product" of the Matches
Step 11: Apply the final FTScope filter out Matches whose TokenInfos are not within the same node
This is the final AllMatches from the evaluation of the FTSelection.
The resulting AllMatches does not contain a Match that does not contain a StringExclude. Therefore, the sample FTContainsExpr returns false
.
The EBNF in this document and in this section is aligned with the XML Query 1.0 Last Call grammar.
[1] | Pragma |
::= | "(::" "pragma" QName ExtensionContents* "::)" |
/* gn: parens */ |
[2] | MUExtension |
::= | "(::" "extension" QName ExtensionContents* "::)" |
/* gn: parens */ |
[3] | ExprComment |
::= | "(:" (ExprCommentContent | ExprComment)* ":)" |
/* gn: comments */ |
[4] | ExprCommentContent |
::= | Char |
/* gn: parens */ |
[5] | ExtensionContents |
::= | Char |
|
[6] | IntegerLiteral |
::= | Digits |
|
[7] | DecimalLiteral |
::= | ("." Digits) | (Digits "." [0-9]*) |
/* ws: explicit */ |
[8] | DoubleLiteral |
::= | (("." Digits) | (Digits ("." [0-9]*)?)) ("e" | "E") ("+" | "-")? Digits |
/* ws: explicit */ |
[9] | StringLiteral |
::= | ('"' (PredefinedEntityRef | CharRef | ('"' '"') | [^"&])* '"') | ("'" (PredefinedEntityRef | CharRef | ("'" "'") | [^'&])* "'") |
/* ws: significant */ |
[10] | S |
::= | [http://www.w3.org/TR/REC-xml#NT-S]XML |
/* gn: xml-version */ |
[11] | ValidationMode |
::= | "lax" | "strict" | "skip" |
|
[12] | SchemaGlobalTypeName |
::= | "type" "(" QName ")" |
|
[13] | SchemaGlobalContext |
::= | QName | SchemaGlobalTypeName |
|
[14] | SchemaContextStep |
::= | QName |
|
[15] | Digits |
::= | [0-9]+ |
|
[16] | EscapeQuot |
::= | '"' '"' |
|
[17] | PITarget |
::= | NCName |
|
[18] | NCName |
::= | [http://www.w3.org/TR/REC-xml-names/#NT-NCName]Names |
/* gn: xml-version */ |
[19] | VarName |
::= | QName |
|
[20] | QName |
::= | [http://www.w3.org/TR/REC-xml-names/#NT-QName]Names |
/* gn: xml-version */ |
[21] | PredefinedEntityRef |
::= | "&" ("lt" | "gt" | "amp" | "quot" | "apos") ";" |
/* ws: explicit */ |
[22] | HexDigits |
::= | ([0-9] | [a-f] | [A-F])+ |
|
[23] | CharRef |
::= | "&#" (Digits | ("x" HexDigits)) ";" |
/* ws: explicit */ |
[24] | EscapeApos |
::= | "''" |
|
[25] | Char |
::= | [http://www.w3.org/TR/REC-xml#NT-Char]XML |
/* gn: xml-version */ |
[26] | ElementContentChar |
::= | Char - [{}<&] |
|
[27] | QuotAttContentChar |
::= | Char - ["{}<&] |
|
[28] | AposAttContentChar |
::= | Char - ['{}<&] |
[29] | Module |
::= | VersionDecl? (MainModule | LibraryModule) |
|
[30] | MainModule |
::= | Prolog QueryBody |
|
[31] | LibraryModule |
::= | ModuleDecl Prolog |
|
[32] | ModuleDecl |
::= | <"module" "namespace"> NCName "=" StringLiteral Separator |
|
[33] | Prolog |
::= | (Setter Separator)* ((Import | NamespaceDecl | VarDecl | FunctionDecl) Separator)* |
|
[34] | Separator |
::= | ";" |
|
[35] | VersionDecl |
::= | <"xquery" "version" StringLiteral> Separator |
|
[36] | Setter |
::= | XMLSpaceDecl | DefaultCollationDecl | BaseURIDecl | ValidationDecl | DefaultNamespaceDecl |
|
[37] | Import |
::= | SchemaImport | ModuleImport |
|
[38] | ModuleImport |
::= | <"import" "module"> ("namespace" NCName "=")? StringLiteral <"at" StringLiteral>? |
|
[39] | VarDecl |
::= | <"declare" "variable" "$"> VarName TypeDeclaration? (("{" Expr "}") | "external") |
|
[40] | XPath |
::= | Expr? |
|
[41] | QueryBody |
::= | Expr |
|
[42] | Expr |
::= | ExprSingle ("," ExprSingle)* |
|
[43] | Pattern |
::= | PathPattern (("union" | "|") Pattern)? |
|
[44] | PathPattern |
::= | ("/" RelativePathPattern?) |
/* gn: leading-lone-slash */ |
[45] | RelativePathPattern |
::= | PatternStep (("/" | "//") RelativePathPattern)? |
|
[46] | PatternStep |
::= | PatternAxis? NodeTest PredicateList |
|
[47] | PatternAxis |
::= | <"child" "::"> |
|
[48] | IdKeyPattern |
::= | (<"id" "("> IdKeyValue ")") | (<"key" "("> StringLiteral "," IdKeyValue ")") |
|
[49] | IdKeyValue |
::= | StringLiteral | ("$" VarName) |
|
[50] | ExprSingle |
::= | FLWORExpr |
|
[51] | FLWORExpr |
::= | (ForClause | LetClause)+ (ForClause | LetClause) WhereClause? OrderByClause? "return" ExprSingle |
|
[52] | ForExpr |
::= | SimpleForClause "return" ExprSingle |
|
[53] | ForClause |
::= | <"for" "$"> VarName TypeDeclaration? PositionalVar? "in" ExprSingle ("," "$" VarName TypeDeclaration? PositionalVar? "in" ExprSingle)* |
|
[54] | PositionalVar |
::= | "at" "$" VarName |
|
[55] | SimpleForClause |
::= | <"for" "$"> VarName "in" ExprSingle ("," "$" VarName "in" ExprSingle)* |
|
[56] | LetClause |
::= | <"let" "$"> VarName TypeDeclaration? ":=" ExprSingle ("," "$" VarName TypeDeclaration? ":=" ExprSingle)* |
|
[57] | WhereClause |
::= | "where" Expr |
|
[58] | OrderByClause |
::= | (<"order" "by"> | <"stable" "order" "by">) OrderSpecList |
|
[59] | OrderSpecList |
::= | OrderSpec ("," OrderSpec)* |
|
[60] | OrderSpec |
::= | ExprSingle OrderModifier |
|
[61] | OrderModifier |
::= | ("ascending" | "descending")? (<"empty" "greatest"> | <"empty" "least">)? ("collation" StringLiteral)? |
|
[62] | QuantifiedExpr |
::= | (<"some" "$"> | <"every" "$">) VarName TypeDeclaration? "in" ExprSingle ("," "$" VarName TypeDeclaration? "in" ExprSingle)* "satisfies" ExprSingle |
|
[63] | TypeswitchExpr |
::= | <"typeswitch" "("> Expr ")" CaseClause+ "default" ("$" VarName)? "return" ExprSingle |
|
[64] | CaseClause |
::= | "case" ("$" VarName "as")? SequenceType "return" ExprSingle |
|
[65] | IfExpr |
::= | <"if" "("> Expr ")" "then" ExprSingle "else" ExprSingle |
|
[66] | OrExpr |
::= | AndExpr ( "or" AndExpr )* |
|
[67] | AndExpr |
::= | ComparisonExpr ( "and" ComparisonExpr )* |
|
[68] | ComparisonExpr |
::= | RangeExpr (((ValueComp |
|
[69] | RangeExpr |
::= | AdditiveExpr ( "to" AdditiveExpr )? |
|
[70] | AdditiveExpr |
::= | MultiplicativeExpr ( ("+" | "-") MultiplicativeExpr )* |
|
[71] | MultiplicativeExpr |
::= | UnaryExpr ( ("*" | "div" | "idiv" | "mod") UnaryExpr )* |
|
[72] | UnaryExpr |
::= | ("-" | "+")* UnionExpr |
|
[73] | UnionExpr |
::= | IntersectExceptExpr ( ("union" | "|") IntersectExceptExpr )* |
|
[74] | IntersectExceptExpr |
::= | InstanceofExpr ( ("intersect" | "except") InstanceofExpr )* |
|
[75] | InstanceofExpr |
::= | TreatExpr ( <"instance" "of"> SequenceType )? |
|
[76] | TreatExpr |
::= | CastableExpr ( <"treat" "as"> SequenceType )? |
|
[77] | CastableExpr |
::= | CastExpr ( <"castable" "as"> SingleType )? |
|
[78] | CastExpr |
::= | ValueExpr ( <"cast" "as"> SingleType )? |
|
[79] | ValueExpr |
::= | ValidateExpr | PathExpr | AxisStep |
|
[80] | PathExpr |
::= | ("/" RelativePathExpr?) |
/* gn: leading-lone-slash */ |
[81] | RelativePathExpr |
::= | StepExpr (("/" | "//") StepExpr)* |
|
[82] | StepExpr |
::= | AxisStep | FilterExpr |
|
[83] | AxisStep |
::= | (ForwardStep | ReverseStep) PredicateList |
|
[84] | FilterExpr |
::= | PrimaryExpr PredicateList |
|
[85] | ContextItemExpr |
::= | "." |
|
[86] | PrimaryExpr |
::= | Literal | VarRef | ParenthesizedExpr | ContextItemExpr | FunctionCall | Constructor |
|
[87] | VarRef |
::= | "$" VarName |
|
[88] | Predicate |
::= | "[" Expr "]" |
|
[89] | PredicateList |
::= | Predicate* |
|
[90] | ValidateExpr |
::= | (<"validate" "{"> | (<"validate" "global"> "{") | (<"validate" "context"> SchemaContextLoc "{") | (<"validate" ValidationMode> ValidationContext? "{")) Expr "}" |
/* gn: validate */ |
[91] | ValidationContext |
::= | ("context" SchemaContextLoc) | "global" |
|
[92] | Constructor |
::= | DirElemConstructor |
|
[93] | ComputedConstructor |
::= | CompElemConstructor |
|
[94] | GeneralComp |
::= | "=" | "!=" | "<" | "<=" | ">" | ">=" |
/* gn: lt */ |
[95] | ValueComp |
::= | "eq" | "ne" | "lt" | "le" | "gt" | "ge" |
|
[96] | NodeComp |
::= | "is" | "<<" | ">>" |
|
[97] | ForwardStep |
::= | (ForwardAxis NodeTest) | AbbrevForwardStep |
|
[98] | ReverseStep |
::= | (ReverseAxis NodeTest) | AbbrevReverseStep |
|
[99] | AbbrevForwardStep |
::= | "@"? NodeTest |
|
[100] | AbbrevReverseStep |
::= | ".." |
|
[101] | ForwardAxis |
::= | <"child" "::"> |
|
[102] | ReverseAxis |
::= | <"parent" "::"> |
|
[103] | NodeTest |
::= | KindTest | NameTest |
|
[104] | NameTest |
::= | QName | Wildcard |
|
[105] | Wildcard |
::= | "*" |
/* ws: explicit */ |
[106] | Literal |
::= | NumericLiteral | StringLiteral |
|
[107] | NumericLiteral |
::= | IntegerLiteral | DecimalLiteral | DoubleLiteral |
|
[108] | ParenthesizedExpr |
::= | "(" Expr? Expr ")" |
|
[109] | FunctionCall |
::= | (<QName "("> | <"id" "("> | <"key" "(">) (ExprSingle ("," ExprSingle)*)? ")" |
|
[110] | DirElemConstructor |
::= | "<" QName AttributeList ("/>" | (">" ElementContent* "</" QName S? ">")) |
/* ws: explicit */ |
/* gn: lt */ | ||||
[111] | CompDocConstructor |
::= | <"document" "{"> Expr "}" |
|
[112] | CompElemConstructor |
::= | (<"element" QName "{"> | (<"element" "{"> Expr "}" "{")) CompElemBody? "}" |
|
[113] | CompElemBody |
::= | (CompElemNamespace | ExprSingle) ("," (CompElemNamespace | ExprSingle))* |
|
[114] | CompElemNamespace |
::= | "namespace" NCName? "{" StringLiteral "}" |
|
[115] | CompAttrConstructor |
::= | (<"attribute" QName "{"> | (<"attribute" "{"> Expr "}" "{")) Expr? "}" |
|
[116] | CompXmlPI |
::= | (<"processing-instruction" NCName "{"> | (<"processing-instruction" "{"> Expr "}" "{")) Expr? "}" |
|
[117] | CompXmlComment |
::= | <"comment" "{"> Expr "}" |
|
[118] | CompTextConstructor |
::= | <"text" "{"> Expr? "}" |
|
[119] | CdataSection |
::= | "<![CDATA[" Char* "]]>" |
/* ws: significant */ |
[120] | XmlPI |
::= | "<?" PITarget (S Char*)? "?>" |
/* ws: explicit */ |
[121] | XmlComment |
::= | "<!--" Char* "-->" |
/* ws: significant */ |
[122] | ElementContent |
::= | ElementContentChar |
/* ws: significant */ |
[123] | AttributeList |
::= | (S (QName S? "=" S? AttributeValue)?)* |
/* ws: explicit */ |
[124] | AttributeValue |
::= | ('"' (EscapeQuot | QuotAttrValueContent)* '"') |
/* ws: significant */ |
[125] | QuotAttrValueContent |
::= | QuotAttContentChar |
/* ws: significant */ |
[126] | AposAttrValueContent |
::= | AposAttContentChar |
/* ws: significant */ |
[127] | EnclosedExpr |
::= | "{" Expr "}" |
|
[128] | XMLSpaceDecl |
::= | <"declare" "xmlspace"> ("preserve" | "strip") |
|
[129] | DefaultCollationDecl |
::= | <"declare" "default" "collation"> StringLiteral |
|
[130] | BaseURIDecl |
::= | <"declare" "base-uri"> StringLiteral |
|
[131] | NamespaceDecl |
::= | <"declare" "namespace"> NCName "=" StringLiteral |
|
[132] | DefaultNamespaceDecl |
::= | (<"declare" "default" "element"> | <"declare" "default" "function">) "namespace" StringLiteral |
|
[133] | FunctionDecl |
::= | <"declare" "function"> <QName "("> ParamList? (")" | (<")" "as"> SequenceType)) (EnclosedExpr | "external") |
/* gn: parens */ |
[134] | ParamList |
::= | Param ("," Param)* |
|
[135] | Param |
::= | "$" VarName TypeDeclaration? |
|
[136] | TypeDeclaration |
::= | "as" SequenceType |
|
[137] | SingleType |
::= | AtomicType "?"? |
|
[138] | SequenceType |
::= | (ItemType OccurrenceIndicator?) |
|
[139] | AtomicType |
::= | QName |
|
[140] | ItemType |
::= | AtomicType | KindTest | <"item" "(" ")"> |
|
[141] | KindTest |
::= | DocumentTest |
|
[142] | ElementTest |
::= | <"element" "("> ((SchemaContextPath ElementName) |
|
[143] | AttributeTest |
::= | <"attribute" "("> ((SchemaContextPath AttributeName) |
|
[144] | ElementName |
::= | QName |
|
[145] | AttributeName |
::= | QName |
|
[146] | TypeName |
::= | QName |
|
[147] | ElementNameOrWildcard |
::= | ElementName | "*" |
|
[148] | AttribNameOrWildcard |
::= | AttributeName | "*" |
|
[149] | TypeNameOrWildcard |
::= | TypeName | "*" |
|
[150] | PITest |
::= | <"processing-instruction" "("> (NCName | StringLiteral)? ")" |
|
[151] | DocumentTest |
::= | <"document-node" "("> ElementTest? ")" |
|
[152] | CommentTest |
::= | <"comment" "("> ")" |
|
[153] | TextTest |
::= | <"text" "("> ")" |
|
[154] | AnyKindTest |
::= | <"node" "("> ")" |
|
[155] | SchemaContextPath |
::= | <SchemaGlobalContext "/"> <SchemaContextStep "/">* |
|
[156] | SchemaContextLoc |
::= | (SchemaContextPath? QName) | SchemaGlobalTypeName |
|
[157] | OccurrenceIndicator |
::= | "?" | "*" | "+" |
|
[158] | ValidationDecl |
::= | <"declare" "validation"> ValidationMode |
|
[159] | SchemaImport |
::= | <"import" "schema"> SchemaPrefix? StringLiteral <"at" StringLiteral>? |
|
[160] | SchemaPrefix |
::= | ("namespace" NCName "=") | (<"default" "element"> "namespace") |
|
[161] | RHSPrimaryExpr |
::= | StepExpr |
|
[162] | FTContains |
::= | "ftcontains" FTSelection FTIgnoreOption? |
|
[163] | FTSelection |
::= | FTOr (FTMatchOption | FTProximity)* |
|
[164] | FTOr |
::= | FTAnd ( "||" FTAnd )* |
|
[165] | FTAnd |
::= | FTUnaryNot ( "&&" FTUnaryNot )* |
|
[166] | FTUnaryNot |
::= | ("!")? FTMildnot |
|
[167] | FTMildnot |
::= | FTWordsSelection ( <"mild" "not"> FTWordsSelection )* |
|
[168] | FTWordsSelection |
::= | FTWords | ("(" FTSelection ")") |
|
[169] | FTWords |
::= | PrimaryExpr FTAnyallOption? |
|
[170] | FTProximity |
::= | "ordered" | FTWindow | FTDistance | FTTimes | FTScope |
|
[171] | FTMatchOption |
::= | FTCaseOption |
|
[172] | FTCaseOption |
::= | "lowercase" |
|
[173] | FTDiacriticsOption |
::= | <"with" "diacritics"> |
|
[174] | FTSpecialcharOption |
::= | <"with" "special" "characters"> | <"without" "special" "characters"> |
|
[175] | FTStemOption |
::= | <"with" "stemming"> | <"without" "stemming"> |
|
[176] | FTThesaurusOption |
::= | (<"with" "thesaurus"> UnionExpr) | <"without" "thesaurus"> |
|
[177] | FTStopwordOption |
::= | (<"with" "stop" "words"> | <"without" "stop" "words">) UnionExpr |
|
[178] | FTLanguageOption |
::= | "language" UnionExpr |
|
[179] | FTRegexOption |
::= | <"with" "regex"> | <"without" "regex"> |
|
[180] | FTAnyallOption |
::= | "any" | "all" | "phrase" | <"any" "word"> | <"all" "words"> |
|
[181] | FTRange |
::= | ("exactly" UnionExpr) |
|
[182] | FTDistance |
::= | "with"? "distance" FTRange FTUnit |
|
[183] | FTWindow |
::= | "within"? "window" FTRange |
|
[184] | FTTimes |
::= | "occurs" FTRange |
|
[185] | FTScope |
::= | ("same" | "different") FTBigUnit |
|
[186] | FTUnit |
::= | "words" | "sentences" | "paragraphs" |
|
[187] | FTBigUnit |
::= | "sentence" | "paragraph" |
|
[188] | FTIgnoreOption |
::= | <"without" "content"> UnionExpr |
This section contains general notes on the EBNF productions, which may be helpful in understanding how to create a parser based on this EBNF, how to read the EBNF, and generally call out issues with the syntax. The notes below are referenced from the right side of the production, with the notation: /* gn: <id> */.
A look-ahead of one character is required to distinguish function patterns from a QName followed by a comment. For example: address (: this may be empty :)
may be mistaken for a call to a function named "address" unless this lookahead is employed.
Token disambiguation of the overloaded "<" pattern is defined in terms of positional lexical states. The "<" comparison operator can not occur in the same places as a "<" tag open pattern. The "<" comparison operator can only occur in the OPERATOR state and the "<" tag open pattern can only occur in the DEFAULT and the ELEMENT_CONTENT states. (These states are only a specification tool, and do not mandate an implementation strategy for this same effect.)
The ValidateExpr in the exposition, which does not use the "< ... >" token grouping, presents the production in a much simplified, and understandable, form. The ValidateExpr presented in the appendix is technically correct, but structurally hard to understand, because of limitations of the "< ... >" token grouping.
The "/" presents an issue because it occurs both in a leading position and an operator position in expressions. Thus, expressions such as "/ * 5" can easily be confused with the path expression "/*". Therefore, a stand-alone slash, in a leading position, that is followed by an operator, will need to be parenthesized in order to stand alone, as in "(/) * 5". "5 * /", on the other hand, is fine.
Expression comments are allowed inside expressions everywhere that ignorable white space is allowed. Note that expression comments are not allowed in constructor content.
The general rules for XML 1.1 vs. XML 1.0, as described in the Section A.2 Lexical structureXQ, should be applied to this production.
For readability, white space may be used in most expressions even though not explicitly notated in the EBNF. White space is tolerated before the first token and after the last token. White space is optional between terminals, except a few cases where white space is needed to disambiguate the token. For instance, in XML, "-" is a valid character in an element or attribute name. When used as an operator after the characters of a name, it must be separated from the name, e.g. by using white space or parentheses.
Special white space notation is specified with the EBNF productions, when it is different from the default rules, as follows.
"ws: explicit" means that the EBNF notation explicitly notates where white space is allowed, and whitespace is otherwise not allowed.
"ws: significant" means that white space is significant as value content.
For XQuery, White space is not freely allowed in the non-computed Constructor productions, but is specified explicitly in the grammar, in order to be more consistent with XML. The lexical states where white space must have explicit specification are as follows: START_TAG, END_TAG, ELEMENT_CONTENT, XML_COMMENT, PROCESSING_INSTRUCTION, PROCESSING_INSTRUCTION_CONTENT, CDATA_SECTION, QUOT_ATTRIBUTE_CONTENT, and APOS_ATTRIBUTE_CONTENT.
For other usage of white space, one or more white space characters are required to separate "words". Zero or more white space characters may optionally be used around punctuation and non-word symbols.
This section contains an extension of the XPath 2.0 grammar with Full-Text.
TBD.
This section contains the current issues related to this document.
Scoring Properties
Is it possible to specify anything other than range ? Examples: do we want to define scoring rules for efficient scoring, rules to guarantee score monotonicity?
Resolution:
None recorded.
Scoring Values
Do we require that 0 be returned if there are no matches?
Resolution:
None recorded.
Semantics Data Model
Data model incorporates new names - TokenInfo, Match, AllMatches.
Resolution:
CLOSED.
All occurrences of FullMatch, SimpleMatch, and Position in the text, in the schemas, and in the XQuery implementations of the semantics have been replaced with AllMatches, Match, and TokenInfo respectively.
FTContains Grammar
Expr "ftcontains" FTSelection FTIgnoreCtxMod?. One production for FTSelection which includes FTIgnoreCtxMod?
Resolution:
CLOSED.
We replaced the previous grammar production Expr "ftcontains" FTSelection that allowed FTIgnoreCtxMod to be combined with any FTSelection with the new one that restricts the application of FTIgnoreCtxMod to the highest level.
FTContextModifiers
Paull C.: Change the name of the FTContextModifer production which modify the operational semantics of the FTSelections they are applied to. Abandon the use of "ContextModifier" as in FTCaseCtxMod, FTStemCtxMod, FTIgnoreCtxMod. Issue raised at FTTF Feb 5-6, 2004 meeting. Find in the minutes at: http://lists.w3.org/Archives/Member/member-query-fttf/2004Feb/0010.html (Cntl-F on FTContextModifiers) (W3C members only)
Resolution:
CLOSED.
Replaced FTContextModfiers with FTMatchOptions as in FTCaseOption, FTStemOption, FTIgnoreOption in the Feburary 26, 2004 Editor's Draft.
CLOSED February 26, 2004.
Grammar
Grammar: Where does the ftcontains expression belong in the XQuery grammar: Boolean expression or comparison expression?
Resolution:
CLOSED.
The ftcontains expression plugs in to the XQuery grammar in the "FTComparisonExpr" production. This seems to give ftcontains the correct precedence among other XQuery operations, and it makes intuitive sense.
Wildcards
Pat Case: There are a few inconsistencies between this document and the Use Cases Working Draft.
This document and the Use Cases Working Draft present different syntax in regex examples. I can find no syntax provided in this document for the starts-with and exact match functionality. Should we rename the Wildcard section in the Use Cases to Regex Section and possibly rethink the use cases?
Resolution:
None recorded.
Thesaurus
Thesaurus names: "synonyms", "narrower terms", "soundex", "spellcheck" and "wordnet". We need to define Thesaurus operators. We need more options when specifying thesaurs: Name, URI, Depth, Dimension. Standards. ISO 2788/ANSI Z39.19.
We need to discuss what the grammar of ThesaurusMatchOption is. Current grammar is:
FTThesaurusOption ::= ("with"? "thesaurus" Expr) | "without thesaurus".
Proposed grammar is:
FTThesaurusOption ::= ("with"? "thesaurus" Expr "operation" Expr) | "without thesaurus".
Resolution:
None recorded.
Window
Currently, FTDistanceSpec only permits a single distance specification for all of the terms specified by an FTSelection.
For example:
("dog" && "cat" && "bird") with word distance at most 10
In this scenario above, the terms "dog", "cat", and "bird" must all occur within 10 words of one another.
However, if one would want to return documents where "dog" occurs within 10 words of "cat" and this SAME "cat" term occurs within 5 words of "bird", it is currently not possible with the current language specification. The best that could be done is the following:
(("dog" && "cat") with word distance at most 10) and (("cat" && "bird") with word distance at most 5)
But, this will not lead to the exact desired result because the "cat" and "bird' comparison will not use only those "cat" terms which occurred within 10 positions of "dog" ... it can use any "cat" term within the search context.
Resolution:
None recorded.
MildNot
Andrew E.: Should we remove the mild not? It has never been included in a query language before.
Pat Case has provided use cases to justify its inclusion at: http://lists.w3.org/Archives/Member/member-query-fttf/2003Dec/0034.html (W3C members only)
Discussion followed. Michael Rys' reply: http://lists.w3.org/Archives/Member/member-query-fttf/2003Dec/0038.html (W3C members only)
Pat Case's reply: http://lists.w3.org/Archives/Member/member-query-fttf/2003Dec/0043.html (W3C members only)
Use case paraphrase (for non-members): Consider a collection of 3 documents:
The Delights of Mexico - a document that includes "Mexico" several times.
The Perils of New Mexico - a document that includes "New Mexico" several times.
Travel in North America - a document that includes both "Mexico" and "New Mexico" several times.
Suppose you are planning a trip to Mexico. You want documents 1 and 3, but not 2. You could search for "Mexico" and get documents 1, 2 and 3. Or you could search for "Mexico AND NOT 'New Mexico'" and get just document 1. But the "strong not" has ruled out document 3 - even though it contained the thing you were looking for - just because it contained the thing you were not looking for.
The "mild not" operator allows you to say "Mexico MILD NOT 'New Mexico'", which means "find me all the documents that contain 'Mexico'. Do not take any notice of occurrences of 'New Mexico', but do not rule out a document just because it contains 'New Mexico'".
There are many cases where you may want to search for a word, but NOT get documents just because they contain a common phrase that includes that word. e.g. "security" mildnot "social security", "house" mildnot "house of representatives", "estate tax" mildnot "real estate tax"
Resolution:
None recorded.
Markup vs Structure
Some tags are "markup" - e.g. b - some are "structure" - e.g. title. We generally want to treat structure tags as word boundaries, but not markup tags. How do we distinguish between markup and structure?
Michael to provide reformulation.
Resolution:
None recorded.
MatchOption Policy
We need some indirection to specify match context, defaults "Thesaurus name" gives us a way to define a thesaurus, then specify it in the query - an indirection. Steve Buxton proposes there are many classes of things that are needed for context-match (stoplist, special characters, etc.) that need an indirection. So we need an extra level of indirection - a named policy that refers to a set of named things.
Resolution:
None recorded.
Loose Grammar
The grammar allows lots of queries that do not make sense. e.g. "(dog || cat) within word distance N", "dog within word distance N", "(dog || cat) ordered", "!dog 5 times" If the grammar does not provide a way of identifying these "nonsense queries", then the implementation still has to identify them - i.e. implementors will have to augment the grammar to identify nonsense queries, and augment the semantics to do something with them.
J. Doerre asks if we should allow nested FTNegations in the RHS of a FTMildNegation. From his email (http://lists.w3.org/Archives/Member/member-query-fttf/2004Apr/0019.html (W3C members only)) point 3: "The ApplyFTSelection ignores all StringExcludes in the arguments of the FTMildNegation. I think, if we don't want to deal with StringExcludes in that function, we should explicitly forbid them to appear, i.e. require arguments of FTMildNegation to not include any FTNegation."
Resolution:
None recorded.
FTTimesSelection
How do I count occurrences, where the query is NOT a single term?. How many occurrences of "!dog" are there in "very very big"? Zero or very many?
Resolution:
None recorded.
RegExp Escape
Need to define some escaping mechanism for regexp characters, and for (||, ...).
Resolution:
None recorded.
FTScopeSelection
Is there a need for both FTScopeSelection and FTDistance ? For example, how is the 'same sentence' or 'same paragraph' really different than a FTDistance of 'with sentence exactly 1' or 'with paragraph exactly 1'?.
Resolution:
None recorded.
Weighting
Michael R.: What syntactic form should scoring take? How do we describe the constraints on the types of expressions that are allowed? Should scoring be expressed using a second-order function, a stand-alone operator, or as a clause in a FLWOR expression? Consider moving weighting to ftContains, something like the following: TreatExpr ("ftcontains" FTSelection ("weight" Expr)? )?
Options in presentation of full-text language proposal and some discussion at XQuery January meeting, Tampa at: http://www.w3.org/XML/Group/2004/01/xquery-minutes (W3C members only) (Cntl-F on Report of Full-Text Task Force)
Resolution:
None recorded.
Weight Values
Valid values for weights must be defined.
Resolution:
None recorded.
Issue (ftscopeselection-on-structure):
FTScopeSelection on structure
Scoping based on structure (e.g. same node and different node) should be considered. Support for queries where distance is measured in terms of "number of intervening elements" where elements can be any markup including chapter, paragraph and sentence. Consider sentence/paragraph/node distance.
Resolution:
None recorded.
LanguageMatchOption
What is the default language? SA: Dana F.: does the language have to be a literal or an Expr thyat returns xs:string? Is there an implementation-defined list of valid languages ?
Resolution:
None recorded.
Issue (casematchoption-specialcharmatchoption):
CaseMatchOption and SpecialCharMatchOption
Paul C. pointed out whether "lowercase", "uppercase", "case sensitive" and "case insensitive" should be defined in the context of Unicode. J. Doerre provided this link to the Unicode standard is: http://www.unicode.org. The current version is 4.0.0. Case folding is described in Chapter 3.13. Please note that the case folding operations, like toUppercase(X), only depend on the characters to be folded, not on additional information, like language.
Resolution:
None recorded.
Issue (diacriticsmatchoption):
DiacriticsMatchOption
Paul C.: We need to define what a diacritic is. Steve B. pointed out whether "with diacritics" and "without diacritics" are needed or not.
Resolution:
None recorded.
Tokenizers
Darin/Paul C.: What is the most general behavior for tokenizers?
Michael Kay: Can we define a set of rules that apply regardless of which tokenizer we are using in the same manner as the rues we defined for scoring? For example, we could impose constraints on words, sentences and paragraphs.
Resolution:
None recorded.
Issue (specialcharmatchoption):
SpecialCharMatchOption
We need to say more about special characters, what kind of special characters do we want to consider, what is their impact on the ability to use a given index, their impact on tokenization.
Resolution:
None recorded.
MatchOption Syntax
Paul C.: It maybe that we should reconsider the syntax and allow to apply modifiers to individual words.
Resolution:
None recorded.
StopWordsMatchOption
We need to say more about stopwords, what kind of stopwords do we want to consider, what is their impact on the ability to use a given index, their impact on tokenization. Should we allow to specify the URI of a StopWords list? Paul C.: What would a single search with a stopword return?
Resolution:
None recorded.
Issue (matchoptionstokenization):
MatchOption and Tokenization
Does the language document clearly state the impact of match options on tokenization? Consider regex * when does it get applied? What effect does it have on word breaks? Example: expr ftcontains "brown .ox" with regex, expr ftcontains "brown .*ox" with regex.
Resolution:
None recorded.
IGNORE Syntax
Do we need special syntax for IGNORE in case of level by level search?
Resolution:
None recorded.
Scoping
Do we need same sentence, same paragraph search? * in semantics, not in requirements.
Resolution:
None recorded.
Issue (precedencexqueryfulltext):
Precendence of XQuery and full-text
We need to distinguish between XQuery expressions embedded in full-text expressions and FTSelections themselves. S. Buxton suggests that we use different kinds of parentheses to distinguish between these two expressions. See his message in http://lists.w3.org/Archives/Member/member-query-fttf/2004Apr/0042.html (W3C members only) and subsequent messages. A simple example is to distinguish between ("cat") as an XQuery expression that builds an XQuery sequence and ("cat") as an FTSelection.
In the current draft of the document, we are using lookahead
Other possibilities include the use of "{}" to switch from fyll-text to XQuery when XQuery expressions are embedded in full-text expressions. This is similar to element construction in XQuery and has been pointed out by Mary H in her email at http://lists.w3.org/Archives/Member/member-query-fttf/2004May/0163.html (W3C members only)
Resolution:
None recorded.
Optional Keyword "with" in FTDistance
In 3.1.9 FTDistance: Do we need "with" in FTDistance?
Raised by Pat Case by email April 28, 2004
Resolution:
None recorded.
Optional Keyword "within" in FTWindow
In 3.1.20 FTWindow: Do we need "within" in FTWindow?
Raised by Pat Case by email April 28, 2004
Resolution:
None recorded.
Issue (ftspecialcharoption-issue):
FTSpecialCharOption Specify Which
In 3.2.3 FTSpecialCharOption: Should we have to or be able to specify which special characters are to be matched or not? Should the following syntax be allowed "without special characters "-" or "with special characters "-"?
Raised by Pat Case by email April 28, 2004
Resolution:
None recorded.
FTNegation Includes Unary Not
In 3.1.5 FTNegation: If we are supporting the unary not which is shown in the production, please add text and examples to show that both the "unary not" and the "and not" are supported.
Raised by Pat Case by email April 28, 2004
Resolution:
None recorded.
FTOrder Unordered Option
In 3.1.7 FTOrder: [30] FTOrder ::= FTSelection "ordered" should we have an explicit "unordered" for the default?
Raised by Pat Case by email April 28, 2004
Resolution:
None recorded.
FTIgnoreOption Naming
Would FTFilterOption be a better name than FTIgnoreOption?
Raised by Pat Case by email April 28, 2004
Resolution:
None recorded.
FTRangeSpec Syntax for 1 to 4
We should consider aligning the syntax for the FTRangeSpec with an upper and lower boundary in 3.1.9 FTDistance (from 1 to 4) with the syntax for using range expressions to construct sequences in XQuery and XPath (1 to 4), See the XQuery/XPath language document Section 3.3.1 Constructing Sequences.
Raised by Pat Case by email April 28, 2004
Resolution:
None recorded.
Boolean (&& || !) Naming
Is it not possible and maybe preferable to use ftand ftor ftnot instead of && || ! following the lead of ftcontains?
Raised by Pat Case by email April 28, 2004
Resolution:
None recorded.
Exact Element Content
We have a use case for an exact element content query which finds the exact words or phrases being queried, no more and no less in an element and allows variations on case, diacritics, and special characters. Should this functionality be in XQuery full-text? If so, should we use the keywords "exact content"?
Raised by Pat Case by email April 28, 2004
Resolution:
None recorded.
Starts With
We have a use case for a starts with query which finds the words or phrases being queried as the first content of an element. Should this functionality be in XQuery full-text? If so, should we use the keywords "starts with"?
Raised by Pat Case by email April 28, 2004
Resolution:
None recorded.
What should we call the mild not
The name "mild not" or "mild negation" is not really helpful in understanding what we want it to denote. We should try hard to find a better name for this construct. Since it is used to exclude certain matches, why not call it "FTMatchExclude" or just "FTExclude"? Keeping "mild not" as the name makes it recognizable as a form of "not". If it remains as "mild not" and the ! continues as the syntax for "not", consider using mild! as the syntax for "mild not".
Raised by Jochen by email April 21, 2004; Additional comments by Pat Case May 4, 2004
Resolution:
None recorded.
Issue (multi-word-phrases-thesauri-lookup):
Thesauri lookup for multi-word phrases
It should be decided whether thesauri lookups can be performed only on single words or whether it is possible to apply it on multi-word phrases. For example, should we allow the thesaurus to replace "bells and whistles" with "frills"?
In the latter case, should thesauri lookup be applied only to the FTWord "bells andwhistles", or should it applied also on ("bells" "and" "whistles") phrase? Another question is if the thesauri expansion can be applied on phrase and on a word in the phrase, which one takes precedence.
Resolution:
None recorded.
Exactly in FTRangeSpec
Should "exactly" be optional? Should we allow both "word distance 6" and "word distance exactly 6"? Raised at Redmond May 2004 by Steve Buxton and Pat Case.
Resolution:
None recorded.
FTContains Semantics
FTContains operates on a sequence of nodes. Strings cannot be searched.
Raised at Redmond May 2004 by Steve Buxton. See also: http://lists.w3.org/Archives/Member/member-query-fttf/2004May/0085.html (W3C members only)
Resolution:
None recorded.
Issue (matchoptions-defaults):
MatchOptions Default
We need to specify defaults for MatchOptions. We should align this default with the static content for XQuery/XPath and add to the XQuery prolog corresponding declarations to set query-wide defaults.
Resolution:
None recorded.
FTNegation Semantics
We need to specify the semantics of FTNegation.
Raised by Jochen Doerre. See http://lists.w3.org/Archives/Member/member-query-fttf/2004May/0082.html (W3C members only).
Resolution:
we decided to use <allMatches/> to denote false. See answer to http://lists.w3.org/Archives/Member/member-query-fttf/2004May/0082.html (W3C members only).
Zero-length phrase
If Expr in FTWords results in the empty sequence or the tokenization results in a zero-length phrase, the result is? Always a match, never a match? Depending on the keyword?
Resolution:
None recorded.
Stop words option
The syntax and semantics of stop words are still under discussion.
3.2.6 FTStopwordOption is inconsistent with the grammar and semantics.
the second example includes "without stop words" NOT followed by an expression, which is not valid according to the EBNF (see also the default options query in 3.2 FTMatchOptions)
the keyword "additional" is not part of the current grammar
the text and examples in 3.2.6 FTStopwordOption imply that queries work as though stop words were removed from documents before positions are calculated, which is inconsistent with the description in 4.2.4 FTStopWordOption
Resolution:
None recorded.
Grammar Precedence and Lookahead
When integrating the XQuery Full-Text grammar with the XQuery 1.0 grammar, there were a number of challenges. Challenges include (using pseudo-code for examples):
The Full-Text operators must have the correct precedence (binding order) with respect to XQuery operators
It must be possible to override the default precedence of the Full-text operators - e.g. you must be able to express "(cat and dog) or mouse" as well as "cat and (dog or mouse)"
You must be able to embed XQuery expressions in the Full-Text expression, e.g. "cat and $i"
You must be able to embed the XQuery Full-Text expression in an arbitrarily-complex XQuery expression, e.g. "where title ftcontains ('dog' and 'cat') and price/dollars < 3 or disclaimer ftcontains 'buy this'"
The Working Groups discussed a number of ways of achieving this. The current grammar satisfies these requirements at the cost of introducing ambiguity in one place. The current XQuery 1.0 grammar is LL(1) - i.e. it is possible to write a parser that reads a query from left to right and only looks 1 token ahead. But the XQuery Full-Text grammar is NOT LL(1). At [PROD: 168] the parser must lookahead a full non-terminal - it must try to expand FTWords, and if that fails it must try to expand (FTSelection).
This is still under discussion - the Working Groups may remove the requirement for lookahead in a future publication.
Resolution:
None recorded.
We would like to thank the members of the XQuery and XPath Full-Text group for their fruitful discussions.
We would like to thank the following people for their contributions on earlier drafts of this document.
"Andrew Eisenberg" - IBM - andrew.eisenberg@us.ibm.com
"Roland Seiffert" - IBM - seiffert@de.ibm.com
"Andrew Cencini" - Microsoft - acencini@microsoft.com
"Nimish Khanolkar" - Microsoft - nimishk@exchange.microsoft.com
"Ashok Malhotra" Microsoft - ashokma@microsoft.com
"Tapas Nayak" Microsoft - tapasnay@exchange.microsoft.com