This document is also available in these non-normative formats: XML.
Copyright © 2005 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
This document defines the syntax and formal semantics of XQuery 1.0 and XPath 2.0 Full-Text which is a language that extends XQuery 1.0 [XQuery 1.0: An XML Query Language] and XPath 2.0 [XML Path Language (XPath) 2.0] with full-text search capabilities.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document has been produced following the procedures set out for the W3C Process. This document was produced through the efforts of XML Query Working Group and the XSL Working Group (both part of the XML Activity). It is designed to be read in conjunction with the following documents: W3C XQuery and XPath Full-Text Requirements [XQuery and XPath Full-Text Requirements] and the W3C XQuery Full-Text Use Cases [XQuery 1.0 and XPath 2.0 Full-Text Use Cases].
This is the fourth version of this document. Since the last version was published, several technical and editorial changes have been made to all the sections of the document. Among the most significant changes are: a reformulation of FTIgnore, including alignment of the specifications in Section 3 and Section 4; a more complete normalization of the rules for matching, significantly simplifying the behavior; a thorough pass through the entire document to correct grammar, spelling, and punctuation, resulting in significantly higher document quality; the distance between sentences and between paragraphs has been respecified to align with the distance between words (that is, adjacent sentences or paragraphs now have a distance between them of zero sentences or paragraphs, respectively); and the addition of two new appendices, one summarizing the error codes used in the Full-Text document and the other summarizing all items specified in the document to be implementation-defined.
The text of the XQuery functions used to define the semantics have not been completely syntax checked; that continues to be an on-going activity.
This is a public W3C Working Draft for review by W3C members and other interested parties. Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
Public comments on this document and its open issues are invited. Comments should be entered into the last-call issue tracking system for this specification (instructions can be found at http://www.w3.org/XML/2005/04/qt-bugzilla). If access to that system is not feasible, you may send your comments to the W3C mailing list, public-qt-comments@w3.org (http://lists.w3.org/Archives/Public/public-qt-comments/) with "[FT]" at the beginning of the subject field of email messages involving such comments.
The patent policy for this document is specified in the 5 February 2004 W3C Patent Policy. Patent disclosures relevant to this specification may be found on the XML Query Working Group's patent disclosure page and the XSL Working Group's patent disclosure page. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) with respect to this specification should disclose the information in accordance with section 6 of the W3C Patent Policy.
1 Introduction
1.1 Full-Text Search and XML
1.2 Organization of this document
1.3 A word about namespaces
2 Full-Text Extensions to XQuery and XPath
2.1 Expression FTContainsExpr
2.1.1 FTContainsExpr Description
2.1.2 FTContainsExpr Examples
2.2 Score Variables
2.2.1 Using Weights Within a Scored FTContainsExpr
2.3 Extensions to the Static Context
3 FTSelections, FTMatchOptions, and FTIgnoreOption
3.1 FTSelection
3.1.1 FTSelection Example
3.1.2 FTWords
3.1.3 FTOr
3.1.4 FTAnd
3.1.5 FTMildNot
3.1.6 FTUnaryNot
3.1.7 FTOrder
3.1.8 FTScope
3.1.9 FTDistance
3.1.10 FTWindow
3.1.11 FTTimes
3.1.12 FTContent
3.2 FTMatchOptions
3.2.1 FTCaseOption
3.2.2 FTDiacriticsOption
3.2.3 FTStemOption
3.2.4 FTThesaurusOption
3.2.5 FTStopwordOption
3.2.6 FTLanguageOption
3.2.7 FTWildCardOption
3.3 FTIgnoreOption
4 Semantics
4.1 Introduction
4.2 Nested XQuery 1.0 and XPath 2.0 Expressions
4.2.1 Left-hand Side of a FTContainsExpr
4.2.2 FTWords
4.2.3 FTRangeSpec
4.2.4 FTStopWordOption
4.2.5 FTThesaurusOption
4.2.6 FTLanguageOption
4.2.7 Tokenization
4.3 Evaluation of FTSelections
4.3.1 AllMatches
4.3.1.1 Formal Model
4.3.1.2 Examples
4.3.1.3 XML representation
4.3.1.4 Match and AllMatches Normal Form
4.3.1.5 The normalizeAllMatches function
4.3.2 FTSelections
4.3.2.1 XML Representation
4.3.2.2 The evaluate function
4.3.2.3 Formal semantics functions
4.3.2.4 FTWords
4.3.2.5 FTOr
4.3.2.6 FTAnd
4.3.2.7 FTUnaryNot
4.3.2.8 FTMildNot
4.3.2.9 FTOrder
4.3.2.10 FTScope
4.3.2.11 FTContent
4.3.2.12 FTDistance
4.3.2.13 FTWindow
4.3.2.14 FTTimes
4.3.3 Match Options Semantics
4.3.3.1 Types
4.3.3.2 High-Level Semantics
4.3.3.3 Formal Semantics Functions
4.3.3.4 FTCaseOption
4.3.3.5 FTDiacriticsOption
4.3.3.6 FTStemOption
4.3.3.7 FTStopWordOption
4.3.3.8 FTLanguageOption
4.3.3.9 FTWildCardOption
4.4 XQuery 1.0 and XPath 2.0 Full-Text and Scoring Expressions
4.4.1 FTContainsExpr
4.4.1.1 Semantics of FTContainsExpr
4.4.1.2 Example
4.4.2 Scoring
A EBNF for XQuery 1.0 Grammar with Full-Text extensions
A.1 Terminal Symbols
B EBNF for XPath 2.0 Grammar with Full-Text extensions
B.1 Terminal Symbols
C Static Context Components
D Error Conditions
E References
E.1 Normative References
E.2 Non-normative References
F Acknowledgements (Non-Normative)
G Glossary (Non-Normative)
H Checklist of Implementation-Defined Features (Non-Normative)
I Issues List (Non-Normative)
J Change Log (Non-Normative)
This document defines the language and the formal semantics of XQuery 1.0 and XPath 2.0 Full-Text. This language is designed to meet the requirements identified in W3C XQuery and XPath Full-Text Requirements [XQuery and XPath Full-Text Requirements] and to support the queries in the W3C XQuery Full-Text Use Cases [XQuery 1.0 and XPath 2.0 Full-Text Use Cases].
XQuery 1.0 and XPath 2.0 Full-Text extends the syntax and semantics of XQuery 1.0 and XPath 2.0.
As XML becomes mainstream, users expect to be able to search their XML documents. This requires a standard way to do full-text search, as well as structured searches, against XML documents. A similar requirement for full-text search led ISO to define the SQL/MM-FT [SQL/MM] standard. SQL/MM-FT [SQL/MM] defines extensions to SQL to express full-text searches providing similar functionality as does this full-text language extension to XQuery 1.0 and XPath 2.0.
XML documents may contain highly-structured data (numbers, dates), unstructured data (untagged free-flowing text), and semi-structured data (text with embedded tags). Where a document contains unstructured or semi-structured data, it is important to be able to search using Information Retrieval techniques such as scoring and weighting.
Full-text search is different from substring search in many ways:
A full-text search searches for words and phrases rather than substrings. A substring search for news items that contain the string "lease" will return a news item that contains "Foobar Corporation releases the 20.9 version ...". A full-text search for the word "lease" will not.
There is an expectation that a full-text search will support language-based searches which substring search cannot. An example of a language-based search is "find me all the news items that contain a word with the same linguistic stem as "mouse" (finds "mouse" and "mice"). Another example based on word proximity is "find me all the news items that contain the words "XML" and "Query" allowing up to 3 intervening words.
Full-text search must address the vagaries and nuances of language. Search results are often of varying usefulness. When you search a web site for cameras that cost less than $100, this is an exact search. There is a set of cameras that matches this search, and a set that does not. Similarly, when you do a string search across news items for "mouse", there is only 1 expected result set. When you do a full-text search for all the news items that contain the word "mouse", you probably expect to find news items containing the word "mice", and possibly "rodents", or possibly "computers". Not all results are equal. Some results are more "mousey" than others. Because full-text search may be inexact, we have the notion of score or relevance. We generally expect to see the most relevant results at the top of the results list.
As XQuery and XPath evolve, they may apply the notion of score to querying structured data. For example, when making travel plans or shopping for cameras, it is sometimes useful to get an ordered list of near matches in addition to exact matches. If XQuery and XPath define a generalized inexact match, we expect XQuery and XPath to utilize the scoring framework provided by XQuery and XPath Full-Text.
The following definitions apply to full-text search:
Full-text queries are performed on text which has been tokenized, i.e., broken into a sequence of words, units of punctuation, and spaces.
A word is defined as a character, n-gram, or sequence of characters returned by a tokenizer as a basic unit to be searched. Each instance of a word consists of one or more consecutive characters. Beyond that, words are implementation-defined. Note that consecutive words need not be separated by either punctuation or space, and words may overlap. A phrase is a sequence of ordered words which may contain any number of words.
Tokenization enables functions and operators which work with the relative positions of words (e.g., proximity operators). It also uniquely identifies sentences and paragraphs in which words appear. It enables functions and operators which operate on a part or the root of the word (e.g., wildcards, stemming). Whatever a tokenizer for a particular language chooses to do, it must preserve the containment hierarchy: paragraphs contain sentences which contain words. The tokenizer has to evaluate two equal strings in the same way, i.e., it should identify the same tokens. Everything else is implementation-defined.
This specification focuses on functionality that serves all languages. It also selectively includes functionalities useful within specific families of languages. For example, searching within sentences and paragraphs is useful to many western languages and to some non-western languages, so that functionality is incorporated into this specification.
Some XML elements represent semantic markup, e.g., <title>. Others represent formatting markup, e.g., <b> to indicate bold. Semantic markup serves well as token boundaries, while formatting markup sometimes does not. Implementations are free to provide implementation-defined ways to differentiate between the markup's effect on token boundaries during tokenization.
This document is organized as follows. We first present a high level syntax for the XQuery 1.0 and XPath 2.0 Full-Text language along with some examples. Then, we present the syntax and examples of the basic primitives in the XQuery 1.0 and XPath 2.0 Full-Text language. This is followed by the semantics of the XQuery 1.0 and XPath 2.0 Full-Text language. The appendix contains a section that provides an EBNF for the XPath 2.0 Grammar with Full-Text extensions, an EBNF for XQuery 1.0 Grammar with Full-Text extensions, a list of issues, acknowledgements and a glossary.
Certain namespace prefixes are predeclared by XQuery 1.0 and, by implication, by this specification, and bound to fixed namespace URIs. These namespace prefixes are as follows:
xml = http://www.w3.org/XML/1998/namespace
xs = http://www.w3.org/2001/XMLSchema
xsi = http://www.w3.org/2001/XMLSchema-instance
fn = http://www.w3.org/2005/xpath-functions
xdt = http://www.w3.org/2005/xpath-datatypes
local = http://www.w3.org/2005/xquery-local-functions
In addition to the prefixes in the above list, this document uses the prefix err
to represent the namespace URI http://www.w3.org/2005/xqt-errors
, This namespace prefix is not predeclared and its use in this document is not normative. Error codes that are not defined in this document are defined in other XQuery 1.0 and XPath 2.0 specifications, particularly [XML Path Language (XPath) 2.0] and [XQuery 1.0 and XPath 2.0 Functions
and Operators].
Finally, this document uses the prefix fts
to represent a namespace containing a number of functions used in this document to describe the semantics of XQuery 1.0 and XPath 2.0 Full-Text functions. There is no requirement that these functions be implemented, therefore no URI is associated with that prefix.
XQuery 1.0 and XPath 2.0 Full-Text extends the languages of XQuery 1.0 and XPath 2.0 in three ways. It:
Adds a new expression called FTContainsExpr;
Enhances the syntax of FLWOR expressions in XQuery 1.0 and for
expressions in XPath 2.0 with optional score variables; and
Adds static context declarations for full-text match options to the query prolog.
XQuery 1.0 and XPath 2.0 Full-Text extends the languages of XQuery 1.0 and XPath 2.0 by adding the expression FTContainsExpr. An FTContainsExpr is similar to a comparison expression (see Section 3.5.2 General ComparisonsXQ). This grammar rule introduces FTContainsExpr.
[50] | ComparisonExpr |
::= | FTContainsExpr ( (ValueComp |
An FTContainsExpr may be used anywhere a ComparisonExpr may be used. FTContainsExprs have higher precedence than comparison operators, so the results of FTContainsExpr may be compared without enclosing them in parentheses.
[51] | FTContainsExpr |
::= | RangeExpr ( "ftcontains" FTSelection FTIgnoreOption? )? |
An FTContainsExpr returns a Boolean value. It returns true, if there is some node in RangeExpr that matches FTSelection. For the purpose of determining a match, certain descendants of nodes in RangeExpr may be ignored, as specified in FTIgnoreOption.
FTSelections are composed of the following ingredients:
Words and phrases that are the strings to be found as matches;
Match options, such as indicators for case sensitivity and stop words;
Boolean operators, that compose an FTSelection from simpler FTSelections; and
Constraints on the positions of matches, such as indicators for distance between words and for the cardinality of matches.
The following example in extended XQuery 1.0 returns the author of each book with a title containing a word with the same root as dog
and the word cat
.
for $b in /books/book where $b/title ftcontains ("dog" with stemming) && "cat" return $b/author
The same example in extended XPath 2.0 is written as:
/books/book[title ftcontains ("dog" with stemming) && "cat"]/author
Besides specifying a match of a full-text search as a Boolean condition, full-text search applications typically also have the ability to associate scores with the results. Such scores express the relevance of those results to the full-text search conditions.
XQuery 1.0 and XPath 2.0 Full-Text extends the languages of XQuery 1.0 and XPath 2.0 further by adding optional score
variables to the for
and let
clauses of FLWOR expressions.
The production for the extended for
clause follows.
[35] | ForClause |
::= | "for" "$" VarName TypeDeclaration? PositionalVar? FTScoreVar? "in" ExprSingle ("," "$" VarName TypeDeclaration? PositionalVar? FTScoreVar? "in"
ExprSingle)* |
[37] | FTScoreVar |
::= | "score" "$" VarName |
When a score
variable is present in a for
clause the evaluation of the expression following the in
keyword not only needs to determine the result sequence of the expression, i.e., the sequence of items which are iteratively bound to the for
variable. It must also determine in each iteration the relevance "score" value of the current item and bind the score
variable to that value.
In the following example book
elements are determined that satisfy the condition [content ftcontains "web site" && "usability" and .//chapter/title ftcontains "testing"]
. The scores assigned to the book
elements are returned.
for $b score $s in /books/book[content ftcontains "web site" && "usability" and .//chapter/title ftcontains "testing"] return $s
XPath 2.0 Full-Text extends the language of XPath 2.0 in the for
expression in the same way: with optional score variables. The example above is also a legal example of the XPath 2.0 extension.
Scores are typically used to order results, as in the following, more complete example.
for $b score $s in /books/book[content ftcontains "web site" && "usability"] where $s > 0.5 order by $s descending return <result> <title> {$b//title} </title> <score> {$s} </score> </result>
The score
variable is bound to a value which reflects the relevance of the match criteria in the FTSelections to the nodes in the respective RangeExprs. The calculation of relevance is implementation-dependent, but score evaluation must follow these rules:
Score values are of type xs:float in the range [0, 1].
For score values greater than 0, a higher score must imply a higher degree of relevance
Similar to their use in a for
clause, score variables may be specified in a let
clause. A score variable in a let
clause is also bound to the score of the expression evaluation, but in the let
clause one score is determined for the complete result. The let
variable may be dropped from the let
clause, if the score
variable is present.
The production for the extended let
clause follows.
[38] | LetClause |
::= | (("let" "$" VarName TypeDeclaration? FTScoreVar?) | ("let" "score" "$" VarName)) ":=" ExprSingle ("," (("$" VarName TypeDeclaration? FTScoreVar?) | FTScoreVar)
":=" ExprSingle)* |
While the score option in a for
clause conveniently allows to specify that the filtering expression, which drives the iteration, is at the same time the expression that determines the scores, it is possible to separate the filtering from the scoring expression using the let
clause syntax. The following is an example of this.
for $b in /books/book[.//chapter/title ftcontains "testing"] let score $s := $b/content ftcontains "web site" && "usability" order by $s descending return <result score="{$s}">{$b}</result>
This example returns book
elements with chapter titles that contain "testing". Along with the book
elements scores are returned. These scores, however, reflect whether the book content contains "web site" and "usability".
Note that it is not a requirement of the score of an FTContainsExpr to be 0, if the expression evaluates to false, nor to be non-zero, if the expression evaluates to true. Hence, in the example above it is not possible to infer the Boolean value of the FTContainsExpr in the let
clause from the calculated score of a returned result
element. For instance, an implementation may want to assign a non-zero score to a book that contained only "web site", but not "usability", as this
may be considered more relevant than a book that does not contain either of both.
The use of score
variables introduces a second-order aspect to the evaluation of expressions which cannot be emulated by (first-order) XQuery functions. Consider the following replacement of the clause let score $s := FTContainsExpr
let $s := score(FTContainsExpr)
where a function score
is applied to some FTContainsExpr. If the function score
were first-order, it would only be applied to the result of the evaluation of its argument, which is one of the Boolean constants true
or false
. Hence, there would be at most two possible values such a score
function would be able to return and no further differentiation would be possible.
Scoring may be influenced by adding weight declarations to individual search words, phrases, and expressions. Weight declarations are described in detail in Section 3.1.
for $b in /books/book let score $s := $b/content ftcontains ("web site" weight 0.2) && ("usability" weight 0.8) return <result score="{$s}">{$b}</result>
The effect of weights on the result score is implementation-dependent. However, weight declarations must follow these rules:
Weights in an FTContainsExpr are significant only in relation to each other; and
When no explicit weight is specified, the default weight is 0.5.
Weight declarations in an FTContainsExpr for which no scores are evaluated are ignored.
The XQuery Static Context is extended by a component for each of the full-text match options. Thus, the default of a match option in a query may be changed by providing a setting in the static context using the following declaration syntax.
[6] | Prolog |
::= | ((DefaultNamespaceDecl | Setter | NamespaceDecl | Import) Separator)* ((VarDecl | FunctionDecl | OptionDecl | FTOptionDecl) Separator)* |
[14] | FTOptionDecl |
::= | "declare" "ft-option" FTMatchOption |
Match options modify the match semantics of full-text expressions. They are described in detail in Section 3.2 FTMatchOptions. When a match option is specified explicitly in a query, that setting overrides the setting of the respective match option in the static context.
This section describes FTSelection which contains the full-text operators in the FTContainsExpr, and the match options in FTMatchOptions which modify the matching semantics of the full-text selection expressions.
The FTSelection production specifies the possible full-text search conditions.
[144] | FTSelection |
::= | FTOr (FTMatchOption | FTProximity)* ("weight" DecimalLiteral)? |
The syntax and semantics of the individual full-text selection operators follow.
This XML document fragment is the source document for examples in this section.
Tokenization is implementation-defined. A sample tokenization is used for the example sin this section. The results may be different for other tokenizations.
Unless stated otherwise, the results assume a case-insensitive match.
<book number="1"> <title shortTitle="Improving Web Site Usability">Improving the Usability of a Web Site Through Expert Reviews and Usability Testing</title> <author>Millicent Marigold</author> <author>Montana Marigold</author> <editor>Véra Tudor-Medina</editor> <content> <p>The usability of a Web site is how well the site supports the users in achieving specified goals. A Web site should facilitate learning, and enable efficient and effective task completion, while propagating few errors. </p> <note>This book has been approved by the Web Site Users Association. </note> </content> </book>
FTWords specifies the words and phrases that are being searched as the left-hand side argument of FTContainsExpr.
[150] | FTWords |
::= | (Literal | VarRef | ContextItemExpr | FunctionCall | ("{" Expr "}")) FTAnyallOption? |
The right-hand side of the above production must evaluate to a sequence of string values or nodes of type "xs:string". The result is then atomized into a sequence of strings which is tokenized into a sequence of words and phrases. If the atomized sequence is not a subtype of "xs:string*", an error is raised: [err:XPTY0004]XP.
If the "any" option is specified, a match occurs, if and only if at least one word or phrase in the sequence has a match in the searched text.
If the "all" option is specified, a match occurs, if and only if all of the words and phrases in the sequence are matched in the searched text.
If the "phrase" option is specified, the sequence of words and phrases is used to create a single phrase by concatenating the words and phrases and interleaving whitespace. A match occurs, if and only if the resulting phrase is matched in the searched text.
If the "any word" option is specified, a match occurs, if and only if at least one word in the sequence of words and phrases is matched in the searched text.
If the "all word" option is specified, a match occurs, if and only if all words in the sequence of words and phrases are matched in the searched text.
If no option is specified, "any" is the default.
If the result is a single string, "any", "all", and "phrase" are equivalent.
/book[@number="1" and ./title ftcontains "Expert"]
returns the book
element whose number
is 1, because its title
element contains the word "Expert".
/book[@number="1" and ./title ftcontains "Expert Reviews"]
returns the book
element whose number
is 1, because its title
element contains the phrase "Expert Reviews".
/book[@number="1" and ./title ftcontains {"Expert", "Reviews"} all]
returns the book
element whose number
is 1, because its title
element contains two words "Expert" and "Reviews".
/book[@number="1"]//p ftcontains "Web Site Usability"
returns false, because the p
element doesn't contain the phrase "Web Site Usability" although it contains all of the words in the phrase.
for $book in /book[.//author ftcontains "Marigold"] let score $score := $book/title ftcontains "Web Site Usability" where $score > 0.8 order by $score descending return $book/@number
returns book
numbers of book
elements by "Marigold" with a title about "Web Site Usability" sorting them in descending score order.
[145] | FTOr |
::= | FTAnd ( "||" FTAnd )* |
FTOr finds matches that satisfy at least one of the selection criteria.
A match must satisfy at least one of the FTSelection criteria.
/book[.//author ftcontains "Millicent" || "Voltaire"]
returns the book
element written by "Millicent".
[146] | FTAnd |
::= | FTMildnot ( "&&" FTMildnot )* |
FTAnd finds matches that satisfy both of the selection criteria.
A match must satisfy all of the FTSelection criteria which are specified by one or more FTMildNot expressions.
/book[@number="1"]/title ftcontains ("usability" && "testing")
returns true, since the book
title
contains "usability" and "testing".
/book/author ftcontains "Millicent" && "Montana"
returns false, because "Millicent" and "Montana" are not contained by the same author
element in any book
element.
[147] | FTMildnot |
::= | FTUnaryNot ( "not" "in" FTUnaryNot )* |
FTMildNot is a milder form of && ! (and not). 'a not in b' matches an expression that contains "a", but not when it is a part of "b". For example, a search for "Mexico" not in "New Mexico" returns, amon others, a document which is all about "Mexico" but mentions at the end that "New Mexico was named after Mexico", which would not be returned by an "and not" search.
A match to FTMildNot must contain at least one word occurrence that satisfies the first condition and does not satisfy the second condition. If it contains a word occurrence that satisfies both the first and the second condition, the occurrence is not considered as a result.
/book ftcontains "usability" not in "usability testing"
returns true, because "usability" appears in the title
and the p
elements and the occurrence within the phrase "Usability Testing" in the title
element is not considered.
The right-hand side of a FTMildNot may not contain an FTSelection that evaluates to an AllMatches that contains a StringExclude. Such FTSelections are FTUnaryNot and FTTimes with at most, from-to, and exactly occurrences ranges.
[148] | FTUnaryNot |
::= | ("!")? FTWordsSelection |
FTUnaryNot finds matches that do not satisfy the selection criteria.
/book[. ftcontains ! "usability"]
returns the empty sequence, because all book
elements contain "usability".
/book ftcontains "information" && "retrieval" && ! "information retrieval"
returns true, because book
elements contain "information" and "retrieval" but not "information retrieval".
/book[. ftcontains "web site usability" && !"usability testing"]
return book
elements containing "web site usability" but not "usability testing".
[152] | FTOrderedIndicator |
::= | "ordered" |
FTOrder controls the order of words and phrases to be the same as the order in which they are written in the query.
The default is unordered. Unordered is in effect when ordered is not specified in the query. Unordered cannot be written explicitly in the query.
FTOrder finds matches which must satisfy the nested selection condition and the match must contain the words in the order specified in the query.
/book/title ftcontains ("web site" && "usability") ordered
returns true, because titles of book
elements contain "web site" and "usability" in the order in which they are written in the query, i.e., "web site" must precede "usability".
/book[@number="1"]/title ftcontains ("Montana" && "Millicent") ordered
returns false, because although "Montana" and "Millicent" appear in the title
element, they do not appear in the order they are written in the query.
[170] | FTScope |
::= | ("same" | "different") FTBigUnit |
[172] | FTBigUnit |
::= | "sentence" | "paragraph" |
FTScope finds words and phrases contained in the same or a different scope.
Possible scopes are sentences and paragraphs.
By default, there are no restriction on the scope of the matches.
If two words appear in the same sentence and in different sentences, then both same sentence and different sentence return true. The same is true for same paragraph and different paragraph.
/book ftcontains "usability" && "Marigold" same sentence
returns false, because the words "usability" and "Marigold" are not contained within the same sentence.
/book ftcontains "usability" && "Marigold" different sentence
returns true, because the words "usability" and "Marigold" are contained within different sentences.
/book[. ftcontains "usability" && "testing" same paragraph]
returns a book
element, because it contains "usability" and "testing" in the same paragraph.
/book[. ftcontains "site" && "errors" same sentence]
returns a book
element, because "site" and "errors" appear in the same sentence.
Some subtle relationships between FTScope and FTDistance will be discussed in Section 4.
[167] | FTDistance |
::= | "distance" FTRange FTUnit |
[166] | FTRange |
::= | ("exactly" UnionExpr) |
[171] | FTUnit |
::= | "words" | "sentences" | "paragraphs" |
FTDistance finds matches by specifying the distance between words and phrases in FTUnit (words, sentences, and paragraphs). The number of intervening FTUnits is specified in the integer value of FTRange.
FTRange specifies a range of integer values, providing a minimum and maximum value. Each UnionExpr in an FTRange must evaluate (after atomization) to a singleton sequence with an atomic value of type "xs:integer". Otherwise, an error is raised [err:XPTY0004]XP.
Let the value of the first (or only) UnionExpr be M. If "from" is specified, let the value of the second UnionExpr be N. FTDistance may cross element boundaries when computing distance.
The following rule applies to FTDistance:
Zero words (sentences, paragraphs) means adjacent words (sentences, paragraphs).
If "exactly" is specified, then the range is the closed interval [M, M]. If "at least" is specified, then the range is the half-closed interval [M, unbounded). If "at most" is specified, then the range is the closed interval [0, M]. If "from-to" is specified, then the range is the closed interval [M, N].
Here are some examples of FTRanges:
'exactly 0' specifies the range [0, 0].
'at least 1' specifies the range [1,unbounded].
'at most 1' specifies the range [0, 1].
'from 5 to 10' specifies the range [5, 10].
The distances computed by FTDistance are not affected by the presence or absence of element boundaries in the text. Stop words are counted in those computations whether they are ignored or not.
/book ftcontains ("information" && "retrieval") not in ("information" && "retrieval" distance at least 11 words)
returns false, because "information" and "retrieval" are more than at least 11 words apart.
/book ftcontains "web" && "site" && "usability" distance at most 2 words
returns true, because "web", "site", and "usability" have at most 2 intervening words between them.
/book[. ftcontains "web site" && "usability" distance at most 1 words]/title
returns the book
title. A similar query for the p
element would return false because "web site" and "usability" have two intervening words between them.
[168] | FTWindow |
::= | "window" UnionExpr FTUnit |
FTWindow finds matches within a number of FTUnits (words, paragraphs, and phrases). The number of FTUnits is specified as an integer.
FTWindow may cross element boundaries. The size of the window is not affected by the presence or absence of element boundaries. Stop words are included in those computations whether they are ignored or not.
UnionExpr must evaluate to an atom of type "xs:integer".
A match of an FTSelection is considered a match within a window, if there exists a window of the given number of consecutive units (words, sentences, or paragraphs) in the document within which the match lies.
/book/title ftcontains "web" && "site" && "usability" window 5 words
returns true, because "web", "site", and "usability" are within a window of 5 words in the title
element.
/book ftcontains ("web" && "site" ordered) && ("usability" || "testing") window 10 words
returns true, because "web" and "site" in the order they are written in the query and either "usability" or "testing" are within a window of at most 10 words.
/book//title ftcontains "web site" && "usability" window 3 words
returns true, because the title
element contains "Web Site Usability". A similar query on the p
element would not return true, because its occurrences of "web site" and "usability" are not within a window of 3.
/book[@number="1" and . ftcontains "efficient" && ! "and" window 3 words]
returns the empty sequence, because in the selected book
element, there is no occurrence of "efficient" within a window of 3 words which would not also contain an occurrence of "and".
[169] | FTTimes |
::= | "occurs" FTRange "times" |
FTTimes finds matches in which an FTSelection occurs a specified number of times.
FTTimes limits the number of different occurrences of FTSelection, within the specified range.
In the document fragment "very very big":
The FTSelection "very big" has 1 occurrence consisting of the second "very" and "big".
The FTSelection "very && big" has 2 occurrences; one consisting of the first "very" and "big", and the other containing the second "very" and "big".
The FTSelection "very || big" has 3 occurrences.
The FTSelection ! "small" has 1 occurrence.
/book[. ftcontains "usability" occurs at least 2 times]/@number
returns book
numbers because book
elements contain 2 or more occurrences of "usability".
/book[@number="1" and title ftcontains "usability" || "testing" occurs at most 3 times]
returns the empty sequence, because there are 4 occurrences of "usability" || "testing" in the designated title
.
/book ftcontains "usability" occurs at least 2 times
returns true, because the book
element contains 3 occurrences of "usability" in its title
element although its p
element contains only 1 occurrence.
[164] | FTContent |
::= | ("at" "start") | ("at" "end") | ("entire" "content") |
FTContent finds matches in which the words and phrases are the first, last or all of the words and phrases in the tokenized string value of the element being searched.
The "at" "start" option finds matches in which the words or phrases are the first words or phrases in the tokenized string value of the element being searched.
The "at" "end" option finds matches in which the words or phrases are the last words or phrases in the tokenized string value of the element being searched.
The "entire" content" option finds matches in which the words or phrases are the entire content of the tokenized string value of the element being searched.
/books//title[. ftcontains "improving the usability of a web site" at start]
returns each title
element starting with the phrase "improving the usability of a web site".
/books//p[. ftcontains "propagat*" && "few errors" distance at most 2 words at end]
returns each p
element ending with the phrase "propagating few errors".
/books//note[. ftcontains "this site has been approved by the web site users association" entire content]
returns each note
element whose entire content is "this site has been approved by the web site users association".
FTMatchOptions modify the operational semantics of the FTSelection on which they are applied.
[153] | FTMatchOption |
::= | FTCaseOption |
FTMatchOptions set environments for the matching options of FTSelection. If a match option isn't specified explicitly in the query, its value is given by its static context component. Details about these context components, including their default values, are given in Appendix C Static Context Components.
If no match options declarations are present in the prolog and the implementation does not define any overwriting of the static context components for the match options, the query:
/book/title ftcontains "usability"
is equivalent to the query
/book/title ftcontains "usability" case insensitive diacritics insensitive without stemming without thesaurus without stop words language "none" without wildcards
FTMatchOptions are applied in the order in which they are written in the query. More information on their semantics is given in 4.3.3 Match Options Semantics.
We describe each match option in more detail in the following sections.
[154] | FTCaseOption |
::= | "lowercase" |
FTCaseOption modifies words and phrases matching by specifying how upper and lower charcters are considered.
FTCaseOption influences the way FTWords is applied.
There are four possible character case options:
The option "uppercase" matches words and phrases with uppercase characters, regardless of the case of characters of the words and phrases as they are written in the query.
The option "lowercase" matches words and phrases with lowercase characters, regardless of the case of characters of the words and phrases as they are written in the query.
The option "case" "insensitive" matches the uppercase and lowercase characters of words and phrases. The case of characters as they are written in the query is not considered.
The option "case" "sensitive" matches the case of the characters in words and phrases as they are written in the query.
The default is "case insensitive".
The following table summarizes the interactions between the case match options and the use of the default collations.
Default collation options/Case options | UCC (Unicode Codepoint Collation) | CCS (some generic case-sensitive collation) | CCI (some generic case-insensitive collation) |
insensitive | compare as if both lower | case-insensitive variant of CCS if it exists, else error | CCI |
sensitive | UCC | CCS | case-sensitive variant of CCI if it exists, else error |
uppercase | uppercase(Expr) + UCC | uppercase(Expr) + CSS | CCI |
lowercase | lowercase(Expr) + UCC | lowercase(Expr) + CSS | CCI |
Note:
In this table, "else error" means "Otherwise, an error is raised: [err:FOCH0002]FO". The phrase "if it exists" is used, because the case-sensitive collation CCS does not always have a case-insensitive variant (and, even if one exists, it may not be possible to determine it algorithmically), and because the case-insensitive collation CCI does not always have a case-sensitive variant (and, even if one exists, it may not be possible to determine it algorithmically).
/book[@number="1"]/title ftcontains "Usability" lowercase
returns false, because the title
element doesn't contain "usability" in lower-case characters.
/book[@number="1"]/title ftcontains "usability" case insensitive
returns true, because the character case is not considered.
[155] | FTDiacriticsOption |
::= | ("with" "diacritics") |
FTDiacriticsOption modifies word and phrase matching by specifying how diacritics are considered.
There are four possible diacritics options:
The option "with" "diacritics" matches words and phrases with diacritics, regardless of whether the diacritics are written in the query.
The option "without" "diacritics" matches words and phrases without diacritics, regardless of whether the diacritics are written in the query.
The option "diacritics" "insensitive" matches words and phrases with and without diacritics. Whether diacritics are written in the query or not is not considered.
The option "diacritics" "sensitive" matches words and phrases only if they contain the diacritics as they are written in the query.
The default is "diacritics insensitive".
The following table summarizes the interactions between the diacritics match options and the use of the default collations.
Default collation options/Diacritics options | UCC (Unicode Codepoint Collation) | CDS (some generic diacritics-sensitive collation) | CDI (some generic diacritics-insensitive collation) |
insensitive | compare as if with and without | diacritics-insensitive variant of CDS if it exists, else error | CDI |
sensitive | UCC | CDS | diacritics-sensitive variant of CDI if it exists, else error |
with diacritics | "resume diacritic insensitive" not in "resume" | "resume diacritic insensitive" not in "resume" | CDI |
without diacritics | "resume" not in "resume diacritic sensitive" | "resume" not in "resume diacritic sensitive" | CDI |
Note:
In this table, "else error" means "Otherwise, an error is raised: [err:FOCH0002]FO". The phrase "if it exists" is used, because the diacritics-sensitive collation CDS does not always have a diacritics-insensitive variant (and, even if one exists, it may not be possible to determine it algorithmically), and because the diacritics-insensitive collation CDI does not always have a diacritics-sensitive variant (and, even if one exists, it may not be possible to determine it algorithmically).
/book[@number="1"]//editor ftcontains "Vera" with diacritics
returns true, because the editor
element contains the word "Vera" with an acute accent.
/book[@number="1"]/editors ftcontains "Véra" without diacritics
returns false, because the editor
element does not contain the word "Vera" without an acute accent.
[156] | FTStemOption |
::= | ("with" "stemming") | ("without" "stemming") |
FTStemOption modifies word and phrase matching by specifying whether stemming is applied or not.
FTStemOption influences the way FTWords is applied. It produces a disjunction of the query words by expanding the words into the list of words that share the same stem. By definition, the query words are included in that disjunction.
The "with stemming" option specifies that matches may contain words that have the same stem as the words and phrases written in the query. It is implementation-defined what a stem of a word is.
The "without stemming" option specifies that the words and phrases are not stemmed.
It is implementation-defined whether the stemming is based on an algorithm, dictionary, or mixed approach.
The default is "without stemming".
/book[@number="1"]/title ftcontains "improve" with stemming
returns true, because the title
of the spekcified book
contains "improving" which has the same stem as "improve".
[157] | FTThesaurusOption |
::= | ("with" "thesaurus" (FTThesaurusID | "default")) |
[158] | FTThesaurusID |
::= | "at" StringLiteral ("relationship" StringLiteral)? (FTRange "levels")? |
FTThesaurusOption modifies word and phrase matching by specifying whether a thesaurus is used or not. If thesauri are used, it locates the thesauri by default or URI reference. It also states the relationship to be applied and how many levels within the thesaurus to be traversed..
FTThesaurusOption influences the way FTWords is applied.
The StringLiteral following the keyword at
in FTThesaurusID is of the form of a URI Reference.
Thesauri add related words and phrases to the search. Thus, the user may narrow, broaden, or otherwise modify the search using synonyms, hypernyms (more generic terms), etc. The search is performed as though the user has specified all related search words and phrases in a disjunction (FTOr).
Note:
A thesaurus may be standards-based or locally-defined. It may be a traditional thesaurus, or a taxonomy, soundex, ontology, or topic map. How the thesaurus is represented is implementation-dependent.
FTThesaurusID specifies the relationship sought between words and phrases written in the query and terms in the thesaurus and the number of levels to be queried in hierarchical relationships by including an FTRange "levels". If no levels are specified, the default is to query all levels in hierarchical relationships.
Relationships include, but are not limited to, the relationships and their abbreviations presented in [ISO 2788] and their equivalents in other languages:
equivalence relationships (synoymns): PREFERRED TERM (USE), NONPREFERRED USED FOR TERM (UF);
hierarchical relationships: BROADER TERM (BT), NARROWER TERM (NT), BROADER TERM GENERIC (BTG), NARROWER TERM GENERIC (NTG), BROADER TERM PARTITIVE (BTP), NARROWER TERM PARTITIVE (NTP), TOP Terms (TT); and
associative relationships: RELATED TERM (RT).
The "with thesaurus" option specifies that string matches include words that can be found in one of the specified thesauri.
The "without thesaurus" option specifies that no thesaurus will be used.
The "with default thesaurus" option specifies that a system-defined default thesaurus with a system-defined relationship is used. The default thesaurus may be used in combination with other explicitly specified thesauri.
The default is "without thesaurus".
count(.//book/content ftcontains "duties" with thesaurus at "http://bstore1.example.com/UsabilityThesaurus.xml" relationship "synonyms")>0
returns true, because it finds a content
element containing "tasks" which the thesaurus identified as a synonym for "duties".
doc("http://bstore1.example.com/full-text.xml") /books/book[count(./content ftcontains "web site components" with thesaurus at "http://bstore1.example.com/UsabilityThesaurus.xml" relationship "narrower terms" at most 2 levels)>0]
returns book
elements, because it finds a content
element containing "web site components", and narrower terms "navigation" and "layout".
doc("http://bstore1.example.com/full-text.xml") /books/book[count(. ftcontains "Merrygould" with thesaurus at "http://bstore1.example.com/UsabilitySoundex.xml" relationship "sounds like")>0]
returns a book
element containing "Marigold which sounds which sound like "Merrygould".
[159] | FTStopwordOption |
::= | ("with" "stop" "words" FTRefOrList FTInclExclStringLiteral*) |
[160] | FTRefOrList |
::= | ("at" StringLiteral) |
[161] | FTInclExclStringLiteral |
::= | ("union" | "except") FTRefOrList |
FTStopWordOption controls word matching by specifying whether stop words are used or not.
FTStopWordOption influences the way FTWords is applied.
FTRefOrList specifies the list of stop words either explicitly as a comma-separated list of string literals, or by a URI following the keyword at
. If a URI is used, it must point to a sequence of string atoms or nodes of type "xs:string". In both cases, no tokenization is performed on the strings: they are used as they occur in the sequence.
The "with stop words" option specifies that if a word is within the specified collection of stop words, it is removed from the search and any word may be substituted for it. Stop words retain their position numbers and are counted in FTDistance and FTWindow searches.
Stop word lists can be combined using the usual semantics of "except" and "union".
The "with default stop words" option specifies that an implementation-defined collection of stop words is used.
The "without stop words" option specifies that no stop words are used. This is equivalent to specifying an empty list of stop words.
The default is "without stop words".
/book[@number="1"]//p ftcontains "propagation of errors" with stemming with stop words ("a", "the", "of")
returns true, because the document contains the phrase "propagating few errors".
Note the asymmetry in the stop word semantics: the property of being a stop word is only relevant to query terms, not to document terms. Hence, it is irrelevant for the above-mentioned match whether "few" is a stop word or not, and on the other hand we do not want the query above to match "propagation" followed by 2 stop words, or even a sequence of 3 stop words in the document.
/book[@number="1"]//p ftcontains "propagation of errors" with stemming without stop words
returns false, because "of" is not in the p
element between "propagating" and "errors".
doc("http://bstore1.example.com/full-text.xml") /books/book[count(.//content ftcontains "planning then conducting" with stop words at "http://bstore1.example.com/StopWordList.xml")>0]
uses the stop words list specified at the URL. Assuming that the specified stop word list contains the word "then", this query is reduced to a query on the phrase "planning X conducting", allowing any word as a substitute for X. It returns a book
element, because its content
element contains "planning then conducting". It would have also returned the book
if the phrases "planning and conducting" and "planning before conducting" if they had been in its
content
.
doc("http://bstore1.example.com/full-text.xml") /books/book[count(.//content ftcontains "planning then conducting" with stop words at "http://bstore1.example.com/StopWordList.xml" except ("the then"))>0]
returns book
s containing "planning then conducting", but not does not return book
s containing "planning and conducting", since it is exempting "then" from being a stop word.
[162] | FTLanguageOption |
::= | "language" StringLiteral |
FTLanguageOption modifies word matching by specifying the language of search words and phrases.
FTLanguageOption influences the way FTWords is applied.
The StringLiteral following the keyword language
designates one language. It must either be castable to "xs:language", or be the value "none". Otherwise, an error is raised: [err:XPTY0004]XP.
The "language" option influences tokenization, stemming, and stop words.
If the language "none"
option is specified, no language selected.
The set of valid language identifiers is implementation-defined.
By default, there is no language selected.
/book[@number="1"]//editor ftcontains "salon de the" with default stop words language "fr"
This is an example where the language option is used to select the appropriate stop word list.
[163] | FTWildCardOption |
::= | ("with" "wildcards") | ("without" "wildcards") |
FTWildCardOption modifies word and phrase matching by specifying whether wildcards are used or not.
FTWildCardOption influences the way FTWords is applied.
In addition to specifying the "with wildcards"' option, indicators (represented by periods (.)) and qualifiers are appended to or inserted into words being searched. Zero or more characters replace each indicator and qualifier.
Indicators are mandatory. When the "with wildcards"' option is present, one or more periods (.) must be appended at the beginning or end of words or inserted into words. If the period is at the beginning of a word, the wildcard is a prefix wildcard. If the period is at the end of a word, it is a suffix wildcard. If the period is inserted into a word, it is an infix wildcard.
When the "with wildcards" option and one or more periods (.) appended to or inserted into words are present, characters are appended or inserted at each of the periods. Any characters may be appended or inserted except newline characters (#xA), return characters (#xD), and tab characters (#x9). The number of characters depends on the qualifier. Qualifiers available are none, question mark, asterisk, plus sign, and two numbers separated by a comma, both enclosed by curly braces.
If a period is present, but no qualifiers, one character is appended or inserted.
If a period is followed by a question mark (.?), zero or one characters are appended or inserted.
If a period is followed by an asterisk (.*), zero or more characters are appended or inserted.
If a period is followed by a plus sign (.+), one or more characters are appended or inserted.
If a period is followed by two numbers separated by a comma, both enclosed by curly braces (.{n,m}), a specified range of characters is appended or inserted.
The "without wildcards" option finds words without recognizing wildcard indicators and qualifiers. Periods, question marks, asterisks, plus signs, and two numbers separated by a comma, both enclosed by curly braces recognized as regular characters.
The default is "without wildcards".
/book[@number="1"]/title ftcontains "improv.*" with wildcards
returns true, because the title
element contains "improving".
/book[@number="1"]/title ftcontains ".?site" with wildcards
returns true, because the title
element contains "site".
/book[@number="1"]/p ftcontains "w.ll" with wildcards
returns true, because the p
element contains "well".
[173] | FTIgnoreOption |
::= | "without" "content" UnionExpr |
FTIgnoreOption specifies a set of element nodes whose content are ignored. Ignored nodes are identified by the XQuery expression UnionExpr. Let N1, N2, ..., Nk
be the sequence of nodes of the search context. The expression UnionExpr is evaluated in the context of each node Ni
being searched. That is, the search context expression of the ftcontains
predicate creates a new focus for the evaluation of the UnionExpr given with FTIgnoreOption, similar to the creation of the dynamic context of a path expression E1/E2
or a filter expression E1[E2]
(see Section 2.1.2 Dynamic ContextXQ).
Now, let I1, I2, ..., In
be the sequence of items that UnionExpr evaluates to. For each Ni (i=1..k)
a copy is made that omits each node Ij (j=1..n)
that is not Ni
. Those copies form the new search context. If UnionExpr evaluates to an empty sequence no nodes are omitted.
In the following fragment, if .//annotation
is ignored, "Web Usability" will be found 2 times: once in the title
element and once in the editor
element. The 2 occurrences in the 2 annotation
elements are ignored. On the other hand, "expert" will not be found, as it appears only in an annotation
element.
<book> <title>Web Usability and Practice</title> <author>Montana <annotation> this author is an expert in Web Usability</annotation> Marigold </author> <editor>Véra Tudor-Medina on Web <annotation> best editor on Web Usability</annotation> Usability </editor> </book>
By default, no element content is ignored.
This section describes the formal semantics of XQuery 1.0 and XPath 2.0 Full-Text. The figure below shows how XQuery 1.0 and XPath 2.0 Full-Text integrates with XQuery 1.0 and XPath 2.0.
The following diagram represents the interaction of XQuery 1.0 and XPath 2.0 Full-Text with the rest of XQuery 1.0 and XPath 2.0 languages. It specifies how full-text expression can be nested within XQuery 1.0 and XPath 2.0 expressions and vice versa.
Arrow 1 represents the composability of the XQuery 1.0 and XPath 2.0 expressions by showing that XQuery 1.0 expressions are nested inside FTSelections and evaluated to a sequence of items.
Arrow 2 shows how Regular XQuery expressions can be nested inside FTSelections by evaluating them to a sequence of items and then converting them to a tokenized text. The process is described in Nested XQuery and XPath Expressions.
Arrow 3 represents the composability of FTSelections. The composability is achived by evaluating the FTSelections to AllMatches. Each FTSelection operates on zero or more AllMatches and returns AllMatches. The process is described in the Evaluation of FTSelections section.
Arrow 4 shows how the result of the evaluation of XQuery 1.0 and XPath 2.0 Full-Text and scoring expressions needs to be integrated in the XPath and XQuery model. The section XQuery 1.0 and XPath 2.0 Full-Text and scoring expressions describes how this is achieved.
The functions and schemas defined in this section are considered to be within the fts: namespace. These functions and schemas are used only for describing the semantics. There is no requirement that these functions and schemas be implemented, so there is no URI is associated with the fts: prefix.
XQuery 1.0 and XPath 2.0 expressions can be nested inside FTContainsExprs.
Nested XQuery 1.0 and XPath 2.0 expressions are evaluated to a sequence of items before the evaluation of FTContainsExpr. The sequence of items must satisfy certain constraints depending on the context in which it is used. These constraints are described below.
Full-text queries are performed on text which has been tokenized, i.e., broken into a sequence of words, units of punctuation, and spaces. The tokenization is applied on the string value of the evaluation of the left-hand side of the FTContainsExpr expression.
The XQuery 1.0 and XPath 2.0 expression nested inside an FTWords must evaluate to a sequence of string values after applying atomization. Otherwise, an error is raised: [err:XPTY0004]XP. Then, FTWords performs tokenization on the string values from the sequence.
The XQuery 1.0 and XPath 2.0 expression, or expressions in the case of a "from-to" range must evaluate to a singleton sequence of integers after applying atomization. Otherwise, an error is raised: [err:XPTY0004]XP. The resulting integer values are treated as boundaries for the range.
The XQuery 1.0 and XPath 2.0 expression. Otherwise, an error is raised: [err:XPTY0004]XP. The resulting string values are treated as stop words for which any word may substituted during string matching.
The XQuery 1.0 and XPath 2.0 expression must evaluate to a sequence of string values after applying atomization. Otherwise, an error is raised: [err:XPTY0004]XP. The resulting string values are treated as names of thesauri to use during string matching.
The XQuery 1.0 and XPath 2.0 expression must evaluate to either an empty sequence or a singleton sequence of a string value or an empty sequence after applying atomization. Otherwise, an error is raised: [err:XPTY0004]XP. The resulting string value is treated as a language identifier. It specifies the language of the words and phrases in the query.
[Definition: Tokenization is the process of converting a string to a sequence of TokenInfos.]
A [Definition: TokenInfo is the identity of a word occurrence inside an XML document. ] Each TokenInfo is associated with:
the word it identifies: word
the unique identifier that captures the relative position of the word in the document order: pos
the relative position of the sentence containing the word: sentence
the relative position of the paragraph containing the word: para
The tokenization is performed by the formal semantics functions.
function fts:getTokenInfo( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchToken as fts:TokenInfo) as fts:Tokeninfo*
The above function returns the TokenInfos in nodes in $searchContext
that match the search string in $searchToken
when using the match options in $matchOptions
. The match options that occur at the beginning of the list should be applied before match options that occur later in the list.
function fts:getSearchTokenInfo( $searchString as xs:string, $matchOptions as fts:FTMatchOptions) as fts:Tokeninfo*
The above function tokenizes the search string $searchString
and returns a sequence of TokenInfos that describes the sequence of tokens in the search string. If $searchString
is the empty string, the function is required to return the empty sequence.
This document fragment is the source document for examples in this section. Tokenization is implementation-defined. A sample tokenization is used for the examples in this section. The results might be different for other tokenizations.
Unless stated otherwise, the results assume a case-insensitive match.
<offers> <offer id="1000" price="10000"> Ford Mustang 2000, 65K, excellent condition, runs great, AC, CC, power all </offer> <offer id="1001" price="8000"> Honda Accord 1999, 78K, A/C, cruise control, runs and looks great, excellent condition </offer> <offer id="1005" price="5500"> Ford Mustang, 1995, 150K highway mileage, no rust, excellent condition </offer> </offers>
In this sample tokenization, words are delimited by punctuation and whitespace symbols. The relative position numbers of the TokenInfos are shown below in parenthesis.
The word "Ford" will be assigned a TokenInfo with relative position of 1.
The word "Mustang" will be assigned a TokenInfo with relative position of 2.
The word "2000" will be assigned a TokenInfo with a relative position of 3.
Relative position numbers are assigned sequentially through the end of the document.
The relative positions of the TokenInfos are shown below in parentheses.
<offers> <offer id="1000" price="10000"> Ford(1) Mustang(2) 2000(3), 65K(4), excellent(5) condition(6), runs(7) great(8), AC(9), CC(10), power(11) all(12) </offer> <offer id="1001" price="8000"> Honda(13) Accord(14) 1999(15), 78K(16), A(17)/C(18), cruise(19) control(20), runs(21) and(22) looks(23) great(24), excellent(25) condition(26) </offer> <offer id="1005" price="5500"> Ford(27) Mustang(28), 1995(29), 150K(30) highway(31) mileage(32), little(33) rust(34), excellent(35) condition(36) </offer> </offers>
The relative positions of paragraphs are determined similarly. In this sample tokenization, the paragraph delimiters are start tags, end tags, and end of line characters.
The words in the first element will be assigned relative paragraph number 1.
The words from the next element will be assigned relative paragraph number 2.
Relative paragraph numbers are assigned sequentially through the end of the document.
The relative positions of sentences are determined similarly using sentence delimiters.
The "sequence of nodes" in the XQuery 1.0 and XPath 2.0 Data Model is inadequate to support fully composable FTSelection. Full-text operations, such as FTSelections, operate on linguistic units, such as positions of words, and which are not captured in the XQuery 1.0 and XPath 2.0 Data Model (XDM).
XQuery 1.0 and XPath 2.0 Full-Text adds relative word, sentence, and paragraph position numbers via AllMatches. AllMatches make FTSelections fully composable.
An [Definition: AllMatches describes the possible results of an FTSelection.] The UML Static Class diagram of AllMatches is shown on the diagram given below.
The AllMatches object contains zero or more Matches.
Each [Definition: Match describes one result to the FTSelection.] The result is described in terms of zero or more StringIncludes and zero or more StringExcludes
[Definition: StringIncludes and StringExcludes are known collectively as StringMatch, which describes a possible match of a search token with a word in a document.] The queryString attribute of StringMatch stores the search token. The queryPos attribute specifies the position of this search token in the query. This attribute is needed for FTOrders. The matched document word is described in the TokenInfo associated with the StringMatch.
[Definition: A StringInclude is a StringMatch that describes a TokenInfo that must be contained in the document.]
[Definition: A StringExclude is a StringMatch that describes a TokenInfo that must not be contained in the document.]
Intuitively, AllMatches specifies the TokenInfos that a node contains and does not contains to satisfy an FTSelection.
The AllMatches structure resembles the Disjunctive Normal Form (DNF) in propositional and first-order logic. The AllMatches is a disjunction of Matches. Each Match is a conjunction of StringIncludes, and StringExcludes.
The simplest example of an FTSelection is an FTWords such as "Mustang"
. The AllMatches corresponding to this FTWords is given below.
As shown, the AllMatches consists of two Matches. Each Match represents one possible result of the FTWords "Mustang"
. The result represented by the first Match,represented as StringInclude, contains the word "Mustang" at position 2. The result described by the second Match contains the word "Mustang" at position 28.
A more complex example of an FTSelection is an FTWords such as "Ford Mustang"
. The AllMatches for this FTWords is given below.
There are two possible results for this FTWords, and these are represented by the two Matches. Each of the Matches requires two words to be matched. The first Match is obtained by matching "Ford" at position 1 and matching "Mustang" at position 2. Similarly, the second Match is obtained by matching "Ford" at position 27 and "Mustang" at position 28.
An even more complex example of an FTSelection is an FTSelection such as "Mustang" && ! "rust"
that searches for "Mustang" but not "rust". The AllMatches for this FTSelection is given below.
This example introduces StringExclude. StringExclude corresponds to negation in DNF. It specifies that the result described by the corresponding Match must not match the word at the specified position. In this example, the first Match specifies that "Mustang" is matched at position 2, and that the word "rust" at position 34 is not matched .
AllMatches has a well-defined hierarchical structure. Therefore, the AllMatches can be easily modeled in XML. This XML representation and those which follow formally describe the semantics of FTSelections. For example, the XML representation of AllMatches formally specifies how an FTSelection operates on zero or more AllMatches to produce a resulting AllMatches.
The XML schema for representing AllMatches is given below.
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified"> <xs:complexType name="AllMatches"> <xs:sequence> <xs:element name="match" type="fts:Match" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="stokenNum" type="xs:string" use="required" /> </xs:complexType> <xs:complexType name="Match"> <xs:sequence> <xs:element name="stringInclude" type="fts:StringMatch" minOccurs="0" maxOccurs="unbounded"/> <xs:element name="stringExclude" type="fts:StringMatch" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> <xs:complexType name="StringMatch"> <xs:sequence> <xs:element name="tokenInfo" type="fts:TokenInfo"/> </xs:sequence> <xs:attribute name="queryString" type="xs:string" use="required"/> <xs:attribute name="queryPos" type="xs:integer" use="required"/> </xs:complexType> <xs:complexType name="TokenInfo"> <xs:attribute name="word" type="xs:string" use="required"/> <xs:attribute name="pos" type="xs:integer" use="required"/> <xs:attribute name="para" type="xs:integer" use="required"/> <xs:attribute name="sentence" type="xs:integer" use="required"/> </xs:complexType> </xs:schema>
The stokenNum
attribute in AllMatches. is related to the representation of the semantics as XQuery functions. Therefore, it is not considered part of the AllMatches model. The stokenNum
attribute stores the number of search tokens used when evaluating the AllMatches. This value is used to compute the correct value for the queryPos
attribute in new StringMatches.
[Definition: A Match M is in Match Normal Form if and only if it satisfies the following properties]
[Definition: (Match minimality) M does not contain any duplicate StringIncludes or duplicate StringExcludes. ], and
[Definition: (Match non-contradiction) M does not contain a StringInclude and a StringExclude containing the same TokenInfo ]
Note: two StringMatches are duplicates of each other, if they have the same queryPos
attribute value and their TokenInfos have the same pos
attribute value. Testing for these attributes is sufficient, since all attributes in a TokenInfo are functionally dependent on the pos
attribute and queryString
depends on queryPos
.
[Definition: (Match subsumption) We say that a Match M1 subsumes a Match object M2 if the following hold. ]
The set of StringIncludes in M2 is a subset of the set of StringIncludes in M1, and
The set of StringExcludes in M2 is a subset of the set of StringExcludes in M1.
[Definition: An AllMatches object A is in AllMatches Normal Form if and only if it satisfies the following properties. ]
Every Match M in A is in Match Normal Form, and
No Match M contained in A is subsumed by another Match M' contained in A.
In other words, in normal-form AllMatches the representations of the contained Matches can be viewed as sets, as opposed to multi-sets, of StringIncludes and StringExcludes. The representations of such AllMatches themselves can be considered as sets of alternatives of Matches, where Matches that are subsumed by others need not be represented, because such subsumed Matches only embody stronger conditions.
normalizeAllMatches
functionThe helper function fts:normalizeAllMatches()
is used to transform an AllMatches object into AllMatches Normal Form. The denotational semantics of FTSelections defined below assures as an invariant that any AllMatches produced as a result for an FTSelection is in AllMatches Normal Form.
The normalization of a AllMatches is conducted by normalizing each contained Match and then eliminating any Match subsumptions.
declare function fts:normalizeAllMatches( $allMatches as fts:AllMatches) as element(allMatches, fts:AllMatches) { let $mSeq1 := for $m in $allMatches/match return fts:normalizeMatch($m) let $mSeq2 := fts:eliminateMatchSubsumption($mSeq1) return <allMatches stokenNum="${$allMatches/@stokenNum}"> {$mSeq2} </allMatches> };
The normalization of a Match is conducted by eliminating the duplicate StringMatches and then eliminating contradictory Matches.
declare function fts:normalizeMatch( $match as fts:Match) as element(match, fts:Match) { let $m1 := <match> {fts:eliminateStrMatchDupl($match/*, ())} </match> return if fts:isMatchContradictory($m1) then () else $m1 }; declare function fts:eliminateStrMatchDupl( $smSeq as fts:StringMatch*, $resultSoFar as fts:StringMatch*) as fts:StringMatch* { if (fn:count($smSeq) eq 0) then $resultSoFar else if (fts:containsStrMatch($resultSoFar, $smSeq[1]) then eliminateStrMatchDupl($smSeq[position() ge 2], $resultSoFar) else eliminateStrMatchDupl($smSeq[position() ge 2], ($resultSoFar, $smSeq[1])) }; declare function fts:containsStrMatch( $smSeq as fts:StringMatch*, $strMatch as fts:StringMatch) as xs:boolean { if (fn:count($smSeq) eq 0) then fn:false() else if (($smSeq[1] instance of element(stringInclude)) eq ($strMatch instance of element(stringInclude)) and $smSeq[1]/tokenInfo/@pos eq $strMatch/tokenInfo/@pos and $smSeq[1]/@queryPos eq $strMatch/@queryPos) then fn:true() else containsStrMatch($smSeq[position() ge 2], $strMatch) }; declare function fts:isMatchContradictory( $match as fts:Match) as xs:boolean { some $si in $match/stringInclude satisfies let $se := <stringExclude queryPos="{$si/@queryPos}" queryString="{$si/@queryString}"> {$si/tokenInfo} </stringExclude> return fts:containsStrMatch($match/stringExclude, $se) };
The elimination of Match subsumption is defined as follows.
declare function fts:eliminateMatchSubsumption( $matches as fts:Match*) as fts:Match* { for $m at $p in $matches let $isNotMin := some $m1 in $matches/match[position() ne $p] satisfies fts:isStrMatchSubset($m1/*, $m/*) where fn:not($isNotMin) return $m }; declare function fts:isStrMatchSubset( $smSeq1 as fts:StringMatch*, $smSeq2 as fts:StringMatch*) as xs:boolean{ if (fn:count($smSeq1) eq 0) then fn:true() else if (fts:containsStrMatch($smSeq2, $smSeq1[1])) then fts:isStrMatchSubset($smSeq1[position() ge 2], $smSeq2) else fn:false() };
FTSelections are fully composable and may be nested arbitrarily under other FTSelections. Each FTSelection may be associated with match options (such as stemming and stop words) and score weights. Since score weights are solely interpreted by the formal semantics scoring function, they do not influence the semantics of FTSelections. Therefore, score weights are not considered in the formal semantics.
The XML representation of the FTSelections used in the fts:evaluate
function closely follows the grammar of the language. It can be viewed as an XML representation of an abstract syntax tree (AST) of a parsed full-text query. Every FTSelection is represented as an XML element. Every nested FTSelection is represented as a nested descendant element. For binary FTSelections, e.g. FTAnd, the nested FTSelections are represented in
<left>
and <right>
descendant elements. For unary FTSelections, a <selection>
descendant element is used. Additional characteristics of FTSelections, e.g., the distance unit for FTDistance, are stored in attributes.
<xs:schema elementFormDefault="qualified" attributeFormDefault="unqualified"> <xs:include schemaLocation="AllMatches.xsd" /> <xs:include schemaLocation="MatchOptions.xsd" /> <xs:complexType name="FTSelection"> <xs:sequence> <xs:choice> <xs:element name="FTWords" type="fts:FTWords"/> <xs:element name="FTAnd" type="fts:FTAnd"/> <xs:element name="FTOr" type="fts:FTOr"/> <xs:element name="FTUnaryNot" type="fts:FTUnaryNot"/> <xs:element name="FTMildNot" type="fts:FTMildNot"/> <xs:element name="FTOrder" type="fts:FTOrder"/> <xs:element name="FTScope" type="fts:FTScope"/> <xs:element name="FTContent" type="fts:FTContent"/> <xs:element name="FTDistance" type="fts:FTDistance"/> <xs:element name="FTWindow" type="fts:FTWindow"/> <xs:element name="FTTimes" type="fts:FTTimes"/> </xs:choice> <xs:element name="matchOption" type="fts:FTMatchOption" minOccurs="0"/> <xs:element name="weight" type="xs:float" minOccurs="0"/> </xs:sequence> </xs:complexType> <xs:complexType name="FTWords"> <xs:sequence> <xs:element name="searchToken" type="fts:TokenInfo" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="type" type="fts:FTWordsType" use="required"/> </xs:complexType> <xs:complexType name="FTAnd"> <xs:sequence> <xs:element name="left" type="fts:FTSelection"/> <xs:element name="right" type="fts:FTSelection"/> </xs:sequence> </xs:complexType> <xs:complexType name="FTOr"> <xs:sequence> <xs:element name="left" type="fts:FTSelection"/> <xs:element name="right" type="fts:FTSelection"/> </xs:sequence> </xs:complexType> <xs:complexType name="FTUnaryNot"> <xs:sequence> <xs:element name="selection" type="fts:FTSelection"/> </xs:sequence> </xs:complexType> <xs:complexType name="FTMildNot"> <xs:sequence> <xs:element name="selection" type="fts:FTSelection"/> </xs:sequence> </xs:complexType> <xs:complexType name="FTOrder"> <xs:sequence> <xs:element name="selection" type="fts:FTSelection"/> </xs:sequence> </xs:complexType> <xs:complexType name="FTScope"> <xs:sequence> <xs:element name="selection" type="fts:FTSelection"/> </xs:sequence> <xs:attribute name="type" type="fts:ScopeType" use="required"/> <xs:attribute name="scope" type="fts:ScopeSelector" use="required"/> </xs:complexType> <xs:complexType name="FTContent"> <xs:sequence> <xs:element name="selection" type="fts:FTSelection"/> </xs:sequence> <xs:attribute name="type" type="fts:ContentMatchType" use="required"/> </xs:complexType> <xs:complexType name="FTDistance"> <xs:sequence> <xs:element name="range" type="fts:FTRangeSpec"/> <xs:element name="selection" type="fts:FTSelection"/> </xs:sequence> <xs:attribute name="type" type="fts:DistanceType" use="required"/> </xs:complexType> <xs:complexType name="FTWindow"> <xs:sequence> <xs:element name="selection" type="fts:FTSelection"/> </xs:sequence> <xs:attribute name="size" type="xs:integer" use="required"/> <xs:attribute name="type" type="fts:DistanceType" use="required"/> </xs:complexType> <xs:complexType name="FTTimes"> <xs:sequence> <xs:element name="range" type="fts:FTRangeSpec"/> <xs:element name="selection" type="fts:FTSelection"/> </xs:sequence> </xs:complexType> <xs:complexType name="FTCaseOption"> <xs:attribute name="value" use="required"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="lowercase"/> <xs:enumeration value="uppercase"/> <xs:enumeration value="case insensitive"/> <xs:enumeration value="case sensitive"/> </xs:restriction> </xs:simpleType> </xs:attribute> </xs:complexType> <xs:complexType name="FTRangeSpec"> <xs:attribute name="type" type="fts:RangeSpecType" use="required"/> <xs:attribute name="m" type="xs:integer"/> <xs:attribute name="n" type="xs:integer" use="required"/> </xs:complexType> <xs:simpleType name="FTWordsType"> <xs:restriction base="xs:string"> <xs:enumeration value="any"/> <xs:enumeration value="all"/> <xs:enumeration value="phrase"/> <xs:enumeration value="any word"/> <xs:enumeration value="all word"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="ScopeType"> <xs:restriction base="xs:string"> <xs:enumeration value="same"/> <xs:enumeration value="different"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="ScopeSelector"> <xs:restriction base="xs:string"> <xs:enumeration value="paragraph"/> <xs:enumeration value="sentence"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="RangeSpecType"> <xs:restriction base="xs:string"> <xs:enumeration value="exactly"/> <xs:enumeration value="at least"/> <xs:enumeration value="at most"/> <xs:enumeration value="from to"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="DistanceType"> <xs:restriction base="xs:string"> <xs:enumeration value="paragraph"/> <xs:enumeration value="sentence"/> <xs:enumeration value="word"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="ContentMatchType"> <xs:restriction base="xs:string"> <xs:enumeration value="at start"/> <xs:enumeration value="at end"/> <xs:enumeration value="entire content"/> </xs:restriction> </xs:simpleType> </xs:schema>
evaluate
functionThe denotational semantics for the evaluation of FTSelections is defined using the fts:evaluate
function. The function takes in three parameters: (1) an FTSelection, 2) a search context node, and 3) the default set of match options that apply to the evaluation of the FTSelection.
The fts:evaluate
function returns the AllMatches that is the result of evaluating the FTSelection. When fts:evaluate
is applied to some FTSelection X, it calls the function fts:ApplyX
to build the resulting AllMatches. If X is applied on nested FTSelections, the fts:evaluate
function is recursively called on these nested FTSelections and the returned AllMatches are used in the evaluation of
fts:ApplyX
.
The fts:evaluate
function is defined below.
declare function evaluate($ftSelect as element(*, fts:FTSelection), $searchContext as node(), $matchOptions as FTMatchOptions, $searchTokenNum as xs:integer) as AllMatches { if (fn:count($ftSelect/FTMatchOption) > 0) then (: First we deal with all match options that the :) (: FTSelection might bear: we add the match options :) (: in front of the current match options sequence :) (: and pass the new sequence to the recursive call :) let $newFTSelection := $ftSelect/*[!(. instance of element(FTMatchOption))] return fts:evaluate($newFTSelection, $searchContext, ($ftSelect/matchOption, $matchOptions), $searchTokenNum) else if (fn:count($ftSelect/weight) > 0) then (: Weight has no bearing on semantics – just :) (: call "evaluate" on nested FTSelection :) let $newFTSelection := $ftSelect/*[! (. instance of element(weight)] return fts:evaluate($newFTSelection, $searchContext, $matchOptions, $searchTokenNum) else typeswitch ($ftSelect) case ($nftSelection as element(FTWords)) (: Apply the FTWords in the search context :) return ApplyFTWords($searchContext, $matchOptions, $nftSelection/searchToken, $searchTokenNum + 1); case ($nftSelection as element(FTAnd)) return let $left = fts:evaluate($nftSelection/left, $searchContext, $matchOptions, $searchTokenNum) let $newSearchTokenNum = $left/@stokenNum let $right = fts:evaluate($nftSelection/right, $searchContext, $matchOptions, $newSearchTokenNum) return ApplyFTAnd($left, $right) case ($nftSelection as element(FTOr)) return let $left = fts:evaluate($nftSelection/left, $searchContext, $matchOptions, $searchTokenNum) let $newSearchTokenNum = $left/@stokenNum let $right = fts:evaluate($nftSelection/right, $searchContext, $matchOptions, $newSarchTokenNum) return ApplyFTOr($left, $right) case ($nftSelection as element(FTUnaryNot)) return applyFTUnaryNot($nftSelection/selection) case ($ftSelection as element(FTMildNot)) return let $left = fts:evaluate($nftSelection/left, $searchContext, $matchOptions, $searchTokenNum) let $newSearchTokenNum = $left/@stokenNum let $right = fts:evaluate($nftSelection/right, $searchContext, $matchOptions, $newSearchTokenNum) return ApplyFTMildNot($left, $right) case ($nftSelection as element(FTOrder)) let $nested := fts:evaluate($nftSelection/selection, $searchContext, $matchOptions, $searchTokenNum) return ApplyFTOrder($nested) case ($nftSelection as element(FTScope)) let $nested := fts:evaluate($nftSelection/selection, $searchContext, $matchOptions, $searchTokenNum) return ApplyFTScope($nftSelection/@type, $nftSelection/@scope, $nested) case ($nftSelection as element(FTContent)) let $nested := fts:evaluate($nftSelection/selection, $searchContext, $matchOptions, $searchTokenNum) return ApplyFTContent($searchContext, $nftSelection/@type, $nested) case ($nftSelection as element(FTDistance)) let $nested := fts:evaluate($nftSelection/selection, $searchContext, $matchOptions, $searchTokenNum) return ApplyFTDistance($matchOptions, $nftSelection/@type, $nftSelection/range, $nested) case ($nftSelection as element(FTWindow)) let $nested := fts:evaluate($nftSelection/selection, $searchContext, $matchOptions, $searchTokenNum) return ApplyFTWindow($matchOptions, $nftSelection/@type, $nftSelection/@size, $nested) case ($nftSelection as element(FTTimes)) let $nested := fts:evaluate($nftSelection/selection, $searchContext, $matchOptions, $searchTokenNum) return ApplyFTTimes($nftSelection/range, $nested) }
For concreteness, assume that the FTSelection was invoked inside an ftcontains
expression such as searchContext ftcontains ftselection
. In order to determine the AllMatches result of ftselection
, the fts:evaluate
function is invoked as follows: fts:evaluate($ftselection, $searchContext, $matchOptions, 0)
, where $ftselection
is the XML representation of the ftselection
and $searchContext
is bound to the result of the evaluation of the XQuery expression searchContext
.
Initially, the $searchTokensNum
is 0, i.e., no search tokens have been processed.
The variable $matchOptions
is bound to the list of match options as defined in the static context (see Appendix C Static Context Components). Match options embedded in ftselection
modify the match options collection as evaluation proceeds.
Match options are applied to an FTSelection, organized in a stack.
The top match option in the stack is applied first.
The second match option is applied next.
Match options are applied sequentially down to the bottom of the stack.
Ordering among match options is necessary because match options are not always commutative. For example, synonym(stem(word)) is not always the same as stem(synonym(word)). Naturally, match options may be reordered when they commute, but this is an optimization issue and is beyond the scope of this document.
Given the invocation of: fts:evaluate($ftselection, $searchContext, $matchOptions)
, evaluation proceeds as follows. First, $ftselection
is checked to see whether a match option is applied 1) on a nested FTSelection, 2) on a weight specification, 3) on an FTWords, or 4) on some other FTSelection (case 4).
If $ftselection
contains a match option, then it modifies the context for the nested FTSelection. Consequently, a new match option element is created and pushed onto the top of the stack of match options. The createOptionElement
function used to create a stack element corresponding to the match option creates a data structure that stores the type of match option, such as stemming, thesaurus, and the details relating to the match option, such as the name of the thesaurus,
the stop words for which other words may be substituted. The context match option created is added to the top of the stack because, in the FTSelection, it was applied before the other match options in the current match options stack. The evaluate
function is then invoked on the nested FTSelection with the new match options stack. When the function returns, the match option is popped from the stack, and the result of the nested evaluate
function is returned. The match
option is popped because the match options do not apply to FTSelections outside its scope.
If $ftselection
contains a weight specification, then the specification is ignored because it does not alter the semantics. The evaluate
function is recursively called on the nested FTSelection and the resulting AllMatches is returned.
If $ftselection
is an FTWords, then it does not have any nested FTSelections. Consequently, this is the base of the recursive call, and the AllMatches result of the FTWords is computed and returned. The AllMatches is computed by invoking the ApplyFTWords
function with the current search context and other necessary information.
If $ftselection
contains neither a match option nor a weight specification and is not an FTWords, the FTSelection performs a full-text operation, such as &&
, ||
, window
. These operations are fully-compositional and may be invoked on nested FTSelections. Consequently, evaluation proceeds as follows.
First, the evaluate
function is recursively invoked on each nested FTSelection. The result of evaluating each nested FTSelection is an AllMatches.
The AllMatches are transformed into the resulting AllMatches by applying the full-text operation corresponding to FTSelection1
which is generically named applyX
for some type of FTSelection X in the code.
For example, let FTSelection1
be FTSelection2 && FTSelection3
. Here FTSelection2
and FTSelection3
may themselves be arbitrarily nested FTSelections. Thus, evaluate
is invoked on FTSelection2
and FTSelection3
, and the resulting AllMatches are transformed to the final AllMatches using the ApplyFTAnd
function corresponding to &&
.
The semantics of the ApplyX
function for each FTSelection kind X are given below.
The formal semantics of the applyX
functions for each FTSelection kind X are specified in by six functions. How these six functions are computed is implementation-dependent, but the functions must satisfy some well-defined properties.
The getTokenInfo
function is described in in Section 4.2.7 Tokenization.
The wordDistance
function returns the number of words that occur between the positions of the TokenInfos $tokenInfo1
and $tokenInfo2
. For example, two consecutive words have a distance of 0 words.
function fts:wordDistance( $tokenInfo1 as fts:TokenInfo, $tokenInfo2 as fts:TokenInfo, $matchOptions as fts:FTMatchOptions) as xs:integer
The getParaDistance
function returns the number of paragraphs between the TokenInfos $tokenInfo1
and $tokenInfo2
.
function fts:paraDistance( $tokenInfo1 as fts:TokenInfo, $tokenInfo2 as fts:TokenInfo, $matchOptions as fts:FTMatchOptions) as xs:integer
The sentenceDistance
function returns the number of sentences between the TokenInfos $tokenInfo1
and $tokenInfo2
.
function fts:sentenceDistance( $tokenInfo1 as fts:TokenInfo, $tokenInfo2 as fts:TokenInfo, $matchOptions as fts:FTMatchOptions) as xs:integer
The isStartToken
function returns true if the TokenInfo $tokenInfo
describes the first token of the node $searchContext
.
function fts:isStartToken( $searchContext as node(), $tokenInfo as fts:TokenInfo) as xs:boolean
The isEndToken
function returns true if the TokenInfo $tokenInfo
describes the last words or phrases of the node $searchContext
.
function fts:isEndToken( $searchContext as node(), $tokenInfo as fts:TokenInfo) as xs:boolean
If an FTWords consists of a single search string, the parameters of the applySingleSearchToken
function are a) the search context, 2) the list of match options, 3) the search TokenInfo, and 4) the position where the latter occurs in the query.
If after the application of all the match options, the sequence of search tokens returned for an FTWords is empty, an empty AllMatches is returned.
declare function fts:applySingleSearchToken( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchToken as fts:TokenInfo, $queryPos as xs:integer) as element(allMatches, fts:AllMatches) { let $res := <allMatches stokenNum="{$queryPos}"> { let $token_pos := fts:getTokenInfo($searchContext, $matchOptions, $searchToken) for $pos in $token_pos return <match> <stringInclude queryPos="{$queryPos}" queryString="{$searchToken/@word}" > {$pos} </stringInclude> </match> } </allMatches> return fts:normalizeAllMatches($res) }
The AllMatches corresponding to an FTWords corresponds to a set of Matches. Each of Matches is associated with a position where the corresponding search token was found. For example, the AllMatches result for the FTWords "Mustang" is given below.
There are five variations of FTWords depending on how the words and phrases in the nested XQuery 1.0 and XPath 2.0 expression are matched.
When any word
is specified, at least one word in the tokenization of the nested expeession must be matched.
When all word
is specified, all words in the tokenization of the nested expression must be matched.
When phrase
is specified, all words in the tokenization of the nested expression must be matched as a phrase.
When any
is specified, at least one string atomic value in the nested expression must be matched as a phrase.
When all
is specified, all string atomic values in the nested expression must be matched as a phrase.
The semantics for FTWords when any word
is specified are given below. Since FTWords does not have nested FTSelections, the ApplyFTWords
function does not take AllMatches parameters corresponding to nested FTSelection results.
declare function fts:MakeDisjunction($curRes as element(allMatches, fts:AllMatches), $rest as element(allMatches, fts:AllMatches)*) as element(allMatches, fts:AllMatches) { if (fn:count($rest) = 0) then $curRes else let $firstAllMatches := $rest[1] let $restAllMatches := fn-subsequence($rest, 2) let $newCurRes := fts:ApplyFTOr($curRes, $firstAllMatches) return fts:MakeDisjunction($newCurRes, $restAllMatches) } declare function fts:ApplyFTWordsAnyWord( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchStrings as xs:string*, $queryPos as xs:integer) as element(allMatches, fts:AllMatches) { let $searchTokens := for $searchString in $searchStrings return fts:getSearchTokenInfo($searchString, $matchOptions) return if (fn:count($searchTokens) eq 0) then <allMatches stokenNum="0" /> else let $allAllMatches = for $searchToken at $pos in $searchTokens return fts:applySingleSearchToken( $searchContext, $matchOptions, $searchToken, $queryPos + $pos - 1) let $firstAllMatches := $allAllMatches[1] let $restAllMatches := fn:subsequence($allAllMatches, 2) return fts:MakeDisjunction($firstAllMatches, $restAllMatches) }
The search strings are tokenized and a single sequence that consists of all TokenInfos is constructed. For each of these, the result of FTWords is computed using ApplySingleSearchSelection
. Finally, the conjunction of all resulting AllMatches is computed.
The semantics for FTWords when all word
is specified are given below.
declare function fts:MakeConjunction($curRes as element(allMatches, fts:AllMatches), $rest as element(allMatches, fts:AllMatches)*) as element(allMatches, fts:AllMatches) { if (fn:count($rest) = 0) then $curRes else let $firstAllMatches := $rest[1] let $restAllMatches := fn-subsequence($rest, 2) let $newCurRes := fts:ApplyFTAnd($curRes, $firstAllMatches) return fts:MakeConjunction($newCurRes, $restAllMatches) } declare function fts:ApplyFTWordsAllWord( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchStrings as xs:string*, $queryPos as xs:integer) as element(allMatches, fts:AllMatches) { let $searchTokens := for $searchString in $searchStrings return fts:getSearchTokenInfo($searchString, $matchOptions) return if (fn:count($searchTokens) eq 0) then <allMatches stokenNum="0" /> else let $allAllMatches = for $searchToken at $pos in $searchTokens return fts:applySingleSearchToken( $searchContext, $matchOptions, $searchToken, $queryPos + $pos - 1) let $firstAllMatches := $allAllMatches[1] let $restAllMatches := fn:subsequence($allAllMatches, 2) return fts:MakeConjunction($firstAllMatches, $restAllMatches) }
The semantics for FTWords if phrase
is specified are given below.
declare function fts:ApplyFTWordsPhrase( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchStrings as xs:string*, $queryPos as xs:integer) as element(allMatches, fts:AllMatches) { let $conj := fts:ApplyFTWordsAllWord($searchContext, $matchOptions, $searchStrings, $queryPos) let $ordered := fts:ApplyFTOrder($conj) let $distance1 := fts:ApplyFTDistance($matchOptions, $ordered, <fts:range type="exactly" n="0">) return $distance1 }
The ApplyFTWordsPhrase
function is similar to the ApplyFTWordsAllWord
function in the case of all word
. The only differences are that the additional FTSelections ordered
and distance 0 words
are applied.
The semantics for FTWords when any
is specified are given below.
declare function fts:ApplyFTWordsAny( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchStrings as xs:string*, $queryPos as xs:integer) as element(allMatches, fts:AllMatches) { if (fn:count($searchTokens) eq 0) then <allMatches stokenNum="0" /> else let $firstSearchString := $searchStrings[1] let $restSearchString := fn:subsequence($searchStrings, 2) let $firstAllMatches := fts:ApplyFTWordsPhrase($searchContext, $matchOptions, $firstSearchString, $queryPos) let $newQueryPos := fn:max($firstAllMatches//@queyrPos) + 1 let $restAllMatches := fts:ApplyFTWordsAny($searchContext, $matchOptions, $restSearchString, $newQueryPos) return fts:ApplyFTOr($firstAllMatches, $resAllMatches) }
The FTWords with any
specified forms the disjunction of the AllMatches that are the result of the matching of each search token as a phrase.
The semantics for FTWords when all
is specified are given below.
declare function fts:ApplyFTWordsAll( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchStrings as xs:string*, $queryPos as xs:integer) as element(allMatches, fts:AllMatches) { if (fn:count($searchTokens) = 0) then <allMatches stokenNum="0" /> else let $firstSearchString := $searchStrings[1] let $restSearchString := fn:subsequence($searchStrings, 2) let $firstAllMatches := fts:ApplyFTWordsPhrase($searchContext, $matchOptions, $firstSearchString, $queryPos) let $newQueryPos := fn:max($firstAllMatches//@quetyPos) + 1 let $restAllMatches := fts:ApplyFTWordsAll($searchContext, $matchOptions, $searchStrings, $newQueryPos) return fts:ApplyFTAnd($firstAllMatches, $resAllMatches) }
The difference between all
and any
is the use of conjunction instead of disjunction.
The ApplyFTWords
the function that combines all of these functions.
declare function fts:ApplyFTWords($searchContext as Node*, $matchOptions as fts:FTMatchOptions, $type as element(type, fts:FTWordsType), $searchTokens as xs:string*, $queryPos as xs:integer) as element(allMatches, fts:AllMatches) { if ($type eq "any word") then fts:ApplyFTWordsAnyWord($searchContext, $matchOptions, $searchTokens, $queryPos) else if ($type eq "all word") then fts:ApplyFTWordsAllWord($searchContext, $matchOptions, $searchTokens, $queryPos) else if ($type eq "phrase") then fts:ApplyFTWordsPhrase($searchContext, $matchOptions, $searchTokens, $queryPos) else if ($type eq "any") then fts:ApplyFTWordsAny($searchContext, $matchOptions, $searchTokens, $queryPos) else fts:ApplyFTWordsAll($searchContext, $matchOptions, $searchTokens, $queryPos) }
The parameters of the ApplyFTOr
function are the two AllMatches parameters corresponding to the results of the two nested FTSelections. The search context and the match options stack are not used by this function. The semantics are given below.
declare function fts:ApplyFTOr($allMatches1 as element(allMatches, fts:AllMatches), $allMatches2 as element(allMatches, fts:AllMatches)) as element(allMatches, fts:AllMatches) { let res := <allMatches stokenNum="{fn:max(($allMatches1/@stokenNum, $allMatches2/@stokenNum))}"> ($allMatches1/match $allMatches2/match) </allMatches> return fts:normalizeAllMatches($res) }
The ApplyFTOr
function creates a new AllMatches in which Matches are the union of those found in the input AllMatches. Each Match represents one possible result of the corresponding FTSelection. Thus, a Match from either of the AllMatches is a resukt.
For example, consider the FTSelection "Mustang" || "Honda"
. The AllMatches corresponding to "Mustang" and "Honda" are given below.
The AllMatches produced by ApplyFTOr
is given below.
The parameters of the ApplyFTAnd
function are the two AllMatches corresponding to the results of the two nested FTSelections. The search context and the match options are not used by this function. The semantics are given below.
declare function fts:ApplyFTAnd ($allMatches1 as element(allMatches, fts:AllMatches), $allMatches2 as element(allMatches, fts:AllMatches)) as element(allMatches, fts:AllMatches) { let res:= <allMatches stokenNum="{fn:max(($allMatches1/@stokenNum, $allMatches2/@stokenNum))}" > {for $sm1 in $allMatches1/match for $sm2 in $allMatches2/match return <match> {$sm1/* $sm2/*} </match> } </allMatches> return fts:normalizeAllMatches($res) }
The result of the conjunction is a new AllMatches that contains the "Cartesian product" of the matches of the participating FTSelections. Every resulting Match is formed by the combination of the StringInclude components and StringExclude from the AllMatches of the nested FTSelection . Thus every match contains the positions to satisfy a Match from both original FTSelections and excludes the positions that violate the same Matches.
For example, consider the FTSelection "Mustang" && "rust"
. The source AllMatches are give below.
The AllMatches produced by ApplyFTAnd
is given below.
The parameters of the ApplyFTUnaryNot
function are 1) the search context, 2) the list of match options, and 3) one AllMatches parameter corresponding to the result of the nested FTSelection to be negated. The search context and the match options are not used by this function. The semantics are given below.
declare function fts:InvertStringMatch($strm) { if ($strm instanceof element(stringExclude)) then <stringInclude queryPos="{$strm/@queryPos}" queryString="{$strm/@queryString}"> {$strm/docPos} </stringInclude> else <stringExclude queryPos="{$strm/@queryPos}" queryString="{$strm/@queryString}"> {$strm/docPos} </stringInclude> } declare function fts:UnaryNotHelper($sms) { let $res := <allMatches stokenNum="{$stokenNum}"> { for $sm in $sms/match[1]/child::element() for $rest in fts:UnaryNotHelper( fn:subsequence($sms/match, 2)/match return <match> (fts:InvertStringMatch($sm) $rest/*) </match> } </allMatches> return fts:normalizeAllMatches($res) } declare function fts:ApplyFTUnaryNot($allMatches as element(allMatches, fts:AllMatches)) as element(allMatches, fts:AllMatches) { if ($allMatches/match) then {fts:UnaryNotHelper($allMatches)} else <allMatches stokenNum="{$allMatches/@stokenNum}"> <match /> </allMatches> }
The generation of the resulting AllMatches of an FTUnaryNot resembles the transformation of a negation of prepositional formula in DNF back to DNF. The negation of AllMatches requires the inversion of all the conditions on the nodes encoded by the AllMatches .
In the InvertStringMatch
function above, this inversion occurs as follows.
The function fts:invertStringMatch
inverts a stringInclude into a stringExclude and vice versa.
The function fts:neg_helper
transforms the source Matches into the resulting Matches by combining a the inversions of a StringInclude or StringExclude component from every source Match into a new Match.
For example, consider the FTSelection ! ("Mustang" || "Honda")
. The source AllMatches is given below:
The FTUnaryNot transforms the StringIncludes to StringExcludes as illustrated below.
The parameters of the ApplyFTMildNot
function are the two AllMatches parameters corresponding to the results of the two nested FTSelections. The search context and the match options stack are not used by this function. The semantics are given below.
declare function fts:ApplyFTMildNot($allMatches1 as element(allMatches, fts:AllMatches), $allMatches2 as element(allMatches, fts:AllMatches)) as element(allMatches, fts:AllMatches){ if (fn:count($allMatches2//stringExclude) gt 0) then fn:error("Invalid expression on the right-hand side of a not-in") else let res := <allMatches stokenNum="{$allMatches1/@stokenNum}"> {let $posSet2 = $allMatches2/match/stringInclude/pos return $allMatch1/match[every $pos1 in ./stringInclude/pos, $pos2 in $posSet2 satisfies $pos1 ne $pos2] } </allMatches> return fts:normalizeAllMatches($res) }
The resulting AllMatches contains Matches of the first operand that do not mention in their stringInclude components positions in a StringInclude component in the AllMatches of the second operand.
For example, consider the FTSelection ("Ford" mildnot "Ford Mustang")
. The source AllMatches for the left-hand side argument is given below.
The source AllMatches for the right-hand side argument is given below.
The FTMildNot will transform these to an empty AllMatches because both position 1 and position 27 from the first AllMatches contain only TokenInfos from stringInclude components of the second AllMatches.
The parameters of the ApplyFTOrder
function are 1) the search context, 2) the list of match options, and 3) one AllMatches parameter corresponding to the result of the nested FTSelections. The evaluation context and the match options are not used by this function. The semantics are given below.
declare function fts:ApplyFTOrder($allMatches as element(allMatches, fts:AllMatches)) as element(allMatches, fts:AllMatches) { let $res := <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match where every $stringInclude1 in $match, $stringInclude2 in $match satisfies (($stringInclude1/tokenInfo/@pos <= $stringInclude2/tokenInfo/@pos) and ($stringInclude1/@queryPos <= $stringInclude2/@queryPos)) or (($stringInclude1/tokenInfo/@pos>= $stringInclude2/tokenInfo/@pos) and ($stringInclude1/@queryPos >= $stringInclude2/@queryPos)) return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where every $stringIncl in $match/stringInclude satisfies (($stringExcl/tokenInfo/@pos <= $stringIncl/tokenInfo/@pos) and ($stringExcl/@queryPos <= $stringIncl/@queryPos)) or (($stringExcl/tokenInfo/@pos >= $stringIncl/tokenInfo/@pos) and ($stringExcl/@queryPos >= $stringIncl/@queryPos)) } </match> } </allMatches> return fts:normalizeAllMatches($res) }
The resulting AllMatches contains the Matches of for which positions in the StringInclude elements are in the order of the query positions of their query strings. StringExcludes that preserve the order are also retained.
For example, consider the FTSelection ("great" && "condition") ordered
. The source AllMatches is given below.
The AllMatches for FTOrder are given below.
The parameters of the ApplyFTScope
function are 1) the search context, 2) the list of match options, 3) the type of the scope (same or different), 4) the linguistic unit (sentence or paragraph), and 5) one AllMatches parameter corresponding to the result of the nested FTSelections. The search context and the match options are not used by this function. The function definitions depend on the type of the scope (paragraph, sentence) and the scope predicate (same, different).
The semantics of same sentence
are given below.
declare function fts:ApplyFTScopeSameSentence( $allMatches as element(allMatches, fts:AllMatches) ) as element(allMatches, fts:AllMatches) { let $res := <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match where every $stringInclude1 in $match, $stringInclude2 in $match satisfies $stringInclude1/tokenInfo/@sentence = $stringInclude2/tokenInfo/@sentence return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where every $stringIncl in $match/stringInclude satisfies $stringIncl/tokenInfo/@sentence = $stringExcl/tokenInfo/@sentence } </match> } </allMatches> return fts:normalizeAllMatches($res) }
The semantics of different sentence
are given below.
declare function fts:ApplyFTScopeDifferentSentence( $allMatches as element(allMatches, fts:AllMatches) ) as element(allMatches, fts:AllMatches) { let $res := <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match where every $stringInclude1 in $match, $stringInclude2 in $match satisfies $stringInclude1 = $stringInclude2 or $stringInclude1/tokenInfo/@sentence != $stringInclude2/tokenInfo/@sentence return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where every $stringIncl in $match/stringInclude satisfies $stringIncl/tokenInfo/@sentence != $stringExcl/tokenInfo/@sentence } </match> } </allMatches> return fts:normalizeAllMatches($res) }
The semantics of same paragraph
, the semantics are given below.
declare function fts:ApplyFTScopeSameParagraph( $allMatches as element(allMatches, fts:AllMatches) ) as element(allMatches, fts:AllMatches) { let $res := <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match where every $stringInclude1 in $match, $stringInclude2 in $match satisfies $stringInclude1/tokenInfo/@para = $stringInclude2/tokenInfo/@para return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where every $stringIncl in $match/stringInclude satisfies $stringIncl/tokenInfo/@para = $stringExcl/tokenInfo/@para } </match> } </allMatches> return fts:normalizeAllMatches($res) }
The semantics of different paragraph
are given below.
declare function fts:ApplyFTScopeDifferentParagraph( $type $allMatches as element(allMatches, fts:AllMatches) ) as element(allMatches, fts:AllMatches) { let $res := <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match where every $stringInclude1 in $match, $stringInclude2 in $match satisfies $stringInclude1 = $stringInclude2 or $stringInclude1/tokenInfo/@para != $stringInclude2/tokenInfo/@para return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where every $stringIncl in $match/stringInclude satisfies $stringIncl/tokenInfo/@para != $stringExcl/tokenInfo/@para } </match> } </allMatches> return fts:normalizeAllMatches($res) }
The semantics of the scope same sentence
is for every Match from the AllMatches of the operand, those that contain StringMatches from StringInclude only in the same element sentence are filtered out. Those StringExcludes that refer to the same node are retained.
The semantics for scope type same paragraph
are analogous.
The semantics for the general case are given below.
declare function fts:ApplyFTScope( $type as fts:ScopeType, $selector fts:ScopeSelector, $allMatches as element(allMatches, fts:AllMatches) ) as element(allMatches, fts:AllMatches) { if ($type eq "same" and $selector eq "sentence") then fts:ApplyFTScopeSameSentence($allMatches) else if ($type eq "different" and $selector eq "sentence") then fts:ApplyFTScopeDifferentSentence($allMatches) else if ($type eq "same" and $selector eq "paragraph") then fts:ApplyFTScopeSameParagraph($allMatches) else fts:ApplyFTScopeDifferentParagraph($allMatches) }
For example, consider the FTSelection ("Mustang" && "Honda") same paragraph
. The source AllMatches is given below.
The FTScope returns an empty AllMatches because neither Match contains TokenInfos from a single sentence.
The parameters of the ApplyFTContent
function are 1) the search context, 2) the match options, 3) the type of the content match at the start of the current node, at the end of it, or its entire content, and 4) one AllMatches parameter corresponding to the result of the nested FTSelections. The semantics are given below.
declare function fts:ApplyFTContent( $searchContext as node(), $matchOptions as element(matchOptions, fts:FTMatchOptions), $type as fts:ContentMatchType, $allMatches as element(allMatches, fts:AllMatches) ) as element(allMatches, fts:AllMatches) { if ($type eq "entire content") then let $temp1 := fts:ApplyFTWordDistanceExactly( $matchOptions, $allMatches, 1) let $temp2 := fts:ApplyFTContent( $searchContext, $matchOptions, $temp1, "at start") let $temp3 := fts:ApplyFTContent( $searchContext, $matchOptions, $temp2, "at end") return $temp3 else let $res := <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match where if ($type eq "at start") then some $si in $math/stringInclude satisfies fts:isStartToken($searchContext, $si/tokenInfo) else (: $type eq "at end" :) then some $si in $math/stringInclude satisfies fts:isEndToken($searchContext, $si/tokenInfo) else return {$match} </allMatches> return fts:normalizeAllMatches($res) }
The evaluation of scope functions depends on the type of the content match.
entire match
is evaluated as distance exactly 0 words at start at end
, i.e. all the StringIncludes must match every token in the content of the current search context node.
at start
retains only Matches that contain a StringInclude that matches the first token. This is checked using the semantic function fts:isStartToken
.
at end
retains the Matches that contain a StringInclude that matches the last token. This is checked using the semantic function fts:isEndToken
.
The parameters of the ApplyFTDistance
function are 1) the search context, 2) the list of match options, 3) one AllMatches parameter corresponding to the result of the nested FTSelections, 4) the unit of the distance (words, sentences, paragraphs), and 5) the range specified. The search context is not used by this function. The function definitions depend on the distance units and the range specifications.
The semantics of case word distance exactly N
are given below.
declare function fts:ApplyFTWordDistanceExactly( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer) ) as element(allMatches, fts:AllMatches) { let $res := <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $idx in (1 to fn:count($sorted) - 1) satisfies fts:wordDistance( $sorted[$idx]/tokenInfo, $sorted[$idx+1]/tokenInfo, $matchOptions) = $n return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where some $stringIncl in $match/stringInclude satisfies fts:wordDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) = $n return $stringExcl } </match> } </allMatches> return fts:normalizeAllMatches($res) }
The semantics of word distance at least N
are given below.
declare function fts:ApplyFWordDistanceAtLeast( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { let $res := <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:wordDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) >= $n return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where some $stringIncl in $match/stringInclude satisfies fts:wordDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) >= $n return $stringExcl } </match> } </allMatches> return fts:normalizeAllMatches($res) }
The semantics of word distance at most N
are given below.
declare function fts:ApplyFWordDistanceAtMost( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { let $res := <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@Identifier ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:wordDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) <= $n return <match> {$match/stringInclude} for $stringExcl in $match/stringExclude where some $stringIncl in $match/stringInclude satisfies fts:wordDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) <= $n return $stringExclude } </match> } </allMatches> return fts:normalizeAllMatches($res) }
The semantics of word distance from M to N
are given below.
declare function fts:ApplyFWordDistanceFromTo( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $m as xs:integer, $n as xs:integer ) as element(allMatches, fts:AllMatches) { let $res := <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:wordDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) >= $m and fts:wordDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) <= $n return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where some $stringIncl in $match/stringInclude satisfies fts:wordDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) >= $m and fts:wordDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) <= $n return $stringExcl } </match> } </allMatches> return fts:normalizeAllMatches($res) }
The the semantics of sentence distance exactly N
are given below.
declare function fts:ApplyFSentenceDistanceExactly( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { let $res := <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:sentenceDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) = $n return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where some $stringIncl in $match/stringInclude satisfies fts:sentenceDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) = $n return $stringExcl } </match> } </allMatches> return fts:normalizeAllMatches($res) }
The semantics of sentence distance at least N
are given below.
declare function fts:ApplyFSentenceDistanceAtLeast( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { let $res := <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:sentenceDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) >= $n return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where some $stringIncl in $match/stringInclude satisfies fts:sentenceDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) >= $n return $stringExcl } </match> } </allMatches> return fts:normalizeAllMatches($res) }
The semantics of sentence distance at most N
are given below.
declare function fts:ApplyFSentenceDistanceAtMost( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { let $res := <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:sentenceDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) <= $n return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where some $stringIncl in $match/stringInclude satisfies fts:sentenceDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) <= $n return $stringExcl } </match> } </allMatches> return fts:normalizeAllMatches($res) }
The semantics of sentence distance from M to N
are given below.
declare function fts:ApplyFSentenceDistanceFromTo( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $m as xs:integer, $n as xs:integer ) as element(allMatches, fts:AllMatches) { let $res := <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:sentenceDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) >= $m and fts:sentenceDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) <= $n return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where some $stringIncl in $match/stringInclude satisfies fts:sentenceDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) >= $m and fts:sentenceDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) <= $n return $stringExcl } </match> } </allMatches> return fts:normalizeAllMatches($res) }
The semantics of paragraph distance exactly N
are given below.
declare function fts:ApplyFTParagraphDistanceExactly( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { let $res := <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:paraDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) = $n return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where some $stringIncl in $match/stringInclude satisfies fts:paraDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) = $n return $stringExcl } </match> } </allMatches> return fts:normalizeAllMatches($res) }
The semantics of paragraph distance at least N
are given below.
declare function fts:ApplyFTParagraphDistanceAtLeast( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { let $res := <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:paraDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) >= $n return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where some $stringIncl in $match/stringInclude satisfies fts:paraDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) >= $n return $stringExcl } </match> } </allMatches> return fts:normalizeAllMatches($res) }
The semantics of paragraph distance at most N
are given below.
declare function fts:ApplyFTParagraphDistanceAtMost( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { let $res := <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:paraDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) <= $n return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where some $stringIncl in $match/stringInclude satisfies fts:paraDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) <= $n return $stringExcl } </match> } </allMatches> return fts:normalizeAllMatches($res) }
The semantics of paragraph distance from M to N
are given below.
declare function fts:ApplyFTParagraphDistanceFromTo( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $m as xs:integer, $n as xs:integer ) as element(allMatches, fts:AllMatches) { let $res := <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $sitokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:paraDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) >= $m and fts:paraDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) <= $n return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where some $stringIncl in $match/stringInclude satisfies fts:paraDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) >= $m and fts:paraDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) <= $n return $stringExcl } </match> } </allMatches> return fts:normalizeAllMatches($res) }
The resulting AllMatches contains Matches of the operand that satisfy the condition that the distance (measured in words, sentences, or paragraphs) for every pair of consecutive positions in StringIncludes is within the specified interval. [Definition: Consecutive Positions in a Match are two positions from the same Match with no intervening StringIncludes.]
In the general case, the semantics are given below.
declare function fts:ApplyFTDistance( $matchOptions as element(matchOptions, fts:FTMatchOptions), $type as fts:DistanceType, $range as element(range, fts:FTRangeSpec), $allMatches as element(allMatches, fts:AllMatches) ) as element(allMatches, fts:AllMatches) { if ($type eq "word") then if ($range/@type eq "exactly") then fts:ApplyFTWordDistanceExactly($matchOptions, $allMatches, $range/@n) else if ($range/@type eq "at least") then fts:ApplyFTWordDistanceAtLeast($matchOptions, $allMatches, $range/@n) else if ($range/@type eq "at most") then fts:ApplyFTWordDistanceAtMost($matchOptions, $allMatches, $ range/@n) else fts:ApplyFTWordDistanceFromTo($matchOptions, $allMatches, $range/@m, $range/@n) else if ($type eq "sentence") then if ($range/@type eq "exactly") then fts:ApplyFTSentenceDistanceExactly($matchOptions, $allMatches, $range/@n) else if ($range/@type eq "at least") then fts:ApplyFTSentenceDistanceAtLeast($matchOptions, $allMatches, $range/@n) else if ($range/@type eq "at most") then fts:ApplyFTSentenceDistanceAtMost($matchOptions, $allMatches, $range/@n) else fts:ApplyFTSentenceDistanceFromTo($matchOptions, $allMatches, $range/@m, $range/@n) else if ($range/@type eq "exactly") then fts:ApplyFTParagraphDistanceExactly($matchOptions, $allMatches, $range/@n) else if ($range/@type eq "at least") then fts:ApplyFTParagraphDistanceAtLeast($matchOptions, $allMatches, $range/@n) else if ($range/@type eq "at most") then fts:ApplyFTParagraphDistanceAtMost($matchOptions, $allMatches, $range/@n) else fts:ApplyFTParagraphDistanceFromTo($matchOptions, $allMatches, $range/@m, $range/@n) }
For example, consider the FTDistance selection ("Ford Mustang" && "excellent") word distance at most 3
. The Matches of the source AllMatches for ("Ford Mustang" && "excellent")
are given below.
Continued on next diagram
Continued on next diagram
Continued on next diagram
Continued on next diagram
Continued on next diagram
The result for the FTDistance selection consists of only the first Match because the distance between consecuive TokenInfos (distance 1 and distance 3) is less than or equal to 3.
The parameters of the ApplyFTWindow
function are 1) the search context, 2) the list of match options, 3) the unit of type fts:DistanceType
, 4) a size, and 5) one AllMatches parameter corresponding to the result of the nested FTSelections. The search context is not used by this function. For each unit type a function is defined as follows.
The semantics of window N words
is given below.
define function fts:ApplyFTWordWindow( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { let $res := <allMatches> { for $match in $allMatches/match let $minpos := fn:min($match/*/tokenInfo/@pos), $maxpos := fn:max($match/*/tokenInfo/@pos) for $windowStartPos in ($minpos to $maxpos-n+1) let $windowEndPos := $windowStartPos+n-1 where fn:min($match/stringInclude/tokenInfo/@pos) >= $windowStartPos and fn:max($match/stringInclude/tokenInfo/@pos) <= $windowEndPos return <match> {$match/stringInclude} { for $stringExclude in $match/stringExclude where $stringExclude/tokenInfo/@pos >= $windowStartPos and $stringExclude/tokenInfo/@pos <= $windowEndPos return $stringExclude } </match> } </allMatches> return fts:normalizeAllMatches($res) }
The semantics of window N sentences
is given below.
define function fts:ApplyFTSentenceWindow( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { let $res := <allMatches> { for $match in $allMatches/match let $minpos := fn:min($match/*/tokenInfo/@sentence), $maxpos := fn:max($match/*/tokenInfo/@sentence) for $windowStartPos in ($minpos to $maxpos-n+1) let $windowEndPos := $windowStartPos+n-1 where fn:min($match/stringInclude/tokenInfo/@sentence) >= $windowStartPos and fn:max($match/stringInclude/tokenInfo/@sentence) <= $windowEndPos return <match> {$match/stringInclude} { for $stringExclude in $match/stringExclude where $stringExclude/tokenInfo/@sentence >= $windowStartPos and $stringExclude/tokenInfo/@sentence <= $windowEndPos return $stringExclude } </match> } </allMatches> return fts:normalizeAllMatches($res) }
The semantics of word N paragraphs
is given below.
define function fts:ApplyFTParagraphWindow( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { let $res := <allMatches> { for $match in $allMatches/match let $minpos := fn:min($match/*/tokenInfo/@para), $maxpos := fn:max($match/*/tokenInfo/@para) for $windowStartPos in ($minpos to $maxpos-n+1) let $windowEndPos := $windowStartPos+n-1 where fn:min($match/stringInclude/tokenInfo/@para) >= $windowStartPos and fn:max($match/stringInclude/tokenInfo/@para) <= $windowEndPos return <match> {$match/stringInclude} { for $stringExclude in $match/stringExclude where $stringExclude/tokenInfo/@para >= $windowStartPos and $stringExclude/tokenInfo/@para <= $windowEndPos return $stringExclude } </match> } </allMatches> return fts:normalizeAllMatches($res) }
The resulting AllMatches contains Matches of the operand that satisfy the condition that there exists a sequence of the specified number of consecutive (word, sentence, or paragraph) positions, such that all StringIncludes are within that window, and the StringExcludes retained are also within that window.
The semantics for the general function are given below.
declare function fts:ApplyFTWindow( $matchOptions as element(matchOptions, fts:FTMatchOptions), $type as fts:DistanceType, $size as xs:integer, $allMatches as element(allMatches, fts:AllMatches), ) as element(allMatches, fts:AllMatches) { if ($type eq "word") then fts:ApplyFTWordWindow($matchOptions, $allMatches, $size) else if ($type eq "sentence") then fts:ApplyFTSentenceWindow($matchOptions, $allMatches, $size) else fts:ApplyFTParagraphWindow($matchOptions, $allMatches, $size) }
For example, consider the FTWindow selection ("Ford Mustang" && "excellent") window 10 words
. The Matches of the source AllMatches for ("Ford Mustang" && "excellent")
are given below.
Continued on next diagram
Continued on next diagram
Continued on next diagram
Continued on next diagram
Continued on next diagram
The result for the FTWindow selection consists of only the first, the fifth, and the sixth Matches because their respective window sizes are 5, 4, and 9.
The parameters of the ApplyFTTimes
function are 1) the search context, 2) the list of match options, 3) one AllMatches a range specification, and 4) parameter corresponding to the result of the nested FTSelection. The search context and the match options stack are not used by this function.
The function definitions depend on the range specification FTRange to limit the number of occurrences.
The general semantics are given below.
declare function fts:FormCombinations($sms, $times) { if (fn:count($sms) lt $times) then () else if (fn:count($sms) eq $times) then <match> {$sms/*} </match> else { fts:FormCombination(fn:subsequence($sms, 2), $times) <match> {$sms[1]/*} {fts:FormCombinations(fn:subsequence($sms, 2), $times-1)/*} </match> } } declare function fts::FormRange($sms, $l, $u, $stokenNum) { if ($l > $u) then () else let $am1 := fts:normalizeAllMatches( <allMatches stokenNum="{$stokenNum}"> {fts:FormCombinations($sms, $l)} </allMatches>) let $am2 := fts:normalizeAllMatches( <allMatches> {fts:FormCombinations($sms, $u+1)} </allMatches>) fts:ApplyFTAnd($am1, fts::ApplyFTUnaryNot($am2) }
The semantics of exactly N occurrences
are given below.
declare function fts:ApplyFTTimesExactly( $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { fts:FormRange($allMatches/match, $n, $n, $allMatches/@stokenNum) }
The semantics of at least N occurrences
are given below.
declare function fts:ApplyFTTimesAtLeast( $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { let $res := <allMatches stokenNum="{$allMatches/@stokenNum}"> {fts:formCombinations($allMatches/match, $n)} </allMatches> return fts:normalizeAllMatches($res) }
The semantics of at most N occurrences
are given below.
declare function fts:ApplyFTTimesAtMost( $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { fts:formRange($allMatches/match, 0, $n, $allMatches/@stokenNum) }
The semantics of from M to N occurrences
are given below.
declare function fts:ApplyFTTimesFromTo( $allMatches as element(allMatches, fts:AllMatches), $m as xs:integer, $n as xs:integer ) as element(allMatches, fts:AllMatches) { fts:formRange($allMatches/match, $m, $n, $allMatches/@stokenNum) }
The way to ensure that there are at least N different matches of an FTSelection is to ensure that at least N of its Matches occur simultaneously. This is similar to forming their conjunction by combining N distinct Matches into one simple match. Therefore, the AllMatches for the selection condition specifying the range qualifier at least N
contains the possible combinations of N simple matches of the operand and one Match for
each combination negating the rest of the simple matches. This operations is performed in the function fts:FormCombinations
.
The range [l, u] is represented by the condition at least l and not at least l+1
.This transformation is performed in the function fts:FormRange
.
The semantics for the general case are given below.
declare function fts:ApplyFTTimes( $range as element(range, fts:FTRangeSpec), $allMatches as element(allMatches, fts:AllMatches), ) as element(allMatches, fts:AllMatches) { if ($range/@type eq "exactly") then fts:ApplyFTTimesExactly($allMatches, $range/@n) else if ($range/@type eq "at least") then fts:ApplyFTTimesAtLeast($allMatches, $range/@n) else if ($range/@type eq "at most") then fts:ApplyFTTimesAtMost($allMatches, $range/@n) else fts:ApplyFTTimesFromTo($allMatches, $range/@m, $range/@n) }
For example, consider the FTTimes selection "Mustang" at least 2 occurrences
. The source AllMatches of the FTWords selection "Mustang"
is given below.
The result consists of the pairs of the Matches.
XQuery 1.0 functions are used to define the semantics of FTMatchOptions. These functions operate on an XML representation of the FTMatchOptions. The representation closely follows the syntax. Each FTMatchOption is represented by an XML element. Additional characteristics of the match option are represented as attributes. The schema is given below.
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified"> <xs:complexType name="FTMatchOptions"> <xs:sequence> <xs:element name="matchOption" type="fts:FTMatchOption"/> </xs:sequence> </xs:complexType> <xs:complexType name="FTMatchOption"> <xs:choice> <xs:element name="case" type="fts:FTCaseOption" /> <xs:element name="diacritics" type="fts:FTDiacriticsOption" /> <xs:element name="thesaurus" type="fts:FTThesaurusOption" /> <xs:element name="stem" type="fts:FTStemOption" /> <xs:element name="wildcard" type="fts:FTWildCardOption" /> <xs:element name="language" type="fts:FTLanguageOption" /> <xs:element name="stopWord" type="fts:FTStopwordOption" /> </xs:choice> </xs:complexType> <xs:complexType name="FTCaseOption"> <xs:attribute name="caseIndicator"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="insensitive"/> <xs:enumeration value="sensitive"/> <xs:enumeration value="lowercase"/> <xs:enumeration value="uppercase"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="caseLanguage" type="xs:string"/> </xs:complexType> <xs:complexType name="FTDiacriticsOption"> <xs:attribute name="diacriticsIndicator"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="insensitive"/> <xs:enumeration value="sensitive"/> <xs:enumeration value="with"/> <xs:enumeration value="without"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="language" type="xs:string"/> </xs:complexType> <xs:complexType name="FTThesaurusOption"> <xs:sequence> <xs:element name="thesaurusName" type="xs:string" minOccurs="0" maxOccurs="1"/> <xs:element name="relationship" type="xs:string" minOccurs="0" maxOccurs="1"/> <xs:element name="range" type="fts:FTRangeSpec" minOccurs="0" maxOccurs="1"/> </xs:sequence> <xs:attribute name="thesaurusIndicator"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="with"/> <xs:enumeration value="without"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="language" type="xs:string"/> </xs:complexType> <xs:complexType name="FTStemOption"> <xs:attribute name="stemIndicator"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="with"/> <xs:enumeration value="without"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="language" type="xs:string"/> </xs:complexType> <xs:complexType name="FTWildCardOption"> <xs:attribute name="wildcardIndicator"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="with"/> <xs:enumeration value="without"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="language" type="xs:string"/> </xs:complexType> <xs:complexType name="FTLanguageOption"> <xs:attribute name="languageName" type="xs:string"/> </xs:complexType> <xs:complexType name="FTStopwordOption"> <xs:sequence> <xs:choice> <xs:element name="default-stopwords"> <xs:complexType /> </xs:element> <xs:element name="stop-word" type="xs:string" /> <xs:element name="uri" type="xs:anyURI" /> </xs:choice> <xs:element name="oper" minOccurs="0" maxOccurs="unbounded"> <xs:choice> <xs:element name="stop-word" type="xs:string" /> <xs:element name="uri" type="xs:anyURI" /> </xs:choice> <xs:attribute name="type"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="union"/> <xs:enumeration value="except"/> </xs:restriction> </xs:simpleType> </xs:attribute> </xs:element> </xs:sequence> </xs:complexType> </xs:schema>
An additional schema supports the explicit representation of the concept of a phrase. We need this representation to support thesauri lookups. Each lookup produces a sequence of phrases. Each phrase is one possible alternative for the search string.
The schema for phrases is given below.
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified"> <xs:complexType name="TokenPhrase"> <xs:sequence> <xs:element name="token" type="xs:string" minOccurs="1" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:schema>
The previous section described FTSelections as if no FTMatchOptions were present in the language. In this section, the semantics are extended to support FTMatchOptions.
The extension is achieved by modifying the existing functions of FTSelections and adding functions that are specific to the FTMatchOptions.
Modifications in the semantics of existing functions
The semantics of most of the FTSelections remains unmodified. The modifications are to the method for matching search tokens.
1. The modifications are to the semantics of FTWords because it is most influenced by the FTMatchOptions. Under the extended semantics, search tokens are modified depending on the FTMatchOptions. For example, in the presence of FTThesaurusOption search tokens may be expanded by related tokens from a thesaurus lookup.
declare function fts:applySingleSearchToken( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchToken as fts:TokenInfo, $queryPos as xs:integer) as element(allMatches, fts:AllMatches) { let $withDiacriticsOption := $matchOptions[(fn:local-name(.) eq "diacritics") and (./@type eq "with")][1] return if ($withDiacriticsOption) then let $newOption1 := <diacritics type="insensitive" /> let $newOption2 := <diacritics type="without" /> let $lhs := fts:applySingleSearchToken( $searchContext, ($newOption1, $matchOptions), $searchToken, $queryPos) let $rhs := fts:applySingleSearchToken( $searchContext, ($newOption2, $matchOptions), $searchToken, $queryPos) return fts:ApplyMildNot($lhs, $rhs) else let $thesaurusOption := $matchOptions[(fn:local-name(.) eq "thesaurus") and (./@type eq "with")][1] return if ($thesaurusOption) then let $noThesaurusOption := (<theasurus thesaurusIndicator="without" />, $matchOptions) let $lookupRes := fts:applyThesaurusOption( $thesaurusOption, $searchStrings) return fts:ApplyPhraseAlternatives($searchContext, $noThesaurusOptions, $lookupRes, $queryPos) else let $res := <allMatches stokenNum="{$queryPos}"> {let $searchTokens := if ($matchOptions//wildcard) then fts:applyWildCardOption($searchContext, $matchOptions, $searchToken) else $searchToken let $effectiveOptions := $matchOptions except $matchOptions[self::wildcard] let $token_pos := fts:matchStr($searchContext, $effectiveOptions, $searchTokens) for $pos in $token_pos return <match> <stringInclude queryPos="{$queryPos}" queryString="{$searchToken/@word}" > <tokenInfo>{$pos}</tokenInfo> </stringInclude> </match>} </allMatches> return fts:normalizeAllMatches($res) }; declare function fts:ApplyPhraseAlternatives( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchPhrases as fts:TokenPhrase*, $queryPos as xs:integer) as element(allMatches, fts:AllMatches) { if (fn:count($searchPhrases) eq 0) then <allMatches stokenNum="0" /> else let $firstSearchPhrase := $searchPhrase[1] let $restSearchPhrases := fn:subsequence($searchPhrases, 2) let $firstAllMatches := fts:ApplyFTWordsPhrase($searchContext, $matchOptions, $firstSearchPhrase/word, $queryPos) let $newQueryPos := fn:max($firstAllMatches//@queyrPos) + 1 let $restAllMatches := fts:ApplyPhraseAlternatives($searchContext, $matchOptions, $restSearchPhrases, $newQueryPos) return fts:ApplyFTOr($firstAllMatches, $resAllMatches) };
There are several major modifications to the semantics of the single-token search with the addition of match option processing.
Three FTMatchOptions are processed differently than the rest of the FTMatchOptions.
The FTDiacriticsOption with type with diacritics
is processed as though the query is $searchToken/@word diacratics insensitive not in $searchToken/@word without diacritics
. The desired Matches contain any version of the search token except the ones without any diacritics.
The semantics of the FTThesaurusOption cannot be represented simply in terms of search token expansion. Since the result of a thesaurus lookup is a sequence of alternatives, there must be a higher level of processing. The alternatives are connected in a disjunction using the fts:ApplyPhraseAlternatives
. The latter function is almost identical to fts:ApplyFTWordsAny
but takes into consideration the specific representation of the search tokens in
$searchPhrases
. The matching of the alternatives is performed with FTThesaurusOption turned off to avoid double expansions, i.e., expansion of an already expanded token.
FTWildCardOption commutes with all other options and therefore it is possible to ignore its position within the FTMatchOptions stack.
The remaining FTMatchOptions are processed in the MatchStr
function.
The semantics are given below.
declare function fts:matchStr( $searchContext as node(), $matchOptionss as fts:FTMatchOptions, $searchToken as fts:TokenInfo) as element(tokenInfo, fts:TokenInfo)* { let $nonexpOptions := $matchOptions[self::language or self::ignore] let $expOptions := $matchOptions except $nonexpOptions let $searchTokens := applyMatchOptions($matchOptions, $searchTokens), $searchTokens return getTokenInfo($searchContext, $nonexpOptions, $searchToken) }
The MatchStr
function rewrites search tokens based on applied FTMatchOptions to obtain the resulting sequence of TokenInfos that match the search tokens. FTThesaurusOptions are removed to avoid repeated theasurus expansion if the FTThesaurusOptions have already been applied to the search tokens.
Other FTMatchOptions transform the search tokens using the fts:applyMatchOption
function. Its structure is similar to the fts:evaluate
function structure. It inspects the supplied FTMatchOptions and applies them using FTMatchOption functions much like the FTSelection functions. These will be discussed later.
One last modification to search token matching is to treat phrases as search tokens when they are being processed against a thesaurus. This allows phrases as well as words to be modified using a thesaurus. The semantics to support multi-token thesaurus lookups are given below.
declare function fts:ApplyFTWordsPhrase( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchStrings as xs:string*, $queryPos as xs:integer) as element(allMatches, fts:AllMatches) { let $thesaurusOption := $matchOptions[fn:local-name(.) eq "thesaurus"][1] return if ($thesaurusOption and $thesaurusOption/@type eq "with") then let $noThesaurusOptions := $matchOptions[fn:local-name(.) ne "thesaurus"] let $lookupRes := fts:applyThesaurusOption($thesaurusOption, $searchStrings) return fts:ApplyPhraseAlternatives($searchContext, $noThesaurusOptions, $lookupRes, $queryPos) else let $conj := fts:ApplyFTWordsAllWord($searchContext, $matchOptions, $searchStrings, $queryPos) let $ordered := fts:ApplyFTOrder($conj) let $distance1 := fts:ApplyFTDistance($matchOptions, $ordered, <fts:range type="exactly" n="0">) return $distance1 };
Before, the fts:ApplyFTWordsPhrase
fucntion is processed an explicit check is conducted for the presence of a FTThesaurusOption is done. If a phrase is processed by an FTThesaurusOption, it is processed as in fts:ApplySingleSearchToken
.
Semantics of new FTMatchOptions functions
The expansion of FTSelections also includes adding additional functions that are specific to the FTMatchOptions.
declare function fts:applyMatchOption( $matchOptions as fts:FTMatchOption*, $searchTokens as fts:TokenInfo* ) as element(tokenInfo, fts:TokenInfo)* { if ($matchOptions) then let $firstOption := $matchOptions[1] let $firstOptionType := fn:local-name($firstOption) let $restOptions := $matchOptions[fn:local-name(.) ne $firstOptionType] let $applyFirst := fts:applyMatchOption($firstOption, $searchTokens) return fts:applyMatchOptions($restOptions, $$applyFirst) else $searchTokens }; declare function fts:applyMatchOption( $matchOption as fts:FTMatchOption, $searchTokens as fts:TokenInfo* ) as element(tokenInfo, fts:TokenInfo)* { if (fn:local-name($matchOption) eq "stopWord") then fts::applyStopWordOption($matchOptions, $searchTokens) else if (fn:local-name($matchOption) eq "case") return applyCaseOption($matchOption,$searchTokens) else if (fn:local-name($matchOption) eq "diacritics") return fts:applyDiacriticsOption($matchOption, $searchTokens) else if (fn:local-name($matchOption) eq "stem") return fts:applyStemOption($matchOption, $searchTokens) };
The fts:ApplyMatchOptions
function expands search tokens by the consecutive application of the specified FTMatchOption. Once a FTMatchOption of a particular type has been applied, other options of the same type are ignored since the former overrides them.
The application of FTMatchOptions is performed by the dispatcher function fts:ApplyMatchOption
which invokes the respective function implementing the semantics of the match option.
FTMatchOption functions which are necessary to support match option processing are given below.
function fts:lowerCase($token as fts:TokenInfo, $caseLanguage as xs:string) as fts:TokenInfo
function fts:upperCase($token as fts:TokenInfo, $caseLanguage as xs:string) as fts:TokenInfo
function fts:insensitiveCase($token as fts:TokenInfo, $caseLanguage as xs:string) as fts:TokenInfo
These character case functions convert the token in a TokenInfo to lowercase, uppercase, or case-insensitive form.
function fts:removeDiacritics( $token as fts:TokenInfo, $diacriticsLanguage as xs:string) as fts:TokenInfo
function fts:insensitiveDiacritics( $token as fts:TokenInfo, $diacriticsLanguage as xs:string) as fts:TokenInfo
These diacritics functions convert the token in a TokenInfo to a form without diacritics or to a diacritics-insensitive form.
function fts:lookupThesaurus($tokens as fts:TokenInfo*, $thesaurusName as xs:string, $thesaurusLanguage as xs:string, $relationship as xs:string, $range as fts:FTRanceSpec?) as element(tokenPhrase, fts:TokenPhrase)*
The thesaurus function finds all words related to $tokens
in the thesaurus $thesaurusName
for the language $thesaurusLanguage
using the relationship $relationship
within the optional number of levels $range
. If $tokens
consists of more than one TokenInfos, it is regarded as a phrase.
The thesaurus function returns a sequence of expansion alternatives. Each alternative is regarded as a new search phrase and is represented as a tokenized phrase. Alternatives are treated as though they are connected with a disjunction (FTOr).
function fts:stemmedForm($word as fts:TokenInfo, $stemLanguage as xs:string) as fts:TokenInfo
The stemming function converts the token in a TokenInfo object to a form that represents its stem.
function fts:wildcardForm($word as fts:TokenInfo, $wildcardLanguage as xs:string) as fts:TokenInfo*
The wildcard function converts the token in a TokenInfo object to a sequence of forms that can be used by the tokenizer to match document tokens.
The semantics for the FTCaseOption are given below.
declare function fts:applyCaseOption($matchOption as fts:FTMatchOption, $searchTokens as fts:TokenInfo*) as fts:TokenInfo* { let $searchToken := $searchTokens[1] let $nextTokens := $searchTokens[position() ge 2] let $returnedTokens := if ($matchOption/@caseIndicator = "lowercase") then (fts:lowerCase($searchToken/@word, $matchOption/@language), applyCaseOption($matchOption, $nextTokens)) else if ($matchOption/@caseIndicator = "uppercase") then (fts:upperCase($searchToken, $matchOption/@language), applyCaseOption($matchOption, $nextTokens)) else if ($matchOption/@caseIndicator = "insensitive") then (insensitiveCase($searchToken, $matchOption/@language), applyCaseOption($matchOption, $nextTokens)) else $searchTokens return $returnedTokens }
The semantics for the FTDiacriticsOption are given below.
declare function fts:applyDiacriticsOption( $matchOption as fts:FTMatchOption, $searchTokens as fts:TokenInfo*) as fts:TokenInfo* { let $searchToken := $searchTokens[1] let $nextTokens := $searchTokens[position() ge 2] let $indicator := $matchOption/@diacriticsIndicator let $returnedTokens := if ($indicator eq "with") then (addDiacritics($searchToken, $matchOption/@language), applyDiacriticsOption($matchOption, $nextTokens)) else if ($indicator eq "without") then (removeDiacritics($searchToken, $matchOption/@language), applyDiacriticsOption($matchOption, $nextTokens)) else if ($indicator eq "insensitive") then (insensitiveDiacritics($searchToken, $matchOption/@language), applyDiacriticsOption($matchOption, $nextTokens)) else (: $indicator eq "sensitive" :) $searchTokens return $returnedTokens }
The semantics for the FTStemOption are given below.
declare function fts:applyStemOption($matchOption as fts:FTMatchOption, $searchTokens as fts:TokenInfo*) as fts:TokenInfo* { if ($matchOption/@stemIndicator = "with") then let $searchToken := $searchTokens[1] let $nextTokens := $searchTokens[position() ge 2] return (stemmedForm($searchToken, $matchOption/@language), applyStemOption($matchOption, $nextTokens) else if ($matchOption/@stemIndicator = "without") then $returnedTokens else () }
Stop-words interact with FTDistance and FTWindow. The semantics for the FTStopWordOption are given below.
declare function fts:applyStopwordOption($matchOption as fts:FTMatchOption, $searchTokens as fts:TokenInfo*) as fts:TokenInfo* { let $rootElem := fn:local-name($matchOption/element()[1]) let $rootWords := $matchOption/element()[1]/text() let $swords := if ($rootElem eq "stop-word") then $rootWords else fts:resolveStopwordsUri($rootWords) let $tokenizedSwords := for $sw in $swords return <tokenInfo word="{$sw}" pos="0" sentence="0" para="0" /> let $restOpers := $matchOption/element()[position() ge 2] let $effectiveStopwords := fts:calcStopwords($tokenizedSwords, $resOpers) return fts:replaceStopwords($searchTokens, $stopWords) }; declare function fts:replaceStopwords( $searchTokens as fts:TokenInfo*, $stopWords as fts:TokenInfo*) as fts:TokenInfo* { for $stoken in $searchTokens let $replace := some $sw in $stopWords satisfies $stoken/@word eq $stoken/@word return if ($replace) then let $newTI := <tokenInfo word=".*" pos="{$stoken/@pos}" para="{$stoken/@para}" sentence="{$stoken/@sentence}" /> let $wcOption := <wildcard wildcardIndicator="with" /> return fts:applyWildCardOption($wcOption, $newTI) else $stoken }; declare function fts:addStopwords($stopWords as fts:TokenInfo*, $newStopwords as fts:TokenInfo*) as fts:TokenInfo* { if ($newStopwords) then let $firstStopword := $newStopwords[1] let $restStopwords := $newStopwords[position() ge 2] let $temp := if ($stopWords[@word eq $firstStopword/@word]) then $stopWords else ($stopWords, $firstStopword) return addStopwords($temp, $restStopwords) else $stopWords }; declare function fts:remStopwords($stopWords as fts:TokenInfo*, $remStopwords as fts:TokenInfo*) as fts:TokenInfo* { if ($newStopwords) then let $firstStopword := $newStopwords[1] let $restStopwords := $newStopwords[position() ge 2] let $temp := if ($stopWords[@word eq $firstStopword/@word]) then $stopWords[@word ne $firstStopword] else $stopWords return remStopwords($temp, $restStopwords) else $stopWords }; declare function fts:calcStopwords($stopWords as fts:TokenInfo*, $opers) as fts:TokenInfo* { if ($opers) then let $firstOper := $opers[1] let $restOpers := $opers[position() ge 2] let $operType := $firstOper/@type let $operElem := fn:local-name($firstOper/element()) let $operWords := $firstOper/element()/text() let $swords := if ($operElem eq "stop-word") then $operWords else fts:resolveStopwordsUri($operWords) return if ($operType eq "union") then calcStopwords(fts:addStopword($stopWords, $swords, $restOpers) else calcStopwords(fts:remStopword($stopWords, $swords), $restOpers) else $stopWords };
The stop words set is computed using the fts:calcStopwords
function. The function uses the function fts:resoleStopwordsUri
to resolve any URI to a sequence of strings. Then, the stop words are removed from the set of search tokens.
The semantics for the FTWildCardOption are given below.
declare function fts:applyWildCardOption($matchOption as fts:FTMatchOption, $searchTokens as fts:TokenInfo*) as xs:TokenInfo* { if ($matchOption/@wildcardIndicator = "with") { let $searchToken := $searchTokens[1] let $nextTokens := $searchTokens[position() ge 2] return wildcardForm($searchToken, $matchOption/@language), applyWildCardOption($matchOption, $nextTokens) } else if ($matchOption/@wildcardIndicator eq "without") then $searchTokens else () };
The FTContainsExpr
function defining the semantics of FTContainsExpr takes the following parameters: 1) a search context consisting of a sequence of nodes (which is the result of a regular XQuery/XPath expression) and 2) an AllMatches corresponding to an FTSelection. The function returns a xs:boolean
atomic value. This value is true
if and only if some node in the search contains satisifes the full-text condition given by the
FTSelection. Since FTContainsExpr returns results in XDM (a sequence of items), it may be treated like XQuery 1.0 expressions and may be fully composed with other XQuery 1.0 expressions. In addition, since the FTContainsExpr
function maps AllMatches to a sequence of items, it provides semantics for mapping from AllMatches to XDM.
Consider an FTContainsExpr expression of the form EvaluationContext ftcontains FTSelection
, where EvaluationContext
is an XQuery 1.0 expression that returns a sequence of nodes and FTSelection
is an FTSelection that returns AllMatches. The FTContainsExpr returns true if and only if some node in the result of EvaluationContext
satisfies the AllMatches returned by FTSelection
.
If the FTContainsExpr is of the form EvaluationContext ftcontains FTSelection without content IgnoreExpr
for some XQuery 1.0 expression IgnoreExpr
, then that FTContainsExpr is evaluated as given below.
declare function reconstruct($n as node(), $ignore as node()*) as node()? { if (some $i in $ignore satisfies $n is $i) then () else if ($n instance of element()) then let $nodeName := fn:node-name($n) let $nodeContent := for $nn in $n/node() return reconstruct($nn) return element {$nodeName} {$nodeContent} else $n } let $newEvalContext := let $ignoreNodes := EvaluationContext/IgnoreExpr/text() return for $n in EvaluationContext return reconstruct($n, $ignoreNodes) return$newEvalContext ftcontains FTSelection
The EvaluationContext
is rewritten so it does not include any text node descendants from nodes that should be ignored.
The XQuery 1.0 and XPath 2.0 FTContainsExpr
function takes three parameters.
The sequence of nodes returned by EvalationContext
The XML node representation of FTSelection
The XML representation of the set of default values for each of the FTMatchOptions as given by the static context.
The FTContainsExpr
function returns true if and only if the corresponding FTContainsExpr returns true, and thus specifies the semantics of FTContainsExpr. Note that by using XQuery 1.0 and XPath 2.0 to specify the formal semantics, we avoid the need to introduce new formalism. We simply reuse the formal semantics of XQuery 1.0 and XPath 2.0.
declare function FTContainsExpr( $searchContext as node()*, $ftSelection as fts:FTSelection, $defOptions as fts:FTMatchOptions) as xs:Boolean { return some $node in $searchContext satisfies let $allMatches := fts:evaluate($ftSelection, $node, $defOptions, 0) return some $match in $allMatches/match satisfies fn:count($match/stringExclude) eq 0 }
The FTContainsExpr
function returns true if and only if the AllMatches that is the result of the application of the FTSelection for some node in the search context contains a Match with no StringExcludes. In other words, there is a set of TokenInfos in that node which satisfy the condition of the FTSelection.
Consider this more complex example. This example uses the same sample document fragment and assigns it $doc
. Consider the following FTContainsExpr
$doc ftcontains ( ( "mustang" && (("great" || "excellent") at least 2 occurrences) ) window 30 words && ! "rust" ) same paragraph
Begin by evaluating the FTSelection to AllMatches.
( ( "mustang" && (("great" || "excellent") at least 2 occurrences) ) window 30 words && ! "rust" ) same paragraph
Step 1: Evaluate the FTWords "Mustang"
.
Step 2: Evaluate the FTWords "great"
.
Step 3: Evaluate the FTWords "excellent"
.
Step 4 - Apply the FTOr ("great" || "excellent")
forming a union of the Matches.
Step 5 - Apply the FTTimes ("great" || "excellent") at least 2 occurrences
forming two pairs of Matches
Continued on next diagram
Continued on next diagram
Step 6 - Apply the FTAnd "Mustang" && (("great" || "excellent") at least 2 occurrences)
forming all possible pairs of StringMatches.
Continued on next diagram
Continued on next diagram
Continued on next diagram
Continued on next diagram
Continued on next diagram
Continued on next diagram
Continued on next diagram
Continued on next diagram
Continued on next diagram
Step 7 - Apply the FTWindow ("Mustang" && (("great" || "excellent") at least 2 occurrences)) window 30 words
, filtering out Matches for which the window is not less than or equal to 30 words.
Continued on next diagram
Continued on next diagram
Continued on next diagram
Step 8 - Evaluate FTWords "rust"
.
Step 9 - Apply the FTUnaryNot ! "rust"
, transforming the StringInclude
into a StringExclude
.
Step 10 - Apply the FTAnd (("Mustang" && (("great" || "excellent") at least 2 occurrences)) window 30 words) && ! "rust"
, forming all possible combintations of three StringMatches from the first AllMatches and one StringMatch from the second AllMatches.
Continued on next diagram
Continued on next diagram
Continued on next diagram
Continued on next diagram
Step 11: Apply the FTScope, filtering out Matches whose TokenInfos are not within the same paragraph (assuming the <offers>
elements determine also paragraph boundaries).
The resulting AllMatches contains a Match that contains a StringExclude. Therefore, the sample FTContainsExpr returns false
.
This section addresses the semantics of scoring variables in XQuery 1.0 for
and let
clauses and XPath 2.0 for
expressions.
The scoring variables are a constructed that allows the association of a numeric score with the result of the evaluation of XQuery 1.0 and XPath 2.0 expressions. This numeric score tries to estimate the value of a result item to the user information need expressed using the XQuery 1.0 and XPath 2.0 expression. The numeric score is computed using a implementation-provided scoring algorithm.
There are numerous scoring algorithms used in practice. Most of the scoring algorithms take as inputs a query and a set of results to the query. In computing the score, these algorithms rely on the structure of the query to estimate the relevance of the results.
In the context of defining the semantics of XQuery 1.0 and XPath 2.0 Full-text, passing the structure of the query poses a problem. The query is an XQuery 1.0 and XPath 2.0 expression and an XQuery 1.0 and XPath 2.0 Full-text expression in particular. The semantics of XQuery 1.0 and XPath 2.0 expressions is expressed using functions take as arguments sequences of items and return sequences of items. They are not aware of what expression produced a particular sequence, i.e., they are not aware of the expression structure.
To define the semantics of scoring in XQuery 1.0 and XPath 2.0 Full-text using XQuery 1.0 using the current approach of utilizing XQuery 1.0 and XPath 2.0 itself, it is necessary that the XQuery 1.0 and XPath 2.0 expressions that produce the query result (or the functions that implement the expressions) can be passed as arguments. In other words, there is a necessity for second-order functions. Current XQuery 1.0 and XPath 2.0 do not provide such functions.
Nevertheless, in the interest of the exposition, assume that such second-order functions are present. In particular, that there are two semantic second-order function fts:score
and fts:scoreSequence
that take one argument (an expression) and return the score value of this expression, respectively a sequence of score values, one for each item to which the expression evaluates. The scores must satisfy scoring properties.
A for
clause containing a score variable
for $result score $score in Expr ...
is evaluated as though it is replaced by the following the set of clauses
let $scoreSeq := fts:scoreSequence(Expr) for $result at $i in Expr let $score := $scoreSeq[$i] ...
Here, $scoreSeq
and $i
are new variables, not appearing elsewhere, and fts:scoreSequence
is the second-order function.
Similarly, a let
clause containing a score variable
let $result score $score := Expr ...
is evaluated as though it is replaced by the following set of clauses.
let $result := Expr let $score := fts:score(Expr) ...
The EBNF in this document and in this section is aligned with the current XML Query 1.0 grammar (see http://www.w3.org/TR/2005/CR-xquery-20051103/).
[1] | Module |
::= | VersionDecl? (LibraryModule | MainModule) |
|
[2] | VersionDecl |
::= | "xquery" "version" StringLiteral ("encoding" StringLiteral)? Separator |
|
[3] | MainModule |
::= | Prolog QueryBody |
|
[4] | LibraryModule |
::= | ModuleDecl Prolog |
|
[5] | ModuleDecl |
::= | "module" "namespace" NCName "=" URILiteral Separator |
|
[6] | Prolog |
::= | ((DefaultNamespaceDecl | Setter | NamespaceDecl | Import) Separator)* ((VarDecl | FunctionDecl | OptionDecl | FTOptionDecl) Separator)* |
|
[7] | Setter |
::= | BoundarySpaceDecl | DefaultCollationDecl | BaseURIDecl | ConstructionDecl | OrderingModeDecl | EmptyOrderDecl | CopyNamespacesDecl |
|
[8] | Import |
::= | SchemaImport | ModuleImport |
|
[9] | Separator |
::= | ";" |
|
[10] | NamespaceDecl |
::= | "declare" "namespace" NCName "=" URILiteral |
|
[11] | BoundarySpaceDecl |
::= | "declare" "boundary-space" ("preserve" | "strip") |
|
[12] | DefaultNamespaceDecl |
::= | "declare" "default" ("element" | "function") "namespace" URILiteral |
|
[13] | OptionDecl |
::= | "declare" "option" QName StringLiteral |
|
[14] | FTOptionDecl |
::= | "declare" "ft-option" FTMatchOption |
|
[15] | OrderingModeDecl |
::= | "declare" "ordering" ("ordered" | "unordered") |
|
[16] | EmptyOrderDecl |
::= | "declare" "default" "order" "empty" ("greatest" | "least") |
|
[17] | CopyNamespacesDecl |
::= | "declare" "copy-namespaces" PreserveMode "," InheritMode |
|
[18] | PreserveMode |
::= | "preserve" | "no-preserve" |
|
[19] | InheritMode |
::= | "inherit" | "no-inherit" |
|
[20] | DefaultCollationDecl |
::= | "declare" "default" "collation" URILiteral |
|
[21] | BaseURIDecl |
::= | "declare" "base-uri" URILiteral |
|
[22] | SchemaImport |
::= | "import" "schema" SchemaPrefix? URILiteral ("at" URILiteral ("," URILiteral)*)? |
|
[23] | SchemaPrefix |
::= | ("namespace" NCName "=") | ("default" "element" "namespace") |
|
[24] | ModuleImport |
::= | "import" "module" ("namespace" NCName "=")? URILiteral ("at" URILiteral ("," URILiteral)*)? |
|
[25] | VarDecl |
::= | "declare" "variable" "$" QName TypeDeclaration? ((":=" ExprSingle) | "external") |
|
[26] | ConstructionDecl |
::= | "declare" "construction" ("strip" | "preserve") |
|
[27] | FunctionDecl |
::= | "declare" "function" QName "(" ParamList? ")" ("as" SequenceType)? (EnclosedExpr | "external") |
|
[28] | ParamList |
::= | Param ("," Param)* |
|
[29] | Param |
::= | "$" QName TypeDeclaration? |
|
[30] | EnclosedExpr |
::= | "{" Expr "}" |
|
[31] | QueryBody |
::= | Expr |
|
[32] | Expr |
::= | ExprSingle ("," ExprSingle)* |
|
[33] | ExprSingle |
::= | FLWORExpr |
|
[34] | FLWORExpr |
::= | (ForClause | LetClause)+ WhereClause? OrderByClause? "return" ExprSingle |
|
[35] | ForClause |
::= | "for" "$" VarName TypeDeclaration? PositionalVar? FTScoreVar? "in" ExprSingle ("," "$" VarName TypeDeclaration? PositionalVar? FTScoreVar? "in"
ExprSingle)* |
|
[36] | PositionalVar |
::= | "at" "$" VarName |
|
[37] | FTScoreVar |
::= | "score" "$" VarName |
|
[38] | LetClause |
::= | (("let" "$" VarName TypeDeclaration? FTScoreVar?) | ("let" "score" "$" VarName)) ":=" ExprSingle ("," (("$" VarName TypeDeclaration? FTScoreVar?) | FTScoreVar) ":=" ExprSingle)* |
|
[39] | WhereClause |
::= | "where" ExprSingle |
|
[40] | OrderByClause |
::= | (("order" "by") | ("stable" "order" "by")) OrderSpecList |
|
[41] | OrderSpecList |
::= | OrderSpec ("," OrderSpec)* |
|
[42] | OrderSpec |
::= | ExprSingle OrderModifier |
|
[43] | OrderModifier |
::= | ("ascending" | "descending")? ("empty" ("greatest" | "least"))? ("collation" URILiteral)? |
|
[44] | QuantifiedExpr |
::= | ("some" | "every") "$" VarName TypeDeclaration? "in" ExprSingle ("," "$" VarName TypeDeclaration? "in" ExprSingle)* "satisfies" ExprSingle |
|
[45] | TypeswitchExpr |
::= | "typeswitch" "(" Expr ")" CaseClause+ "default" ("$" VarName)? "return" ExprSingle |
|
[46] | CaseClause |
::= | "case" ("$" VarName "as")? SequenceType "return" ExprSingle |
|
[47] | IfExpr |
::= | "if" "(" Expr ")" "then" ExprSingle "else" ExprSingle |
|
[48] | OrExpr |
::= | AndExpr ( "or" AndExpr )* |
|
[49] | AndExpr |
::= | ComparisonExpr ( "and" ComparisonExpr )* |
|
[50] | ComparisonExpr |
::= | FTContainsExpr ( (ValueComp |
|
[51] | FTContainsExpr |
::= | RangeExpr ( "ftcontains" FTSelection FTIgnoreOption? )? |
|
[52] | RangeExpr |
::= | AdditiveExpr ( "to" AdditiveExpr )? |
|
[53] | AdditiveExpr |
::= | MultiplicativeExpr ( ("+" | "-") MultiplicativeExpr )* |
|
[54] | MultiplicativeExpr |
::= | UnionExpr ( ("*" | "div" | "idiv" | "mod") UnionExpr )* |
|
[55] | UnionExpr |
::= | IntersectExceptExpr ( ("union" | "|") IntersectExceptExpr )* |
|
[56] | IntersectExceptExpr |
::= | InstanceofExpr ( ("intersect" | "except") InstanceofExpr )* |
|
[57] | InstanceofExpr |
::= | TreatExpr ( "instance" "of" SequenceType )? |
|
[58] | TreatExpr |
::= | CastableExpr ( "treat" "as" SequenceType )? |
|
[59] | CastableExpr |
::= | CastExpr ( "castable" "as" SingleType )? |
|
[60] | CastExpr |
::= | UnaryExpr ( "cast" "as" SingleType )? |
|
[61] | UnaryExpr |
::= | ("-" | "+")* ValueExpr |
|
[62] | ValueExpr |
::= | ValidateExpr | PathExpr | ExtensionExpr |
|
[63] | GeneralComp |
::= | "=" | "!=" | "<" | "<=" | ">" | ">=" |
|
[64] | ValueComp |
::= | "eq" | "ne" | "lt" | "le" | "gt" | "ge" |
|
[65] | NodeComp |
::= | "is" | "<<" | ">>" |
|
[66] | ValidateExpr |
::= | "validate" ValidationMode? "{" Expr "}" |
|
[67] | ValidationMode |
::= | "lax" | "strict" |
|
[68] | ExtensionExpr |
::= | Pragma+ "{" Expr? "}" |
|
[69] | Pragma |
::= | "(#" S? QName PragmaContents "#)" |
/* ws: explicitXQ */ |
[70] | PragmaContents |
::= | (Char* - (Char* '#)' Char*)) |
|
[71] | PathExpr |
::= | ("/" RelativePathExpr?) |
/* gn: leading-lone-slashXQ */ |
[72] | RelativePathExpr |
::= | StepExpr (("/" | "//") StepExpr)* |
|
[73] | StepExpr |
::= | FilterExpr | AxisStep |
|
[74] | AxisStep |
::= | (ReverseStep | ForwardStep) PredicateList |
|
[75] | ForwardStep |
::= | (ForwardAxis NodeTest) | AbbrevForwardStep |
|
[76] | ForwardAxis |
::= | ("child" "::") |
|
[77] | AbbrevForwardStep |
::= | "@"? NodeTest |
|
[78] | ReverseStep |
::= | (ReverseAxis NodeTest) | AbbrevReverseStep |
|
[79] | ReverseAxis |
::= | ("parent" "::") |
|
[80] | AbbrevReverseStep |
::= | ".." |
|
[81] | NodeTest |
::= | KindTest | NameTest |
|
[82] | NameTest |
::= | QName | Wildcard |
|
[83] | Wildcard |
::= | "*" |
/* ws: explicitXQ */ |
[84] | FilterExpr |
::= | PrimaryExpr PredicateList |
|
[85] | PredicateList |
::= | Predicate* |
|
[86] | Predicate |
::= | "[" Expr "]" |
|
[87] | PrimaryExpr |
::= | Literal | VarRef | ParenthesizedExpr | ContextItemExpr | FunctionCall | OrderedExpr | UnorderedExpr | Constructor |
|
[88] | Literal |
::= | NumericLiteral | StringLiteral |
|
[89] | NumericLiteral |
::= | IntegerLiteral | DecimalLiteral | DoubleLiteral |
|
[90] | VarRef |
::= | "$" VarName |
|
[91] | VarName |
::= | QName |
|
[92] | ParenthesizedExpr |
::= | "(" Expr? ")" |
|
[93] | ContextItemExpr |
::= | "." |
|
[94] | OrderedExpr |
::= | "ordered" "{" Expr "}" |
|
[95] | UnorderedExpr |
::= | "unordered" "{" Expr "}" |
|
[96] | FunctionCall |
::= | QName "(" (ExprSingle ("," ExprSingle)*)? ")" |
/* gn: reserved-function-namesXQ */ |
/* gn: parensXQ */ | ||||
[97] | Constructor |
::= | DirectConstructor |
|
[98] | DirectConstructor |
::= | DirElemConstructor |
|
[99] | DirElemConstructor |
::= | "<" QName DirAttributeList ("/>" | (">" DirElemContent* "</" QName S? ">")) |
/* ws: explicitXQ */ |
[100] | DirAttributeList |
::= | (S (QName S? "=" S? DirAttributeValue)?)* |
/* ws: explicitXQ */ |
[101] | DirAttributeValue |
::= | ('"' (EscapeQuot | QuotAttrValueContent)* '"') |
/* ws: explicitXQ */ |
[102] | QuotAttrValueContent |
::= | QuotAttrContentChar |
|
[103] | AposAttrValueContent |
::= | AposAttrContentChar |
|
[104] | DirElemContent |
::= | DirectConstructor |
|
[105] | CommonContent |
::= | PredefinedEntityRef | CharRef | "{{" | "}}" | EnclosedExpr |
|
[106] | DirCommentConstructor |
::= | "<!--" DirCommentContents "-->" |
/* ws: explicitXQ */ |
[107] | DirCommentContents |
::= | ((Char - '-') | ('-' (Char - '-')))* |
/* ws: explicitXQ */ |
[108] | DirPIConstructor |
::= | "<?" PITarget (S DirPIContents)? "?>" |
/* ws: explicitXQ */ |
[109] | DirPIContents |
::= | (Char* - (Char* '?>' Char*)) |
/* ws: explicitXQ */ |
[110] | CDataSection |
::= | "<![CDATA[" CDataSectionContents "]]>" |
/* ws: explicitXQ */ |
[111] | CDataSectionContents |
::= | (Char* - (Char* ']]>' Char*)) |
/* ws: explicitXQ */ |
[112] | ComputedConstructor |
::= | CompDocConstructor |
|
[113] | CompDocConstructor |
::= | "document" "{" Expr "}" |
|
[114] | CompElemConstructor |
::= | "element" (QName | ("{" Expr "}")) "{" ContentExpr? "}" |
|
[115] | ContentExpr |
::= | Expr |
|
[116] | CompAttrConstructor |
::= | "attribute" (QName | ("{" Expr "}")) "{" Expr? "}" |
|
[117] | CompTextConstructor |
::= | "text" "{" Expr "}" |
|
[118] | CompCommentConstructor |
::= | "comment" "{" Expr "}" |
|
[119] | CompPIConstructor |
::= | "processing-instruction" (NCName | ("{" Expr "}")) "{" Expr? "}" |
|
[120] | SingleType |
::= | AtomicType "?"? |
|
[121] | TypeDeclaration |
::= | "as" SequenceType |
|
[122] | SequenceType |
::= | ("empty-sequence" "(" ")") |
|
[123] | OccurrenceIndicator |
::= | "?" | "*" | "+" |
/* gn: occurrence-indicatorsXQ */ |
[124] | ItemType |
::= | KindTest | ("item" "(" ")") | AtomicType |
|
[125] | AtomicType |
::= | QName |
|
[126] | KindTest |
::= | DocumentTest |
|
[127] | AnyKindTest |
::= | "node" "(" ")" |
|
[128] | DocumentTest |
::= | "document-node" "(" (ElementTest | SchemaElementTest)? ")" |
|
[129] | TextTest |
::= | "text" "(" ")" |
|
[130] | CommentTest |
::= | "comment" "(" ")" |
|
[131] | PITest |
::= | "processing-instruction" "(" (NCName | StringLiteral)? ")" |
|
[132] | AttributeTest |
::= | "attribute" "(" (AttribNameOrWildcard ("," TypeName)?)? ")" |
|
[133] | AttribNameOrWildcard |
::= | AttributeName | "*" |
|
[134] | SchemaAttributeTest |
::= | "schema-attribute" "(" AttributeDeclaration ")" |
|
[135] | AttributeDeclaration |
::= | AttributeName |
|
[136] | ElementTest |
::= | "element" "(" (ElementNameOrWildcard ("," TypeName "?"?)?)? ")" |
|
[137] | ElementNameOrWildcard |
::= | ElementName | "*" |
|
[138] | SchemaElementTest |
::= | "schema-element" "(" ElementDeclaration ")" |
|
[139] | ElementDeclaration |
::= | ElementName |
|
[140] | AttributeName |
::= | QName |
|
[141] | ElementName |
::= | QName |
|
[142] | TypeName |
::= | QName |
|
[143] | URILiteral |
::= | StringLiteral |
|
[144] | FTSelection |
::= | FTOr (FTMatchOption | FTProximity)* ("weight" DecimalLiteral)? |
|
[145] | FTOr |
::= | FTAnd ( "||" FTAnd )* |
|
[146] | FTAnd |
::= | FTMildnot ( "&&" FTMildnot )* |
|
[147] | FTMildnot |
::= | FTUnaryNot ( "not" "in" FTUnaryNot )* |
|
[148] | FTUnaryNot |
::= | ("!")? FTWordsSelection |
|
[149] | FTWordsSelection |
::= | FTWords | ("(" FTSelection ")") |
|
[150] | FTWords |
::= | (Literal | VarRef | ContextItemExpr | FunctionCall | ("{" Expr "}")) FTAnyallOption? |
|
[151] | FTProximity |
::= | FTOrderedIndicator | FTWindow | FTDistance | FTTimes | FTScope | FTContent |
|
[152] | FTOrderedIndicator |
::= | "ordered" |
|
[153] | FTMatchOption |
::= | FTCaseOption |
|
[154] | FTCaseOption |
::= | "lowercase" |
|
[155] | FTDiacriticsOption |
::= | ("with" "diacritics") |
|
[156] | FTStemOption |
::= | ("with" "stemming") | ("without" "stemming") |
|
[157] | FTThesaurusOption |
::= | ("with" "thesaurus" (FTThesaurusID | "default")) |
|
[158] | FTThesaurusID |
::= | "at" StringLiteral ("relationship" StringLiteral)? (FTRange "levels")? |
|
[159] | FTStopwordOption |
::= | ("with" "stop" "words" FTRefOrList FTInclExclStringLiteral*) |
|
[160] | FTRefOrList |
::= | ("at" StringLiteral) |
|
[161] | FTInclExclStringLiteral |
::= | ("union" | "except") FTRefOrList |
|
[162] | FTLanguageOption |
::= | "language" StringLiteral |
|
[163] | FTWildCardOption |
::= | ("with" "wildcards") | ("without" "wildcards") |
|
[164] | FTContent |
::= | ("at" "start") | ("at" "end") | ("entire" "content") |
|
[165] | FTAnyallOption |
::= | ("any" "word"?) | ("all" "words"?) | "phrase" |
|
[166] | FTRange |
::= | ("exactly" UnionExpr) |
|
[167] | FTDistance |
::= | "distance" FTRange FTUnit |
|
[168] | FTWindow |
::= | "window" UnionExpr FTUnit |
|
[169] | FTTimes |
::= | "occurs" FTRange "times" |
|
[170] | FTScope |
::= | ("same" | "different") FTBigUnit |
|
[171] | FTUnit |
::= | "words" | "sentences" | "paragraphs" |
|
[172] | FTBigUnit |
::= | "sentence" | "paragraph" |
|
[173] | FTIgnoreOption |
::= | "without" "content" UnionExpr |
[174] | IntegerLiteral |
::= | Digits |
|
[175] | DecimalLiteral |
::= | ("." Digits) | (Digits "." [0-9]*) |
/* ws: explicitXQ */ |
[176] | DoubleLiteral |
::= | (("." Digits) | (Digits ("." [0-9]*)?)) [eE] [+-]? Digits |
/* ws: explicitXQ */ |
[177] | StringLiteral |
::= | ('"' (PredefinedEntityRef | CharRef | EscapeQuot | [^"&])* '"') | ("'" (PredefinedEntityRef | CharRef | EscapeApos | [^'&])* "'") |
/* ws: explicitXQ */ |
[178] | PredefinedEntityRef |
::= | "&" ("lt" | "gt" | "amp" | "quot" | "apos") ";" |
/* ws: explicitXQ */ |
[179] | EscapeQuot |
::= | '""' |
|
[180] | EscapeApos |
::= | "''" |
|
[181] | ElementContentChar |
::= | Char - [{}<&] |
|
[182] | QuotAttrContentChar |
::= | Char - ["{}<&] |
|
[183] | AposAttrContentChar |
::= | Char - ['{}<&] |
|
[184] | Comment |
::= | "(:" (CommentContents | Comment)* ":)" |
/* ws: explicitXQ */ |
/* gn: commentsXQ */ | ||||
[185] | PITarget |
::= | [http://www.w3.org/TR/REC-xml#NT-PITarget]XML |
/* gn: xml-versionXQ */ |
[186] | CharRef |
::= | [http://www.w3.org/TR/REC-xml#NT-CharRef]XML |
/* gn: xml-versionXQ */ |
[187] | QName |
::= | [http://www.w3.org/TR/REC-xml-names/#NT-QName]Names |
/* gn: xml-versionXQ */ |
[188] | NCName |
::= | [http://www.w3.org/TR/REC-xml-names/#NT-NCName]Names |
/* gn: xml-versionXQ */ |
[189] | S |
::= | [http://www.w3.org/TR/REC-xml#NT-S]XML |
/* gn: xml-versionXQ */ |
[190] | Char |
::= | [http://www.w3.org/TR/REC-xml#NT-Char]XML |
/* gn: xml-versionXQ */ |
The following symbols are used only in the definition of terminal symbols; they are not terminal symbols in the grammar of A EBNF for XQuery 1.0 Grammar with Full-Text extensions.
[191] | Digits |
::= | [0-9]+ |
[192] | CommentContents |
::= | (Char+ - (Char* ('(:' | ':)') Char*)) |
The EBNF in this document and in this section is aligned with the current XPath 2.0 grammar (see http://www.w3.org/TR/2005/CR-xpath20-20051103/).
[1] | XPath |
::= | Expr |
|
[2] | Expr |
::= | ExprSingle ("," ExprSingle)* |
|
[3] | ExprSingle |
::= | ForExpr |
|
[4] | ForExpr |
::= | SimpleForClause "return" ExprSingle |
|
[5] | SimpleForClause |
::= | "for" "$" VarName FTScoreVar? "in" ExprSingle ("," "$" VarName FTScoreVar? "in" ExprSingle)* |
|
[6] | FTScoreVar |
::= | "score" "$" VarName |
|
[7] | QuantifiedExpr |
::= | ("some" | "every") "$" VarName "in" ExprSingle ("," "$" VarName "in" ExprSingle)* "satisfies" ExprSingle |
|
[8] | IfExpr |
::= | "if" "(" Expr ")" "then" ExprSingle "else" ExprSingle |
|
[9] | OrExpr |
::= | AndExpr ( "or" AndExpr )* |
|
[10] | AndExpr |
::= | ComparisonExpr ( "and" ComparisonExpr )* |
|
[11] | ComparisonExpr |
::= | FTContainsExpr ( (ValueComp |
|
[12] | FTContainsExpr |
::= | RangeExpr ( "ftcontains" FTSelection FTIgnoreOption? )? |
|
[13] | RangeExpr |
::= | AdditiveExpr ( "to" AdditiveExpr )? |
|
[14] | AdditiveExpr |
::= | MultiplicativeExpr ( ("+" | "-") MultiplicativeExpr )* |
|
[15] | MultiplicativeExpr |
::= | UnionExpr ( ("*" | "div" | "idiv" | "mod") UnionExpr )* |
|
[16] | UnionExpr |
::= | IntersectExceptExpr ( ("union" | "|") IntersectExceptExpr )* |
|
[17] | IntersectExceptExpr |
::= | InstanceofExpr ( ("intersect" | "except") InstanceofExpr )* |
|
[18] | InstanceofExpr |
::= | TreatExpr ( "instance" "of" SequenceType )? |
|
[19] | TreatExpr |
::= | CastableExpr ( "treat" "as" SequenceType )? |
|
[20] | CastableExpr |
::= | CastExpr ( "castable" "as" SingleType )? |
|
[21] | CastExpr |
::= | UnaryExpr ( "cast" "as" SingleType )? |
|
[22] | UnaryExpr |
::= | ("-" | "+")* ValueExpr |
|
[23] | ValueExpr |
::= | PathExpr |
|
[24] | GeneralComp |
::= | "=" | "!=" | "<" | "<=" | ">" | ">=" |
|
[25] | ValueComp |
::= | "eq" | "ne" | "lt" | "le" | "gt" | "ge" |
|
[26] | NodeComp |
::= | "is" | "<<" | ">>" |
|
[27] | PathExpr |
::= | ("/" RelativePathExpr?) |
/* gn: leading-lone-slashXP */ |
[28] | RelativePathExpr |
::= | StepExpr (("/" | "//") StepExpr)* |
|
[29] | StepExpr |
::= | FilterExpr | AxisStep |
|
[30] | AxisStep |
::= | (ReverseStep | ForwardStep) PredicateList |
|
[31] | ForwardStep |
::= | (ForwardAxis NodeTest) | AbbrevForwardStep |
|
[32] | ForwardAxis |
::= | ("child" "::") |
|
[33] | AbbrevForwardStep |
::= | "@"? NodeTest |
|
[34] | ReverseStep |
::= | (ReverseAxis NodeTest) | AbbrevReverseStep |
|
[35] | ReverseAxis |
::= | ("parent" "::") |
|
[36] | AbbrevReverseStep |
::= | ".." |
|
[37] | NodeTest |
::= | KindTest | NameTest |
|
[38] | NameTest |
::= | QName | Wildcard |
|
[39] | Wildcard |
::= | "*" |
/* ws: explicitXP */ |
[40] | FilterExpr |
::= | PrimaryExpr PredicateList |
|
[41] | PredicateList |
::= | Predicate* |
|
[42] | Predicate |
::= | "[" Expr "]" |
|
[43] | PrimaryExpr |
::= | Literal | VarRef | ParenthesizedExpr | ContextItemExpr | FunctionCall |
|
[44] | Literal |
::= | NumericLiteral | StringLiteral |
|
[45] | NumericLiteral |
::= | IntegerLiteral | DecimalLiteral | DoubleLiteral |
|
[46] | VarRef |
::= | "$" VarName |
|
[47] | VarName |
::= | QName |
|
[48] | ParenthesizedExpr |
::= | "(" Expr? ")" |
|
[49] | ContextItemExpr |
::= | "." |
|
[50] | FunctionCall |
::= | QName "(" (ExprSingle ("," ExprSingle)*)? ")" |
/* gn: reserved-function-namesXP */ |
/* gn: parensXP */ | ||||
[51] | SingleType |
::= | AtomicType "?"? |
|
[52] | SequenceType |
::= | ("empty-sequence" "(" ")") |
|
[53] | OccurrenceIndicator |
::= | "?" | "*" | "+" |
/* gn: occurrence-indicatorsXP */ |
[54] | ItemType |
::= | KindTest | ("item" "(" ")") | AtomicType |
|
[55] | AtomicType |
::= | QName |
|
[56] | KindTest |
::= | DocumentTest |
|
[57] | AnyKindTest |
::= | "node" "(" ")" |
|
[58] | DocumentTest |
::= | "document-node" "(" (ElementTest | SchemaElementTest)? ")" |
|
[59] | TextTest |
::= | "text" "(" ")" |
|
[60] | CommentTest |
::= | "comment" "(" ")" |
|
[61] | PITest |
::= | "processing-instruction" "(" (NCName | StringLiteral)? ")" |
|
[62] | AttributeTest |
::= | "attribute" "(" (AttribNameOrWildcard ("," TypeName)?)? ")" |
|
[63] | AttribNameOrWildcard |
::= | AttributeName | "*" |
|
[64] | SchemaAttributeTest |
::= | "schema-attribute" "(" AttributeDeclaration ")" |
|
[65] | AttributeDeclaration |
::= | AttributeName |
|
[66] | ElementTest |
::= | "element" "(" (ElementNameOrWildcard ("," TypeName "?"?)?)? ")" |
|
[67] | ElementNameOrWildcard |
::= | ElementName | "*" |
|
[68] | SchemaElementTest |
::= | "schema-element" "(" ElementDeclaration ")" |
|
[69] | ElementDeclaration |
::= | ElementName |
|
[70] | AttributeName |
::= | QName |
|
[71] | ElementName |
::= | QName |
|
[72] | TypeName |
::= | QName |
|
[73] | FTSelection |
::= | FTOr (FTMatchOption | FTProximity)* ("weight" DecimalLiteral)? |
|
[74] | FTOr |
::= | FTAnd ( "||" FTAnd )* |
|
[75] | FTAnd |
::= | FTMildnot ( "&&" FTMildnot )* |
|
[76] | FTMildnot |
::= | FTUnaryNot ( "not" "in" FTUnaryNot )* |
|
[77] | FTUnaryNot |
::= | ("!")? FTWordsSelection |
|
[78] | FTWordsSelection |
::= | FTWords | ("(" FTSelection ")") |
|
[79] | FTWords |
::= | (Literal | VarRef | ContextItemExpr | FunctionCall | ("{" Expr "}")) FTAnyallOption? |
|
[80] | FTProximity |
::= | FTOrderedIndicator | FTWindow | FTDistance | FTTimes | FTScope | FTContent |
|
[81] | FTOrderedIndicator |
::= | "ordered" |
|
[82] | FTMatchOption |
::= | FTCaseOption |
|
[83] | FTCaseOption |
::= | "lowercase" |
|
[84] | FTDiacriticsOption |
::= | ("with" "diacritics") |
|
[85] | FTStemOption |
::= | ("with" "stemming") | ("without" "stemming") |
|
[86] | FTThesaurusOption |
::= | ("with" "thesaurus" (FTThesaurusID | "default")) |
|
[87] | FTThesaurusID |
::= | "at" StringLiteral ("relationship" StringLiteral)? (FTRange "levels")? |
|
[88] | FTStopwordOption |
::= | ("with" "stop" "words" FTRefOrList FTInclExclStringLiteral*) |
|
[89] | FTRefOrList |
::= | ("at" StringLiteral) |
|
[90] | FTInclExclStringLiteral |
::= | ("union" | "except") FTRefOrList |
|
[91] | FTLanguageOption |
::= | "language" StringLiteral |
|
[92] | FTWildCardOption |
::= | ("with" "wildcards") | ("without" "wildcards") |
|
[93] | FTContent |
::= | ("at" "start") | ("at" "end") | ("entire" "content") |
|
[94] | FTAnyallOption |
::= | ("any" "word"?) | ("all" "words"?) | "phrase" |
|
[95] | FTRange |
::= | ("exactly" UnionExpr) |
|
[96] | FTDistance |
::= | "distance" FTRange FTUnit |
|
[97] | FTWindow |
::= | "window" UnionExpr FTUnit |
|
[98] | FTTimes |
::= | "occurs" FTRange "times" |
|
[99] | FTScope |
::= | ("same" | "different") FTBigUnit |
|
[100] | FTUnit |
::= | "words" | "sentences" | "paragraphs" |
|
[101] | FTBigUnit |
::= | "sentence" | "paragraph" |
|
[102] | FTIgnoreOption |
::= | "without" "content" UnionExpr |
[103] | IntegerLiteral |
::= | Digits |
|
[104] | DecimalLiteral |
::= | ("." Digits) | (Digits "." [0-9]*) |
/* ws: explicitXP */ |
[105] | DoubleLiteral |
::= | (("." Digits) | (Digits ("." [0-9]*)?)) [eE] [+-]? Digits |
/* ws: explicitXP */ |
[106] | StringLiteral |
::= | ('"' (EscapeQuot | [^"])* '"') | ("'" (EscapeApos | [^'])* "'") |
/* ws: explicitXP */ |
[107] | EscapeQuot |
::= | '""' |
|
[108] | EscapeApos |
::= | "''" |
|
[109] | Comment |
::= | "(:" (CommentContents | Comment)* ":)" |
/* ws: explicitXP */ |
/* gn: commentsXP */ | ||||
[110] | QName |
::= | [http://www.w3.org/TR/REC-xml-names/#NT-QName]Names |
/* gn: xml-versionXP */ |
[111] | NCName |
::= | [http://www.w3.org/TR/REC-xml-names/#NT-NCName]Names |
/* gn: xml-versionXP */ |
[112] | Char |
::= | [http://www.w3.org/TR/REC-xml#NT-Char]XML |
/* gn: xml-versionXP */ |
The following symbols are used only in the definition of terminal symbols; they are not terminal symbols in the grammar of B EBNF for XPath 2.0 Grammar with Full-Text extensions.
[113] | Digits |
::= | [0-9]+ |
[114] | CommentContents |
::= | (Char+ - (Char* ('(:' | ':)') Char*)) |
The following table describes the full-text components of the static context. The following aspects of each component are described:
Default initial value: This is the initial value of the component if it is not overridden or augmented by the implementation or by a query.
Can be overwritten or augmented by implementation: Indicates whether an XQuery implementation is allowed to replace the default initial value of the component by a different, implementation-defined value and/or to augment the default initial value by additional implementation-defined values.
Can be overwritten or augmented by a query: Indicates whether a query is allowed to replace and/or augment the initial value provided by default or by the implementation. If so, indicates how this is accomplished (for example, by a declaration in the prolog).
Scope: Indicates where the component is applicable. "Global" indicates that the component applies globally, throughout all the modules used in a query. "Module" indicates that the component applies throughout a module. "Lexical" indicates that the component applies within the expression in which it is defined (equivalent to "module" if the component is declared in a Prolog.)
Consistency Rules: Indicates rules that must be observed in assigning values to the component.
Component | Default initial value | Can be overwritten or augmented by implementation? | Can be overwritten or augmented by a query? | Scope | Consistency rules |
---|---|---|---|---|---|
FTCaseOption | case insensitive |
overwriteable | overwriteable by prolog | lexical | Value must be case insensitive or case sensitive . |
FTDiacriticsOption | diacritics insensitive |
overwriteable | overwriteable by prolog | lexical | Value must be diacritics insensitive or diacritics sensitive . |
FTStemOption | without stemming |
overwriteable | overwriteable by prolog | lexical | Value must be without stemming or with stemming . |
FTThesaurusOption | without thesaurus |
overwriteable | overwriteable by prolog (refer to default to augment) | lexical | Value must be part of the statically known thesauri. |
Statically known thesauri | none | augmentable | cannot be augmented or overwritten by prolog | module | Each URI uniquely identifies a thesaurus list. |
FTStopWordOption | without stopwords |
overwriteable | overwriteable by prolog (refer to default to augment) | lexical | Value must be part of the statically known stop word lists. |
Statically known stop word lists | none | augmentable | cannot be augmented or overwritten by prolog | module | Each URI uniquely identifies a stop word list. |
FTLanguageOption | no language is selected | overwriteable | overwriteable by prolog | lexical | Value must be castable to "xs:language" or "none". |
FTWildCardOption | without wildcards |
no | overwriteable by prolog | lexical | Value must be without wildcards or without wildcards . |
It is a type error if, during the static analysis phase, an expression is found to have a static type that is not appropriate for the context in which the expression occurs, or during the dynamic evaluation phase, the dynamic type of a value does not match a required type as specified by the matching rules in Section 2.5.4 SequenceType MatchingXP.
It is a dynamic error if, in a function invocation, the argument corresponding to the specified function's collation parameter does not identify a supported collation.
We would like to thank the members of the XQuery and XPath Full-Text group for their fruitful discussions.
We would like to thank the following people for their contributions on earlier drafts of this document.
"Andrew Eisenberg" - IBM - andrew.eisenberg@us.ibm.com
"Roland Seiffert" - IBM - seiffert@de.ibm.com
"Andrew Cencini" - Microsoft - acencini@microsoft.com
"Nimish Khanolkar" - Microsoft - nimishk@exchange.microsoft.com
"Ashok Malhotra" Oracle - ashok.malhotra@oracle.com
"Tapas Nayak" Microsoft - tapasnay@exchange.microsoft.com
This section contains definitions of important terms in this document.
Certain aspects of language processing are described in this specification as implementation-defined or implementation-dependent.
[Definition: Implementation-defined indicates an aspect that may differ between implementations, but must be specified by the implementor for each particular implementation.]
[Definition: Implementation-dependent indicates an aspect that may differ between implementations, is not specified by this or any W3C specification, and is not required to be specified by the implementor for any particular implementation.]
[Definition: A module is a fragment of XQuery code that conforms to the module grammar defined in XQuery 1.0: An XML Query Language draft.] Each module is either a main module or a library module.
[Definition: A main module consists of a Prolog followed by a Query Body.] A query has exactly one main module. In a main module, the Query Body can be evaluated, and its value is the result of the query.
[Definition: A module that does not contain a Query Body is called a library module. A library module consists of a module declaration followed by a Prolog.] A library module cannot be evaluated directly; instead, it provides function and variable declarations that can be imported into other modules.
The XQuery syntax does not allow a module to contain both a module declaration and a Query Body.
[Definition: A Prolog is a series of declarations and imports that define the processing environment for the module that contains the Prolog.] Each declaration or import is followed by a semicolon. A Prolog is organized into two parts.
The first part of the Prolog consists of setters, imports, namespace declarations, and default namespace declarations. [Definition: Setters are declarations that set the value of some property that affects query processing, such as construction mode, ordering mode, or default collation.] Namespace declarations and default namespace declarations affect the interpretation of QNames within the query. Imports are used to import definitions from schemas and modules. [Definition: Each imported schema or module is identified by its target namespace, which is the namespace of the objects (such as elements or functions) that are defined by the schema or module.]
The second part of the Prolog consists of declarations of variables, functions, and options. These declarations appear at the end of the Prolog because they may be affected by declarations and imports in the first part of the Prolog.
[Definition: The Query Body, if present, consists of an expression that defines the result of the query]. A module can be evaluated only if it has a Query Body.
[Definition: A module declaration serves to identify a module as a library module. A module declaration begins with the keyword module
and contains a namespace prefix and a URILiteral].
This appendix provides a summary of features defined in this specification whose effect is explicitly implementation-defined. The conformance rules require vendors to provide documentation that explains how these choices have been exercised.
Everything about tokenization, including the definition of the term "words", is implementation-defined, except that
each word consists of one or more consecutive characters;
the tokenizer must preserve the containment hierarchy (paragraphs contain sentences contain words); and
the tokenizer must, when tokenizing two equal strings, identify the same tokens in each.
Implementations are free to provide implementation-defined ways to differentiate between markup's effect on token boundaries during tokenization.
It is implementation-defined what a stem of a word is and whether stemming will based on an algorithm, dictionary, or mixed approach.
When the option "with default stop words" is used, an implementation-defined collection of stop words is used.
The set of valid language identifiers is implementation-defined.
Certain values in the static context (see C Static Context Components) that can be overwritten or augmented by implementations are implementation-defined.
This section contains the current issues related to this document.
This list of issues is classified in clusters. Each cluster has a unique name that reflects its topic. Each issue has a unique number. Some issues are labelled VNext. The clusters are:
Cluster A: Scoring and Weighting
Cluster B: IgnoreOption, Markup vs. Structure
Cluster C: Wildcards, Regex, Match Anchoring
Cluster D: Thesaurus, Match Option Defaults and Policies
Cluster E: Other MatchOptions Details
Cluster F: Grammar Integration, Syntax Details, and Naming
Cluster G: Semantics Details
Cluster H: Extensions
Cluster I: Simplifications and Variations of Language Constructs
Cluster J: IgnoreOption, Markup vs. Structure
Cluster K: Issue closed before we started clustering
Issue: scoring-properties, priority: , status: closed
Scoring Properties (Cluster A, Issue 1)
Is it possible to specify anything other than range ? Examples: do we want to define scoring rules for efficient scoring, rules to guarantee score monotonicity?
Resolution:
CLOSED.
No changes required. Closed at FTTF Meeting 62: http://lists.w3.org/Archives/Member/member-query-fttf/2004Oct/0020.html
Issue: scoring-values, priority: , status: closed
Scoring Values (Cluster A, Issue 2)
Answers that do not contain a match (in the Boolean sense) are assigned a score value that depends on the scoring algorithm and that might be greater than 0.
The following implications should hold:
score = 0 implies ftcontains is false.
score <> 0 does not imply anything for ftcontains.
ftcontains is true implies score > 0.
ftcontains is false does not imply anything for score.
This interpretation enables the use of query relaxation in the ftcontains expression and thus, return a score value greater than 0 for those nodes that do not match the ftcontains expression (in a Boolean sense).
For example, given the query:
for $b in //books score $score as $b//content ftcontains "usability && testing" where $score > 0 return {$b}
The scoring algorithm could rewrite it to:
for for $b in //books score $score as $b//content ftcontains "usability || testing with stemming" where $score > 0 return {$b}
and thus, some of the books that are not returned by the first query will be returned by the second query.
Resolution:
CLOSED.
We discussed several alternatives in http://lists.w3.org/Archives/Member/member-query-fttf/2004Dec/0024.html and we would like to adopt the one described above.
However, this issue is still under discussion.
See resolution in Cluster A, Issue 60.
Issue: data-model, priority: , status: closed
Semantics Data Model (Cluster K, Issue 3)
Data model incorporates new names - TokenInfo, Match, AllMatches.
Resolution:
CLOSED.
All occurrences of FullMatch, SimpleMatch, and Position in the text, in the schemas, and in the XQuery implementations of the semantics have been replaced with AllMatches, Match, and TokenInfo respectively.
Issue: ftcontains-grammar, priority: , status: closed
FTContains Grammar (Cluster K, Issue 4)
Expr "ftcontains" FTSelection FTIgnoreCtxMod?. One production for FTSelection which includes FTIgnoreCtxMod?
Resolution:
CLOSED.
We replaced the previous grammar production Expr "ftcontains" FTSelection that allowed FTIgnoreCtxMod to be combined with any FTSelection with the new one that restricts the application of FTIgnoreCtxMod to the highest level.
Issue: ftcontextmodifiers, priority: , status: closed
FTContextModifiers (Cluster K, Issue 5)
Paul C.: Change the name of the FTContextModifer production which modify the operational semantics of the FTSelections they are applied to. Abandon the use of "ContextModifier" as in FTCaseCtxMod, FTStemCtxMod, FTIgnoreCtxMod. Issue raised at FTTF Feb 5-6, 2004 meeting. Find in the minutes at: http://lists.w3.org/Archives/Member/member-query-fttf/2004Feb/0010.html (Cntl-F on FTContextModifiers)
Resolution:
CLOSED.
Replaced FTContextModfiers with FTMatchOptions as in FTCaseOption, FTStemOption, FTIgnoreOption in the Feburary 26, 2004 Editor's Draft.
CLOSED February 26, 2004.
Issue: grammar, priority: , status: closed
Grammar (Cluster K, Issue 6)
Grammar: Where does the ftcontains expression belong in the XQuery grammar: Boolean expression or comparison expression?
Resolution:
CLOSED.
The ftcontains expression plugs in to the XQuery grammar in the "FTComparisonExpr" production. This seems to give ftcontains the correct precedence among other XQuery operations, and it makes intuitive sense.
Issue: wildcards, priority: , status: closed
Wildcards (Cluster C, Issue 7)
Pat Case: There are a few inconsistencies between this document and the Use Cases Working Draft.
This document and the Use Cases Working Draft present different syntax in regex examples. I can find no syntax provided in this document for the starts-with and exact match functionality. Should we rename the Wildcard section in the Use Cases to Regex Section and possibly rethink the use cases?
Resolution:
CLOSED.
We dropped regular expression support in favor of wildcard support. Closed at Meeting 67: http://lists.w3.org/Archives/Member/member-query-fttf/2005Jan/0051.html
Issue: thesaurus, priority: , status: closed
Thesaurus (Cluster D, Issue 8)
Thesaurus names: "synonyms", "narrower terms", "soundex", "spellcheck" and "wordnet". We need to define Thesaurus operators. We need more options when specifying thesaurs: Name, URI, Depth, Dimension. Standards. ISO 2788/ANSI Z39.19.
We need to discuss what the grammar of ThesaurusMatchOption is. Current grammar is:
FTThesaurusOption ::= ("with"? "thesaurus" Expr) | "without thesaurus".
Proposed grammar is:
FTThesaurusOption ::= ("with"? "thesaurus" Expr "operation" Expr) | "without thesaurus".
Resolution:
CLOSED.
Changed the syntax and semantics of thesaurus according to http://lists.w3.org/Archives/Member/member-query-fttf/2005Jan/0111.html
Issue: window, priority: , status: closed
Window (VNext, Cluster H, Issue 9)
Currently, FTDistanceSpec only permits a single distance specification for all of the terms specified by an FTSelection.
For example:
("dog" && "cat" && "bird") with word distance at most 10
In this scenario above, the terms "dog", "cat", and "bird" must all occur within 10 words of one another.
However, if one would want to return documents where "dog" occurs within 10 words of "cat" and this SAME "cat" term occurs within 5 words of "bird", it is currently not possible with the current language specification. The best that could be done is the following:
(("dog" && "cat") with word distance at most 10) and (("cat" && "bird") with word distance at most 5)
But, this will not lead to the exact desired result because the "cat" and "bird' comparison will not use only those "cat" terms which occurred within 10 positions of "dog" ... it can use any "cat" term within the search context.
Resolution:
CLOSED.
The issue has been closed on April 25, 2005 <http://lists.w3.org/Archives/Member/member-query-fttf/2005Apr/0072.html >. No changes are made to the language. Although the current language can express a lot of the specified types in question, the group recognizes that the query expressions are clumsy and difficult to write. Therefore, this issue will be considered again for VNext.
Issue: mildnot, priority: , status: closed
MildNot (Cluster I, Issue 10)
Andrew E.: Should we remove the mild not? It has never been included in a query language before.
Pat Case has provided use cases to justify its inclusion at: http://lists.w3.org/Archives/Member/member-query-fttf/2003Dec/0034.html
Discussion followed. Michael Rys' reply: http://lists.w3.org/Archives/Member/member-query-fttf/2003Dec/0038.html
Pat Case's reply: http://lists.w3.org/Archives/Member/member-query-fttf/2003Dec/0043.html
Use case paraphrase (for non-members): Consider a collection of 3 documents:
The Delights of Mexico - a document that includes "Mexico" several times.
The Perils of New Mexico - a document that includes "New Mexico" several times.
Travel in North America - a document that includes both "Mexico" and "New Mexico" several times.
Suppose you are planning a trip to Mexico. You want documents 1 and 3, but not 2. You could search for "Mexico" and get documents 1, 2 and 3. Or you could search for "Mexico AND NOT 'New Mexico'" and get just document 1. But the "strong not" has ruled out document 3 - even though it contained the thing you were looking for - just because it contained the thing you were not looking for.
The "mild not" operator allows you to say "Mexico MILD NOT 'New Mexico'", which means "find me all the documents that contain 'Mexico'. Do not take any notice of occurrences of 'New Mexico', but do not rule out a document just because it contains 'New Mexico'".
There are many cases where you may want to search for a word, but NOT get documents just because they contain a common phrase that includes that word. e.g. "security" mildnot "social security", "house" mildnot "house of representatives", "estate tax" mildnot "real estate tax"
Resolution:
CLOSED.
Issues 10 and 41 are now closed. We add the mildnot functionality and FTMildNot is spelled as "not in". Closed at FTTF Meeting 80: http://lists.w3.org/Archives/Member/member-query-fttf/2005May/att-0030/fttf-20050503.txt
Issue: markup-vs-structure, priority: , status: closed
Markup vs Structure (Cluster J, Issue 11)
Some tags are "markup" - e.g. b - some are "structure" - e.g. title. We generally want to treat structure tags as word boundaries, but not markup tags. How do we distinguish between markup and structure?
Michael to provide reformulation.
Resolution:
CLOSED.
Closed on April 29, 2005 and updated Section 1.1 as in http://lists.w3.org/Archives/Member/member-query-fttf/2005Apr/0091.html.
Issue: matchoption-policy, priority: , status: open
MatchOption Policy (Cluster D, Issue 12)
We need some indirection to specify match context, defaults "Thesaurus name" gives us a way to define a thesaurus, then specify it in the query - an indirection. Steve Buxton proposes there are many classes of things that are needed for context-match (stoplist, special characters, etc.) that need an indirection. So we need an extra level of indirection - a named policy that refers to a set of named things.
Resolution:
None recorded.
Issue: loose-grammar, priority: , status: closed
Loose Grammar (Cluster I, Issue 13)
The grammar allows lots of queries that do not make sense. e.g. "(dog || cat) within word distance N", "dog within word distance N", "(dog || cat) ordered", "!dog 5 times" If the grammar does not provide a way of identifying these "nonsense queries", then the implementation still has to identify them - i.e. implementors will have to augment the grammar to identify nonsense queries, and augment the semantics to do something with them.
J. Doerre asks if we should allow nested FTNegations in the RHS of a FTMildNegation. From his email (http://lists.w3.org/Archives/Member/member-query-fttf/2004Apr/0019.html) point 3: "The ApplyFTSelection ignores all StringExcludes in the arguments of the FTMildNegation. I think, if we don't want to deal with StringExcludes in that function, we should explicitly forbid them to appear, i.e. require arguments of FTMildNegation to not include any FTNegation."
Resolution:
CLOSED.
Leave the grammar as it is for a couple of reasons. 1. We cannot solve this problem with a (context-free) grammar without complicating it unnecessary. For example, apart from "(dog || cat) word distance N", the "no-op" rule can be also applied to "(dog with diacritics || cat) case-insensitive without stop words word distance N".
2. It is hard if not impossible to enumerate all "no-ops". Here are some additional ones: "a" && !"a", (dog && cat) distance at most 5 words distance at most 6 words, "To be or not to be" distance at least 10 words, etc. It should be left to the application to determine what constitutes a no-op and optimize if possible.
See F2F minutes in http://lists.w3.org/Archives/Member/member-query-fttf/2005Jul/0049.html
Issue: fttimesselection, priority: , status: closed
FTTimesSelection (Cluster G, Issue 14)
How do I count occurrences, where the query is NOT a single term?. How many occurrences of "!dog" are there in "very very big"? Zero or very many?
Resolution:
CLOSED.
Issue: regexp-escape, priority: , status: closed
RegExp Escape (Cluster C, Issue 15)
Need to define some escaping mechanism for regexp characters, and for (||, ...).
Resolution:
CLOSED.
Closed on Feb. 14, 2005 because regular expressions are not part of the language anymore.
Issue: ftscopeselection, priority: , status: closed
FTScopeSelection (Cluster I, Issue 16)
Is there a need for both FTScopeSelection and FTDistance ? For example, how is the 'same sentence' or 'same paragraph' really different than a FTDistance of 'with sentence exactly 1' or 'with paragraph exactly 1'?.
Resolution:
CLOSED.
We decided to keep both FTScopeSelection and FTDistance.
Issue: weighting, priority: , status: closed
Weighting (Cluster A, Issue 17)
Michael R.: What syntactic form should scoring take? How do we describe the constraints on the types of expressions that are allowed? Should scoring be expressed using a second-order function, a stand-alone operator, or as a clause in a FLWOR expression? Consider moving weighting to ftContains, something like the following: TreatExpr ("ftcontains" FTSelection ("weight" Expr)? )?
Options in presentation of full-text language proposal and some discussion at XQuery January meeting, Tampa at: http://www.w3.org/XML/Group/2004/01/xquery-minutes (Cntl-F on Report of Full-Text Task Force)
Resolution:
CLOSED.
Added weight to FTSelection inside a scoring expression.
Issue: weight-values, priority: , status: closed
Weight Values (Cluster A, Issue 18)
Valid values for weights must be defined.
Resolution:
CLOSED.
Weight values in scoring expressions are in the interval [0,1].
Issue: ftscopeselection-on-structure, priority: , status: closed
FTScopeSelection on structure (VNext, Cluster H, Issue 19)
Scoping based on structure (e.g. same node and different node) should be considered. Support for queries where distance is measured in terms of "number of intervening elements" where elements can be any markup including chapter, paragraph and sentence. Consider sentence/paragraph/node distance.
Resolution:
CLOSED.
Postponed to VNext.
Issue: languagematchoption, priority: , status: closed
LanguageMatchOption (Cluster E, Issue 20)
What is the default language? SA: Dana F.: does the language have to be a literal or an Expr that returns xs:string? Is there an implementation-defined list of valid languages ?
Resolution:
CLOSED.
1. Default language is "None".
The Working Draft states explicitly in Section 3.2.7 the possibility to have no language selected. I think this is a good choice for the default (and it is specified as the default in the Working Draft). A typical application that uses XQuery-FT will probably have logic in place to override the default by the language setting from the locale of the client, so the default is really unimportant.
2. The language is given by a UnionExpr that must return an xs:string, or an empty sequence. This is what the Working Draft specifies. Let us keep it like that.
3. Yes, there is an implementation-defined list of valid languages. We added a statement on this to Section 3.2.7. See http://lists.w3.org/Archives/Member/member-query-fttf/2005Jan/0083.html
Issue: casematchoption-specialcharmatchoption, priority: , status: closed
CaseMatchOption and SpecialCharMatchOption (Cluster E, Issue 21)
Paul C. pointed out whether "lowercase", "uppercase", "case sensitive" and "case insensitive" should be defined in the context of Unicode. J. Doerre provided this link to the Unicode standard is: http://www.unicode.org. The current version is 4.0.0. Case folding is described in Chapter 3.13. Please note that the case folding operations, like toUppercase(X), only depend on the characters to be folded, not on additional information, like language.
Resolution:
CLOSED.
There will be no syntax for special character handling in the current draft. Issues to consider for v. next are in this list of issues.
Issue: diacriticsmatchoption, priority: , status: closed
DiacriticsMatchOption (Cluster E, Issue 22)
Paul C.: We need to define what a diacritic is. Steve B. pointed out whether "with diacritics" and "without diacritics" are needed or not.
Resolution:
CLOSED.
We removed the special character match option as instructed in http://lists.w3.org/Archives/Member/member-query-fttf/2005Jan/0051.html
Issue: tokenizers, priority: , status: closed
Tokenizers (Cluster J, Issue 23)
Darin/Paul C.: What is the most general behavior for tokenizers?
Michael Kay: Can we define a set of rules that apply regardless of which tokenizer we are using in the same manner as the rues we defined for scoring? For example, we could impose constraints on words, sentences and paragraphs.
Resolution:
CLOSED.
Modified item 7 in Section 1.1 to reflect conditions on tokenizers.
Issue: specialcharmatchoption, priority: , status: closed
SpecialCharMatchOption (Cluster E, Issue 24)
We need to say more about special characters, what kind of special characters do we want to consider, what is their impact on the ability to use a given index, their impact on tokenization.
Resolution:
CLOSED.
We decided to remove this match option from the current WD and create new issues to be considered for v. next.
Issue: matchoption-syntax, priority: , status: closed
MatchOption Syntax (Cluster E, Issue 25)
Paul C.: It maybe that we should reconsider the syntax and allow to apply modifiers to individual words.
Resolution:
CLOSED.
Issue: stopwordsmatchoption, priority: , status: closed
StopWordsMatchOption (Cluster E, Issue 26)
We need to say more about stopwords, what kind of stop words do we want to consider, what is their impact on the ability to use a given index, their impact on tokenization. Should we allow to specify the URI of a StopWords list? Paul C.: What would a single search with a stop word return?
Resolution:
CLOSED.
We changed the syntax of stop words sepcification to allow for using a URI as a stop word list. The new syntax is given in: http://lists.w3.org/Archives/Member/member-query-fttf/2005Jan/0109.html
Issue: matchoptionstokenization, priority: , status: closed
MatchOption and Tokenization (Cluster C, Issue 27)
Does the language document clearly state the impact of match options on tokenization? Consider regex * when does it get applied? What effect does it have on word breaks? Example: expr ftcontains "brown .ox" with regex, expr ftcontains "brown .*ox" with regex.
Resolution:
CLOSED.
Closed, on Feb. 17, 2005, because no longer an issue.
The only impact of match options on tokenization that needs to be addressed in the specification is the impact of the wildcard match option. Other match options, like "language", are allowed to impact tokenization in an implementation-dependent way.
For the wildcard match option its implication on tokenization is now clearly stated in its description, namely that wildcards, i.e., the character sequences ".", ".*", ".+", etc., are to be interpreted as token-internal character sequences when within an FTWords that is inside the scope of the wildcard match option.
Issue: ignoresyntax, priority: , status: closed
IGNORE Syntax (Cluster B, Issue 28)
Do we need special syntax for IGNORE in case of level by level search?
Resolution:
CLOSED.
We already have a syntax for this.
Issue: scoping, priority: , status: closed
Scoping (Cluster I, Issue 29)
Do we need same sentence, same paragraph search? * in semantics, not in requirements.
Resolution:
CLOSED.
Closed by Pat Case in http://lists.w3.org/Archives/Member/member-query-fttf/2005Mar/0230.html
This recommendation should focus on functionality which serves all languages. It should also selectively include functionalities useful within families of languages. Searching within sentences and paragraphs is useful to many western languages and some non-western languages. They should remain in the recommendation.
Issue: precedencexqueryfulltext, priority: , status: closed
Precedence of XQuery and full-text (Cluster F, Issue 30)
We need to distinguish between XQuery expressions embedded in full-text expressions and FTSelections themselves. S. Buxton suggests that we use different kinds of parentheses to distinguish between these two expressions. See his message in http://lists.w3.org/Archives/Member/member-query-fttf/2004Apr/0042.html and subsequent messages. A simple example is to distinguish between ("cat") as an XQuery expression that builds an XQuery sequence and ("cat") as an FTSelection.
In the current draft of the document, we are using lookahead
Other possibilities include the use of "{}" to switch from full-text to XQuery when XQuery expressions are embedded in full-text expressions. This is similar to element construction in XQuery and has been pointed out by Mary H in her email at http://lists.w3.org/Archives/Member/member-query-fttf/2004May/0163.html
Resolution:
CLOSED.
We decided to use {} to delimit XQuery expressions inside XQuery Full-Text ones according to the discussion in http://lists.w3.org/Archives/Member/member-query-fttf/2005Mar/0019.html
Issue: ftdistancewith, priority: , status: closed
Optional Keyword "with" in FTDistance (Cluster F, Issue 31)
In 3.1.9 FTDistance: Do we need "with" in FTDistance?
Raised by Pat Case by email April 28, 2004
Resolution:
CLOSED.
We removed the optional keyword "with" from FTDistance. Closed at FTTF Meeting 69, by accepting text at http://lists.w3.org/Archives/Member/member-query-fttf/2005Jan/0112.html as amended.
Issue: ftwindowwithin, priority: , status: closed
Optional Keyword "within" in FTWindow (Cluster F, Issue 32)
In 3.1.20 FTWindow: Do we need "within" in FTWindow?
Raised by Pat Case by email April 28, 2004
Resolution:
CLOSED.
We removed the optional keyword "within" from FTWindow. Closed at FTTF Meeting 69, by accepting text at http://lists.w3.org/Archives/Member/member-query-fttf/2005Jan/0112.html as amended.
Issue: ftspecialcharoption-issue, priority: , status: closed
FTSpecialCharOption (Cluster E, Issue 33)
In 3.2.3 FTSpecialCharOption: Should we have to or be able to specify which special characters are to be matched or not? Should the following syntax be allowed "without special characters "-" or "with special characters "-"?
Raised by Pat Case by email April 28, 2004
Resolution:
CLOSED.
Closed on 14 Feb. 2005 because special character match options is not part of the language anymore.
Issue: ftnegationunarynot, priority: , status: closed
FTNegation Includes Unary Not (Cluster F, Issue 34)
In 3.1.5 FTNegation: If we are supporting the unary not which is shown in the production, please add text and examples to show that both the "unary not" and the "and not" are supported.
Raised by Pat Case by email April 28, 2004
Resolution:
CLOSED.
Closed by Pat Case in http://lists.w3.org/Archives/Member/member-query-fttf/2005Mar/0230.html
Issue: ftorderunordered, priority: , status: closed
FTOrder Unordered Option (Cluster F, Issue 35)
In 3.1.7 FTOrder: [30] FTOrder ::= FTSelection "ordered" should we have an explicit "unordered" for the default?
Raised by Pat Case by email April 28, 2004
Resolution:
CLOSED.
We don't introduce an explicit "unordered" operator. This would necessitate the semantics to deal with partial orders inside ALLMATCHES. There are no use cases warranting such complications in the semantics. Closed at FTTF Meeting 68: http://lists.w3.org/Archives/Member/member-query-fttf/2005Feb/0020.html
Issue: ftignoreoptionnaming, priority: , status: closed
FTIgnoreOption Naming (Cluster F, Issue 36)
Would FTFilterOption be a better name than FTIgnoreOption?
Raised by Pat Case by email April 28, 2004
Resolution:
CLOSED.
Closed by Pat Case in http://lists.w3.org/Archives/Member/member-query-fttf/2005Mar/0208.html
Since filter and skip are already used in the XQuery recommendation, the name of this functionality should remain FTIgnore.
Issue: ftrangespecsyntax, priority: , status: closed
FTRangeSpec Syntax for 1 to 4 (Cluster F, Issue 37)
We should consider aligning the syntax for the FTRangeSpec with an upper and lower boundary in 3.1.9 FTDistance (from 1 to 4) with the syntax for using range expressions to construct sequences in XQuery and XPath (1 to 4), See the XQuery/XPath language document Section 3.3.1 Constructing Sequences.
Raised by Pat Case by email April 28, 2004
Resolution:
CLOSED.
Closed by Pat Case in http://lists.w3.org/Archives/Member/member-query-fttf/2005Mar/0227.html
The document will continue to use the syntax (from 1 to 4) for number ranges in FTRange. This syntax for number ranges is the most user-friendly. There is no need to align this syntax with the XML Schema/XQuery regular expression syntax for number ranges.
Issue: booleannaming, priority: , status: closed
Boolean (&& || !) Naming (Cluster F, Issue 38)
Is it not possible and maybe preferable to use ftand ftor ftnot instead of && || ! following the lead of ftcontains?
Raised by Pat Case by email April 28, 2004
Resolution:
CLOSED.
Issue: exactelementcontent, priority: , status: closed
Exact Element Content (Cluster C, Issue 39)
We have a use case for an exact element content query which finds the exact words or phrases being queried, no more and no less in an element and allows variations on case, diacritics, and special characters. Should this functionality be in XQuery full-text? If so, should we use the keywords "exact content"?
Raised by Pat Case by email April 28, 2004
Resolution:
CLOSED.
We added an FTSelection in Section 3.1.12 to express exact content.
Issue: startswith, priority: , status: closed
Starts With (Cluster C, Issue 40)
We have a use case for a starts with query which finds the words or phrases being queried as the first content of an element. Should this functionality be in XQuery full-text? If so, should we use the keywords "starts with"?
Raised by Pat Case by email April 28, 2004
Resolution:
CLOSED.
We added an FTSelection in Section 3.1.12 to express start with.
Issue: mild-not-naming, priority: , status: closed
What should we call the mild not (Cluster F, Issue 41)
The name "mild not" or "mild negation" is not really helpful in understanding what we want it to denote. We should try hard to find a better name for this construct. Since it is used to exclude certain matches, why not call it "FTMatchExclude" or just "FTExclude"? Keeping "mild not" as the name makes it recognizable as a form of "not". If it remains as "mild not" and the ! continues as the syntax for "not", consider using mild! as the syntax for "mild not".
Raised by Jochen by email April 21, 2004; Additional comments by Pat Case May 4, 2004
Resolution:
CLOSED.
Issues 10 and 41 are now closed. We add the mildnot functionality and FTMildNot is spelled as "not in". Closed at FTTF Meeting 80: http://lists.w3.org/Archives/Member/member-query-fttf/2005May/att-0030/fttf-20050503.txt
Issue: multi-word-phrases-thesauri-lookup, priority: , status: closed
Thesauri lookup for multi-word phrases (Cluster D, Issue 42)
It should be decided whether thesauri lookups can be performed only on single words or whether it is possible to apply it on multi-word phrases. For example, should we allow the thesaurus to replace "bells and whistles" with "frills"?
In the latter case, should thesauri lookup be applied only to the FTWord "bells andwhistles", or should it applied also on ("bells" "and" "whistles") phrase? Another question is if the thesauri expansion can be applied on phrase and on a word in the phrase, which one takes precedence.
Resolution:
CLOSED.
The semantics has been modified so that thesauri lookups can be performed on multi-token phrases. They are applied only on phrases that are explictly specified by the user; e.g., they will be applied in FTWord selections "('bells', 'and', 'whistles') phrase" or "'bells and whistles' any/all/phrase". Multi-token phrase lookup for "bells and whistles" will no be performed for "('bells', 'and', 'whistles') all word" or "'various bells and whistles' phrase". Multi-token phrase lookups take precedence over single-token lookups: once a multi-token phrase lookup is performed no more thesauri lookups will be performed.
Issue: exactly, priority: , status: closed
Exactly in FTRangeSpec (Cluster F, Issue 43)
Should "exactly" be optional? Should we allow both "word distance 6" and "word distance exactly 6"? Raised at Redmond May 2004 by Steve Buxton and Pat Case.
Resolution:
CLOSED.
Exactly is required and is not optional.
Issue: ftcontains-semantics, priority: , status: closed
FTContains Semantics (VNext, Cluster H, Issue 44)
FTContains operates on a sequence of nodes. Strings cannot be searched.
Raised at Redmond May 2004 by Steve Buxton. See also: http://lists.w3.org/Archives/Member/member-query-fttf/2004May/0085.html
Resolution:
CLOSED.
Postponed to VNext.
Issue: matchoptions-defaults, priority: , status: closed
MatchOptions Default (Cluster D, Issue 45)
We need to specify defaults for MatchOptions. We should align this default with the static context for XQuery/XPath and add to the XQuery prolog corresponding declarations to set query-wide defaults.
Resolution:
CLOSED.
Added match option declarations to prolog (see Section 2.3 Extensions to the Static Context) and static context components for match options (see Appendix C Static Context Components). Closed at F2F Meeting 80: http://lists.w3.org/Archives/Member/member-query-fttf/2005May/att-0030/fttf-20050503.txt.
Issue: ftnegation-semantics, priority: , status: closed
FTNegation Semantics (Cluster K, Issue 46)
We need to specify the semantics of FTNegation.
Raised by Jochen Doerre. See http://lists.w3.org/Archives/Member/member-query-fttf/2004May/0082.html
Resolution:
CLOSED.
we decided to use <allMatches/> to denote false. See answer to http://lists.w3.org/Archives/Member/member-query-fttf/2004May/0082.html
Issue: zero-length-phrase, priority: , status: closed
Zero-length phrase (Cluster G, Issue 47)
If Expr in FTWords results in the empty sequence or the tokenization results in a zero-length phrase, the result is? Always a match, never a match? Depending on the keyword?
Resolution:
CLOSED.
As agreed by the FTTF, an FTWords with an empty list of search tokens returns an empty AllMatches. This applies for both the search tokens supplied directly by the user (as an XQuery expression) and the final search tokens after the application of all match options. Closed at FTTF Meeting 70: http://lists.w3.org/Archives/Member/member-query-fttf/2005Feb/0105.html
Issue: stopwordsoptions, priority: , status: closed
Stop words option (Cluster E, Issue 48)
The syntax and semantics of stop words are still under discussion.
3.2.5 FTStopwordOption is inconsistent with the grammar and semantics.
the second example includes "without stop words" NOT followed by an expression, which is not valid according to the EBNF (see also the default options query in 3.2 FTMatchOptions)
the keyword "additional" is not part of the current grammar
the text and examples in 3.2.5 FTStopwordOption imply that queries work as though stop words were removed from documents before positions are calculated, which is inconsistent with the description in 4.2.4 FTStopWordOption
Resolution:
CLOSED.
We changed the syntax of stop words sepcification to allow for using a URI as a stop word list. The new syntax is given in: http://lists.w3.org/Archives/Member/member-query-fttf/2005Jan/0109.html
Issue: grammar-precedence, priority: , status: closed
Grammar Precedence and Lookahead (Cluster F, Issue 49)
When integrating the XQuery Full-Text grammar with the XQuery 1.0 grammar, there were a number of challenges. Challenges include (using pseudo-code for examples):
The Full-Text operators must have the correct precedence (binding order) with respect to XQuery operators
It must be possible to override the default precedence of the Full-text operators - e.g. you must be able to express "(cat and dog) or mouse" as well as "cat and (dog or mouse)"
You must be able to embed XQuery expressions in the Full-Text expression, e.g. "cat and $i"
You must be able to embed the XQuery Full-Text expression in an arbitrarily-complex XQuery expression, e.g. "where title ftcontains ('dog' and 'cat') and price/dollars < 3 or disclaimer ftcontains 'buy this'"
The Working Groups discussed a number of ways of achieving this. The current grammar satisfies these requirements at the cost of introducing ambiguity in one place. The current XQuery 1.0 grammar is LL(1) - i.e. it is possible to write a parser that reads a query from left to right and only looks 1 token ahead. But the XQuery Full-Text grammar is NOT LL(1). At [PROD: 149] the parser must lookahead a full non-terminal - it must try to expand FTWords, and if that fails it must try to expand (FTSelection).
This is still under discussion - the Working Groups may remove the requirement for lookahead in a future publication.
Resolution:
CLOSED.
We decided to use {} to delimit XQuery expressions inside XQuery Full-Text ones accroding to the discussion in http://lists.w3.org/Archives/Member/member-query-fttf/2005Mar/0019.html
Issue: ignore-queries, priority: , status: closed
IGNORE Queries (VNext, Cluster B, Issue 50)
There are 3 main issues with IGNORE queries:
Do we need to specify the UnionExpr that follows IGNORE in the grammar?
Yes, we do.
This issue has been resolved in http://lists.w3.org/Archives/Member/member-query-fttf/2004Aug/0059.html
Should IGNORE be made composable with other FTSelections or should it be kept at the top level in the grammar?
Does the semantics of level-by-level IGNORE (used in the Use Cases document) differ from the semantics of IGNORE in the language document?
Resolution:
CLOSED.
Point 1. We are using UnionExpr in the current syntax. Point 2. Composing FTIgnore at any level of FTSelections is too complex at this stage and should be postponed to VNext after we have some implementation experience. See examples where semantics of composing ignore with FTSelection is not clear. Point 3. No, the semantics of FTIgnore is now the level by level semantics. See minutes in http://lists.w3.org/Archives/Member/member-query-fttf/2005May/0007.html
Issue: ftwindow-alternative-semantics, priority: , status: closed
Alternative Semantics for FTWindow (Cluster I, Issue 51)
The current semantics of FTWindow does not capture the most intuitive notion of window as a matching constraint. Suppose we have a simple query like:
"Internet" && "Cafe"
,
and we want to restrict a match to say, a window of 5. The interpretation I think is most natural for this query, is that it restricts each match, such that it is required to "lie" within a "window of 5 (word) positions" (but we could also use sentence or paragraph as position unit). Note that this does not imply that the search terms have to be a certain distance apart in any way. The window in which a match can be found exists independently of the match. In our current semantics, however, the "window" is defined by the first and last stringInclude (matching term) position. This allows us to constrain the window size using "exactly", "at most" and "at least". I find this notion of window counter-intuitive and confusing. Finding a match in a larger window should always be a weaker condition than finding a match in a smaller window!
The difference in the notion of window also comes to bear when looking at queries with negative parts. A query like:
"Internet" && ! "Cafe" within window 5
,
has the very intuitive meaning of searching for any occurrence of "Internet" such that inside some window of 5 positions that includes that occurrence there is not an occurrence of "Cafe". With our current notion of window, such a query simply cannot be expressed.
Here is the formalization of the proposed window semantics.
define function fts:ApplyFTWordWindow( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches> { for $match in $allMatches/match let $minpos := fn:min($match/*/tokenInfo/@pos), $maxpos := fn:max($match/*/tokenInfo/@pos) for $windowStartPos in ($minpos to $maxpos-n+1) let $windowEndPos := $windowStartPos+n-1 where fn:min($match/stringInclude/tokenInfo/@pos) >= $windowStartPos and fn:max($match/stringInclude/tokenInfo/@pos) <= $windowEndPos return <match> {$match/stringInclude} { for $stringExclude in $match/stringExclude where $stringExclude/tokenInfo/@pos >= $windowStartPos and $stringExclude/tokenInfo/@pos <= $windowEndPos return $stringExclude } </match> } </allMatches> }
Raised by Jochen.
Resolution:
CLOSED.
The proposed new semantics has been accepted. Closed at F2F Meeting 84: http://lists.w3.org/Archives/Member/member-query-fttf/2005Jul/0061.html.
Issue: with-stop-words-UnionExpr, priority: , status: open
UnionExpr in StopWords(Cluster D, Issue 52)
The change from "UnionExpr" to "some complicated rewrite of UnionExpr that only includes literals" makes the grammar more complex, makes the language less clear and comprehensible, and adds only some questionable optimization possibilities (the query may be optimizable statically instead of at runtime).
Raised by Steve Buxton in http://lists.w3.org/Archives/Member/member-query-fttf/2004Dec/0065.html
Resolution:
None recorded.
Issue: matchoptions-default-functions, priority: , status: closed
Functions returning defaults for match options (VNext, Cluster D, Issue 53)
We would like to create functions that return the defaults for match options. Each implementation may choose different default values for match options. The purpose of these functions is to query and return those defaults.
This issue was raised at the Dec. 2004 F2F in http://lists.w3.org/Archives/Member/member-query-fttf/2004Dec/0072.html
Resolution:
CLOSED.
If we decide to pursue this functionality, we decided to do it in VNext and it will most likely be pursued in XQuery instead of XQuery and XPath Full-Text because if users want to query for defaults, they will be interested in those for both XQuery and XQuery and XPath Full-Text.
See F2F minutes in http://lists.w3.org/Archives/Member/member-query-fttf/2005Jul/0049.html
Issue: weight-granularity, priority: , status: closed
Weight Granularity in Scoring (Cluster A, Issue 54)
Michael Rys: Should we permit weights to be expressed at the level of FTContainsExpr and FTSelection or should we only permit them at the level of individual terms (FTWords)?
Resolution:
CLOSED.
Resolved by Cluster A, Issue 17
Issue: specialcharacters, priority: , status: closed
Special Characters (VNext, Cluster E, Issue 55)
We removed the special characters match option from the current draft and we will consider it for V next.
Discussion initiated in http://lists.w3.org/Archives/Member/member-query-fttf/2004Dec/0072.html
Resolution:
CLOSED.
Postponed to VNext.
Issue: scoring-corpus, priority: , status: closed
Scoring Corpus (VNext, Cluster A, Issue 56)
Do we want to alows users to specify a scoring corpus such as in:
Discussion initiated in http://lists.w3.org/Archives/Member/member-query-fttf/2004Dec/0072.html
Resolution:
CLOSED.
Postponed to VNext.
Issue: collations-match-option, priority: , status: closed
Collations Match Option (VNext, Cluster D, Issue 57)
Currently, XQuery 1.0 and XPath 2.0 Full-Text depends on the collation chosen in XQuery. It can be modified by the FTCaseOption and the FTDiacriticsOption match options. We need to explore the interaction of the collation with FTLanguageOption. Presumably, the latter will change the collation. What if there are more than one collations available for a given language? Moreover, if we decide to introduce back the FTSpecialCharsOption (see Issue 24), there might be different collations that treat special characters differently.
One approach is to have a FTCollationOption:
FTCollationOption ::= "using"? "collation" CollationUri
Another option is to have a collation only associated with FTLanguageOption (and possibly a future version of FTSpecialCharsOption)
FTLanguageOption := "language" UnionExpr ("collation" CollationUri)?
Resolution:
CLOSED.
Postponed to VNext.
Issue: ft-about-operator, priority: , status: closed
Free-text search operator ft-about (VNext, Cluster H, Issue 58)
While 'ftcontains' is aimed at supporting full text queries for XPath/XQuery still it lacks internet-style IR searches where the user can't express precisely her needs and would like to let the search engine return also close matches. For example a user need like "+cat dog" which request to find documents that contain a 'cat' but rank those that contain also a 'dog' higher is hard to express with 'ftcontains'. With 'ftcontains' one can express either ftcontains(cat and dog) which will return only documents that have both cat and dog or ftcontains(cat or dog) which will return documents containing cat or dog. None is what the user needs.
In order to express the above user need in XQuery-FT one needs to separate the user need into a filtering part and a scoring part. For example, the XQuery syntax to find all books with a title that contains a 'cat', but give those that contain also a 'dog' a higher ranking is shown below.
for $book in /books/book where $book/title ftcontains "cat" let $score := ft:score($book/title ftcontains "cat" || "dog") where $score > 0 order by $score return $book
The first 'ftcontains' is needed for filtering while the second 'ftcontains' is for scoring. We see that the search arguments of the two 'ftcontains' predicates are quite similar with the difference that the first contains 'cat' while the second contains also 'dog'. In general, this results in rather complex queries that redundantly have to repeat the same query terms in different query parts, even for simple user needs.
A proposal to overcome this problem has been put forward by Yosi Mass (IBM Research) and Jochen Doerre by introducing a free-text operator 'ft-about' that allows to specify Internet-style searches directly. For instance, the query above could be expressed without duplication as:
for $book in /books/book let $score := ft:score($book/title ft-about "+cat dog") where $score > 0 order by $score return $book
For more details see http://lists.w3.org/Archives/Member/member-query-fttf/2004Nov/0019.html
Open question: how can ft-about be integrated more tightly with ftcontains?
Raised by Jochen.
Resolution:
CLOSED.
No change to the document with resolution that it is to be considered for VNext, because it is not clear how it will fit with the grammar and the data model and it is not clear what the ftabout search would do and how tightly we could define it.
See F2F minutes in http://lists.w3.org/Archives/Member/member-query-fttf/2005Jul/0049.html
Issue: error-codes, priority: , status: closed
Error Codes(Cluster J, Issue 59)
XPath 2.0, XQuery 1.0, and associated documents use 8-character error codes. The Full-Text spec uses 6-character error codes. Full-Text must be brought up to date and use 8-character error codes.
Resolution:
CLOSED.
Error codes have been adapted.
Issue: extended-scoring, priority: , status: closed
Extended Scoring (Cluster A, Issue 60)
Motivation:
The proposal extends the previous (SCORE AS) scoring proposal in two ways.
It provides a iterator that can iterate over an item *and* its score in a single construct.
It allows users to relax the semantics of XPath/XQuery expressions so that users can obtain "fuzzy" results along with their scores.
The benefit of (1) is that it makes queries easier to write since the score is directly attached to an item in a single construct. This is in contrast to the original scoring proposal, where a FOR clause is required to iterate over items and a separate SCORE AS clause is required to score the items.
The benefit of (2) is that it generalizes traditional Information Retrieval (IR) and can capture the class of queries used by the XML IR community (notably INEX). In traditional IR, end-users who ask keyword queries are perfectly happy when the system returns documents that contains those keywords or stemmed versions of those keywords or synonyms. By analogy, XQuery FT (like INEX) should allow the possibility of interpreting XQuery expressions (i.e., queries on both content and structure) in a fuzzy way on behalf of users. This should make the query specification less cumbersome for users since they do not have to (and may not be able to) explicitly specify all the query variants.
The proposal can be divided into three separate parts that can be decided upon independently.
Extend FOR clause with a score binding option.
Proposed syntax:
for $res scored $score in EXPR
Semantics:
Let S be the result of evaluating the XQuery expression EXPR.
As a normal FOR clause this clause iterates over each item in S and binds $res to the item, but also binds $s to the score of the item.
Like in the SCORE AS clause we need to assume a second-order function for the evaluation of EXPR. E.g., it makes a difference, whether EXPR is a just a function call which evaluates to some sequence, or is the equivalent body of the function. Only in the latter case the scoring can take the evaluation of the function body into account.
We might want to restrict the kinds of expressions allowed in EXPR as discussed below in 3.
Add FUZZY keyword to scoring constructs.
Proposed syntax (when combined with extended for):
for $res scored $score in fuzzy EXPR
(when combined with SCORE AS):
score $s as fuzzy EXPR
Motivation: allow for query relaxation based on relevance; find also items relevant to the query that are near matches.
Semantics: Based on super-sequence (as specified in http://lists.w3.org/Archives/Member/member-query-fttf/2005Mar/0152.html) Any result to the non-scoring variant of the expression is a result, but there may be more.
Should we restrict expressions over which scoring is done to, say, Boolean combinations of FTContainsExpr?
This applies to both the proposed new FOR syntax, as well as the SCORE AS clause.
Note: Because scoring semantics is completely implementation-dependent, implementations are free to simply ignore the embedding of search expression inside XPaths, for instance.
In combination with 1., extend LET clauses with a score binding option, just as 1. extends FOR clauses.
This syntax replaces the SCORE AS clause.
Proposed syntax:
let $v := EXPR scored $score
In contrast to the SCORE AS clause where EXPR is evaluated to only calculate a score, but not a result, this syntax allows to calculate a pair (result, score), as it is done in the FOR clause extension above.
Benefits of extended scoring:
Can score over all of XQuery (note that Expr can be an arbitrary XQuery expression).
Can support query relaxation using the FUZZY keyword.
Makes the syntax of queries simpler (illustrated below).
Can be integrated with XPath.
Can be combined with the ORDERBY clause of FLWOR to sort results based on their scores.
Examples:
One of the main motivations for the scoring FOR proposal is the ability to express XML information retrieval queries such as the INEX queries (see http://inex.is.informatik.uni-duisburg.de:2004/). INEX is an effort that collects XML documents to assess scoring methods for XML in the same way as TREC was defined for assessing keyword search.
We give some examples below and explain the syntax/semantics of the new scoring construct.
Query 1: Find articles on "Usability"
Expressed using scoring FOR:
for $result scored $score in //article[. ftcontains "Usability"] return <result score="{$score}">{$result}</result>
The above query returns all articles and their score, where the score is computed with respect to the predicate: $a ftcontains "Usability".
Expressed using SCORE AS:
for $result in //article score $score as $result ftcontains "Usability" return <result score="{$score}">{$result}</result>
This illustrates how "scoring FOR" has a more compact syntax than SCORE AS when scoring over Boolean expressions such as ftcontains.
Query 2: Find articles and other documents on "Usability".
Expressed using scoring FOR:
for $result scored $score in fuzzy //article [. ftcontains "Usability"] order by $score return <result score = "{$score}">{$result}</result>
The above query returns articles and other documents along with their scores, ordered by score. Note that //article can be interpreted in a fuzzy way since scoring FOR can return a super-sequence of the corresponding XQuery sequence.
Expressed using SCORE AS (Version 1):
for $result in //article score $score as $result ftcontains "Usability" order by $score return <result score = "{$score}">{$result}</result>
The above query only returns articles (and not other documents) that are relevant to "Usability". In this sense, this SCORE AS query is not semantically the same as the above scoring FOR query.
Expressed using SCORE AS (Version 2):
for $result in //* score $score as tagname($result) = "article" and $result ftcontains "Usability" order by $score return <result score = "{$score}">{$result}</result>
The above query returns all elements (not just articles) ordered by the score of how well the element's tag name matches "article" and how relevant it is to "Usability". However, this syntax has two disadvantages compared to the scoring FOR. First, it is more clumsy to write compared to the scoring FOR syntax. Second, the system has to return *all* elements (not just those closely related to article) unless the user performs some explicit filtering based on scores; in contrast, the scoring FOR query only returns elements that are related to articles.
Query 3 (topic 128 in INEX): Find discussions about on-board route planning or navigation systems which are in publications about intelligent transport systems for automobiles.
Expressed using scoring FOR:
for $result scored $score in fuzzy //article[. ftcontains "intelligent transport systems"] /sec[. ftcontains "on-board route planning navigation system for automobiles"] return <result score = "{$score}">{$result}</result>
Since scoring FOR interprets the entire expression in a fuzzy way, it can relax the tag names of //article and /sec. In addition, it can relax /sec to //sec to find sections that may be indirectly contained in article.
Expressed using SCORE AS (Version 1):
for $a in //article, $s in $a/sec score $score as $a ftcontains "intelligent transport systems" and $s ftcontains "on-board route planning navigation system for automobiles" return <result score = "{$score}">{$s}</result>
Using this version of SCORE AS, we cannot support relaxations of the tag names of //article and /sec. Further, SCORE AS cannot explicitly support the relaxation of /sec to //sec. Note that simply replacing $a/sec with $a//sec is not semantically equivalent because, in this case, $a/sec will *not* be ranked higher than $a//sec (which it will be in the case of scoring FOR).
Expressed using SCORE AS (Version 2):
for $a in //*, $s in $a/* score $score as tagname($a) = "article" and tagname($s) = "section" and $a ftcontains "intelligent transport systems" and $s ftcontains "on-board route planning navigation system for automobiles" return <result score = "{$score}">{$s}</result>
This version of SCORE AS is less readable than the version using scoring FOR. Also, it is not possible to relax $a/* to $a//* (as in the Version 1) without losing some scoring semantics.
Raised by Sihem and Jai in http://lists.w3.org/Archives/Member/member-query-fttf/2005Mar/0152.html
Resolution:
CLOSED.
1. Use in definition of score: Score is between 0 and 1 regardless of whether the ftcontains expression returns true or false. Score is inherently fuzzy. Can compute a score independently of computing the Boolean value.
2. Use two syntaxes for score, replacing the current syntax: 1) Use to return exactly what the Boolean returns. for $b score $s in //books[. ftcontains "dog"] return <r>{$b, $s}</r> 2.a) Use to return more or less than the Boolean returns. To use fuzzy within score. Must have one let clause, could have more than one. for $b in //books let score $s := $b ftcontains "dog" let $t := $b ftcontains "dog" return <r>{$b, $s}</r> 2.b) Use to return more or less than the Boolean returns. To use fuzzy within score. for $b in //books let $t score $s := $b ftcontains "dog" return <r>{$b, $s}</r>.
3. Use this semantics: for $res score $s in Expr has the semantics of for $scoreSeq := fts:scoreSequence (Expr) for $res at $i in Expr let $s := $scoreSeq [$i]
See F2F minutes in http://lists.w3.org/Archives/Member/member-query-fttf/2005Jul/0049.html
Issue: desired-fttimes-semantics, priority: , status: open
Desired semantics of FTTimes (Cluster G, Issue 61)
Consider the document
cat cat
and the query
("cat" && "cat") occurrences exactly 4
Currently, it returns true. Is this the desired semantics for FTTimes? If yes, how do we explain it in the language document?
Resolution:
None recorded.
Issue: doublenegation-semantics, priority: , status: closed
Precise semantics of double negation (Cluster G, Issue 62)
Currently, (! ! Q) does not produce the same AllMatches as (Q). There seem to be two reasons for that. First, there are duplicate StringIncludes, StringExcludes, and Matches. Second, there are Matches that are subsumed by other Matches (i.e. the former are a logical consequence of the latter). How do we handle these situations? It seems reasonable to expect that !!Q produces the same result as Q.
Resolution:
CLOSED.
AllMatches returned for FTSelections are subject to a Normal Form now, by which it is insured that (! ! Q) behaves equivalent in all contexts to (Q) (see Section 4.3.1.4 Match and AllMatches Normal Form).
Issue: phrases-with-distance, priority: , status: open
Distance constraints do not work on phrases (Cluster G, Issue 63)
It is not possible to combine the distance operation with searching for phrases.
Example:
[. ftcontains "Redmond-based" && "company" distance at least 2]
The problem is that a phrase is internally resolved into a distance operation itself, which can impose a contradicting requirement to the explicit distance operation used in the query. Here it is (with some assumptions on tokenization):
[. ftcontains ("Redmond" && "-" && "based" ordered with distance 0) && "company" distance at least 2]
The second distance constraint is then imposed to all individual tokens (including those from the phrase) and hence cannot be satisfied. The query will also return false.
Resolution:
None recorded.
Issue: relativedefaults, priority: , status: open
System Relative Operator Defaults (Cluster E, Issue 64)
Do we want to add system relative operator defaults? Do we want to add "closer and "farther" to FTDistance for novice users who do not want to enter specific numbers of intervening words, sentences, and paragraphs? Similar relative defaults might also be added to other operators which call FTRange and score weighting.
Raised by Dana Florescu at Redmond Face to Face Meeting on July 15, 2005.
Resolution:
None recorded.
Issue: nested-ftnegations-right-side-ftmildnegation, priority: , status: closed
Nested FTNegations on Right Side of FTMildNegation (VNext, Cluster I, Issue 65)
Resolution:
CLOSED.
Raise a dynamic error semantically if there is a StringExclude on the right side of an FTMildNegation. Users can replace "&& not" with FTMildNegation. Possibly reconsider for VNext.
No changes required. Closed at F2F Meeting 84: http://lists.w3.org/Archives/Member/member-query-fttf/2005Jul/0061.html
Sihem Amer-Yahia | 2005-04-08 | Updated case matrix | Updated case matrix row "sensitive", column "CCI" from "case-insensitive variant of CCI if it exists, else error" to "case-sensitive variant of CCI if it exists, else error". |
Sihem Amer-Yahia | 2005-05-02 | Closed issues with no changes | Closed Cluster B, Issue 28 IGNORE Syntax with no change to the document. Closed Cluster B, Issue 50 IGNORE Queries with no change to the document. |
Sihem Amer-Yahia | 2005-05-02 | Updated FTTimes syntax | Closed Cluster G, Issue 14 FTTimesSelection and added a related bullet item in Section 3. |
Sihem Amer-Yahia | 2005-05-02 | Updated FTWildCard syntax | Updated FTWildCardOption in Section 3. |
Sihem Amer-Yahia | 2005-05-03 | Updated introduction | Replaced "semantic element" with "semantic markup" and "tag" with "element" in the introduction. |
Sihem Amer-Yahia | 2005-05-03 | Added issue on error codes | Added Cluster J, Issue 59 Error Codes. |
Sihem Amer-Yahia | 2005-05-03 | Closed issues with no change | Closed Cluster A, Issue 54 Weight Granularity in Scoring with same resolution as for Cluster A, Issue 5 Score Weighting, no further change to document. Closed Cluster H, Issue 9 Window with no change to the document. Closed Cluster H, Issue 19 FTScopeSelection on structure with no change to the document. Closed Cluster E, Issue 25 MatchOption Syntax with no change to the document. Closed Cluster H, Issue 44 FTContains Semantics with no change to the document. |
Sihem Amer-Yahia | 2005-05-03 | Updated FTContent syntax | Updated FTContent adding "entire content", Closed Cluster C, Issue 39 Exact Element Content. |
Sihem Amer-Yahia | 2005-05-03 | Closed issue on Boolean Naming | Closed Cluster F, Issue 38 Boolean Naming. Changes to the document are pending awaiting a decision on whether it is OK to use "and", "or", "not" for full-text. If so change existing symbols to "and", "or", "not". If not change existing symbols to "ftand", "ftor", "ftnot". |
Chavdar Botev | 2005-05-03 | Updated FTDistance semantics | Updated the semantics for distance. |
Sihem Amer-Yahia | 2005-05-03 | Updated FTRange syntax | Made "exactly" required before an exact number in FTRange. Closed Cluster F, Issue 43 Exactly in FTRangeSpec. |
Sihem Amer-Yahia | 2005-05-04 | Closed issue on collations | Closed Cluster D, Issue 57 Collations Match Option. |
Jochen Doerre | 2005-05-19 | Added issue on scoring | Added Cluster A, Issue 60 Extended Scoring. |
Chavdar Botev | 2005-06-29 | Added issue on FTNegation | Added Cluster G, Issue 62 Precise semantics of double negation. |
Chavdar Botev | 2005-06-29 | Added issue on FTTimes | Added Cluster G, Issue 61 Desired semantics of FTTimes. |
Sihem Amer-Yahia | 2005-07-11 | Updated FTMildNegation syntax | Updated the mild not syntax from "mild not" to "not in". Closed Cluster I, Issue 10 MildNot and Cluster F, Issue 41 Mildnot Naming. |
Chavdar Botev | 2005-07-12 | Updated FTIgnore semantics | Changed semantics of FTIgnoreOption. |
Sihem Amer-Yahia | 2005-07-18 | Corrected error codes | Corrected and added error codes, closing and implementing the resolution for Cluster J Issue 59 Error Codes. |
Sihem Amer-Yahia | 2005-07-18 | Closed issues with no changes | closed Cluster I, Issue 13 "loose-grammar" leaving the grammar as it is. Closed issue Cluster D, Issue 53 "matchoptions-default" with no change to the document. Closed Cluster H, Issue 58 "ft-about-operator" with no change to the document. |
Sihem Amer-Yahia | 2005-07-21 | Updated score syntax | Closed Cluster A, Issue 60 "new-scoring-proposal" and Issue 2 "scoring-values" and updated Section 2.2 Score Clause to reflect new score syntaxes. There are now syntaxes for scored queries 1) returning the same results as queries with Boolean predicates and 2) for returning more or fewer results. |
Sihem Amer-Yahia | 2005-07-21 | Added appendix for defaults | Added appendix for defaults in the query prolog analogous to C.1 in the XQuery language document. |
Sihem Amer-Yahia | 2005-07-21 | Updated FTThesaurus section | Aligned description in Section 3.2.4 FTThesaurusOption with current grammar. |
Sihem Amer-Yahia | 2005-07-21 | Opened and closed issue on nested FTNegation | Opened and closed Cluster I, Issue 65 Nested FTNegations on the right side of an FTMildNegation. |
Chavdar Botev | 2005-07-25 | Updated FTMildNegation semantics | Changed the semantics of MildNot. |
Sihem Amer-Yahia | 2005-08-10 | Added Change Log | Added Change Log harvesting back entries from CVS change log. |
Jochen Doerre | 2005-08-17 | Grammar changes | Changed XQuery/XPath grammar for new scoring syntax (resolution of Issue 60), for match option defaults in query prolog (resolution of Issue 45), for simplified window operator (resolution to Issue 51), renamed "mild not" to "not in" (resolution of Issue 41), modified FTThesaurusOption, FTStopwordOption and FTLanguageOption to require StringLiterals as decided in May 05 F2F. |
Jochen Doerre | 2005-08-17 | Changes to Section 2 | New scoring syntax introduced; rewritten most of 2.2. Corrected use of weights in 2.2.1 (wrong default, wrong use of 1.5) |
Jochen Doerre | 2005-08-17 | Changes to Section 3 | Adapting the explanations to changed syntax for FTWindow, FTThesaurusOption, FTStopwordOption and FTLanguageOption. Also corrected a couple of example explanations. Removed FTIgnoreOption from the list of match option defaults in 3.2 Corrected explanation and example of FTLanguageOption (diacritics nor case are language-specific!). Commented out last two examples of FTDistance, because distance 15 does not work for phrases. |
Jochen Doerre | 2005-08-17 | Appendices A+B | Adapted introductory comment about which version of the XQuery/XPath grammars we are aligned to. |
Jochen Doerre | 2005-08-17 | Dates in Header | Adapted current date and previous date and links in full-text-query-language-semantics.xml and in tqheader.xml. |
Jochen Doerre | 2005-08-19 | Added Section 2.3, Changes in 3+4 | Added Section 2.3 Extension to Static Context. Changed Sections 3.2 and 4.4.1.1 to refer to match option settings in the static context. |
Jochen Doerre | 2005-08-19 | Added Issue 63 | Added Cluster G Issue 63: Distance constraints do not work on phrases. |
Jochen Doerre | 2005-08-19 | Changes in Section 4 | Adapted semantics to new scoring feature (resolution of Issue 60), changed FTWindow semantics according to resolution of Issue 51, and cleaned examples. |
Jochen Doerre | 2005-08-19 | Appendix G | Added lines for statically known thesauri and stop lists. |
Jochen Doerre | 2005-08-25 | Added Issue 64 | Added Cluster E Issue 64:System Relative Operator Defaults (using wording proposed by Pat Case). |
Jochen Doerre | 2005-10-10 | Changes in Section 3 | Rephrased Section 3.2.7 FTIgnoreOption. Explanation and example adapted to simple (non-recursive) use of "ignore". |
Jochen Doerre | 2005-10-10 | Changes in Section 4 | Incorporated Section 4.3.1.4 Match and AllMatches Normal Form. |
Sihem Amer-Yahia | 2005-10-12 | Incorporated comments | Incorporated Pat's comments at http://lists.w3.org/Archives/Member/member-query-fttf/2005Sep/0068.html |
Jim Melton | 2005-10-20 | Changes in Sections 3 and 4 | Properly marked up errors and inserted error summary appendix. Re-ordered appendices so normative appendices precede non-normative appendices. |
Jochen Doerre | 2005-10-24 | Final editings | Included corrections to examples in Section 3. Changed meaning of distance 0 for sentences (paragraphs) to mean adjacent. Rework of Appendix H Checklist of Implementation-Defined Features. Resolution texts to issues 45, 59, and 62. |