This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
Is tokenization implementation-defined or implementation-dependent? Wouldn't customers expect to know how tokenization works (and thus want it defined?)?
Some vendors traditionally regard details of their algorithms as high-value proprietary information.
I surveyed existing text and have copied out all the instances where we have called tokenization implementation-defined or implementation-dependent. I propose the following changes to reflect th FTTF decision that tokenization SHOULD be implementation-defined. I was not sure whether the SHOULD statement belonged in the 2.1 Processing Model or in 4.1 Tokenization, so I have placed it in both, however one could be truncated and refer to the details in other. >>In 2.1 Processing Model As part of the External Processing that is described in the XQuery Processing Model, when an XML document is parsed into an Infoset/PSVI and ultimately into a XQuery Data Model instance, an implementation-defined full-text process called tokenization is usually executed. >>>>Replace "an implementation-defined full-text process" with "a full-text process" The tokenization process is implementation-dependent. For example, the tokenization may differ from domain to domain and from language to language. This specification will only impose a very small number of constraints on the semantics of a correct tokenizer. As a consequence, all the examples in this document are only given for explanatory purposes but they are not mandatory, i.e. the result of such full-text queries will of course depend on the tokenizer that is being used. >>>>Replace with: Tokenization, including the definition of the term "words", SHOULD be implementation-defined. Implementations SHOULD expose the rules and sample results of tokenization as much as possible to enable users to predict and interprete the results of tokenization. Tokenization MUST only conform to these constraints: a. Each word MUST consist of one or more consecutive characters; b. The tokenizer MUST preserve the containment hierarchy (paragraphs contain sentences contain words); and c. The tokenizer MUST, when tokenizing two equal strings, identify the same tokens in each. A sample tokenization is used for the examples in this document. The results might be different for other tokenizations. 3. Apply the tokenization algorithm to query string(s). (FT2.1 -- this is implementation-dependent) >>>>Delete (FT2.1 -- this is implementation-dependent) 4. For each search context item: a. Apply the tokenization algorithm in order to extract potentially matching terms together with their positional information. This step results in a sequence of token occurrences. (FT2.2 -- this is implementation-dependent) >>>>Delete (FT2.2 -- this is implementation-dependent) >>In 4.1 Tokenization [Definition: Formally, tokenization is the process of converting the string value of a node to a sequence of token occurrences, taking the structural information of the node into account to identify token, sentence, and paragraph boundaries.] Tokenization is subject to the following constraint: Attribute values are not tokenized. 4.1.1 Examples The following document fragment is the source document for examples in this section. Tokenization is implementation-defined. A sample tokenization is used for the examples in this section. The results might be different for other tokenizations. >>>>Replace with: [Definition: Formally, tokenization is the process of converting the string value of a node to a sequence of token occurrences, taking the structural information of the node into account to identify token, sentence, and paragraph boundaries.] Tokenization, including the definition of the term "words", SHOULD be implementation-defined. Implementations SHOULD expose the rules and sample results of tokenization as much as possible to enable users to predict and interprete the results of tokenization. Tokenization MUST only conform to these constraints: a. Each word MUST consist of one or more consecutive characters; b. The tokenizer MUST preserve the containment hierarchy (paragraphs contain sentences contain words); and c. The tokenizer MUST, when tokenizing two equal strings, identify the same tokens in each. 4.1.1 Examples The following document fragment is the source document for examples in this section. A sample tokenization is used for the examples in this document. The results might be different for other tokenizations. >>>>>>Please notice that I removed the "Attribute values are not tokenized" constraint, because we do allow attribute values to be tokenized and queried explicitly. Note: While this matching function assumes a tokenized representation of the search strings, it does not assume a tokenized representation of the input items in $searchContext, i.e. the texts in which the search happens. Hence, the tokenization of the search context is implicit in this function and coupled to the retrieval of matches. Of course, this does not imply that tokenization of the search context cannot be done a priori. Because tokenization is implementation-defined, the tokenization of each item in $searchContext does not necessarily take into account the match options in $matchOptions or the search tokens in $searchTokens. This allows implementations to tokenize and index input data without the knowledge of particular match options used in full-text queries. >>>>Replace with: Note: While this matching function assumes a tokenized representation of the search strings, it does not assume a tokenized representation of the input items in $searchContext, i.e. the texts in which the search happens. Hence, the tokenization of the search context is implicit in this function and coupled to the retrieval of matches. Of course, this does not imply that tokenization of the search context cannot be done a priori. The tokenization of each item in $searchContext does not necessarily take into account the match options in $matchOptions or the search tokens in $searchTokens. This allows implementations to tokenize and index input data without the knowledge of particular match options used in full-text queries. >>In Appendix I Checklist of Implementation-Defined Features (Non-Normative) 1. Everything about tokenization, including the definition of the term "words", is implementation-defined, except that a. each word consists of one or more consecutive characters; b. the tokenizer must preserve the containment hierarchy (paragraphs contain sentences contain words); and c. the tokenizer must, when tokenizing two equal strings, identify the same tokens in each. >>>>Replace with: Tokenization, including the definition of the term "words", SHOULD be implementation-defined. Implementations SHOULD expose the rules and sample results of tokenization as much as possible to enable users to predict and interprete the results of tokenization. Tokenization MUST only conform to these constraints: a. Each word MUST consist of one or more consecutive characters; b. The tokenizer MUST preserve the containment hierarchy (paragraphs contain sentences contain words); and c. The tokenizer MUST, when tokenizing two equal strings, identify the same tokens in each. This completes ACTION FTTF-128-04 on Pat: To provide the wording for having Tokenization as implementation-defined be a "should".
The TF accepted the proposed solution and resolved to close this bug. Because you were present when the TF made this decision, we presume that you are satisfied with the action.