This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
Hi, I would be interested how StopWords and Thesaurus options are to be handled in the XQFT TestSuite. Currently, I have no clue how to evaluate the URLs "http://bstore1.example.com/...." in the existing XQuery files. One solution might be to use a relative/local path here, but I do not know if this would be supported by all implementations. Christian, BaseX Team http://www.basex.org
The WG discussed this issue and agreed we need to augment the testsuite. Please note that we have not yet completely implemented the use of this new system throughout the testsuite. If you are satisfied with this resolution, please mark the bug as closed. Please note the following addition to the instructions: <quote> Special Sources: Stop Word List, Thesaurus, and Stemming Dictionary The stopwords, thesaurus, and stemming-dictionary sources are not intended to be used directly in the form in which they are given, but to provide information to those running the test suite about the expectations a particular test has about various implementation-specific aspects of the execution context. Implementations are expected to provide equivalent information to the query, but in whatever form is appropriate in their context. A stopwords source is a plain text file containing list of stop words, one per line. When a query references this stop word list, the implementation is expected to provide that list of stop words to the query. A thesaurus source is an XML document defined against the thesaurus.xsd XML Schema. When a query references this thesaurus, the implementation is expected to provide equivalent thesaurus information to the query. The stemming-dictionary is a plain text file containing lines of whitespace-separated tokens. Each token on the line should stem to the first token on the line. When the catalog entry for a query references a stemming dictionary, the implementation is expected to provide stemming equivalent to the rules given in the stemming dictionary. </quote> The basic idea is that there are three new kinds of sources: A stop word list, which is just a text file, one stop word per line; a thesaurus, which is an XML file as per the schema; and a stemming dictionary, which is one stem per line. The catalog descriptions for stop word lists and thesauri include a URI that matches up with the one in the query. This is similar to the handling of schemas. The stemming dictionary has no URI: it is the resource ID that matters and it is used to define the relevant stem equivalents when it makes a difference for stemmed search. ** Changes to XQFTTSCatalog.xsd/xml: Add three new kinds of source roles: stopwords, thesaurus, and stemming-dictionary, and corresponding elements in the sources part of the catalog. Add an aux-URI element to the test-case itself. Queries that use a URI for a stop words list should have an aux-URI with role="stopwords"; queries that us a URI for a thesaurus should have an aux-URI with role="thesaurus". Queries that rely on particular stemming behaviour should have an aux-URI with role="stemming-dictionary". ** Examples: * Stop words: TestSources/stopwords.txt: and the then it of in Catalog description: <stopwords ID="stopwords1" uri="http://bstore1.example.com/StopWordList.xml" FileName="stopwords.txt" Creator="Full-Text Task Force"> <description last-mod="2008-11-10">Stop word list for use cases</description> </stopwords> Query description using stopwords (with stop words at "http://bstore1.example.com/StopWordList.xml"): <test-case is-XPath2="true" name="stopwords-1" FilePath="Expressions/Operators/CompExpr/FTContainsExpr/FTSelection/MatchOptions/FTStopWord/" scenario="standard" Creator="Full-Text Task Force"> <description>Example using stop words</description> <spec-citation spec="XQueryFullText" section-number="3.4.7" section-title="Stop Word Option" section-pointer="ftstopwordoption"/> <query name="stopword-1" date="2008-11-10"/> <aux-URI role="stopwords">stopwords1</aux-uri> <input-file role="principal-data" variable="input-context">ftusecases</input-file> <output-file role="principal" compare="XML">stopwords-1.xml</output-file> </test-case> * Thesaurus: (Schema is TestSources/thesaurus.xsd) TestSources/soundex.xml: <thesaurus xmlns="http://www.w3.org/xqftts/thesarus"> <entry> <term>Marigold</term> <synonym> <term>Merrygould</term> <relationship>sounds like</relationship> </synonym> </entry> </thesaurus> Catalog description: <thesaurus ID="soundex" uri="http://bstore1.example.com/UsabilitySoundex.xml" FileName="soundex.txt" Creator="Full-Text Task Force"> <description last-mod="2008-11-10">Soundex thesaurus for examples</description> </thesaurus> Query using thesaurus: (with thesaurus at "http://bstore1.example.com/UsabilitySoundex.xml"): <test-case is-XPath2="true" name="thesaurus-1" FilePath="Expressions/Operators/CompExpr/FTContainsExpr/FTSelection/MatchOptions/FTThesaurus/" scenario="standard" Creator="Full-Text Task Force"> <description>Example using stop words</description> <spec-citation spec="XQueryFullText" section-number="3.4.3" section-title="Thesaurus Option" section-pointer="ftthesaurusoption"/> <query name="thesaurus-1" date="2008-11-10"/> <aux-URI role="thesaurus">soundex</aux-uri> <input-file role="principal-data" variable="input-context">ftusecases</input-file> <output-file role="principal" compare="XML">thesaurus-1.xml</output-file> </test-case> * Stemming TestSources/english-stems.txt improve improves improving improved dog dogs cat cats train trains training trained error errors Catalog description: <stemming-dictionary ID="english-stems" FileName="english-stems.txt" Creator="Full-Text Task Force"> <description last-mod="2008-11-10">English stems</description> </stemming-dictionary> Query using thesaurus: (with stemming) <test-case is-XPath2="true" name="stemming-1" FilePath="Expressions/Operators/CompExpr/FTContainsExpr/FTSelection/MatchOptions/FTStemming/" scenario="standard" Creator="Full-Text Task Force"> <description>Example using stemming</description> <spec-citation spec="XQueryFullText" section-number="3.4.4" section-title="Stemming Option" section-pointer="ftstemoption"/> <query name="stemming-1" date="2008-11-10"/> <aux-URI role="stemming-dictionary">english</aux-uri> <input-file role="principal-data" variable="input-context">ftusecases</input-file> <output-file role="principal" compare="XML">stemming-1.xml</output-file> </test-case>
Mary, thank you for the detailed presentation and discussion of the proposed test suite extensions. As far as I can judge it, the solution for stop words and thesaurus should completely come up to its expectations. I'm wondering, however, if the stemming options should be defined in the test suite. In contrast to the stop words and thesaurus option, the currently available version of the XQFT specification does not allow to specify a specific stemming file.. 'x' ftcontains 'y' with stop words at 'z' 'x' ftcontains 'y' with thesaurus at "z" 'x' ftcontains 'y' with stemming ...? This is of course what you are stating (the resource ID is what we are interested in), but, as the specification states that "it is implementation-defined whether the stemming is based on an algorithm, dictionary, or mixed approach", I would prefer to regard the stemming dictionary as an optional choice. However, these are details; I'm glad to see some more issues solved. Christian, BaseX Team http://www.basex.org