This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
The file "english-stems.txt" contains stemming rules only for lower case text. However, the specification clearly states that the "Stemming Option must be applied before the Case Option and the Diacritics Option". So when tokenizing the string "Dogs and Cats" with stemming, the okens presented to the tokenizer must be "Dogs", "and", "Cats". The guidelines for running XQFTTS state that the "stemming-dictionary is a plain text file containing lines of whitespace-separated tokens. Each token on the line should stem to the first token on the line." Note that it is conceivable that the stemming dictionary might stem "AIDS" to "AIDS" but "aids" to "aid". This would be a useful test of the order of application of stemming and case options. Presumably the test suite doesn't currently test this.
Correct, stemming should be case-sensitive. I have added two additional tests ft-matchoptions-q5 and ft-matchoptions-q6 to test this case. Please indicate your satisfaction with this resolution by closing this bug.
So how would one do case-insensitve stemming?
(In reply to comment #1) > Correct, stemming should be case-sensitive. I have added two additional tests > ft-matchoptions-q5 and ft-matchoptions-q6 to test this case. Please indicate > your satisfaction with this resolution by closing this bug. Thanks. However, for some tests to pass, the english-stems.txt needs to have the following lines added. Dog Dogs Cat Cats Improve Improves Improving Improved Test Tests Testing Tested Plan Planning Conduct Conducting
(In reply to comment #2) > So how would one do case-insensitve stemming? Assuming that lowercase(AB) = ab lowercase(Ab) = ab lowercase(aB) = ab lowercase(ab) = ab one would ensure that if the implementation's stemming algorithm was such that stem(AB) = AB then stem(Ab) = Ab stem(aB) = aB stem(ab) = ab Thus when the case option is case-insensitive, applying the case option to the stem would always return 'ab' for each of AB, Ab, aB and ab.
Made the additions to english-stems.txt.
english-stems.txt is still missing Test Tests Testing Tested which is required for the following tests. ftstaticcontext-q2 ftstaticcontext-q4 ftstaticcontext-q5 stemming-queries-results-q1 stemming-queries-results-q1b xquery-xpath-composability-queries-results-q2
Done, per comment 6.
Confirmed fixed. Thanks.