-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lucene-10008: Respect ignoreCase flag in CommonGramsFilterFactory #188
Conversation
This PR is ready for review. It addresses JIRA - https://issues.apache.org/jira/browse/LUCENE-10008 |
Thanks @vigyasharma, I'll try to have a look soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks awesome! Thank you for the refactoring/cleanup, bug fix and new test cases @vigyasharma
I left a few small comments. I think this is super close.
|
||
/** Default word set implementation. */ | ||
protected CharArraySet createDefaultWords() { | ||
return new CharArraySet(EnglishAnalyzer.ENGLISH_STOP_WORDS_SET, ignoreCase); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kinda weird to default to English stop words here? This base class is a generic "this thing needs to load words from a file" sort of deal ... maybe make this method abstract and force all subclasses to implement it and move this impl down to StopFilterFactory
? It is separately weird that we default to English there too! English is just one (weird!) language! But we don't need to solve that one here.
And I guess CommonGramsFilterFactory
would also default to English stop words, as it does already today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I've moved the default stop word impl. to subclasses and made this method abstract.
/** | ||
* Test that ignoreCase flag is honored when no words are provided and default stopwords are used. | ||
*/ | ||
public void testIgnoreCase() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case failed before the refactoring? Perfect :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it failed before the fix and refactor, and passes after.
commonWords = getWordSet(loader, commonWordFiles, ignoreCase); | ||
} | ||
} else { | ||
commonWords = EnglishAnalyzer.ENGLISH_STOP_WORDS_SET; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, this was the bug? Because this (default) path ignores ignoreCase
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right.
throw new IllegalArgumentException( | ||
"'format' can not be specified w/o an explicit 'words' file: " + format); | ||
} | ||
words = createDefaultWords(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this fixes the bug, because we now always dynamically create the CharArraySet words
, taking ignoreCase
into account...
CommonGramsFilterFactory should respect the ignoreCase flag passed in args even when the default stop word set is used.
9657ebe
to
f4b0558
Compare
Thanks for the review, Michael! I've updated this PR with suggested changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vigyasharma -- this looks awesome! I love all the removed redundant code, pulled out into the base class.
I'll try to push soon!
} | ||
|
||
/** Default word set implementation. */ | ||
protected abstract CharArraySet createDefaultWords(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woot!
Oh, I think this one can/should be backported to 8.10 as well, @vigyasharma could you please open a PR against |
Thanks for the review, @mikemccand. Created PR - apache/lucene-solr#2573 to backport these changes. |
Description
CommonGramsFilterFactory
should respect the ignoreCase flag passed in args even when the default stop word set is used. It currently ignores the flag ifcommonWordFiles
are not specified.Solution
Ensure the flag is respected in even when default stop word set is used.
Tests
Added test to ensure that bigrams get constructed with common words that are not in lower case, when
ignoreCase
is passed astrue
to theCommonGramsFilterFactory
.Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.