Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lucene-10008: Respect ignoreCase flag in CommonGramsFilterFactory #188

Merged
merged 7 commits into from
Aug 13, 2021

Conversation

vigyasharma
Copy link
Contributor

@vigyasharma vigyasharma commented Jun 17, 2021

Description

CommonGramsFilterFactory should respect the ignoreCase flag passed in args even when the default stop word set is used. It currently ignores the flag if commonWordFiles are not specified.

Solution

Ensure the flag is respected in even when default stop word set is used.

Tests

Added test to ensure that bigrams get constructed with common words that are not in lower case, when ignoreCase is passed as true to the CommonGramsFilterFactory.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.

@vigyasharma
Copy link
Contributor Author

This PR is ready for review. It addresses JIRA - https://issues.apache.org/jira/browse/LUCENE-10008

@mikemccand
Copy link
Member

Thanks @vigyasharma, I'll try to have a look soon.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome! Thank you for the refactoring/cleanup, bug fix and new test cases @vigyasharma

I left a few small comments. I think this is super close.


/** Default word set implementation. */
protected CharArraySet createDefaultWords() {
return new CharArraySet(EnglishAnalyzer.ENGLISH_STOP_WORDS_SET, ignoreCase);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kinda weird to default to English stop words here? This base class is a generic "this thing needs to load words from a file" sort of deal ... maybe make this method abstract and force all subclasses to implement it and move this impl down to StopFilterFactory? It is separately weird that we default to English there too! English is just one (weird!) language! But we don't need to solve that one here.

And I guess CommonGramsFilterFactory would also default to English stop words, as it does already today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I've moved the default stop word impl. to subclasses and made this method abstract.

/**
* Test that ignoreCase flag is honored when no words are provided and default stopwords are used.
*/
public void testIgnoreCase() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case failed before the refactoring? Perfect :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it failed before the fix and refactor, and passes after.

commonWords = getWordSet(loader, commonWordFiles, ignoreCase);
}
} else {
commonWords = EnglishAnalyzer.ENGLISH_STOP_WORDS_SET;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, this was the bug? Because this (default) path ignores ignoreCase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right.

throw new IllegalArgumentException(
"'format' can not be specified w/o an explicit 'words' file: " + format);
}
words = createDefaultWords();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this fixes the bug, because we now always dynamically create the CharArraySet words, taking ignoreCase into account...

@vigyasharma
Copy link
Contributor Author

Thanks for the review, Michael! I've updated this PR with suggested changes.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vigyasharma -- this looks awesome! I love all the removed redundant code, pulled out into the base class.

I'll try to push soon!

}

/** Default word set implementation. */
protected abstract CharArraySet createDefaultWords();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woot!

@mikemccand mikemccand merged commit cb4c8ae into apache:main Aug 13, 2021
@mikemccand
Copy link
Member

Oh, I think this one can/should be backported to 8.10 as well, @vigyasharma could you please open a PR against branch_8x in the lucene-solr GitHub repo? Thanks!

@vigyasharma
Copy link
Contributor Author

Oh, I think this one can/should be backported to 8.10 as well, @vigyasharma could you please open a PR against branch_8x in the lucene-solr GitHub repo? Thanks!

Thanks for the review, @mikemccand. Created PR - apache/lucene-solr#2573 to backport these changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants