Lucene-10008: Respect ignoreCase flag in CommonGramsFilterFactory #188

vigyasharma · 2021-06-17T22:05:46Z

Description

CommonGramsFilterFactory should respect the ignoreCase flag passed in args even when the default stop word set is used. It currently ignores the flag if commonWordFiles are not specified.

Solution

Ensure the flag is respected in even when default stop word set is used.

Tests

Added test to ensure that bigrams get constructed with common words that are not in lower case, when ignoreCase is passed as true to the CommonGramsFilterFactory.

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.

vigyasharma · 2021-08-10T17:39:25Z

This PR is ready for review. It addresses JIRA - https://issues.apache.org/jira/browse/LUCENE-10008

mikemccand · 2021-08-12T10:43:42Z

Thanks @vigyasharma, I'll try to have a look soon.

mikemccand

This looks awesome! Thank you for the refactoring/cleanup, bug fix and new test cases @vigyasharma

I left a few small comments. I think this is super close.

mikemccand · 2021-08-12T12:08:29Z

...e/analysis/common/src/java/org/apache/lucene/analysis/en/AbstractWordsFileFilterFactory.java

+
+  /** Default word set implementation. */
+  protected CharArraySet createDefaultWords() {
+    return new CharArraySet(EnglishAnalyzer.ENGLISH_STOP_WORDS_SET, ignoreCase);


It's kinda weird to default to English stop words here? This base class is a generic "this thing needs to load words from a file" sort of deal ... maybe make this method abstract and force all subclasses to implement it and move this impl down to StopFilterFactory? It is separately weird that we default to English there too! English is just one (weird!) language! But we don't need to solve that one here.

And I guess CommonGramsFilterFactory would also default to English stop words, as it does already today.

Good point, I've moved the default stop word impl. to subclasses and made this method abstract.

mikemccand · 2021-08-12T12:10:29Z

...sis/common/src/test/org/apache/lucene/analysis/commongrams/TestCommonGramsFilterFactory.java

+  /**
+   * Test that ignoreCase flag is honored when no words are provided and default stopwords are used.
+   */
+  public void testIgnoreCase() throws Exception {


This test case failed before the refactoring? Perfect :)

Yes, it failed before the fix and refactor, and passes after.

mikemccand · 2021-08-12T12:11:30Z

...nalysis/common/src/java/org/apache/lucene/analysis/commongrams/CommonGramsFilterFactory.java

-        commonWords = getWordSet(loader, commonWordFiles, ignoreCase);
-      }
-    } else {
-      commonWords = EnglishAnalyzer.ENGLISH_STOP_WORDS_SET;


Ahh, this was the bug? Because this (default) path ignores ignoreCase?

That's right.

mikemccand · 2021-08-12T12:12:15Z

...e/analysis/common/src/java/org/apache/lucene/analysis/en/AbstractWordsFileFilterFactory.java

+        throw new IllegalArgumentException(
+            "'format' can not be specified w/o an explicit 'words' file: " + format);
+      }
+      words = createDefaultWords();


And this fixes the bug, because we now always dynamically create the CharArraySet words, taking ignoreCase into account...

CommonGramsFilterFactory should respect the ignoreCase flag passed in args even when the default stop word set is used.

vigyasharma · 2021-08-12T20:48:10Z

Thanks for the review, Michael! I've updated this PR with suggested changes.

mikemccand

Thanks @vigyasharma -- this looks awesome! I love all the removed redundant code, pulled out into the base class.

I'll try to push soon!

mikemccand · 2021-08-13T18:34:27Z

...e/analysis/common/src/java/org/apache/lucene/analysis/en/AbstractWordsFileFilterFactory.java

+  }
+
+  /** Default word set implementation. */
+  protected abstract CharArraySet createDefaultWords();


mikemccand · 2021-08-23T15:47:13Z

Oh, I think this one can/should be backported to 8.10 as well, @vigyasharma could you please open a PR against branch_8x in the lucene-solr GitHub repo? Thanks!

vigyasharma · 2021-09-14T18:10:43Z

Oh, I think this one can/should be backported to 8.10 as well, @vigyasharma could you please open a PR against branch_8x in the lucene-solr GitHub repo? Thanks!

Thanks for the review, @mikemccand. Created PR - apache/lucene-solr#2573 to backport these changes.

mikemccand reviewed Aug 12, 2021

View reviewed changes

vigyasharma and others added 7 commits August 12, 2021 10:17

LUCENE-10008: Respect ignoreCase flag in CommonGramsFilterFactory

609a63f

CommonGramsFilterFactory should respect the ignoreCase flag passed in args even when the default stop word set is used.

LUCENE-10008: Styling fixes from precommit check

6131ed1

Spotless violations fix

eeb1fe3

Add common base class for Common/Stop/KeepWords filter factories

17e93a6

Linting errors

31fff40

Add license header

b5d06e3

Move default stop word implementation to concrete subclasses

f4b0558

vigyasharma force-pushed the lucene-10008 branch from 9657ebe to f4b0558 Compare August 12, 2021 20:43

mikemccand approved these changes Aug 13, 2021

View reviewed changes

mikemccand merged commit cb4c8ae into apache:main Aug 13, 2021

vigyasharma mentioned this pull request Sep 14, 2021

LUCENE-10008: Respect ignoreCase flag in CommonGramsFilterFactory apache/lucene-solr#2573

Merged

paulirwin mentioned this pull request Oct 24, 2024

Respect ignoreCase flag in CommonGramsFilterFactory apache/lucenenet#781

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lucene-10008: Respect ignoreCase flag in CommonGramsFilterFactory #188

Lucene-10008: Respect ignoreCase flag in CommonGramsFilterFactory #188

vigyasharma commented Jun 17, 2021 •

edited

Loading

vigyasharma commented Aug 10, 2021

mikemccand commented Aug 12, 2021

mikemccand left a comment

mikemccand Aug 12, 2021

vigyasharma Aug 12, 2021

mikemccand Aug 12, 2021

vigyasharma Aug 12, 2021

mikemccand Aug 12, 2021

vigyasharma Aug 12, 2021

mikemccand Aug 12, 2021

vigyasharma commented Aug 12, 2021

mikemccand left a comment

mikemccand Aug 13, 2021

mikemccand commented Aug 23, 2021

vigyasharma commented Sep 14, 2021

Lucene-10008: Respect ignoreCase flag in CommonGramsFilterFactory #188

Lucene-10008: Respect ignoreCase flag in CommonGramsFilterFactory #188

Conversation

vigyasharma commented Jun 17, 2021 • edited Loading

Description

Solution

Tests

Checklist

vigyasharma commented Aug 10, 2021

mikemccand commented Aug 12, 2021

mikemccand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vigyasharma commented Aug 12, 2021

mikemccand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikemccand commented Aug 23, 2021

vigyasharma commented Sep 14, 2021

vigyasharma commented Jun 17, 2021 •

edited

Loading