Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting filters, stemmers, ... to work #278

Open
WesleyMConner opened this issue Oct 19, 2021 · 4 comments
Open

Getting filters, stemmers, ... to work #278

WesleyMConner opened this issue Oct 19, 2021 · 4 comments
Assignees
Labels
documentation Documentation request

Comments

@WesleyMConner
Copy link

Initially, I was successful:

  • Using Node v16.9.0 ES6 modules to create, export, import and search Flexsearch indexes
  • Using CSJS ES6 modules (Chrome) to Import and search using the above exported indexes.
$ which node
/home/gitpod/.nvm/versions/node/v16.9.0/bin/node 
$ npm --version
7.21.1
$ npm list flexsearch
[email protected] /workspace/hub
└── [email protected]

I spent the better part of the morning trying to understand why filters, stemmers, etc. were being ignored. From experimentation, I have two takeaways:

  1. We must download local copies of ./dist/module/lang/latin/advanced.js, ./dist/module/lang/en.js and their dependencies in our projects and import them separately from Flexsearch itself.
./dist/
   |
   +-- module/
         |
         +-- lang/
         |     |
         |     +-- en.js
         |     +-- latin
         |           |
         |           +-- advanced.js
         |
         +-- common.js
         +-- lang
         +-- lang.js
         +-- type.js

import FlexSearch from 'flexsearch';
import {encode, rtl} from './dist/module/lang/latin/advanced.js';
import {stemmer, filter, matcher} from './dist/module/lang/en.js';
  1. We SHOULD NOT use the performance preset, which seems to disable these options.
let fsi = new FlexSearch.Index(
  //'performance',
  {
    encode: encode,
    rtl: rtl,
    stemmer: stemmer,
    matcher: matcher,
    filter: filter
  }
);

With the local ./dist/** directory (see above) and the performance preset commented out,
the filters, stemmers, ... worked as expected.

Is this the correct and expected approach?

@peterbe
Copy link

peterbe commented Oct 27, 2021

@WesleyMConner Does this explain why I can't get stemming to work at all? #280

@WesleyMConner
Copy link
Author

@peterbe

Apologies for the delayed response. I have been tied up with jury duty this week.

I have only worked with FlexSearch.Index thus far; so, cannot comment on Document or Worker. At the moment, I do not have my FlexSearch elements isolated enough to publish a standalone complete example. Let me outline what is working for me - with minor caveats.

For context, I attempt to restrict my code to relatively pure ES6 and very up to date / minimal modules (i.e., avoiding Babel, Gulp, …). I make heavy use of Gitpod, VsCode and Jest which requires more than a little experimentation to stabilize give ES6 purity objectives and pre-releases of both FlexSearch (https://www.npmjs.com/package/flexsearch/v/0.7.21) and Antora 3 (3.0.0-alpha.9). My application requires generation of an Index on the server (extending Antora 3) and using the Index by a client with zero server interaction once the code is pulled down.

I have an ES6 base class that wraps/extends a FlexSearch.Index instance - e.g., overloading add(), search(), and introducing application specific extensions, etc. My base class DOES NOT create the FlexSearch.Index nor does it include export or import methods.

I have a server-specific base class extension that creates a FlexSearch.Index instance and adds both export and import methods. [The import methods exist for server-side Jest tests only.]

I have a client-specific base class extension that creates a FlexSearch.Index and adds an import method. [Currently, the server and client import implementations are the same; so, I could re-factored to avoid repeating the same import code.]

I create and use the following object instance to help ensure consistency when initializing new FlexSearch.Index instances across server and client implementations. The object references local copies of referenced filters, stemmers, etc. which must be present in both server and client contexts. Since the server class extension and client class extension both use this object instance, I do not have to worry about slight discrepancies emerging between how the server and client FlexSearch.Index instances are initialized.

import {encode, rtl} from './flex-dist/module/lang/latin/advanced.js';
import {stemmer, filter, matcher} from './flex-dist/module/lang/en.js';

export const fsIndexOptions = {
  encode: encode,
  rtl: rtl,
  stemmer: stemmer,
  matcher: matcher,
  filter: filter
};

The server class extension creates a FlexSearch.Index instance using super(new FlexSearch.Index(fsIndexOptions));, where Flexsearch is installed via package.json.

The client class extension creates a FlexSearch instance using super(new FlexSearch.Index(fsIndexOptions));, where FlexSearch is pulled using the following:

<script
  src="https://rawcdn.githack.com/nextapps-de/flexsearch/0.7.2/dist/flexsearch.es5.js"
  integrity_no="sha256-9SoR+2Y1K3kQJn9R3oiMtFuA/vF3WSmumWWaVPpPxVs="
  crossorigin="anonymous"></script>
</html>

The server class has an exportHandler() member function:

  async exportHandler (fileName, content) {
    // Fix v0.7.2 bug: https://github.com/nextapps-de/flexsearch/issues/273
    if (fileName.lastIndexOf('.') > 0) {
      fileName = fileName.substring(fileName.lastIndexOf('.') + 1);
    };

    await fsp.writeFile(path.resolve(this.exportDir, fileName), content);
  };

The server class has a companion writeIndexToFile() member method that (a) triggers a Flexsearch export, passing in a bound reference to exportHandler() and (b) writes application specific elements that need to be exported/imported alongside the captured FlexSearch index.

async writeIndexToFile () {
  await fsp.mkdir(this.exportDir, {mode: '0755', recursive: true})
        .catch((reason) => console.log(reason));

  let boundExportHandler = this.exportHandler.bind(this);

  await Promise.all([
    this.fsi.export(boundExportHandler)  // this.fsi is the captured FlexSearch instance
    .then(() => {
      sleep(1000);
    }),
          :
    // Additional Promises to persist any application-specific state patterned along
    // the following lines:
    fsp.writeFile(   <— See comments below
      path.resolve(this.exportDir, ‘…’),
      customJsonStringify(this…)
    )
  ]);
};

The server/client import method(s) leverage a Promise.all() which wraps both the import of the captured FlexSearch instance (this.fsi) and import of the application-specific components.

async loadIndexFromFile () {
  await Promise.all([
    ['cfg', 'ctx', 'map', 'reg'].map(file => {
      fsp.readFile(path.resolve(this.exportDir, file))
      .then((buf) => {
        this.fsi.import(file, buf.toString());
      });
    }),
      :
    // Patterned import of custom components.
    fsp.readFile(path.resolve(this.exportDir, '...'))
    .then((buf) => {
      this.... = customJsonParse(buf);
      Promise.resolve(undefined);
    })
  ]);
};

Caveats

I noticed some cases where the FlexSearch export Promise appeared to resolve BEFORE the files were in an import-ready state. Typically my small-index Jest tests would pass even if a server-side import occurred immediately after a server-side export. Once I started using application-scale FlexSearch indexes (i.e., longer write times), I noticed that the FlexSearch export a would resolve its Promise too quickly - before the exported files were sufficiently stable for import.

I added an artificial delay (e.g., sleep promise) to the FlexSearch export promise which resolved the issue. I noticed that recent FlexSearch commits have tweaked the export-callback code; so, I eagerly look forward to a prospective FlexSearch 0.7.22 release which I suspect would clean up the Promise logic and perhaps address #273.

Net: I am very happy with the latest from both the FlexSearch and Antora projects and look forward to using the combination even in their preliminary release forms.

@peterbe
Copy link

peterbe commented Oct 29, 2021

That's a very long story :)
I'm not sure how it relates to specifically getting stemmers to work in Node.
I managed to get it to work by moving en.js file into ./en.mjs and importing it that way.
But it's weird because that file contains an object called stemmer with lots of keys. But it still doesn't cope with simple plural form. I.e. finding "Procurements is a word" when searching for "procurement". Or finding "Repository is a word" when searching for "repositories". So I did this:

import { stemmer, filter } from "./vendored/en.mjs";
const ownStemmer = Object.assign(
  {
    ies: "y",
    s: "",
  },
  stemmer
);
const index = new FlexSearch.Index({
  tokenize: "forward",
  filter: filter,
  stemmer: ownStemmer,
});

that solves it, but it feels hacky and weird to have to do.

@mmm8955405
Copy link

The reason why programmers are confused is that flexsearch runs very fast, but documents are confusing. Unfortunately, I thought he would work, but it didn't work as expected

@ts-thomas ts-thomas self-assigned this Oct 2, 2022
@ts-thomas ts-thomas added the documentation Documentation request label Oct 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Documentation request
Projects
None yet
Development

No branches or pull requests

4 participants