Skip to content

Latest commit

 

History

History
258 lines (193 loc) · 8.61 KB

0.7.0-lang.md

File metadata and controls

258 lines (193 loc) · 8.61 KB

Documentation 0.7.0-rev2

Language Handler

Handling languages was completely replaced by a more generic approach. All language-specific definitions has excluded and was optimized for maximum dead-code elimination when using compiler/bundler. Each language exists of 5 definitions, which are divided into two groups:

  1. Charset
    1. encode, type: function(string):string[]
    2. rtl, type: boolean
  2. Language
    1. matcher, type: {string: string}
    2. stemmer, type: {string: string}
    3. filter, type: string[]

The charset contains the encoding logic, the language contains stemmer, stopword filter and matchers. Multiple language definitions can use the same charset encoder. Also this separation let you manage different language definitions for special use cases (e.g. names, cities, dialects/slang, etc.).

To fully describe a custom language on the fly you need to pass:

const index = FlexSearch({
    // mandatory:
    encode: (str) => [str],
    // optionally:
    rtl: false,
    stemmer: {},
    matcher: {},
    filter: []
});

When passing no parameter it uses the latin:default schema by default.

Field Category Description
encode charset The encoder function. Has to return an array of separated words (or an empty string).
rtl charset A boolean property which indicates right-to-left encoding.
filter language Filter are also known as "stopwords", they completely filter out words from being indexed.
stemmer language Stemmer removes word endings and is a kind of "partial normalization". A word ending just matched when the word length is bigger than the matched partial.
matcher language Matcher replaces all occurrences of a given string regardless of its position and is also a kind of "partial normalization".

1. Language Packs: ES6 Modules

The most simple way to assign charset/language specific encoding via modules is:

import charset from "./dist/module/lang/latin/soundex.js";
import lang from "./dist/module/lang/en.js";

const index = FlexSearch({
    charset: charset,
    lang: lang
});

Just import the default export by each module and assign them accordingly.

The full qualified example from above is:

import { encode, rtl, tokenize } from "./dist/module/lang/latin/soundex.js";
import { stemmer, filter, matcher } from "./dist/module/lang/en.js";

const index = FlexSearch({
    encode: encode,
    // assign forced tokenizer first:
    tokenize: tokenize || "forward",
    rtl: rtl,
    stemmer: stemmer,
    matcher: matcher,
    filter: filter
});

The example above is the standard interface which is at least exported from each charset/language.

Note: Some of the encoder variants limit the use of built-in tokenizer (e.g. soundex). To be save prioritize the forced tokenizer and fall back to your choice, e.g. tokenize || "forward".

Encoder Variants

You remember the encoding variants like simple, advanced, extra, or balanced? These are also supported and provides you several variants of encoding (which differs in performance and degree of normalization).

It is pretty straight forward when using a encoder variant:

import advanced from "./dist/module/lang/latin/advanced.js";
import { encode } from "./dist/module/lang/latin/extra.js";

const index_advanced = FlexSearch({
    // apply all definitions:
    charset: advanced
});

const index_extra = FlexSearch({
    // just apply the encoder:
    encode: encode
});

Available Latin Encoders

  1. default
  2. simple
  3. advanced
  4. extra
  5. balance
  6. soundex

You can assign a charset by passing the charset during initialization, e.g. charset: "latin" for the default charset encoder or charset: "latin:soundex" for a encoder variant.

Dialect / Slang

Language definitions (especially matchers) also could be used to normalize dialect and slang of a specific language.

2. Language Packs: ES5 Modules

You need to make the charset and/or language definitions available by:

  1. All charset definitions are included in the flexsearch.min.js build by default, but no language-specific definitions are included
  2. You can load packages located in /dist/lang/ (files refers to languages, folders are charsets)
  3. You can make a custom build

When loading language packs, make sure that the library was loaded before:

<script src="dist/flexsearch.min.js"></script>
<script src="dist/lang/latin/default.min.js"></script>
<script src="dist/lang/en.min.js"></script>

Because you loading packs as external packages (non-ES6-modules) you have to initialize them by shortcuts:

const index = FlexSearch({
    charset: "latin:soundex",
    lang: "en"
});

Use the charset:variant notation to assign charset and its variants. When just passing the charset without a variant will automatically resolve as charset:default.

You can also override existing definitions, e.g.:

const index = FlexSearch({
    charset: "latin",
    lang: "en",
    matcher: {}
});

Passed definitions will not extend default definitions, they will replace them. When you like to extend a definition just create a new language file and put in all the content.

Encoder Variants

It is pretty straight forward when using an encoder variant:

<script src="dist/flexsearch.min.js"></script>
<script src="dist/lang/latin/advanced.min.js"></script>
<script src="dist/lang/latin/extra.min.js"></script>
<script src="dist/lang/en.min.js"></script>
const index_advanced = FlexSearch({
    charset: "latin:advanced"
});

const index_extra = FlexSearch({
    charset: "latin:extra"
});

Again use the charset:variant notation to define charset and its variants.

Partial Tokenizer

In FlexSearch you can't provide your own partial tokenizer, because it is a direct dependency to the core unit. The built-in tokenizer of FlexSearch splits each word into chunks by different patterns:

  1. strict (supports contextual index)
  2. forward
  3. reverse / both
  4. full
  5. ngram (supports contextual index, coming soon)

Language Processing Pipeline

This is the default pipeline provided by FlexSearch:

Custom Pipeline

At first take a look into the default pipeline in src/common.js. It is very simple and straight forward. The pipeline will process as some sort of inversion of control, the final encoder implementation has to handle charset and also language specific transformations. This workaround has left over from many tests.

Inject the default pipeline by e.g.:

this.pipeline(

    /* string: */ str.toLowerCase(),
    /* normalize: */ false,
    /* split: */ split,
    /* collapse: */ false
);

Use the pipeline schema from above to understand the iteration and the difference of pre-encoding and post-encoding. Stemmer and matchers needs to be applied after charset normalization but before language transformations, filters also.

Here is a good example of extending pipelines: src/lang/latin/extra.jssrc/lang/latin/advanced.jssrc/lang/latin/simple.js.

How to contribute?

Search for your language in src/lang/, if it exists you can extend or provide variants (like dialect/slang). If the language doesn't exist create a new file and check if any of the existing charsets (e.g. latin) fits to your language. When no charset exist, you need to provide a charset as a base for the language.

A new charset should provide at least:

  1. encode A function which normalize the charset of a passed text content (remove special chars, lingual transformations, etc.) and returns an array of separated words. Also stemmer, matcher or stopword filter needs to be applied here. When the language has no words make sure to provide something similar, e.g. each chinese sign could also be a "word". Don't return the whole text content without split.
  2. rtl A boolean flag which indicates right-to-left encoding

Basically the charset needs just to provide an encoder function along with an indicator for right-to-left encoding:

export function encode(str){ return [str] }
export const rtl = false;