-
Notifications
You must be signed in to change notification settings - Fork 763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: generate inverted indexs for each blocks #15150
Conversation
Docker Image for PR
|
Does this PR affects the memory usage of plan? |
yes, |
There are some tests need fixed:
|
I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/
Summary
In the previous PR #14997, we implemented the inverted index, where an index file corresponds to data in multiple blocks. In the actual test, we found that this design will lead to a larger index file, and it will take more time to read the data in the file, which affects the performance. Therefore, we rewrote the design of the index structure and refactored it in this PR, the detailed design is as follows.
Index Design
Generate one index file for each block, and the index file is used to determine which rows in the block match the query keywords when filtering, if no row is matched, the block can be pruned. as shown in the following figure:
Compared with the previous design, it has the following benefits.
Index file structure
The index data generated by Tantivy is stored in a directory containing several files, we merge those files into a large file as the index data. The index file contains the following parts:
terms
stores the term dictionary in FST(Finite State Transducer) struct. The key is the term after tokenizer processing, value is an address in thepostings
and thepositions
.postings
stores the lists of document ids and term freqs.positions
stores the positions of terms in each document.field norms
stores the sum of the length of the term in each field.fast fields
stores column-oriented documents (not used).store
stores row-oriented documents (not used).meta.json
stores the meta information associated with the index, for example:managed.json
stores the name of the file that the index contains, for example:offsets
stores the offsets of each parts in the index file.When querying, we first read
terms
to get the address ofpostings
andpositions
.postings
contains the matched rows and the frequency of the term in each row, we can use the frequency andfield norms
to calculate the match score of each row. If the query keyword is a phrase, we also need to usepositions
of each word in the phrase to determine whether the terms can form the phrase.Block pruner
Databend supports many types of pruners, which will determine whether a block can be pruned in the following order.
range pruner
uses the maximum and minimum values of the fields in the block to determine whether a range query can prune a block.bloom pruner
uses the bloom filter to determine whether a point query can prune a block.limit pruner
uses the amount of query value to determine whether the block can be pruned.inverted index pruner
uses the inverted filter to determine whether a search query can prune a block.If the query condition specifies more than one filter field, such as time range. Some of the data in the inverted index may not be read, which can speed up the query.
Tokenizer
Currently, we support two tokenizers, English and Chinese, to split the input sentences, user can specify the tokenizer by option when creating the index. if not specified, the default is English tokenizer, for example.
Examples
Let's use pmc data as an example for testing.
the test data can be downloaded from this URL: https://s3.amazonaws.com/benchmarks.redislabs/redisearch/datasets/pmc/documents.json.bz2
Other changes
TableIndex
meta add fieldsversion
,options
andrefreshed_on
.index_info_locations
field fromSnapshot
.indexes
field toSnapshot
to record indexed segments of each index.part of #14825
Tests
Type of change