-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bitset codec for off heap filters [LUCENE-5052] #6116
Comments
Stefan Pohl (migrated from JIRA) The following paper might be informative in regard to this ticket (you can even go beyond maxdocs/8, if compared against VInt-coding): A. Moffat and J. S. Culpepper. Hybrid Bitvector Index Compression. In Proceedings of the 12th Australasian Document Computing Symposium (ADCS 2007), December 2007. pp 25-37. More generally, it would be nice to determine the PostingsListFormat depending on statistics of individual terms, not only per-field. |
Michael McCandless (@mikemccand) (migrated from JIRA) Couldn't this just be a PostingsFormat, such that for DOCS_ONLY fields with high enough docFreq, it stores them as a dense bitset on disk? Maybe it could wrap another PostingsBaseFormat (like Pulsing) and 'steal' the high freq terms... |
David Smiley (@dsmiley) (migrated from JIRA) Yeah cool idea. But I feel that to truly take the performance to the next level, there should be a way to intersect the bit vector with another. The spatial Lucene filters have loops that work by populating a FixedBitSet by looping over DocsEnum. But if behind the scene's it's just another bitset, I would love to efficiently union the bitsets. Example snippet of existing code: int docid;
while ((docid = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
bitSet.set(docid);
} I'd bet there's a lot of code like this around. |
Mikhail Khludnev (@mkhludnev) (migrated from JIRA) it's worth to start from trivial bitset, sortedints encodings and lately consider comprehensive Elias-Fano encoding #6148 |
Yury Pakhomov (migrated from JIRA) This is very simple implementation of codec which stores PostingLists as BitSets. This implementation passes BasePostingsFormatTestCase.testDocsOnly() test. |
Nina Gracheva (migrated from JIRA) This is adopted version of BitSetCodec. It uses BlockTermsWriter/Reader infrastructure to build posting format with custom posting writer/reader and standard terms writer/reader. TODO use Long.numberOfTrailingZeros() in advance() and nexdoc() |
Dr Oleg Savrasov (migrated from JIRA) Methods nextDoc() and advance() have been implemented using Long.numberOfTrailingZeros() approach taken from FixedBitSetIterator |
Ahmet Arslan (@iorixxx) (migrated from JIRA) Hi Stefan Pohl, paper link seem broken? Gives 404 to me. |
Stefan Pohl (migrated from JIRA) Besides other locations, the paper can now be found here: |
Michael McCandless (@mikemccand) (migrated from JIRA) This patch looks like a great start! Using BlockTermsDict makes sense; no need to reimplement a terms dictionary (it's not easy!). One problem I see is every term is written as a bitset? This may be OK for some applications, but I think for wider usage, it'd be better if the postings format wrapped another postings format, and then only used the bitset when the docFreq was high enough, and otherwise delegate to the wrapped postings format? Maybe have a look at PulsingPostingsFormat as an example of how to wrap postings formats? |
Mikhail Khludnev (@mkhludnev) (migrated from JIRA)
There are two orthogonal conceptions:
I'd like to clarify use-case for this issue (issue summary might need to be improved). It aims Solr's fq or even Heliosearch's GC-lightness. I suppose that user can decide which fields to index with "no-tf" format, these are "string" fields. Then, user requests filtering for these fields, no scoring is needed, for sure. @mikemccand Thanks! |
Michael McCandless (@mikemccand) (migrated from JIRA) OK, I agree: if we use a sparse bitset, then we could use the format for all postings. I guess we'd switch up the bitset impl depending on docFreq of each term. We already have FieldInfo.IndexOptions.DOCS_ONLY to express that you want to index only the docIDs. E.g. StringField sets this. And our default codec already makes it easy to switch up the postings format by field: just subclass it and override getPostingsFormatForField. |
Mikhail Khludnev (@mkhludnev) (migrated from JIRA) @mikemccand let's agree on desired functionality:
I wonder what’s the correct behavior if docEnum is requested with FLAG_FREQS, should it silently returns 1 on freq() or throwing exception? Let me ask one off-top question about switching to PulsingPF. I've heard that it's enabled automatically for id-like field. Can you point on where it's done particularly? Does it works for Lucene only, or for Solr also? |
Robert Muir (@rmuir) (migrated from JIRA)
See #5564 if there is only one document in the postings list for a term, we just store that document id instead of a pointer to a list ... of only one document. The freq() for that one document is redundant as well: its the totalTermFreq() for the term, so there is no frequency data recorded either. It still has a pointer for positions/payload/offsets if you have that enabled: but in most cases with an ID-like field you do not. |
Michael McCandless (@mikemccand) (migrated from JIRA)
I don't think we need to do that (I was convinced, above)? I think it should just be its own PF, and the app picks it to store all postings as bitsets.
I think it should ONLY accept DOCS_ONLY? Ie, throw an exc if it gets anything else, because it's mis-use.
I think lie (return 1 from freq). |
Dr Oleg Savrasov (migrated from JIRA) Only DOCS_ONLY index option is supported. IllegalArgumentException is thrown for anything else. |
Mikhail Khludnev (@mkhludnev) (migrated from JIRA) Colleagues, |
Otis Gospodnetic (@otisg) (migrated from JIRA) Is this aiming to do the same thing Yonik did for Heliosearch or something different? |
Yonik Seeley (@yonik) (migrated from JIRA)
What I've done in Heliosearch is when you do have to allocate memory for a filter, it's allocated off-heap. |
Michael McCandless (@mikemccand) (migrated from JIRA) I think the patch looks like a good start! Seems like we need to support a sparse bitset form to make it more general purpose? Do all lucene tests pass if you run with -Dtests.codec=BitSetCodec? Why did you use the older BlockTerms dict instead of BlockTree? |
Mikhail Khludnev (@mkhludnev) (migrated from JIRA)
Agree. I wonder what's the shortest path. I see WAH8 docidset impl. Is it a good idea to take it and move it to ByteBuffer? Or just create it in heap as-is and persist it on disk? Is it worth to look at Elias-Fano docid set, which is not committed afaik? Or research other formats like RLE?
There is codec test for docs_only which pass. How other tests can pass if it doesn't support freqs and positions? Or we need to come through all failures and triage them?
Let's check whether we can to move. |
Bit sets can be faster at advancing and more storage-efficient on dense blocks of postings. This is not a new idea, @mkhludnev proposed something similar a long time ago apache#6116. @msokolov recently brought up (apache#14080) that such an encoding has become especially appealing with the introduction of the `DocIdSetIterator#loadIntoBitSet` API, and the fact that non-scoring disjunctions and dense conjunctions now take advantage of it. Indeed, if postings are stored in a bit set, `#loadIntoBitSet` would just need to OR the postings bits into the bits that are used as an intermediate representation of matches of the query.
Bit sets can be faster at advancing and more storage-efficient on dense blocks of postings. This is not a new idea, @mkhludnev proposed something similar a long time ago #6116. @msokolov recently brought up (#14080) that such an encoding has become especially appealing with the introduction of the `DocIdSetIterator#loadIntoBitSet` API, and the fact that non-scoring disjunctions and dense conjunctions now take advantage of it. Indeed, if postings are stored in a bit set, `#loadIntoBitSet` would just need to OR the postings bits into the bits that are used as an intermediate representation of matches of the query.
Bit sets can be faster at advancing and more storage-efficient on dense blocks of postings. This is not a new idea, @mkhludnev proposed something similar a long time ago #6116. @msokolov recently brought up (#14080) that such an encoding has become especially appealing with the introduction of the `DocIdSetIterator#loadIntoBitSet` API, and the fact that non-scoring disjunctions and dense conjunctions now take advantage of it. Indeed, if postings are stored in a bit set, `#loadIntoBitSet` would just need to OR the postings bits into the bits that are used as an intermediate representation of matches of the query.
Colleagues,
When we filter we don’t care any of scoring factors i.e. norms, positions, tf, but it should be fast. The obvious way to handle this is to decode postings list and cache it in heap (CachingWrappingFilter, Solr’s DocSet). Both of consuming a heap and decoding as well are expensive.
Let’s write a posting list as a bitset, if df is greater than segment's maxdocs/8 (what about skiplists? and overall performance?).
Beside of the codec implementation, the trickiest part to me is to design API for this. How we can let the app know that a term query don’t need to be cached in heap, but can be held as an mmaped bitset?
WDYT?
Migrated from LUCENE-5052 by Mikhail Khludnev (@mkhludnev), 3 votes, resolved Mar 18 2015
Attachments: bitsetcodec.zip (versions: 2), LUCENE-5052.patch, LUCENE-5052-1.patch
Linked issues:
The text was updated successfully, but these errors were encountered: