bitset codec for off heap filters [LUCENE-5052] #6116

asfimport · 2013-06-11T19:00:33Z

Colleagues,

When we filter we don’t care any of scoring factors i.e. norms, positions, tf, but it should be fast. The obvious way to handle this is to decode postings list and cache it in heap (CachingWrappingFilter, Solr’s DocSet). Both of consuming a heap and decoding as well are expensive.
Let’s write a posting list as a bitset, if df is greater than segment's maxdocs/8 (what about skiplists? and overall performance?).
Beside of the codec implementation, the trickiest part to me is to design API for this. How we can let the app know that a term query don’t need to be cached in heap, but can be held as an mmaped bitset?

WDYT?

Migrated from LUCENE-5052 by Mikhail Khludnev (@mkhludnev), 3 votes, resolved Mar 18 2015
Attachments: bitsetcodec.zip (versions: 2), LUCENE-5052.patch, LUCENE-5052-1.patch
Linked issues:

asfimport · 2013-06-12T08:41:24Z

Stefan Pohl (migrated from JIRA)

The following paper might be informative in regard to this ticket (you can even go beyond maxdocs/8, if compared against VInt-coding):

A. Moffat and J. S. Culpepper. Hybrid Bitvector Index Compression. In Proceedings of the 12th Australasian Document Computing Symposium (ADCS 2007), December 2007. pp 25-37.
available from: http://goanna.cs.rmit.edu.au/\~e76763/publications.html

More generally, it would be nice to determine the PostingsListFormat depending on statistics of individual terms, not only per-field.

asfimport · 2013-06-20T18:26:09Z

Michael McCandless (@mikemccand) (migrated from JIRA)

Couldn't this just be a PostingsFormat, such that for DOCS_ONLY fields with high enough docFreq, it stores them as a dense bitset on disk?

Maybe it could wrap another PostingsBaseFormat (like Pulsing) and 'steal' the high freq terms...

asfimport · 2013-06-20T19:19:23Z

David Smiley (@dsmiley) (migrated from JIRA)

Yeah cool idea. But I feel that to truly take the performance to the next level, there should be a way to intersect the bit vector with another. The spatial Lucene filters have loops that work by populating a FixedBitSet by looping over DocsEnum. But if behind the scene's it's just another bitset, I would love to efficiently union the bitsets. Example snippet of existing code:

int docid;
while ((docid = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
  bitSet.set(docid);
}

I'd bet there's a lot of code like this around.

asfimport · 2013-07-10T12:34:01Z

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

it's worth to start from trivial bitset, sortedints encodings and lately consider comprehensive Elias-Fano encoding #6148

asfimport · 2013-11-14T12:59:02Z

Yury Pakhomov (migrated from JIRA)

This is very simple implementation of codec which stores PostingLists as BitSets. This implementation passes BasePostingsFormatTestCase.testDocsOnly() test.
Also I have found difficult to implement term dictionary and feel like it's better to somehow combine this posting format with any of standard ones.

asfimport · 2014-02-26T13:57:19Z

Nina Gracheva (migrated from JIRA)

This is adopted version of BitSetCodec. It uses BlockTermsWriter/Reader infrastructure to build posting format with custom posting writer/reader and standard terms writer/reader. TODO use Long.numberOfTrailingZeros() in advance() and nexdoc()

asfimport · 2014-03-13T14:57:19Z

Dr Oleg Savrasov (migrated from JIRA)

Methods nextDoc() and advance() have been implemented using Long.numberOfTrailingZeros() approach taken from FixedBitSetIterator

asfimport · 2014-03-13T18:43:02Z

Ahmet Arslan (@iorixxx) (migrated from JIRA)

Hi Stefan Pohl, paper link seem broken? Gives 404 to me.

asfimport · 2014-03-14T12:00:05Z

Stefan Pohl (migrated from JIRA)

Besides other locations, the paper can now be found here:
http://www.culpepper.io/publications.html

asfimport · 2014-03-14T16:29:33Z

Michael McCandless (@mikemccand) (migrated from JIRA)

This patch looks like a great start!

Using BlockTermsDict makes sense; no need to reimplement a terms dictionary (it's not easy!).

One problem I see is every term is written as a bitset? This may be OK for some applications, but I think for wider usage, it'd be better if the postings format wrapped another postings format, and then only used the bitset when the docFreq was high enough, and otherwise delegate to the wrapped postings format? Maybe have a look at PulsingPostingsFormat as an example of how to wrap postings formats?

asfimport · 2014-03-17T16:02:53Z

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

it'd be better if the postings format wrapped another postings format, and then only used the bitset when the docFreq was high enough

There are two orthogonal conceptions:

particular format - let's generalize "bitset format" to "no-tf format", and use WAH8, Elas-Fano with off-heap access (TODO). Thus, it works for spare postings;
API - how consumer can express his intention to use "no-tf" format? e.g. TermFilter or TermsEnum.docs() with special flag;

I'd like to clarify use-case for this issue (issue summary might need to be improved). It aims Solr's fq or even Heliosearch's GC-lightness. I suppose that user can decide which fields to index with "no-tf" format, these are "string" fields. Then, user requests filtering for these fields, no scoring is needed, for sure.

@mikemccand
Hence, I don't think than conditional conditional triggering is a good choice, however I don't know how to do it. I might not understand well how pulsing codec is used (impl idea is clear, though), can you point me on its' usage.

Thanks!

asfimport · 2014-03-17T17:07:18Z

Michael McCandless (@mikemccand) (migrated from JIRA)

OK, I agree: if we use a sparse bitset, then we could use the format for all postings. I guess we'd switch up the bitset impl depending on docFreq of each term.

We already have FieldInfo.IndexOptions.DOCS_ONLY to express that you want to index only the docIDs. E.g. StringField sets this.

And our default codec already makes it easy to switch up the postings format by field: just subclass it and override getPostingsFormatForField.

asfimport · 2014-03-25T19:53:34Z

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

@mikemccand let's agree on desired functionality:

this posting format should wrap the standard one, like pulsing;
if IndexOptions.DOCS_ONLY is provided, this codec suppresses standard posting format and write the bitset file (<<should-be>> explore fancy formats then);
same logic applies on reading;

I wonder what’s the correct behavior if docEnum is requested with FLAG_FREQS, should it silently returns 1 on freq() or throwing exception?

Let me ask one off-top question about switching to PulsingPF. I've heard that it's enabled automatically for id-like field. Can you point on where it's done particularly? Does it works for Lucene only, or for Solr also?
Thanks

asfimport · 2014-03-25T19:59:20Z

Robert Muir (@rmuir) (migrated from JIRA)

Let me ask one off-top question about switching to PulsingPF. I've heard that it's enabled automatically for id-like field. Can you point on where it's done particularly?

See #5564

if there is only one document in the postings list for a term, we just store that document id instead of a pointer to a list ... of only one document.

The freq() for that one document is redundant as well: its the totalTermFreq() for the term, so there is no frequency data recorded either. It still has a pointer for positions/payload/offsets if you have that enabled: but in most cases with an ID-like field you do not.

asfimport · 2014-03-25T20:12:49Z

Michael McCandless (@mikemccand) (migrated from JIRA)

this posting format should wrap the standard one, like pulsing;

I don't think we need to do that (I was convinced, above)? I think it should just be its own PF, and the app picks it to store all postings as bitsets.

if IndexOptions.DOCS_ONLY is provided, this codec suppresses standard posting format and write the bitset file (<<should-be>> explore fancy formats then);

I think it should ONLY accept DOCS_ONLY? Ie, throw an exc if it gets anything else, because it's mis-use.

I wonder what’s the correct behavior if docEnum is requested with FLAG_FREQS, should it silently returns 1 on freq() or throwing exception?

I think lie (return 1 from freq).

asfimport · 2014-03-31T15:55:08Z

Dr Oleg Savrasov (migrated from JIRA)

Only DOCS_ONLY index option is supported. IllegalArgumentException is thrown for anything else.

asfimport · 2014-04-07T18:47:48Z

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

Colleagues,
Would you mind to provide a feedback for the last patch? What's the next steps you expect to move it forward?

asfimport · 2014-04-07T20:59:30Z

Otis Gospodnetic (@otisg) (migrated from JIRA)

Is this aiming to do the same thing Yonik did for Heliosearch or something different?

asfimport · 2014-04-07T21:20:01Z

Yonik Seeley (@yonik) (migrated from JIRA)

Is this aiming to do the same thing Yonik did for Heliosearch or something different?

What I've done in Heliosearch is when you do have to allocate memory for a filter, it's allocated off-heap.
This issue is more about avoiding any memory allocation at all for simple term filters (since it's more like just memory mapping part of the index). It's probably a bit misleading to describe it as "off-heap". It's more like how the rest of the lucene index works (probably a better description would be "on-disk".

asfimport · 2014-04-07T22:42:34Z

Michael McCandless (@mikemccand) (migrated from JIRA)

I think the patch looks like a good start! Seems like we need to support a sparse bitset form to make it more general purpose? Do all lucene tests pass if you run with -Dtests.codec=BitSetCodec?

Why did you use the older BlockTerms dict instead of BlockTree?

asfimport · 2014-04-09T20:36:49Z

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

I think the patch looks like a good start! Seems like we need to support a sparse bitset form to make it more general purpose?

Agree. I wonder what's the shortest path. I see WAH8 docidset impl. Is it a good idea to take it and move it to ByteBuffer? Or just create it in heap as-is and persist it on disk? Is it worth to look at Elias-Fano docid set, which is not committed afaik? Or research other formats like RLE?

Do all lucene tests pass if you run with -Dtests.codec=BitSetCodec?

There is codec test for docs_only which pass. How other tests can pass if it doesn't support freqs and positions? Or we need to come through all failures and triage them?

Why did you use the older BlockTerms dict instead of BlockTree?

Let's check whether we can to move.

@mkhludnev

Bit sets can be faster at advancing and more storage-efficient on dense blocks of postings. This is not a new idea, @mkhludnev proposed something similar a long time ago apache#6116. @msokolov recently brought up (apache#14080) that such an encoding has become especially appealing with the introduction of the `DocIdSetIterator#loadIntoBitSet` API, and the fact that non-scoring disjunctions and dense conjunctions now take advantage of it. Indeed, if postings are stored in a bit set, `#loadIntoBitSet` would just need to OR the postings bits into the bits that are used as an intermediate representation of matches of the query.

@mkhludnev

Bit sets can be faster at advancing and more storage-efficient on dense blocks of postings. This is not a new idea, @mkhludnev proposed something similar a long time ago #6116. @msokolov recently brought up (#14080) that such an encoding has become especially appealing with the introduction of the `DocIdSetIterator#loadIntoBitSet` API, and the fact that non-scoring disjunctions and dense conjunctions now take advantage of it. Indeed, if postings are stored in a bit set, `#loadIntoBitSet` would just need to OR the postings bits into the bits that are used as an intermediate representation of matches of the query.

@mkhludnev

Bit sets can be faster at advancing and more storage-efficient on dense blocks of postings. This is not a new idea, @mkhludnev proposed something similar a long time ago #6116. @msokolov recently brought up (#14080) that such an encoding has become especially appealing with the introduction of the `DocIdSetIterator#loadIntoBitSet` API, and the fact that non-scoring disjunctions and dense conjunctions now take advantage of it. Indeed, if postings are stored in a bit set, `#loadIntoBitSet` would just need to OR the postings bits into the bits that are used as an intermediate representation of matches of the query.

asfimport closed this as completed Mar 18, 2015

This was referenced Aug 24, 2022

Cutover more postings formats to the inverted "pull" API [LUCENE-5268] #6332

Closed

invert the codec postings API [LUCENE-5123] #6187

Closed

EliasFanoDocIdSet [LUCENE-5084] #6148

Closed

Elias-Fano sequence also on BytesRef [LUCENE-5524] #6587

Closed

jpountz mentioned this issue Oct 31, 2023

Explore within-block skipping for postings #12486

Closed

jpountz mentioned this issue Jan 13, 2025

Encode dense blocks of postings as bit sets. #14133

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bitset codec for off heap filters [LUCENE-5052] #6116

bitset codec for off heap filters [LUCENE-5052] #6116

asfimport commented Jun 11, 2013 •

edited

Loading

asfimport commented Jun 12, 2013

asfimport commented Jun 20, 2013

asfimport commented Jun 20, 2013

asfimport commented Jul 10, 2013 •

edited

Loading

asfimport commented Nov 14, 2013

asfimport commented Feb 26, 2014

asfimport commented Mar 13, 2014

asfimport commented Mar 13, 2014

asfimport commented Mar 14, 2014

asfimport commented Mar 14, 2014

asfimport commented Mar 17, 2014

asfimport commented Mar 17, 2014

asfimport commented Mar 25, 2014

asfimport commented Mar 25, 2014 •

edited

Loading

asfimport commented Mar 25, 2014

asfimport commented Mar 31, 2014

asfimport commented Apr 7, 2014

asfimport commented Apr 7, 2014

asfimport commented Apr 7, 2014

asfimport commented Apr 7, 2014

asfimport commented Apr 9, 2014

bitset codec for off heap filters [LUCENE-5052] #6116

bitset codec for off heap filters [LUCENE-5052] #6116

Comments

asfimport commented Jun 11, 2013 • edited Loading

asfimport commented Jun 12, 2013

asfimport commented Jun 20, 2013

asfimport commented Jun 20, 2013

asfimport commented Jul 10, 2013 • edited Loading

asfimport commented Nov 14, 2013

asfimport commented Feb 26, 2014

asfimport commented Mar 13, 2014

asfimport commented Mar 13, 2014

asfimport commented Mar 14, 2014

asfimport commented Mar 14, 2014

asfimport commented Mar 17, 2014

asfimport commented Mar 17, 2014

asfimport commented Mar 25, 2014

asfimport commented Mar 25, 2014 • edited Loading

asfimport commented Mar 25, 2014

asfimport commented Mar 31, 2014

asfimport commented Apr 7, 2014

asfimport commented Apr 7, 2014

asfimport commented Apr 7, 2014

asfimport commented Apr 7, 2014

asfimport commented Apr 9, 2014

asfimport commented Jun 11, 2013 •

edited

Loading

asfimport commented Jul 10, 2013 •

edited

Loading

asfimport commented Mar 25, 2014 •

edited

Loading