Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bitset codec for off heap filters [LUCENE-5052] #6116

Closed
asfimport opened this issue Jun 11, 2013 · 21 comments · Fixed by #14133
Closed

bitset codec for off heap filters [LUCENE-5052] #6116

asfimport opened this issue Jun 11, 2013 · 21 comments · Fixed by #14133

Comments

@asfimport
Copy link

asfimport commented Jun 11, 2013

Colleagues,

When we filter we don’t care any of scoring factors i.e. norms, positions, tf, but it should be fast. The obvious way to handle this is to decode postings list and cache it in heap (CachingWrappingFilter, Solr’s DocSet). Both of consuming a heap and decoding as well are expensive.
Let’s write a posting list as a bitset, if df is greater than segment's maxdocs/8 (what about skiplists? and overall performance?).
Beside of the codec implementation, the trickiest part to me is to design API for this. How we can let the app know that a term query don’t need to be cached in heap, but can be held as an mmaped bitset?

WDYT?


Migrated from LUCENE-5052 by Mikhail Khludnev (@mkhludnev), 3 votes, resolved Mar 18 2015
Attachments: bitsetcodec.zip (versions: 2), LUCENE-5052.patch, LUCENE-5052-1.patch
Linked issues:

@asfimport
Copy link
Author

Stefan Pohl (migrated from JIRA)

The following paper might be informative in regard to this ticket (you can even go beyond maxdocs/8, if compared against VInt-coding):

A. Moffat and J. S. Culpepper. Hybrid Bitvector Index Compression. In Proceedings of the 12th Australasian Document Computing Symposium (ADCS 2007), December 2007. pp 25-37.
available from: http://goanna.cs.rmit.edu.au/\~e76763/publications.html

More generally, it would be nice to determine the PostingsListFormat depending on statistics of individual terms, not only per-field.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Couldn't this just be a PostingsFormat, such that for DOCS_ONLY fields with high enough docFreq, it stores them as a dense bitset on disk?

Maybe it could wrap another PostingsBaseFormat (like Pulsing) and 'steal' the high freq terms...

@asfimport
Copy link
Author

David Smiley (@dsmiley) (migrated from JIRA)

Yeah cool idea. But I feel that to truly take the performance to the next level, there should be a way to intersect the bit vector with another. The spatial Lucene filters have loops that work by populating a FixedBitSet by looping over DocsEnum. But if behind the scene's it's just another bitset, I would love to efficiently union the bitsets. Example snippet of existing code:

int docid;
while ((docid = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
  bitSet.set(docid);
}

I'd bet there's a lot of code like this around.

@asfimport
Copy link
Author

asfimport commented Jul 10, 2013

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

it's worth to start from trivial bitset, sortedints encodings and lately consider comprehensive Elias-Fano encoding #6148

@asfimport
Copy link
Author

Yury Pakhomov (migrated from JIRA)

This is very simple implementation of codec which stores PostingLists as BitSets. This implementation passes BasePostingsFormatTestCase.testDocsOnly() test.
Also I have found difficult to implement term dictionary and feel like it's better to somehow combine this posting format with any of standard ones.

@asfimport
Copy link
Author

Nina Gracheva (migrated from JIRA)

This is adopted version of BitSetCodec. It uses BlockTermsWriter/Reader infrastructure to build posting format with custom posting writer/reader and standard terms writer/reader. TODO use Long.numberOfTrailingZeros() in advance() and nexdoc()

@asfimport
Copy link
Author

Dr Oleg Savrasov (migrated from JIRA)

Methods nextDoc() and advance() have been implemented using Long.numberOfTrailingZeros() approach taken from FixedBitSetIterator

@asfimport
Copy link
Author

Ahmet Arslan (@iorixxx) (migrated from JIRA)

Hi Stefan Pohl, paper link seem broken? Gives 404 to me.

@asfimport
Copy link
Author

Stefan Pohl (migrated from JIRA)

Besides other locations, the paper can now be found here:
http://www.culpepper.io/publications.html

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

This patch looks like a great start!

Using BlockTermsDict makes sense; no need to reimplement a terms dictionary (it's not easy!).

One problem I see is every term is written as a bitset? This may be OK for some applications, but I think for wider usage, it'd be better if the postings format wrapped another postings format, and then only used the bitset when the docFreq was high enough, and otherwise delegate to the wrapped postings format? Maybe have a look at PulsingPostingsFormat as an example of how to wrap postings formats?

@asfimport
Copy link
Author

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

it'd be better if the postings format wrapped another postings format, and then only used the bitset when the docFreq was high enough

There are two orthogonal conceptions:

  • particular format - let's generalize "bitset format" to "no-tf format", and use WAH8, Elas-Fano with off-heap access (TODO). Thus, it works for spare postings;
  • API - how consumer can express his intention to use "no-tf" format? e.g. TermFilter or TermsEnum.docs() with special flag;

I'd like to clarify use-case for this issue (issue summary might need to be improved). It aims Solr's fq or even Heliosearch's GC-lightness. I suppose that user can decide which fields to index with "no-tf" format, these are "string" fields. Then, user requests filtering for these fields, no scoring is needed, for sure.

@mikemccand
Hence, I don't think than conditional conditional triggering is a good choice, however I don't know how to do it. I might not understand well how pulsing codec is used (impl idea is clear, though), can you point me on its' usage.

Thanks!

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

OK, I agree: if we use a sparse bitset, then we could use the format for all postings. I guess we'd switch up the bitset impl depending on docFreq of each term.

We already have FieldInfo.IndexOptions.DOCS_ONLY to express that you want to index only the docIDs. E.g. StringField sets this.

And our default codec already makes it easy to switch up the postings format by field: just subclass it and override getPostingsFormatForField.

@asfimport
Copy link
Author

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

@mikemccand let's agree on desired functionality:

  • this posting format should wrap the standard one, like pulsing;
  • if IndexOptions.DOCS_ONLY is provided, this codec suppresses standard posting format and write the bitset file (<<should-be>> explore fancy formats then);
  • same logic applies on reading;

I wonder what’s the correct behavior if docEnum is requested with FLAG_FREQS, should it silently returns 1 on freq() or throwing exception?

Let me ask one off-top question about switching to PulsingPF. I've heard that it's enabled automatically for id-like field. Can you point on where it's done particularly? Does it works for Lucene only, or for Solr also?
Thanks

@asfimport
Copy link
Author

asfimport commented Mar 25, 2014

Robert Muir (@rmuir) (migrated from JIRA)

Let me ask one off-top question about switching to PulsingPF. I've heard that it's enabled automatically for id-like field. Can you point on where it's done particularly?

See #5564

if there is only one document in the postings list for a term, we just store that document id instead of a pointer to a list ... of only one document.

The freq() for that one document is redundant as well: its the totalTermFreq() for the term, so there is no frequency data recorded either. It still has a pointer for positions/payload/offsets if you have that enabled: but in most cases with an ID-like field you do not.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

this posting format should wrap the standard one, like pulsing;

I don't think we need to do that (I was convinced, above)? I think it should just be its own PF, and the app picks it to store all postings as bitsets.

if IndexOptions.DOCS_ONLY is provided, this codec suppresses standard posting format and write the bitset file (<<should-be>> explore fancy formats then);

I think it should ONLY accept DOCS_ONLY? Ie, throw an exc if it gets anything else, because it's mis-use.

I wonder what’s the correct behavior if docEnum is requested with FLAG_FREQS, should it silently returns 1 on freq() or throwing exception?

I think lie (return 1 from freq).

@asfimport
Copy link
Author

Dr Oleg Savrasov (migrated from JIRA)

Only DOCS_ONLY index option is supported. IllegalArgumentException is thrown for anything else.

@asfimport
Copy link
Author

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

Colleagues,
Would you mind to provide a feedback for the last patch? What's the next steps you expect to move it forward?

@asfimport
Copy link
Author

Otis Gospodnetic (@otisg) (migrated from JIRA)

Is this aiming to do the same thing Yonik did for Heliosearch or something different?

@asfimport
Copy link
Author

Yonik Seeley (@yonik) (migrated from JIRA)

Is this aiming to do the same thing Yonik did for Heliosearch or something different?

What I've done in Heliosearch is when you do have to allocate memory for a filter, it's allocated off-heap.
This issue is more about avoiding any memory allocation at all for simple term filters (since it's more like just memory mapping part of the index). It's probably a bit misleading to describe it as "off-heap". It's more like how the rest of the lucene index works (probably a better description would be "on-disk".

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

I think the patch looks like a good start! Seems like we need to support a sparse bitset form to make it more general purpose? Do all lucene tests pass if you run with -Dtests.codec=BitSetCodec?

Why did you use the older BlockTerms dict instead of BlockTree?

@asfimport
Copy link
Author

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

I think the patch looks like a good start! Seems like we need to support a sparse bitset form to make it more general purpose?

Agree. I wonder what's the shortest path. I see WAH8 docidset impl. Is it a good idea to take it and move it to ByteBuffer? Or just create it in heap as-is and persist it on disk? Is it worth to look at Elias-Fano docid set, which is not committed afaik? Or research other formats like RLE?

Do all lucene tests pass if you run with -Dtests.codec=BitSetCodec?

There is codec test for docs_only which pass. How other tests can pass if it doesn't support freqs and positions? Or we need to come through all failures and triage them?

Why did you use the older BlockTerms dict instead of BlockTree?

Let's check whether we can to move.

jpountz added a commit to jpountz/lucene that referenced this issue Jan 13, 2025
Bit sets can be faster at advancing and more storage-efficient on dense blocks
of postings. This is not a new idea, @mkhludnev proposed something similar a
long time ago apache#6116.

@msokolov recently brought up (apache#14080) that such an encoding has become
especially appealing with the introduction of the
`DocIdSetIterator#loadIntoBitSet` API, and the fact that non-scoring
disjunctions and dense conjunctions now take advantage of it. Indeed, if
postings are stored in a bit set, `#loadIntoBitSet` would just need to OR the
postings bits into the bits that are used as an intermediate representation of
matches of the query.
jpountz added a commit that referenced this issue Jan 14, 2025
Bit sets can be faster at advancing and more storage-efficient on dense blocks
of postings. This is not a new idea, @mkhludnev proposed something similar a
long time ago #6116.

@msokolov recently brought up (#14080) that such an encoding has become
especially appealing with the introduction of the
`DocIdSetIterator#loadIntoBitSet` API, and the fact that non-scoring
disjunctions and dense conjunctions now take advantage of it. Indeed, if
postings are stored in a bit set, `#loadIntoBitSet` would just need to OR the
postings bits into the bits that are used as an intermediate representation of
matches of the query.
jpountz added a commit that referenced this issue Jan 14, 2025
Bit sets can be faster at advancing and more storage-efficient on dense blocks
of postings. This is not a new idea, @mkhludnev proposed something similar a
long time ago #6116.

@msokolov recently brought up (#14080) that such an encoding has become
especially appealing with the introduction of the
`DocIdSetIterator#loadIntoBitSet` API, and the fact that non-scoring
disjunctions and dense conjunctions now take advantage of it. Indeed, if
postings are stored in a bit set, `#loadIntoBitSet` would just need to OR the
postings bits into the bits that are used as an intermediate representation of
matches of the query.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant