Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore within-block skipping for postings #12486

Closed
mikemccand opened this issue Aug 2, 2023 · 2 comments
Closed

Explore within-block skipping for postings #12486

mikemccand opened this issue Aug 2, 2023 · 2 comments

Comments

@mikemccand
Copy link
Member

Description

One of the differences between Tantivy and Lucene is that Tantivy supports within-block skipping (branchless binary search to locate a target docid inside a block of 128 postings), but Lucene skips only to block boundaries, and then does a linear scan within. Lucene is also doing the "accumulate docid deltas into the absolute docid" too in this loop, but I guess Tantivy does this separately somehow?

Anyway, we could explore within-block skipping as well -- it might make conjunctive queries where both terms have roughly the same cardinality, quite a bit faster?

@jpountz
Copy link
Contributor

jpountz commented Oct 31, 2023

Lucene is also doing the "accumulate docid deltas into the absolute docid" too in this loop, but I guess Tantivy does this separately somehow?

I believe Tantivy does the same, except that it can take advantage of SIMD to accumulate docid deltas into the absolute docid (if it did not accumulate deltas up-front, it could not run a branchless binary search later on). I tried to look into whether we can do the same with Panama, but last time I checked it doesn't give ways to use _mm_slli_si128, which prevents us from making a faster prefix sum through vectorization: https://github.com/jpountz/vectorized-prefix-sum.

For reference, there is also this old @mkhludnev idea about encoding dense postings lists as bitsets, which would naturally help with skipping: #6116 (or can we do it on a per-block basis?). And more generally, there are some formats that are better at skipping within blocks like Elias-Fano.

@jpountz
Copy link
Contributor

jpountz commented Jan 13, 2025

I thinkwe can consider this issue as closed via #13958, and possibly further improved via #14133.

@jpountz jpountz closed this as completed Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants