Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inverted indices should be used to speedup string filters. #3415

Open
3 tasks
westonpace opened this issue Jan 24, 2025 · 3 comments
Open
3 tasks

Inverted indices should be used to speedup string filters. #3415

westonpace opened this issue Jan 24, 2025 · 3 comments

Comments

@westonpace
Copy link
Contributor

westonpace commented Jan 24, 2025

If a string column has a FTS index then we should have enough information to speed up a variety of string-based filters. Here is a (currently very partial as I don't know what's possible) listing:

@westonpace westonpace changed the title Inverted indices should be used to speedup string queries. Inverted indices should be used to speedup string filters. Jan 24, 2025
@wjones127
Copy link
Contributor

Is there any worry that tokenization could mess with this? I think in general it only makes it a wider net by:

  1. Lower casing
  2. Stemming (running, run -> same token)
  3. Ascii folding (café, cafe -> same token)
  4. Stop word removal -> fewer words to match on.

It should be fine, but worth being aware of these transformations.

@westonpace
Copy link
Contributor Author

Hmm, yeah, it would be a problem if contains('run', 'running') returned true. Maybe a specialized index then. A GIN index like label list could work.

@wjones127
Copy link
Contributor

Hmm, yeah, it would be a problem if contains('run', 'running') returned true. Maybe a specialized index then. A GIN index like label list could work.

I was thinking you could still use the FTS index, but would have a "refine" step where you take the results and do the exact contains test after. Not optimal in all cases, but could potentially work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants