Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generated Columns #3368

Open
chenkovsky opened this issue Jan 11, 2025 · 5 comments
Open

Generated Columns #3368

chenkovsky opened this issue Jan 11, 2025 · 5 comments

Comments

@chenkovsky
Copy link
Contributor

We are using lance to store text corpus, but before training, some lightweight normalizations should be applied to the text, for example, removing sensitive words. currently, we have to store both normalized text and unnormalized text. the storage is doubled.

maybe lance can implement generated column which is similar to https://www.sqlite.org/gencol.html.

the benifits are:

  • save storage size
  • better IO performance
  • no need to update normalized text any more.
@LuQQiu
Copy link
Contributor

LuQQiu commented Jan 14, 2025

@westonpace FYI

@chenkovsky
Copy link
Contributor Author

chenkovsky commented Jan 14, 2025

another scenario is that, we have multi model corpus for trainning. but videos are too large, and we want to share videos or images between different datasets. so videos or large images are stored solely, we only store url in lance. with generated columns, user can get videos or images as if stored in lance. they don't need to care about where are images or videos.

sometimes images are also extracted from videos. with this feature. we can extract image on the fly.

@westonpace
Copy link
Contributor

westonpace commented Jan 14, 2025

This seems reasonable. We'd need to track the function itself as well as which columns are source columns for the function. Most of the work would end up in the scanner keeping track of the various schema (but I believe you are also the one that suggested we introduce metadata columns? So maybe some of that schema refactor is overdue)

Today, lightweight functions can be applied as a projection when you query the data:

>>> import pyarrow as pa
>>> tab = pa.table({"a": [-3, -2, -1, 0, 1, 2, 3]})
>>> import lance
>>> ds = lance.write_dataset(tab, "/tmp/foo.lance")
>>> ds.to_table(columns={"a_sq": "a*a"})
pyarrow.Table
a_sq: int64
----
a_sq: [[9,4,1,0,1,4,9]]

another scenario is that, we have multi model corpus for trainning. but videos are too large, and we want to share videos or images between different datasets. so videos or large images are stored solely, we only store url in lance. with generated columns, user can get videos or images as if stored in lance. they don't need to care about where are images or videos.

sometimes images are also extracted from videos. with this feature. we can extract image on the fly.

This is might be doable, but for more expensive tasks it might make more sense to either do this as a dedicated feature and / or do this later in the pipeline (e.g. in a python iterator). Though for video -> image you could maybe get away with a datafusion UDF (I feel like memory might get a bit tricky but maybe solved with batch sizes).

@chenkovsky
Copy link
Contributor Author

yes, we can use

ds.to_table(columns={"a_sq": "a*a"})

the only difference is that, data producer can be decoupled with data consumer. otherwise when sql logic is changed, data producer has to notify every user.

@chenkovsky
Copy link
Contributor Author

(but I believe you are also the one that suggested we introduce metadata columns? So maybe some of that schema refactor is overdue)

after my pr for datafusion is merged, I can help to refactor the schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants