-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generated Columns #3368
Comments
@westonpace FYI |
another scenario is that, we have multi model corpus for trainning. but videos are too large, and we want to share videos or images between different datasets. so videos or large images are stored solely, we only store url in lance. with generated columns, user can get videos or images as if stored in lance. they don't need to care about where are images or videos. sometimes images are also extracted from videos. with this feature. we can extract image on the fly. |
This seems reasonable. We'd need to track the function itself as well as which columns are source columns for the function. Most of the work would end up in the scanner keeping track of the various schema (but I believe you are also the one that suggested we introduce metadata columns? So maybe some of that schema refactor is overdue) Today, lightweight functions can be applied as a projection when you query the data:
This is might be doable, but for more expensive tasks it might make more sense to either do this as a dedicated feature and / or do this later in the pipeline (e.g. in a python iterator). Though for video -> image you could maybe get away with a datafusion UDF (I feel like memory might get a bit tricky but maybe solved with batch sizes). |
yes, we can use
the only difference is that, data producer can be decoupled with data consumer. otherwise when sql logic is changed, data producer has to notify every user. |
after my pr for datafusion is merged, I can help to refactor the schema. |
We are using lance to store text corpus, but before training, some lightweight normalizations should be applied to the text, for example, removing sensitive words. currently, we have to store both normalized text and unnormalized text. the storage is doubled.
maybe lance can implement generated column which is similar to https://www.sqlite.org/gencol.html.
the benifits are:
The text was updated successfully, but these errors were encountered: