Generated Columns #3368

chenkovsky · 2025-01-11T06:33:54Z

We are using lance to store text corpus, but before training, some lightweight normalizations should be applied to the text, for example, removing sensitive words. currently, we have to store both normalized text and unnormalized text. the storage is doubled.

maybe lance can implement generated column which is similar to https://www.sqlite.org/gencol.html.

the benifits are:

save storage size
better IO performance
no need to update normalized text any more.

LuQQiu · 2025-01-14T02:48:29Z

@westonpace FYI

chenkovsky · 2025-01-14T03:01:45Z

another scenario is that, we have multi model corpus for trainning. but videos are too large, and we want to share videos or images between different datasets. so videos or large images are stored solely, we only store url in lance. with generated columns, user can get videos or images as if stored in lance. they don't need to care about where are images or videos.

sometimes images are also extracted from videos. with this feature. we can extract image on the fly.

westonpace · 2025-01-14T14:31:41Z

This seems reasonable. We'd need to track the function itself as well as which columns are source columns for the function. Most of the work would end up in the scanner keeping track of the various schema (but I believe you are also the one that suggested we introduce metadata columns? So maybe some of that schema refactor is overdue)

Today, lightweight functions can be applied as a projection when you query the data:

>>> import pyarrow as pa
>>> tab = pa.table({"a": [-3, -2, -1, 0, 1, 2, 3]})
>>> import lance
>>> ds = lance.write_dataset(tab, "/tmp/foo.lance")
>>> ds.to_table(columns={"a_sq": "a*a"})
pyarrow.Table
a_sq: int64
----
a_sq: [[9,4,1,0,1,4,9]]

another scenario is that, we have multi model corpus for trainning. but videos are too large, and we want to share videos or images between different datasets. so videos or large images are stored solely, we only store url in lance. with generated columns, user can get videos or images as if stored in lance. they don't need to care about where are images or videos.

sometimes images are also extracted from videos. with this feature. we can extract image on the fly.

This is might be doable, but for more expensive tasks it might make more sense to either do this as a dedicated feature and / or do this later in the pipeline (e.g. in a python iterator). Though for video -> image you could maybe get away with a datafusion UDF (I feel like memory might get a bit tricky but maybe solved with batch sizes).

chenkovsky · 2025-01-15T01:45:20Z

yes, we can use

ds.to_table(columns={"a_sq": "a*a"})

the only difference is that, data producer can be decoupled with data consumer. otherwise when sql logic is changed, data producer has to notify every user.

chenkovsky · 2025-01-15T02:00:09Z

(but I believe you are also the one that suggested we introduce metadata columns? So maybe some of that schema refactor is overdue)

after my pr for datafusion is merged, I can help to refactor the schema.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generated Columns #3368

Generated Columns #3368

chenkovsky commented Jan 11, 2025

LuQQiu commented Jan 14, 2025

chenkovsky commented Jan 14, 2025 •

edited

Loading

westonpace commented Jan 14, 2025 •

edited

Loading

chenkovsky commented Jan 15, 2025

chenkovsky commented Jan 15, 2025

Generated Columns #3368

Generated Columns #3368

Comments

chenkovsky commented Jan 11, 2025

LuQQiu commented Jan 14, 2025

chenkovsky commented Jan 14, 2025 • edited Loading

westonpace commented Jan 14, 2025 • edited Loading

chenkovsky commented Jan 15, 2025

chenkovsky commented Jan 15, 2025

chenkovsky commented Jan 14, 2025 •

edited

Loading

westonpace commented Jan 14, 2025 •

edited

Loading