projection design #95

changhiskhan · 2022-08-11T22:02:51Z

when we project annotations.label for example, the output schema is still a list of struct, so we're still faced as query syntax issues as before.

Would it make sense for the output schema to become just a list of string instead?

Expected Behavior

For a schema of list of struct, annotations.label returns list<string> instead of list<struct<label: string>>

The text was updated successfully, but these errors were encountered:

eddyxu · 2022-09-14T23:41:52Z

Returnning annotations.label as a list has better performance characteristics and less code in training DataLoader.

For example, a pytorch dataset can directly return {"labels": [1, 2, 3, 4], "boxes": [[...], [...]]} instead of [{"label": 1, "box": [....]}, {"label": 2, "box": [...]}.

However, it is unclear to me how to differentiate the intent that wants to extract the field (i.e., to a list[int]) or project a subset of the structure.
For example, with a column list<struct<foo, bar, zoo>> , what syntax would that be to get list<struct<foo, bar>>.

What do you think @changhiskhan

changhiskhan · 2022-09-15T00:12:59Z

Can we do something Xpath-like...but not shitty ? Lol

eddyxu · 2022-09-15T06:54:15Z

Let's try to formalize the requirement here.

Ignoring the actual SQL engine implementation, let us just discuss what kind of queries we need support:

Assuming we have a dataset with schema:

id: int64
image: image<binary>
split: string
annotations: list<struct<
   label: dict<string, int8>,
   box: box2d,
   comment: string,
   object_id: int64
>>

Query 1: Label distribution globally

SELECT count(1), label FROM (
   SELECT unnest(annotations.label) as label FROM t
) GROUP BY 2

Query 2: Calculate of the label distribution between splits

SELECT split, label, count(1) as label_count
FROM (
  SELECT split, unnest(annotations.label) as label
  FROM lance_dataset
) AS unnested_labels
GROUP BY 1, 2

Query 3: Find images that contain the specific labels.

option 1:

SELECT distinct image_id
FROM (
  SELECT image_id, unnest(annotations.label) as label
  FROM lance_dataset
) AS unnested_labels
WHERE  label in ('cat', 'dog')

option 2:

SELECT image_id
FROM lance_dataset
WHERE list_contains(annotations.label, 'cat') or list_contains(annotations.label, 'dog')

option 3 (requires additional function array_intersection):

SELECT image_id
FROM lance_dataset
WHERE len(array_intersection(annotations.label, ['cat', 'dog'])) > 0

Query 4: Find the images and bounding box of specific labels (i.e., "cat")

option 1:

SELECT image_id, list(ann.bbox) as boxes
FROM (
  SELECT image_id, unnest(annotations) as ann
  FROM lance_dataset
) AS unnested_labels
WHERE  ann.label in ('cat', 'dog')
GROUP BY 1

option 2:

SELECT image_id, struct_extract(list_filter(annotations, ann -> ann.label = "cat" or ann.label = "dog"), bbox) as boxes
FROM lance_dataset
WHERE list_contains(annotations.label, 'cat') or list_contains(annotations.label, 'dog')

option 3:

SELECT image_id, struct_extract(list_filter(annotations, ann -> ann.label = "cat" or ann.label = "dog"), bbox) as boxes
FROM lance_dataset
WHERE len(list_intersection(annotations.label, ['cat', 'dog'])) > 0

Query 5: Training with datasets with batch size = N

dataset = lance.torch.Dataset(
    uri, 
    columns=["image", "annotations.label", "annotations.box"],
    batch_size=4,
)

It'd expect that annotations.label returns a list of labels for each row. instead of a list of [{"label": 1, "box": [1, 2, 3, 4]}].

Conclusion:

Discussed offline, it appears that other than query #4, the rest of queries are expected "SELECT annotations.label FROM table" returns schema of list<string> instead of list<struct<label: string>>.

This behavior does exist in Spark SQL as well.

For query 4, we can provide / extend SQL function to achieve it. For example, duckdb has struct_extract(struct, "entry") method (link), which we could probably extend it to support struct_extract(struct, entry1, entry2, ....)

changhiskhan mentioned this issue Aug 11, 2022

Coco benchmarks for lance and parquet formats #97

Merged

5 tasks

eddyxu self-assigned this Aug 13, 2022

eddyxu added c++ C++ issues arrow Apache Arrow related issues labels Aug 13, 2022

eddyxu mentioned this issue Sep 15, 2022

Convert coco annotations from the list of structs to struct of lists #166

Merged

changhiskhan closed this as completed Feb 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

projection design #95

projection design #95

changhiskhan commented Aug 11, 2022 •

edited by eddyxu

Loading

eddyxu commented Sep 14, 2022 •

edited

Loading

changhiskhan commented Sep 15, 2022

eddyxu commented Sep 15, 2022 •

edited

Loading

projection design #95

projection design #95

Comments

changhiskhan commented Aug 11, 2022 • edited by eddyxu Loading

Expected Behavior

eddyxu commented Sep 14, 2022 • edited Loading

changhiskhan commented Sep 15, 2022

eddyxu commented Sep 15, 2022 • edited Loading

Conclusion:

changhiskhan commented Aug 11, 2022 •

edited by eddyxu

Loading

eddyxu commented Sep 14, 2022 •

edited

Loading

eddyxu commented Sep 15, 2022 •

edited

Loading