Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

projection design #95

Closed
Tracked by #97
changhiskhan opened this issue Aug 11, 2022 · 3 comments
Closed
Tracked by #97

projection design #95

changhiskhan opened this issue Aug 11, 2022 · 3 comments
Assignees
Labels
arrow Apache Arrow related issues c++ C++ issues

Comments

@changhiskhan
Copy link
Contributor

changhiskhan commented Aug 11, 2022

when we project annotations.label for example, the output schema is still a list of struct, so we're still faced as query syntax issues as before.

Would it make sense for the output schema to become just a list of string instead?

Expected Behavior

For a schema of list of struct, annotations.label returns list<string> instead of list<struct<label: string>>

@eddyxu eddyxu self-assigned this Aug 13, 2022
@eddyxu eddyxu added c++ C++ issues arrow Apache Arrow related issues labels Aug 13, 2022
@eddyxu
Copy link
Contributor

eddyxu commented Sep 14, 2022

Returnning annotations.label as a list has better performance characteristics and less code in training DataLoader.

For example, a pytorch dataset can directly return {"labels": [1, 2, 3, 4], "boxes": [[...], [...]]} instead of [{"label": 1, "box": [....]}, {"label": 2, "box": [...]}.

However, it is unclear to me how to differentiate the intent that wants to extract the field (i.e., to a list[int]) or project a subset of the structure.
For example, with a column list<struct<foo, bar, zoo>> , what syntax would that be to get list<struct<foo, bar>>.

What do you think @changhiskhan

@changhiskhan
Copy link
Contributor Author

Can we do something Xpath-like...but not shitty ? Lol

@eddyxu
Copy link
Contributor

eddyxu commented Sep 15, 2022

Let's try to formalize the requirement here.

Ignoring the actual SQL engine implementation, let us just discuss what kind of queries we need support:

Assuming we have a dataset with schema:

id: int64
image: image<binary>
split: string
annotations: list<struct<
   label: dict<string, int8>,
   box: box2d,
   comment: string,
   object_id: int64
>>

Query 1: Label distribution globally

SELECT count(1), label FROM (
   SELECT unnest(annotations.label) as label FROM t
) GROUP BY 2

Query 2: Calculate of the label distribution between splits

SELECT split, label, count(1) as label_count
FROM (
  SELECT split, unnest(annotations.label) as label
  FROM lance_dataset
) AS unnested_labels
GROUP BY 1, 2

Query 3: Find images that contain the specific labels.

option 1:

SELECT distinct image_id
FROM (
  SELECT image_id, unnest(annotations.label) as label
  FROM lance_dataset
) AS unnested_labels
WHERE  label in ('cat', 'dog')

option 2:

SELECT image_id
FROM lance_dataset
WHERE list_contains(annotations.label, 'cat') or list_contains(annotations.label, 'dog')

option 3 (requires additional function array_intersection):

SELECT image_id
FROM lance_dataset
WHERE len(array_intersection(annotations.label, ['cat', 'dog'])) > 0

Query 4: Find the images and bounding box of specific labels (i.e., "cat")

option 1:

SELECT image_id, list(ann.bbox) as boxes
FROM (
  SELECT image_id, unnest(annotations) as ann
  FROM lance_dataset
) AS unnested_labels
WHERE  ann.label in ('cat', 'dog')
GROUP BY 1

option 2:

SELECT image_id, struct_extract(list_filter(annotations, ann -> ann.label = "cat" or ann.label = "dog"), bbox) as boxes
FROM lance_dataset
WHERE list_contains(annotations.label, 'cat') or list_contains(annotations.label, 'dog')

option 3:

SELECT image_id, struct_extract(list_filter(annotations, ann -> ann.label = "cat" or ann.label = "dog"), bbox) as boxes
FROM lance_dataset
WHERE len(list_intersection(annotations.label, ['cat', 'dog'])) > 0

Query 5: Training with datasets with batch size = N

dataset = lance.torch.Dataset(
    uri, 
    columns=["image", "annotations.label", "annotations.box"],
    batch_size=4,
)

It'd expect that annotations.label returns a list of labels for each row. instead of a list of [{"label": 1, "box": [1, 2, 3, 4]}].

Conclusion:

Discussed offline, it appears that other than query #4, the rest of queries are expected "SELECT annotations.label FROM table" returns schema of list<string> instead of list<struct<label: string>>.

This behavior does exist in Spark SQL as well.

For query 4, we can provide / extend SQL function to achieve it. For example, duckdb has struct_extract(struct, "entry") method (link), which we could probably extend it to support struct_extract(struct, entry1, entry2, ....)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Apache Arrow related issues c++ C++ issues
Projects
None yet
Development

No branches or pull requests

2 participants