-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
projection design #95
Comments
Returnning For example, a pytorch dataset can directly return However, it is unclear to me how to differentiate the intent that wants to extract the field (i.e., to a What do you think @changhiskhan |
Can we do something Xpath-like...but not shitty ? Lol |
Let's try to formalize the requirement here. Ignoring the actual SQL engine implementation, let us just discuss what kind of queries we need support: Assuming we have a dataset with schema:
Query 1: Label distribution globally SELECT count(1), label FROM (
SELECT unnest(annotations.label) as label FROM t
) GROUP BY 2 Query 2: Calculate of the label distribution between splits SELECT split, label, count(1) as label_count
FROM (
SELECT split, unnest(annotations.label) as label
FROM lance_dataset
) AS unnested_labels
GROUP BY 1, 2 Query 3: Find images that contain the specific labels. option 1: SELECT distinct image_id
FROM (
SELECT image_id, unnest(annotations.label) as label
FROM lance_dataset
) AS unnested_labels
WHERE label in ('cat', 'dog') option 2: SELECT image_id
FROM lance_dataset
WHERE list_contains(annotations.label, 'cat') or list_contains(annotations.label, 'dog') option 3 (requires additional function array_intersection): SELECT image_id
FROM lance_dataset
WHERE len(array_intersection(annotations.label, ['cat', 'dog'])) > 0 Query 4: Find the images and bounding box of specific labels (i.e., "cat") option 1: SELECT image_id, list(ann.bbox) as boxes
FROM (
SELECT image_id, unnest(annotations) as ann
FROM lance_dataset
) AS unnested_labels
WHERE ann.label in ('cat', 'dog')
GROUP BY 1 option 2: SELECT image_id, struct_extract(list_filter(annotations, ann -> ann.label = "cat" or ann.label = "dog"), bbox) as boxes
FROM lance_dataset
WHERE list_contains(annotations.label, 'cat') or list_contains(annotations.label, 'dog') option 3: SELECT image_id, struct_extract(list_filter(annotations, ann -> ann.label = "cat" or ann.label = "dog"), bbox) as boxes
FROM lance_dataset
WHERE len(list_intersection(annotations.label, ['cat', 'dog'])) > 0 Query 5: Training with datasets with batch size = N dataset = lance.torch.Dataset(
uri,
columns=["image", "annotations.label", "annotations.box"],
batch_size=4,
) It'd expect that Conclusion:Discussed offline, it appears that other than query #4, the rest of queries are expected "SELECT annotations.label FROM table" returns schema of This behavior does exist in Spark SQL as well. For query 4, we can provide / extend SQL function to achieve it. For example, |
when we project
annotations.label
for example, the output schema is still a list of struct, so we're still faced as query syntax issues as before.Would it make sense for the output schema to become just a list of string instead?
Expected Behavior
For a schema of list of struct,
annotations.label
returnslist<string>
instead oflist<struct<label: string>>
The text was updated successfully, but these errors were encountered: