-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coco benchmarks for lance and parquet formats #97
Merged
Merged
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
e98a29c
Coco benchmarks for lance and parquet formats
changhiskhan a877939
run multiple times
changhiskhan debd435
minor fix
changhiskhan ed196ca
rebase
changhiskhan 2d3290e
Apples-to-apples comparison
changhiskhan e060614
Add embedded images
changhiskhan 96c53e6
create oxford_pet benchmarks
changhiskhan 4191279
add xmls for pet
changhiskhan d7393d8
add histogram
changhiskhan File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,81 +1,106 @@ | ||
#!/usr/bin/env python3 | ||
|
||
import argparse | ||
import json | ||
import os | ||
from typing import Union | ||
|
||
import duckdb | ||
import pandas as pd | ||
import pyarrow as pa | ||
import pyarrow.fs | ||
|
||
from bench_utils import download_uris, timeit | ||
|
||
|
||
def get_metadata(base_uri: str, split: str = "val"): | ||
annotation_uri = os.path.join(base_uri, f"annotations/instances_{split}2017.json") | ||
fs, path = pa.fs.FileSystem.from_uri(annotation_uri) | ||
with fs.open_input_file(path) as fobj: | ||
annotation_json = json.load(fobj) | ||
df = pd.DataFrame(annotation_json["annotations"]) | ||
category_df = pd.DataFrame(annotation_json["categories"]) | ||
annotations_df = df.merge(category_df, left_on="category_id", right_on="id").rename( | ||
{"id": "category_id"} | ||
) | ||
anno_df = ( | ||
pd.DataFrame( | ||
{ | ||
"image_id": df.image_id, | ||
"annotations": annotations_df.drop( | ||
columns=["image_id"], axis=1 | ||
).to_dict(orient="records"), | ||
} | ||
) | ||
.groupby("image_id") | ||
.agg(list) | ||
) | ||
# print(anno_df, anno_df.columns) | ||
images_df = pd.DataFrame(annotation_json["images"]) | ||
images_df["split"] = split | ||
images_df["image_uri"] = images_df["file_name"].apply( | ||
lambda fname: os.path.join(base_uri, f"{split}2017", fname) | ||
) | ||
return images_df.merge(anno_df, left_on="id", right_on="image_id") | ||
|
||
|
||
@timeit | ||
def get_label_distribution(base_uri: str): | ||
|
||
import lance | ||
import pyarrow.compute as pc | ||
import pyarrow.dataset as ds | ||
from bench_utils import download_uris, get_uri, get_dataset, BenchmarkSuite | ||
from parse_coco import CocoConverter | ||
|
||
coco_benchmarks = BenchmarkSuite("coco") | ||
|
||
|
||
@coco_benchmarks.benchmark("label_distribution", key=['fmt', 'flavor']) | ||
def label_distribution(base_uri: str, fmt: str, flavor: str = None): | ||
if fmt == 'raw': | ||
return _label_distribution_raw(base_uri) | ||
elif fmt == 'lance': | ||
uri = get_uri(base_uri, "coco", fmt, flavor) | ||
dataset = get_dataset(uri) | ||
return _label_distribution_lance(dataset) | ||
elif fmt == 'parquet': | ||
uri = get_uri(base_uri, "coco", fmt, flavor) | ||
dataset = get_dataset(uri) | ||
return _label_distribution_duckdb(dataset) | ||
raise NotImplementedError() | ||
|
||
|
||
@coco_benchmarks.benchmark("filter_data", key=['fmt', 'flavor']) | ||
def filter_data(base_uri: str, fmt: str, flavor: str = None): | ||
if fmt == 'raw': | ||
return _filter_data_raw(base_uri) | ||
elif fmt == 'lance': | ||
return _filter_data_lance(base_uri, flavor=flavor) | ||
elif fmt == 'parquet': | ||
return _filter_data_parquet(base_uri, flavor=flavor) | ||
raise NotImplementedError() | ||
|
||
|
||
def _label_distribution_raw(base_uri: str): | ||
"""Minic | ||
SELECT label, count(1) FROM coco_dataset GROUP BY 1 | ||
""" | ||
metadata = get_metadata(base_uri) | ||
exploded_series = ( | ||
metadata["annotations"].explode("annotations").apply(lambda r: r["name"]) | ||
) | ||
return exploded_series.value_counts() | ||
c = CocoConverter(base_uri) | ||
df = c.read_metadata() | ||
return pd.json_normalize(df.annotations.explode()).name.value_counts() | ||
|
||
|
||
@timeit | ||
def get_filtered_data(url: str, klass="cat", offset=20, limit=50): | ||
def _filter_data_raw(base_uri: str, klass="cat", offset=20, limit=50): | ||
"""SELECT image, annotations FROM coco WHERE annotations.label = 'cat' LIMIT 50 OFFSET 20""" | ||
# %time rs = bench.get_pets_filtered_data(url, "pug", 20, 50) | ||
df = get_metadata(url) | ||
print(df["annotations"]) | ||
filtered = df[["image_uri", "annotations"]].loc[df["annotations"].apply( | ||
lambda annos: any([a["name"] == "cat" for a in annos]) | ||
)] | ||
c = CocoConverter(base_uri) | ||
df = c.read_metadata() | ||
mask = df.annotations.apply(lambda ann: any([a["name"] == klass for a in ann])) | ||
filtered = df.loc[mask, ["image_uri", "annotations"]] | ||
limited = filtered[offset:offset + limit] | ||
limited["image"] = download_uris(limited.image_uri) | ||
limited.assign(image=download_uris(limited.image_uri)) | ||
return limited | ||
|
||
|
||
def main(): | ||
parser = argparse.ArgumentParser(description="Benchmarks on COCO dataset") | ||
parser.add_argument("uri", help="base uri for coco dataset") | ||
args = parser.parse_args() | ||
def _filter_data_lance(base_uri: str, klass="cat", offset=20, limit=50, flavor=None): | ||
uri = get_uri(base_uri, "coco", "lance", flavor) | ||
index_scanner = lance.scanner(uri, columns=['image_id', 'annotations.name']) | ||
query = (f"SELECT distinct image_id FROM (" | ||
f" SELECT image_id, UNNEST(annotations) as ann FROM index_scanner" | ||
f") WHERE ann.name == '{klass}'") | ||
filtered_ids = duckdb.query(query).arrow().column("image_id").combine_chunks() | ||
scanner = lance.scanner(uri, ['image_id', 'image', 'annotations.name'], | ||
# filter=pc.field("image_id").isin(filtered_ids), | ||
limit=50, offset=20) | ||
return scanner.to_table().to_pandas() | ||
|
||
|
||
def _filter_data_parquet(base_uri: str, klass="cat", offset=20, limit=50, flavor=None): | ||
uri = get_uri(base_uri, "coco", "parquet", flavor) | ||
dataset = ds.dataset(uri) | ||
query = (f"SELECT distinct image_id FROM (" | ||
f" SELECT image_id, UNNEST(annotations) as ann FROM dataset" | ||
f") WHERE ann.name == '{klass}'") | ||
filtered_ids = duckdb.query(query).arrow().column("image_id").to_numpy().tolist() | ||
id_string = ','.join([f"'{x}'" for x in filtered_ids]) | ||
return duckdb.query(f"SELECT image, annotations " | ||
f"FROM dataset " | ||
f"WHERE image_id in ({id_string}) " | ||
f"LIMIT 50 OFFSET 20").to_arrow_table() | ||
|
||
|
||
get_label_distribution(args.uri) | ||
get_filtered_data(args.uri) | ||
def _label_distribution_lance(dataset: ds.Dataset): | ||
scanner = lance.scanner(dataset, columns=['annotations.name']) | ||
return _label_distribution_duckdb(scanner) | ||
|
||
|
||
def _label_distribution_duckdb(arrow_obj: Union[ds.Dataset | ds.Scanner]): | ||
query = """\ | ||
SELECT ann.name, COUNT(1) FROM ( | ||
SELECT UNNEST(annotations) as ann FROM arrow_obj | ||
) GROUP BY 1 | ||
""" | ||
return duckdb.query(query).to_df() | ||
|
||
|
||
if __name__ == "__main__": | ||
main = coco_benchmarks.create_main() | ||
main() |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there something similar in pytest or other library ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, there are a bunch. was easier to just write something quick for our use case here.
We could switch to asv at some point: https://asv.readthedocs.io/en/stable/index.html