Build Scanner with nested column projection and limit / offset push down. #61

eddyxu · 2022-07-29T23:30:45Z

No description provided.

eddyxu · 2022-07-31T04:54:26Z

A code snippet to run

import time

import duckdb
import lance

ds = lance.dataset("s3://eto-ops-testing/coco.lance")

start = time.time()
scan = lance.scanner(ds, columns=["annotations.label"], limit=10)
print(duckdb.query(
    "SELECT annotations.label, count(1) FROM (SELECT UNNEST(annotations) as annotations FROM scan) GROUP BY 1"))
end = time.time()
print(f"Query time: {end - start}")

changhiskhan · 2022-07-31T05:38:51Z

python/lance/_lib.pyx

+        builder.get().Project([tobytes(c) for c in columns])
+    if filter is not None:
+        builder.get().Filter(_bind(filter, dataset.schema()))
+    if limit is not None:


Silly edge case but offset is ignored if limit isn't specified. May want to document that

SQL standard does not support offset w/o LIMIT IIRC.

eddyxu added 9 commits July 29, 2022 15:58

found a bug

973b870

simplify scanner to arrow RecordBatchReader

443f922

just use arrow schedule

9544e5e

simplify scanner

ce353a8

scanner builder

31cee91

add

e817eab

add c++ test

1c99884

pass cpp test

ba07aae

use fragement scan options to pass limit / offset

ac39d05

eddyxu self-assigned this Jul 29, 2022

eddyxu added the c++ C++ issues label Jul 29, 2022

eddyxu added 2 commits July 29, 2022 16:32

cleanup

5c401e4

at least can pass filter?

ed657f9

eddyxu mentioned this pull request Jul 30, 2022

Pyarrow Dataset Scanner has no public cython definiation #62

Closed

eddyxu and others added 4 commits July 30, 2022 01:44

cython attemp

ff074ff

add arrow-python to manyliux

50b3047

compiles

616e6aa

trick dataset scanner

89ad3cc

eddyxu requested a review from changhiskhan July 31, 2022 04:50

Merge branch 'main' into lei/py_filter

5e2c1c9

eddyxu marked this pull request as ready for review July 31, 2022 04:51

changhiskhan approved these changes Jul 31, 2022

View reviewed changes

fix limt w/o filters

5b512e2

eddyxu merged commit 2586532 into main Jul 31, 2022

eddyxu deleted the lei/py_filter branch July 31, 2022 07:07

eddyxu mentioned this pull request Aug 1, 2022

Expose filter APIs via Python #53

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build Scanner with nested column projection and limit / offset push down. #61

Build Scanner with nested column projection and limit / offset push down. #61

eddyxu commented Jul 29, 2022

eddyxu commented Jul 31, 2022

changhiskhan Jul 31, 2022

eddyxu Jul 31, 2022

Build Scanner with nested column projection and limit / offset push down. #61

Build Scanner with nested column projection and limit / offset push down. #61

Conversation

eddyxu commented Jul 29, 2022

eddyxu commented Jul 31, 2022

changhiskhan Jul 31, 2022

Choose a reason for hiding this comment

eddyxu Jul 31, 2022

Choose a reason for hiding this comment