fix: respect bigquery limit #2754

tychoish · 2024-03-07T23:49:53Z

Closes #716

I found this as I was looking through the issues earlier today.

One concern with this (and limit pushdowns in general,) is that we
could be under reporting data in these cases:

Imagine we only partially push down a predicate but we completly push
down the limit. If we get n rows back from the data source, but then
further filter it down, then we've returned fewer than the target
number of rows. Right?

This isn't new, and I'm not sure we shouldn't do this because of
that, and also I could imagine datafusion passing in different limits
than what the user specified.

tychoish · 2024-03-08T02:22:48Z

crates/datasources/src/bigquery/mod.rs

@@ -342,7 +349,7 @@ impl TableProvider for BigQueryTableProvider {
                };
                if let Some(stream) = stream_opt {
                    match send.send(stream).await {
-                        Ok(_) => {}
+                        Ok(_) => count += 1,


this is batches not records, will keep looking.

tychoish · 2024-03-08T06:07:38Z

(to be clear this doesn't push down the limit to big query: but assuming that for big results, that the client's consumption of result batches from BQ will put some back pressure on the service.)

universalmind303 · 2024-03-12T15:28:47Z

I don't think we should need to write own limit execs. If we are unable to push down the limit, we should use datafusion's GlobalLimitExec or LocalLimitExec instead. It does the same thing, but'll be less opaque within the plan, and has native support.

universalmind303

I'd prefer to see us rewrite the plan to instead wrap the execution node in a limit node. This makes the plan more reflective of the actual operations (we can't pushdown limits for bigquery).

scsmithr · 2024-03-12T15:40:48Z

I don't think this is needed.

limit isn't passed down if we don't return Exact for the filter support method.
The global limit will stop pulling the stream once it gets the required number of rows, and so will stop execution of this node when it stops pulling.

There is the issue of joins though which may try to get the entire table in some cases. I think the best solution here would be to see if we can report statistics for the purposes of join ordering.

tychoish · 2024-03-12T15:51:28Z

I don't think this is needed.

Fair enough, given this, and the above, and the semantics of the BQ integration, is the original ticket actionable?

I think the best solution here would be to see if we can report statistics for the purposes of join ordering.

Do we do that for any other datasource?

scsmithr · 2024-03-12T15:59:04Z

Fair enough, given this, and the above, and the semantics of the BQ integration, is the original ticket actionable?

I think we should include the limit in the query we send to bq if supplied, even though we know it's never actually passed down with how things currently are.

I think a second step would be to actual check the filters provided so that we can return Exact or Inexact as necessary. We can go off of the expressions used when building the query string for this. If the predicate only contains those, then it's Exact. Otherwise just return Inexact.

Do we do that for any other datasource?

No, but we should at some point though. We'll want to look at what a statistics cache might look like for data sources. I think bigquery has a 10MB processed per query minimum, so querying information schema every time woudn't be great.

Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

)

…espect

tychoish · 2024-03-13T16:30:50Z

I updated this to just include the limit in the query string

tychoish and others added 2 commits March 7, 2024 18:45

fix: respect bigquery limit

742b038

Merge branch 'main' into tycho/bigquer-limit-respect

1779694

tychoish commented Mar 8, 2024

View reviewed changes

tychoish added 2 commits March 8, 2024 01:05

counter

f851f95

comments

1665a65

remove legacy attempt

3c22200

tychoish requested review from vrongmeal, scsmithr and universalmind303 March 8, 2024 12:57

universalmind303 reviewed Mar 12, 2024

View reviewed changes

tychoish and others added 5 commits March 13, 2024 10:52

chore: upgrade dependencies (#2769)

4b8395e

Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

fix: Retry writing lease when error code 429 (too many requests). (#2773

78621b8

)

chore: add explicit schema support for json streaming (#2766)

708cfd4

git revert to main

007a7b9

pass limit in

897a002

tychoish requested a review from universalmind303 March 13, 2024 15:13

Merge remote-tracking branch 'origin/main' into tycho/bigquer-limit-r…

e83bd29

…espect

tychoish and others added 3 commits March 15, 2024 10:38

Merge branch 'main' into tycho/bigquer-limit-respect

0e8562c

add test

ca55f11

add tests

99e3faf

scsmithr approved these changes Mar 18, 2024

View reviewed changes

tychoish merged commit f1b558a into main Mar 18, 2024
25 checks passed

tychoish deleted the tycho/bigquer-limit-respect branch March 18, 2024 15:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: respect bigquery limit #2754

fix: respect bigquery limit #2754

tychoish commented Mar 7, 2024

tychoish Mar 8, 2024

tychoish commented Mar 8, 2024

universalmind303 commented Mar 12, 2024

universalmind303 left a comment

scsmithr commented Mar 12, 2024

tychoish commented Mar 12, 2024

scsmithr commented Mar 12, 2024

tychoish commented Mar 13, 2024

fix: respect bigquery limit #2754

fix: respect bigquery limit #2754

Conversation

tychoish commented Mar 7, 2024

tychoish Mar 8, 2024

Choose a reason for hiding this comment

tychoish commented Mar 8, 2024

universalmind303 commented Mar 12, 2024

universalmind303 left a comment

Choose a reason for hiding this comment

scsmithr commented Mar 12, 2024

tychoish commented Mar 12, 2024

scsmithr commented Mar 12, 2024

tychoish commented Mar 13, 2024