Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: respect bigquery limit #2754

Merged
merged 14 commits into from
Mar 18, 2024
Merged

fix: respect bigquery limit #2754

merged 14 commits into from
Mar 18, 2024

Conversation

tychoish
Copy link
Contributor

@tychoish tychoish commented Mar 7, 2024

Closes #716


I found this as I was looking through the issues earlier today.

One concern with this (and limit pushdowns in general,) is that we
could be under reporting data in these cases:

Imagine we only partially push down a predicate but we completly push
down the limit. If we get n rows back from the data source, but then
further filter it down, then we've returned fewer than the target
number of rows. Right?

This isn't new, and I'm not sure we shouldn't do this because of
that, and also I could imagine datafusion passing in different limits
than what the user specified.

@@ -342,7 +349,7 @@ impl TableProvider for BigQueryTableProvider {
};
if let Some(stream) = stream_opt {
match send.send(stream).await {
Ok(_) => {}
Ok(_) => count += 1,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is batches not records, will keep looking.

@tychoish
Copy link
Contributor Author

tychoish commented Mar 8, 2024

(to be clear this doesn't push down the limit to big query: but assuming that for big results, that the client's consumption of result batches from BQ will put some back pressure on the service.)

@universalmind303
Copy link
Contributor

I don't think we should need to write own limit execs. If we are unable to push down the limit, we should use datafusion's GlobalLimitExec or LocalLimitExec instead. It does the same thing, but'll be less opaque within the plan, and has native support.

Copy link
Contributor

@universalmind303 universalmind303 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to see us rewrite the plan to instead wrap the execution node in a limit node. This makes the plan more reflective of the actual operations (we can't pushdown limits for bigquery).

@scsmithr
Copy link
Member

I don't think this is needed.

  • limit isn't passed down if we don't return Exact for the filter support method.
  • The global limit will stop pulling the stream once it gets the required number of rows, and so will stop execution of this node when it stops pulling.

There is the issue of joins though which may try to get the entire table in some cases. I think the best solution here would be to see if we can report statistics for the purposes of join ordering.

@tychoish
Copy link
Contributor Author

I don't think this is needed.

Fair enough, given this, and the above, and the semantics of the BQ integration, is the original ticket actionable?

I think the best solution here would be to see if we can report statistics for the purposes of join ordering.

Do we do that for any other datasource?

@scsmithr
Copy link
Member

Fair enough, given this, and the above, and the semantics of the BQ integration, is the original ticket actionable?

I think we should include the limit in the query we send to bq if supplied, even though we know it's never actually passed down with how things currently are.

I think a second step would be to actual check the filters provided so that we can return Exact or Inexact as necessary. We can go off of the expressions used when building the query string for this. If the predicate only contains those, then it's Exact. Otherwise just return Inexact.

Do we do that for any other datasource?

No, but we should at some point though. We'll want to look at what a statistics cache might look like for data sources. I think bigquery has a 10MB processed per query minimum, so querying information schema every time woudn't be great.

@tychoish
Copy link
Contributor Author

I updated this to just include the limit in the query string

@tychoish tychoish merged commit f1b558a into main Mar 18, 2024
25 checks passed
@tychoish tychoish deleted the tycho/bigquer-limit-respect branch March 18, 2024 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Limit bigquery records returned
4 participants