-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: respect bigquery limit #2754
Conversation
@@ -342,7 +349,7 @@ impl TableProvider for BigQueryTableProvider { | |||
}; | |||
if let Some(stream) = stream_opt { | |||
match send.send(stream).await { | |||
Ok(_) => {} | |||
Ok(_) => count += 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is batches not records, will keep looking.
(to be clear this doesn't push down the limit to big query: but assuming that for big results, that the client's consumption of result batches from BQ will put some back pressure on the service.) |
I don't think we should need to write own limit execs. If we are unable to push down the limit, we should use datafusion's GlobalLimitExec or LocalLimitExec instead. It does the same thing, but'll be less opaque within the plan, and has native support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to see us rewrite the plan to instead wrap the execution node in a limit node. This makes the plan more reflective of the actual operations (we can't pushdown limits for bigquery).
I don't think this is needed.
There is the issue of joins though which may try to get the entire table in some cases. I think the best solution here would be to see if we can report statistics for the purposes of join ordering. |
Fair enough, given this, and the above, and the semantics of the BQ integration, is the original ticket actionable?
Do we do that for any other datasource? |
I think we should include the limit in the query we send to bq if supplied, even though we know it's never actually passed down with how things currently are. I think a second step would be to actual check the filters provided so that we can return Exact or Inexact as necessary. We can go off of the expressions used when building the query string for this. If the predicate only contains those, then it's Exact. Otherwise just return Inexact.
No, but we should at some point though. We'll want to look at what a statistics cache might look like for data sources. I think bigquery has a 10MB processed per query minimum, so querying information schema every time woudn't be great. |
Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
I updated this to just include the limit in the query string |
Closes #716
I found this as I was looking through the issues earlier today.
One concern with this (and limit pushdowns in general,) is that we
could be under reporting data in these cases:
Imagine we only partially push down a predicate but we completly push
down the limit. If we get
n
rows back from the data source, but thenfurther filter it down, then we've returned fewer than the target
number of rows. Right?
This isn't new, and I'm not sure we shouldn't do this because of
that, and also I could imagine datafusion passing in different limits
than what the user specified.