-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add project_id to Bigquery::Table#extract and #extract_job #2692
Conversation
Would this argument be added for other kinds of jobs? I think this issue applies to more than just extract jobs. I also like the original idea of adding a project as an optional parameter when constructing a dataset (though it will not solve this issue alone). In the other languages, the "client" object is constructed with the project you use to run jobs. It looks weird that in ruby you have to construct a client with the |
Yes, I believe it can also be added to |
This library has a hierarchical, OOP-style organization with In keeping with the OOP style, operations with the scope of one or two tables are located on the I tried the "optional parameter when constructing a dataset" solution, but I found that it leads to errors or surprising behavior for a number of methods in the |
Ok that makes sense. @tswast do you think it would make sense to specify a project for |
Yes, this is because if we provide a way to override the project for tabledata.list, we can avoid having to recreate the client (and repeat doing authorization) to list data in a different project from the one we bill queries to. This is the reason I commented on your sample (GoogleCloudPlatform/ruby-docs-samples#370 (comment)) that it's unfortunate that we can't set the project for tabledata.list. |
Should we add a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. This makes sense to do for all job types where the "job project" can differ from the "data project": copy & extract for sure. Possibly query, too, since the default dataset could be in a different project than the project you want to charge for the query.
I'm late to this discussion, and I am going to take a slightly different approach. The core problem here is that the Job is created on whatever resource the The original code example is: require "google/cloud/bigquery"
bucket_name = "my-bucket"
bigquery = Google::Cloud::Bigquery.new project: "bigquery-public-data"
dataset = bigquery.dataset "samples"
table = dataset.table "shakespeare"
destination_uri = "gs://#{bucket_name}/shakespeare.csv"
extract_job = table.extract_job(destination_uri) do |updater|
# Location must match that of the source table.
updater.location = "US"
end
extract_job.wait_until_done! # Waits for the job to complete
puts "Exported #{table.id} to #{destination_uri}" The problem here is that the Job is being created on the bigquery-public-data project. What if we could do something like this instead? require "google/cloud/bigquery"
bucket_name = "my-bucket"
bigquery = Google::Cloud::Bigquery.new project: "my-project-id"
shakespeare_table_id = "bigquery-public-data:samples.shakespeare"
destination_uri = "gs://#{bucket_name}/shakespeare.csv"
extract_job = bigquery.extract_job(
from: shakespeare_table_id,
to: destination_uri
) do |updater|
# Location must match that of the source table.
updater.location = "US"
end
extract_job.wait_until_done! # Waits for the job to complete
puts "Exported #{shakespeare_table_id} to #{destination_uri}" In this example it seems more obvious to me that the Job is going to be built on the my-project-id project. The |
I was also thinking that versions of |
@blowmage I like your suggestion. It avoids the problematic behavior of having to create a client associated with a project that you can't actually do many operations with. Some thoughts:
To parse the |
I also think it would be wise to allow both Dataset and Table objects to be created on an external project, similar to how this is allowed in both Pub/Sub and Storage. To revisit the previous code example, you would be able to create a Dataset object by specifying it's require "google/cloud/bigquery"
bucket_name = "my-bucket"
bigquery = Google::Cloud::Bigquery.new project: "my-project-id"
samples_dataset = bigquery.dataset "samples",
project: "bigquery-public-data",
skip_lookup: true
shakespeare_table = samples_dataset.table "shakespeare",
skip_lookup: true
destination_uri = "gs://#{bucket_name}/shakespeare.csv"
extract_job = bigquery.extract_job(
from: shakespeare_table,
to: destination_uri
) do |updater|
# Location must match that of the source table.
updater.location = "US"
end
extract_job.wait_until_done! # Waits for the job to complete
puts "Exported #{shakespeare_table.id} to #{destination_uri}" Again, the |
As I wrote above, I tried the "optional parameter when constructing a dataset" solution, but I found that it leads to errors or surprising behavior for a number of methods in the Dataset and Table classes. |
Yes, that is my thinking. The We also already have code to extract a table from string which allows omitting the project and dataset. It is currently being used in methods like |
As long as you're OK with runtime errors on |
Correct. I don't know what those issues are yet, but as I stated earlier adding new |
Raising when operations are not allowed on these resource objects is consistent with how Pub/Sub and Storage behave. I think that would be a fair tradeoff, but again, should probably be done in a separate PR. |
I don't think that is necessary if we can pass |
I am OK with adding versions of |
This is less necessary if these methods are added to project. We will use the project methods in the examples on cloud.google.com |
OK, sounds good. I'll close this PR and add versions of |
[fixes #2609]
/cc @alixhami