Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add project_id to Bigquery::Table#extract and #extract_job #2692

Closed
wants to merge 1 commit into from

Conversation

quartzmo
Copy link
Member

[fixes #2609]

/cc @alixhami

@quartzmo quartzmo added the api: bigquery Issues related to the BigQuery API. label Nov 27, 2018
@quartzmo quartzmo self-assigned this Nov 27, 2018
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Nov 27, 2018
@alixhami
Copy link
Contributor

Would this argument be added for other kinds of jobs? I think this issue applies to more than just extract jobs.

I also like the original idea of adding a project as an optional parameter when constructing a dataset (though it will not solve this issue alone). In the other languages, the "client" object is constructed with the project you use to run jobs. It looks weird that in ruby you have to construct a client with the bigquery-public-data project for certain operations. @tswast and I noticed this when I wrote a ruby version (still in PR) of our python sample for listing table rows. I don't think a client should ever need to be constructed with the bigquery-public-data project.

@quartzmo
Copy link
Member Author

Would this argument be added for other kinds of jobs? I think this issue applies to more than just extract jobs.

Yes, I believe it can also be added to Table#copy and Table#copy_job.

@quartzmo
Copy link
Member Author

In the other languages, the "client" object is constructed with the project you use to run jobs... I don't think a client should ever need to be constructed with the bigquery-public-data project.

This library has a hierarchical, OOP-style organization with Project (not "Client") at the root of the hierarchy. The creation syntax that makes it appear that a BigQuery client is being created (bigquery = Google::Cloud::Bigquery.new) was bolted on later for consistency with other platforms, but this hierarchy remains.

In keeping with the OOP style, operations with the scope of one or two tables are located on the Table class. As @tswast mentioned, tabledata.list is a free operation, which is fortunate. For the extract and copy operations, a different project may be needed.

I tried the "optional parameter when constructing a dataset" solution, but I found that it leads to errors or surprising behavior for a number of methods in the Dataset and Table classes.

@alixhami
Copy link
Contributor

Ok that makes sense. @tswast do you think it would make sense to specify a project for tabledata.list for consistency even though it's a free operation?

@tswast
Copy link
Contributor

tswast commented Nov 27, 2018

do you think it would make sense to specify a project for tabledata.list for consistency even though it's a free operation?

Yes, this is because if we provide a way to override the project for tabledata.list, we can avoid having to recreate the client (and repeat doing authorization) to list data in a different project from the one we bill queries to. This is the reason I commented on your sample (GoogleCloudPlatform/ruby-docs-samples#370 (comment)) that it's unfortunate that we can't set the project for tabledata.list.

@tswast
Copy link
Contributor

tswast commented Nov 27, 2018

Should we add a Project.project method to switch projects but keep the same credentials?

Copy link
Contributor

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This makes sense to do for all job types where the "job project" can differ from the "data project": copy & extract for sure. Possibly query, too, since the default dataset could be in a different project than the project you want to charge for the query.

@blowmage
Copy link
Contributor

blowmage commented Nov 27, 2018

I'm late to this discussion, and I am going to take a slightly different approach. The core problem here is that the Job is created on whatever resource the #extract or #extract_job method is called on. Specifying project_id would solve this, but I think that seems a bit counter-intuitive and may not solve the core problem. Please allow me to illustrate.

The original code example is:

require "google/cloud/bigquery"

bucket_name = "my-bucket"
bigquery = Google::Cloud::Bigquery.new project: "bigquery-public-data"
dataset = bigquery.dataset "samples"
table = dataset.table "shakespeare"
destination_uri = "gs://#{bucket_name}/shakespeare.csv"

extract_job = table.extract_job(destination_uri) do |updater|
  # Location must match that of the source table.
  updater.location = "US"
end
extract_job.wait_until_done!  # Waits for the job to complete

puts "Exported #{table.id} to #{destination_uri}"

The problem here is that the Job is being created on the bigquery-public-data project. What if we could do something like this instead?

require "google/cloud/bigquery"

bucket_name = "my-bucket"
bigquery = Google::Cloud::Bigquery.new project: "my-project-id"

shakespeare_table_id = "bigquery-public-data:samples.shakespeare"
destination_uri = "gs://#{bucket_name}/shakespeare.csv"

extract_job = bigquery.extract_job(
  from: shakespeare_table_id,
  to: destination_uri
) do |updater|
  # Location must match that of the source table.
  updater.location = "US"
end
extract_job.wait_until_done!  # Waits for the job to complete

puts "Exported #{shakespeare_table_id} to #{destination_uri}"

In this example it seems more obvious to me that the Job is going to be built on the my-project-id project. The from argument can be any of the ways we represent a table. Thoughts?

@quartzmo
Copy link
Member Author

I was also thinking that versions of #extract_job and #extract that accept a source table parameter could be added to Project, just as we have Project#query and Project#query_job. I also think this could be done in addition to adding project_id to the job operations in Table.

@tswast
Copy link
Contributor

tswast commented Nov 27, 2018

@blowmage I like your suggestion. It avoids the problematic behavior of having to create a client associated with a project that you can't actually do many operations with.

Some thoughts:

to: will need to accept a list of URIs, optionally. Since BigQuery can split across several files when extracting a table.

To parse the from: string, you'll probably want similar logic to Python's TableReference.from_string which allows omitting the project if it's the same as the client's default project.

https://github.com/googleapis/google-cloud-python/blob/a029c82312431c9b162dace76082bcafe110ef86/bigquery/google/cloud/bigquery/table.py#L188-L244

@blowmage
Copy link
Contributor

I also think it would be wise to allow both Dataset and Table objects to be created on an external project, similar to how this is allowed in both Pub/Sub and Storage. To revisit the previous code example, you would be able to create a Dataset object by specifying it's project and pass #export_job a Table object instead of a table ID string:

require "google/cloud/bigquery"

bucket_name = "my-bucket"
bigquery = Google::Cloud::Bigquery.new project: "my-project-id"

samples_dataset = bigquery.dataset "samples",
                                   project: "bigquery-public-data",
                                   skip_lookup: true
shakespeare_table = samples_dataset.table "shakespeare",
                                          skip_lookup: true

destination_uri = "gs://#{bucket_name}/shakespeare.csv"

extract_job = bigquery.extract_job(
  from: shakespeare_table,
  to: destination_uri
) do |updater|
  # Location must match that of the source table.
  updater.location = "US"
end
extract_job.wait_until_done!  # Waits for the job to complete

puts "Exported #{shakespeare_table.id} to #{destination_uri}"

Again, the #export_job method is called on an object who's project is known. With this, users could call #export_job on the table object if they wanted to create the Job resource on that table's project, just as they would do today.

@quartzmo
Copy link
Member Author

As I wrote above, I tried the "optional parameter when constructing a dataset" solution, but I found that it leads to errors or surprising behavior for a number of methods in the Dataset and Table classes.

@blowmage
Copy link
Contributor

Some thoughts:

to: will need to accept a list of URIs, optionally. Since BigQuery can split across several files when extracting a table.

To parse the from: string, you'll probably want similar logic to Python's TableReference.from_string which allows omitting the project if it's the same as the client's default project.

Yes, that is my thinking. The to argument would be treated the same as the extract_url argument on the existing Table#extract_job, which allows the value to be a single string or an array of strings.

We also already have code to extract a table from string which allows omitting the project and dataset. It is currently being used in methods like Dataset::Access#add_reader_view.

@quartzmo
Copy link
Member Author

As long as you're OK with runtime errors on Dataset and Table operations that are not allowed on the external project, I am happy enough to add project_id options to #dataset and #table.

@blowmage
Copy link
Contributor

As I wrote above, I tried the "optional parameter when constructing a dataset" solution, but I found that it leads to errors or surprising behavior for a number of methods in the Dataset and Table classes.

Correct. I don't know what those issues are yet, but as I stated earlier adding new extract and extract_job methods that take the table as an argument instead of the table object being the owner of the method call should satisfy this request without also taking the additional step of allowing project to be provided when creating a Dataset object. I think it would make the code example read better, but it can be implemented separately.

@blowmage
Copy link
Contributor

As long as you're OK with runtime errors on Dataset and Table operations that are not allowed on the external project, I am happy enough to add project_id options to #dataset and #table.

Raising when operations are not allowed on these resource objects is consistent with how Pub/Sub and Storage behave. I think that would be a fair tradeoff, but again, should probably be done in a separate PR.

@blowmage
Copy link
Contributor

Should we add a Project.project method to switch projects but keep the same credentials?

I don't think that is necessary if we can pass project_id when creating a Dataset object.

@quartzmo
Copy link
Member Author

quartzmo commented Nov 27, 2018

I am OK with adding versions of #extract, #extract_job, #copy and #copy_job to Project. Do we also want to continue to add project_id to those methods in Table?

@tswast
Copy link
Contributor

tswast commented Nov 27, 2018

Do we also want to continue to add project_id to those methods in Table?

This is less necessary if these methods are added to project. We will use the project methods in the examples on cloud.google.com

@quartzmo
Copy link
Member Author

OK, sounds good. I'll close this PR and add versions of #extract, #extract_job, #copy and #copy_job to Project in a new PR.

@quartzmo quartzmo closed this Nov 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BigQuery: Run job in one project that uses resources from another
5 participants