Add project_id to Bigquery::Table#extract and #extract_job #2692

quartzmo · 2018-11-27T01:46:06Z

[fixes #2609]

alixhami · 2018-11-27T03:12:03Z

Would this argument be added for other kinds of jobs? I think this issue applies to more than just extract jobs.

I also like the original idea of adding a project as an optional parameter when constructing a dataset (though it will not solve this issue alone). In the other languages, the "client" object is constructed with the project you use to run jobs. It looks weird that in ruby you have to construct a client with the bigquery-public-data project for certain operations. @tswast and I noticed this when I wrote a ruby version (still in PR) of our python sample for listing table rows. I don't think a client should ever need to be constructed with the bigquery-public-data project.

quartzmo · 2018-11-27T15:54:18Z

Would this argument be added for other kinds of jobs? I think this issue applies to more than just extract jobs.

Yes, I believe it can also be added to Table#copy and Table#copy_job.

quartzmo · 2018-11-27T16:19:42Z

In the other languages, the "client" object is constructed with the project you use to run jobs... I don't think a client should ever need to be constructed with the bigquery-public-data project.

This library has a hierarchical, OOP-style organization with Project (not "Client") at the root of the hierarchy. The creation syntax that makes it appear that a BigQuery client is being created (bigquery = Google::Cloud::Bigquery.new) was bolted on later for consistency with other platforms, but this hierarchy remains.

In keeping with the OOP style, operations with the scope of one or two tables are located on the Table class. As @tswast mentioned, tabledata.list is a free operation, which is fortunate. For the extract and copy operations, a different project may be needed.

I tried the "optional parameter when constructing a dataset" solution, but I found that it leads to errors or surprising behavior for a number of methods in the Dataset and Table classes.

alixhami · 2018-11-27T17:25:37Z

Ok that makes sense. @tswast do you think it would make sense to specify a project for tabledata.list for consistency even though it's a free operation?

tswast · 2018-11-27T17:31:00Z

do you think it would make sense to specify a project for tabledata.list for consistency even though it's a free operation?

Yes, this is because if we provide a way to override the project for tabledata.list, we can avoid having to recreate the client (and repeat doing authorization) to list data in a different project from the one we bill queries to. This is the reason I commented on your sample (GoogleCloudPlatform/ruby-docs-samples#370 (comment)) that it's unfortunate that we can't set the project for tabledata.list.

tswast · 2018-11-27T17:38:14Z

Should we add a Project.project method to switch projects but keep the same credentials?

tswast

LGTM. This makes sense to do for all job types where the "job project" can differ from the "data project": copy & extract for sure. Possibly query, too, since the default dataset could be in a different project than the project you want to charge for the query.

blowmage · 2018-11-27T18:15:45Z

I'm late to this discussion, and I am going to take a slightly different approach. The core problem here is that the Job is created on whatever resource the #extract or #extract_job method is called on. Specifying project_id would solve this, but I think that seems a bit counter-intuitive and may not solve the core problem. Please allow me to illustrate.

The original code example is:

require "google/cloud/bigquery"

bucket_name = "my-bucket"
bigquery = Google::Cloud::Bigquery.new project: "bigquery-public-data"
dataset = bigquery.dataset "samples"
table = dataset.table "shakespeare"
destination_uri = "gs://#{bucket_name}/shakespeare.csv"

extract_job = table.extract_job(destination_uri) do |updater|
  # Location must match that of the source table.
  updater.location = "US"
end
extract_job.wait_until_done!  # Waits for the job to complete

puts "Exported #{table.id} to #{destination_uri}"

The problem here is that the Job is being created on the bigquery-public-data project. What if we could do something like this instead?

require "google/cloud/bigquery"

bucket_name = "my-bucket"
bigquery = Google::Cloud::Bigquery.new project: "my-project-id"

shakespeare_table_id = "bigquery-public-data:samples.shakespeare"
destination_uri = "gs://#{bucket_name}/shakespeare.csv"

extract_job = bigquery.extract_job(
  from: shakespeare_table_id,
  to: destination_uri
) do |updater|
  # Location must match that of the source table.
  updater.location = "US"
end
extract_job.wait_until_done!  # Waits for the job to complete

puts "Exported #{shakespeare_table_id} to #{destination_uri}"

In this example it seems more obvious to me that the Job is going to be built on the my-project-id project. The from argument can be any of the ways we represent a table. Thoughts?

quartzmo · 2018-11-27T18:20:51Z

I was also thinking that versions of #extract_job and #extract that accept a source table parameter could be added to Project, just as we have Project#query and Project#query_job. I also think this could be done in addition to adding project_id to the job operations in Table.

tswast · 2018-11-27T18:24:57Z

@blowmage I like your suggestion. It avoids the problematic behavior of having to create a client associated with a project that you can't actually do many operations with.

Some thoughts:

to: will need to accept a list of URIs, optionally. Since BigQuery can split across several files when extracting a table.

To parse the from: string, you'll probably want similar logic to Python's TableReference.from_string which allows omitting the project if it's the same as the client's default project.

https://github.com/googleapis/google-cloud-python/blob/a029c82312431c9b162dace76082bcafe110ef86/bigquery/google/cloud/bigquery/table.py#L188-L244

blowmage · 2018-11-27T18:26:52Z

I also think it would be wise to allow both Dataset and Table objects to be created on an external project, similar to how this is allowed in both Pub/Sub and Storage. To revisit the previous code example, you would be able to create a Dataset object by specifying it's project and pass #export_job a Table object instead of a table ID string:

require "google/cloud/bigquery"

bucket_name = "my-bucket"
bigquery = Google::Cloud::Bigquery.new project: "my-project-id"

samples_dataset = bigquery.dataset "samples",
                                   project: "bigquery-public-data",
                                   skip_lookup: true
shakespeare_table = samples_dataset.table "shakespeare",
                                          skip_lookup: true

destination_uri = "gs://#{bucket_name}/shakespeare.csv"

extract_job = bigquery.extract_job(
  from: shakespeare_table,
  to: destination_uri
) do |updater|
  # Location must match that of the source table.
  updater.location = "US"
end
extract_job.wait_until_done!  # Waits for the job to complete

puts "Exported #{shakespeare_table.id} to #{destination_uri}"

Again, the #export_job method is called on an object who's project is known. With this, users could call #export_job on the table object if they wanted to create the Job resource on that table's project, just as they would do today.

quartzmo · 2018-11-27T18:34:14Z

As I wrote above, I tried the "optional parameter when constructing a dataset" solution, but I found that it leads to errors or surprising behavior for a number of methods in the Dataset and Table classes.

blowmage · 2018-11-27T18:37:05Z

Some thoughts:

to: will need to accept a list of URIs, optionally. Since BigQuery can split across several files when extracting a table.

To parse the from: string, you'll probably want similar logic to Python's TableReference.from_string which allows omitting the project if it's the same as the client's default project.

Yes, that is my thinking. The to argument would be treated the same as the extract_url argument on the existing Table#extract_job, which allows the value to be a single string or an array of strings.

We also already have code to extract a table from string which allows omitting the project and dataset. It is currently being used in methods like Dataset::Access#add_reader_view.

quartzmo · 2018-11-27T18:37:12Z

As long as you're OK with runtime errors on Dataset and Table operations that are not allowed on the external project, I am happy enough to add project_id options to #dataset and #table.

blowmage · 2018-11-27T18:41:08Z

As I wrote above, I tried the "optional parameter when constructing a dataset" solution, but I found that it leads to errors or surprising behavior for a number of methods in the Dataset and Table classes.

Correct. I don't know what those issues are yet, but as I stated earlier adding new extract and extract_job methods that take the table as an argument instead of the table object being the owner of the method call should satisfy this request without also taking the additional step of allowing project to be provided when creating a Dataset object. I think it would make the code example read better, but it can be implemented separately.

blowmage · 2018-11-27T18:43:37Z

As long as you're OK with runtime errors on Dataset and Table operations that are not allowed on the external project, I am happy enough to add project_id options to #dataset and #table.

Raising when operations are not allowed on these resource objects is consistent with how Pub/Sub and Storage behave. I think that would be a fair tradeoff, but again, should probably be done in a separate PR.

blowmage · 2018-11-27T18:49:05Z

Should we add a Project.project method to switch projects but keep the same credentials?

I don't think that is necessary if we can pass project_id when creating a Dataset object.

quartzmo · 2018-11-27T19:01:29Z

I am OK with adding versions of #extract, #extract_job, #copy and #copy_job to Project. Do we also want to continue to add project_id to those methods in Table?

tswast · 2018-11-27T19:05:25Z

Do we also want to continue to add project_id to those methods in Table?

This is less necessary if these methods are added to project. We will use the project methods in the examples on cloud.google.com

quartzmo · 2018-11-27T19:22:26Z

OK, sounds good. I'll close this PR and add versions of #extract, #extract_job, #copy and #copy_job to Project in a new PR.

Add project_id to Bigquery::Table#extract and #extract_job

7d4af7f

[fixes googleapis#2609]

quartzmo added the api: bigquery Issues related to the BigQuery API. label Nov 27, 2018

quartzmo self-assigned this Nov 27, 2018

quartzmo requested a review from blowmage November 27, 2018 01:46

googlebot added the cla: yes This human has signed the Contributor License Agreement. label Nov 27, 2018

quartzmo mentioned this pull request Nov 27, 2018

BigQuery: Run job in one project that uses resources from another #2609

Closed

tswast approved these changes Nov 27, 2018

View reviewed changes

quartzmo closed this Nov 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add project_id to Bigquery::Table#extract and #extract_job #2692

Add project_id to Bigquery::Table#extract and #extract_job #2692

quartzmo commented Nov 27, 2018

alixhami commented Nov 27, 2018

quartzmo commented Nov 27, 2018

quartzmo commented Nov 27, 2018

alixhami commented Nov 27, 2018

tswast commented Nov 27, 2018

tswast commented Nov 27, 2018

tswast left a comment

blowmage commented Nov 27, 2018 •

edited

Loading

quartzmo commented Nov 27, 2018

tswast commented Nov 27, 2018 •

edited

Loading

blowmage commented Nov 27, 2018

quartzmo commented Nov 27, 2018

blowmage commented Nov 27, 2018

quartzmo commented Nov 27, 2018

blowmage commented Nov 27, 2018

blowmage commented Nov 27, 2018

blowmage commented Nov 27, 2018

quartzmo commented Nov 27, 2018 •

edited

Loading

tswast commented Nov 27, 2018

quartzmo commented Nov 27, 2018

Add project_id to Bigquery::Table#extract and #extract_job #2692

Add project_id to Bigquery::Table#extract and #extract_job #2692

Conversation

quartzmo commented Nov 27, 2018

alixhami commented Nov 27, 2018

quartzmo commented Nov 27, 2018

quartzmo commented Nov 27, 2018

alixhami commented Nov 27, 2018

tswast commented Nov 27, 2018

tswast commented Nov 27, 2018

tswast left a comment

Choose a reason for hiding this comment

blowmage commented Nov 27, 2018 • edited Loading

quartzmo commented Nov 27, 2018

tswast commented Nov 27, 2018 • edited Loading

blowmage commented Nov 27, 2018

quartzmo commented Nov 27, 2018

blowmage commented Nov 27, 2018

quartzmo commented Nov 27, 2018

blowmage commented Nov 27, 2018

blowmage commented Nov 27, 2018

blowmage commented Nov 27, 2018

quartzmo commented Nov 27, 2018 • edited Loading

tswast commented Nov 27, 2018

quartzmo commented Nov 27, 2018

blowmage commented Nov 27, 2018 •

edited

Loading

tswast commented Nov 27, 2018 •

edited

Loading

quartzmo commented Nov 27, 2018 •

edited

Loading