Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding upload_file adapter #121

Merged
merged 11 commits into from
Mar 10, 2022
Merged

Adding upload_file adapter #121

merged 11 commits into from
Mar 10, 2022

Conversation

pgoslatara
Copy link
Contributor

@pgoslatara pgoslatara commented Feb 13, 2022

resolves #102

Description

Adding an upload_file adapter that uses the load_table_from_file method. This adapter takes multiple arguments allowing maximum customisation of the LoadJobConfig class.

Example files:

  • ./samples/three_columns.csv:
id,name,created_at
1,john,2022-02-06 19:24:23.614 UTC
  • ./samples/id_only.csv:
id
1
  • ./samples/users.ndjson:
{"id":1,"name":"Alice"}
{"id":2,"name":"Bob"}
{"id":3,"name":"Carol"}

How to use:

  • Via command line:
dbt -d run-operation bigquery__upload_file --args "{local_file_path: './samples/three_columns.csv', database: '<GCP_PROJECT>', table_schema: 'dbt', table_name: 'three_columns', skip_leading_rows: 1, autodetect: True, write_disposition: 'WRITE_TRUNCATE'}"

dbt -d run-operation bigquery__upload_file --args "{local_file_path: './samples/id_only.csv', database: '<GCP_PROJECT>', table_schema: 'dbt', table_name: 'upload_id', skip_leading_rows: 1, write_disposition: 'WRITE_TRUNCATE', schema: '[{\"name\": \"employee_id\", \"type\": \"INTEGER\"}]'}"

dbt -d run-operation bigquery__upload_file --args "{local_file_path: './samples/users.ndjson', database: '<GCP_PROJECT>', table_schema: 'dbt', table_name: 'users', autodetect: True, write_disposition: 'WRITE_TRUNCATE', source_format: 'NEWLINE_DELIMITED_JSON'}"
  • Via a pre_hook:
{{
    config(
        pre_hook=[
            "{{ bigquery__upload_file(
                './samples/three_columns.csv',
                model['database'],
                model['schema'],
                'three_columns',
                autodetect=True,
                skip_leading_rows=1,
                write_disposition='WRITE_TRUNCATE',
                source_format='CSV'
            ) }}",
            "{{ bigquery__upload_file(
                './samples/id_only.csv',
                model['database'],
                model['schema'],
                'id_only',
                skip_leading_rows=1,
                write_disposition='WRITE_TRUNCATE',
                source_format='CSV',
                schema='[{\"name\": \"employee_id\", \"type\": \"INTEGER\"}]'
            ) }}",
            "{{ bigquery__upload_file(
                './samples/users.ndjson',
                model['database'],
                model['schema'],
                'users',
                autodetect=True,
                write_disposition='WRITE_TRUNCATE',
                source_format='NEWLINE_DELIMITED_JSON'
            ) }}"
        ]
    )
}}

Open questions from my side:

  • Currently the adapter only logs the kwargs, do we want to log something else?
  • The most similar adapter is load_dataframe which does not have tests. Should an issue be opened to add tests for both load_dataframe and upload_file?

Comment:
Neither manifest.json or run_results.json can be uploaded using this adapter as they do not conform to ndjson specifications. If this PR is merged an issue can be opened to address this (possibly alter these files before calling the upload_file adapter).

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt-bigquery next" section.

@cla-bot cla-bot bot added the cla:yes label Feb 13, 2022
@pgoslatara
Copy link
Contributor Author

@McKnight-42 Can I get some eyes on this please? Eager to see this functionality added to dbt-bigquery (and hopefully move closer to having dbt_artifacts as an option).

@McKnight-42 McKnight-42 self-requested a review February 28, 2022 20:43
@McKnight-42
Copy link
Contributor

Hi @pgoslatara taking a look over today and will pass along some questions sometime tomorrow, sorry for the delay.

@McKnight-42
Copy link
Contributor

McKnight-42 commented Mar 2, 2022

Currently the adapter only logs the kwargs, do we want to log something else?

@jtcohen6 I'm curious if you have any thoughts on this one way or the other?

@McKnight-42
Copy link
Contributor

Should an issue be opened to add tests for both load_dataframe and upload_file?

@pgoslatara good catch on the missing load_dataframe test. I will open an issue to create a test for that.

For the upload_file test I think we should try to if able write up a test along with this PR.

@pgoslatara
Copy link
Contributor Author

For the upload_file test I think we should try to if able write up a test along with this PR.

@McKnight-42 I've added a basic test for uploading three different types of files (CSV, NDJSON and parquet), bfe11ba. I've never previously worked with tests so this is all new to me. Can you take a look and see if these make sense?

@McKnight-42 McKnight-42 self-requested a review March 7, 2022 21:10
@McKnight-42
Copy link
Contributor

@pgoslatara Great start on the tests though I feel like we need to go a little further, possibly adding calls to the newly created tables and to do some simple check that they actually have some information just incase a error happens and database only creates a empty table.

@McKnight-42 McKnight-42 self-requested a review March 8, 2022 20:29
@pgoslatara
Copy link
Contributor Author

This is a great suggestion.

I've added checks for the number of rows, distinct id values, maximum updated_at value and that the data types are as expected, 1ed8a9d. I've done a bit of refactoring here so as to avoid having to define the checks for each uploaded table, let me know if this doesn't make sense or has some downsides. Also had to re-create the parquet file so updated_at is a timestamp rather than a string.

@McKnight-42
Copy link
Contributor

@pgoslatara sorry about delay, Great work on this. I've tested updates locally and they are looking good, I believe all thats left is for you to update the changelog and mark off the rest of the checklist.

@pgoslatara
Copy link
Contributor Author

@McKnight-42 Done!

Thanks for your input on this one, I learned a lot about how to think about tests.

Copy link
Contributor

@McKnight-42 McKnight-42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests both success of load into database and pull down of table information to make sure its all there, covers many forms of upload, well done!

@pgoslatara pgoslatara mentioned this pull request Apr 3, 2022
4 tasks
siephen pushed a commit to AgencyPMG/dbt-bigquery that referenced this pull request May 16, 2022
* Adding upload_file adapter

* flake8 formatting

* Replacing get_timeout with newer get_job_execution_timeout_seconds

* Adding integration tests for upload_file macro

* Removing conn arg from table_ref method

* Updating schema method to upload_file

* Adding checks on created tables

* Correcting class name

* Updating CHANGELOG.md

Co-authored-by: Matthew McKnight <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add macro/run-operation for uploading files from local filesystem to BigQuery.
2 participants