Adding upload_file adapter #121

pgoslatara · 2022-02-13T11:31:43Z

resolves #102

Description

Adding an upload_file adapter that uses the load_table_from_file method. This adapter takes multiple arguments allowing maximum customisation of the LoadJobConfig class.

Example files:

./samples/three_columns.csv:

id,name,created_at
1,john,2022-02-06 19:24:23.614 UTC

./samples/id_only.csv:

id
1

./samples/users.ndjson:

{"id":1,"name":"Alice"}
{"id":2,"name":"Bob"}
{"id":3,"name":"Carol"}

How to use:

Via command line:

dbt -d run-operation bigquery__upload_file --args "{local_file_path: './samples/three_columns.csv', database: '<GCP_PROJECT>', table_schema: 'dbt', table_name: 'three_columns', skip_leading_rows: 1, autodetect: True, write_disposition: 'WRITE_TRUNCATE'}"

dbt -d run-operation bigquery__upload_file --args "{local_file_path: './samples/id_only.csv', database: '<GCP_PROJECT>', table_schema: 'dbt', table_name: 'upload_id', skip_leading_rows: 1, write_disposition: 'WRITE_TRUNCATE', schema: '[{\"name\": \"employee_id\", \"type\": \"INTEGER\"}]'}"

dbt -d run-operation bigquery__upload_file --args "{local_file_path: './samples/users.ndjson', database: '<GCP_PROJECT>', table_schema: 'dbt', table_name: 'users', autodetect: True, write_disposition: 'WRITE_TRUNCATE', source_format: 'NEWLINE_DELIMITED_JSON'}"

Via a pre_hook:

{{
    config(
        pre_hook=[
            "{{ bigquery__upload_file(
                './samples/three_columns.csv',
                model['database'],
                model['schema'],
                'three_columns',
                autodetect=True,
                skip_leading_rows=1,
                write_disposition='WRITE_TRUNCATE',
                source_format='CSV'
            ) }}",
            "{{ bigquery__upload_file(
                './samples/id_only.csv',
                model['database'],
                model['schema'],
                'id_only',
                skip_leading_rows=1,
                write_disposition='WRITE_TRUNCATE',
                source_format='CSV',
                schema='[{\"name\": \"employee_id\", \"type\": \"INTEGER\"}]'
            ) }}",
            "{{ bigquery__upload_file(
                './samples/users.ndjson',
                model['database'],
                model['schema'],
                'users',
                autodetect=True,
                write_disposition='WRITE_TRUNCATE',
                source_format='NEWLINE_DELIMITED_JSON'
            ) }}"
        ]
    )
}}

Open questions from my side:

Currently the adapter only logs the kwargs, do we want to log something else?
The most similar adapter is load_dataframe which does not have tests. Should an issue be opened to add tests for both load_dataframe and upload_file?

Comment:
Neither manifest.json or run_results.json can be uploaded using this adapter as they do not conform to ndjson specifications. If this PR is merged an issue can be opened to address this (possibly alter these files before calling the upload_file adapter).

Checklist

I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have updated the CHANGELOG.md and added information about my change to the "dbt-bigquery next" section.

pgoslatara · 2022-02-26T11:00:11Z

@McKnight-42 Can I get some eyes on this please? Eager to see this functionality added to dbt-bigquery (and hopefully move closer to having dbt_artifacts as an option).

McKnight-42 · 2022-03-01T20:38:58Z

Hi @pgoslatara taking a look over today and will pass along some questions sometime tomorrow, sorry for the delay.

dbt/adapters/bigquery/impl.py

McKnight-42 · 2022-03-02T21:05:45Z

Currently the adapter only logs the kwargs, do we want to log something else?

@jtcohen6 I'm curious if you have any thoughts on this one way or the other?

McKnight-42 · 2022-03-02T21:10:46Z

Should an issue be opened to add tests for both load_dataframe and upload_file?

@pgoslatara good catch on the missing load_dataframe test. I will open an issue to create a test for that.

For the upload_file test I think we should try to if able write up a test along with this PR.

pgoslatara · 2022-03-06T20:08:27Z

For the upload_file test I think we should try to if able write up a test along with this PR.

@McKnight-42 I've added a basic test for uploading three different types of files (CSV, NDJSON and parquet), bfe11ba. I've never previously worked with tests so this is all new to me. Can you take a look and see if these make sense?

tests/integration/upload_file_tests/test_upload_file.py

McKnight-42 · 2022-03-07T21:30:11Z

@pgoslatara Great start on the tests though I feel like we need to go a little further, possibly adding calls to the newly created tables and to do some simple check that they actually have some information just incase a error happens and database only creates a empty table.

pgoslatara · 2022-03-08T20:29:51Z

This is a great suggestion.

I've added checks for the number of rows, distinct id values, maximum updated_at value and that the data types are as expected, 1ed8a9d. I've done a bit of refactoring here so as to avoid having to define the checks for each uploaded table, let me know if this doesn't make sense or has some downsides. Also had to re-create the parquet file so updated_at is a timestamp rather than a string.

McKnight-42 · 2022-03-10T17:30:10Z

@pgoslatara sorry about delay, Great work on this. I've tested updates locally and they are looking good, I believe all thats left is for you to update the changelog and mark off the rest of the checklist.

pgoslatara · 2022-03-10T20:56:56Z

@McKnight-42 Done!

Thanks for your input on this one, I learned a lot about how to think about tests.

McKnight-42

Tests both success of load into database and pull down of table information to make sure its all there, covers many forms of upload, well done!

* Adding upload_file adapter * flake8 formatting * Replacing get_timeout with newer get_job_execution_timeout_seconds * Adding integration tests for upload_file macro * Removing conn arg from table_ref method * Updating schema method to upload_file * Adding checks on created tables * Correcting class name * Updating CHANGELOG.md Co-authored-by: Matthew McKnight <[email protected]>

Adding upload_file adapter

b94403d

cla-bot bot added the cla:yes label Feb 13, 2022

pgoslatara added 2 commits February 13, 2022 12:38

flake8 formatting

5ea07c0

Merge branch 'main' into upload_file_adapter

e37af8e

Merge branch 'main' into upload_file_adapter

3edda67

McKnight-42 self-requested a review February 28, 2022 20:43

McKnight-42 reviewed Mar 2, 2022

View reviewed changes

dbt/adapters/bigquery/impl.py Outdated Show resolved Hide resolved

pgoslatara added 3 commits March 3, 2022 21:26

Replacing get_timeout with newer get_job_execution_timeout_seconds

7b23a10

Adding integration tests for upload_file macro

bfe11ba

Removing conn arg from table_ref method

68bc01f

McKnight-42 self-requested a review March 7, 2022 21:10

McKnight-42 reviewed Mar 7, 2022

View reviewed changes

tests/integration/upload_file_tests/test_upload_file.py Outdated Show resolved Hide resolved

pgoslatara added 2 commits March 8, 2022 20:21

Updating schema method to upload_file

1f03a04

Adding checks on created tables

1ed8a9d

McKnight-42 self-requested a review March 8, 2022 20:29

pgoslatara added 2 commits March 10, 2022 21:47

Correcting class name

51bbec0

Updating CHANGELOG.md

fde45a8

McKnight-42 approved these changes Mar 10, 2022

View reviewed changes

McKnight-42 merged commit a426d3a into dbt-labs:main Mar 10, 2022

McKnight-42 mentioned this pull request Mar 10, 2022

[CT-349] Add test for load_dataframe #134

Closed

tuftkyle mentioned this pull request Mar 22, 2022

Support BigQuery brooklyn-data/dbt_artifacts#6

Closed

pgoslatara mentioned this pull request Apr 3, 2022

Adding upload_run_results macro #153

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding upload_file adapter #121

Adding upload_file adapter #121

pgoslatara commented Feb 13, 2022 •

edited

Loading

pgoslatara commented Feb 26, 2022

McKnight-42 commented Mar 1, 2022

McKnight-42 commented Mar 2, 2022 •

edited

Loading

McKnight-42 commented Mar 2, 2022

pgoslatara commented Mar 6, 2022

McKnight-42 commented Mar 7, 2022

pgoslatara commented Mar 8, 2022

McKnight-42 commented Mar 10, 2022

pgoslatara commented Mar 10, 2022

McKnight-42 left a comment

Adding upload_file adapter #121

Adding upload_file adapter #121

Conversation

pgoslatara commented Feb 13, 2022 • edited Loading

Description

Checklist

pgoslatara commented Feb 26, 2022

McKnight-42 commented Mar 1, 2022

McKnight-42 commented Mar 2, 2022 • edited Loading

McKnight-42 commented Mar 2, 2022

pgoslatara commented Mar 6, 2022

McKnight-42 commented Mar 7, 2022

pgoslatara commented Mar 8, 2022

McKnight-42 commented Mar 10, 2022

pgoslatara commented Mar 10, 2022

McKnight-42 left a comment

Choose a reason for hiding this comment

pgoslatara commented Feb 13, 2022 •

edited

Loading

McKnight-42 commented Mar 2, 2022 •

edited

Loading