Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support BigQuery #6

Closed
yu-iskw opened this issue Mar 11, 2021 · 20 comments
Closed

Support BigQuery #6

yu-iskw opened this issue Mar 11, 2021 · 20 comments
Labels
enhancement New feature or request

Comments

@yu-iskw
Copy link

yu-iskw commented Mar 11, 2021

The dbt package is awesome. I would like to do the same thing for BigQuery. However, I am not sure we can realize the same with only run_query, because for instance we have to upload artifacts files to GCS in the case of BigQuery. As far as I know, there is no statement type to put it to GCS or BigQuery directly.

NOTE

Unfortunately, there is nothing like DDL to load data from local. It would be worthwhile asking that to Google Cloud.

@aaronsteers
Copy link

aaronsteers commented Jun 10, 2021

I'm curious also about supporting other sources. Is the copy of the raw data the primary blocker? Or said another way, are there any other platform-specific code which could not be written generically?

And for the copy operation, any chance could we trick/hijack the seed/snapshot capability to get a similar result. There must be a generic api layer for data ingestion in order to make seed capabilities work across platforms. (I know this is a longshot, and probably not a critical priority, but I'm curious to others' thoughts on this.) Thanks!

@NiallRees
Copy link
Contributor

My initial thoughts are that we could extend the adapter for each warehouse to add an upload function which is accessible through a
{% do adapter.upload_file(file_path, destination) %}
or similar. That function can then just call any API methods needed to load the data into the warehouse.

From @jtcohen6:

dbt-bigquery connects using google's python clients, and I know those also support uploading JSON files. I tried doing this way back when, hacking into some methods intended for seeds, and got pretty close. I do think it requires need a few (probably simple) changes to the adapter code.

Brooklyn Data can take this on in the next few weeks, or contributions are very welcome! See https://github.com/dbt-labs/dbt-bigquery.

@pgoslatara
Copy link

From @jtcohen6:

dbt-bigquery connects using google's python clients, and I know those also support uploading JSON files. I tried doing this way back when, hacking into some methods intended for seeds, and got pretty close. I do think it requires need a few (probably simple) changes to the adapter code.

@NiallRees Do you have a link for this quote? I'm interested in seeing if there is more background to this statement and my googling skills have failed me when I tried to find this quote.

@NiallRees
Copy link
Contributor

@NiallRees Do you have a link for this quote? I'm interested in seeing if there is more background to this statement and my googling skills have failed me when I tried to find this quote.

Hi, this was in a message so I don't. You could definitely ask for some pointers over on the issue in dbt-bigquery though! dbt-labs/dbt-bigquery#102

@tuftkyle
Copy link

It looks like the macro to upload files to BigQuery has been completed and will be included in dbt v1.1 . Maybe we can start working on a port now?

@pgoslatara
Copy link

@tuftkyle Thanks for the interest in this! I've been looking into this issue for a few weeks (I'm the author MR you reference), I think the next step is still within dbt-bigquery, the artifacts produced by dbt are JSON whereas BigQuery supports NDJSON files for uploads (mentioned in this comment). I'm not sure how to convert JSON to NDJSON using python, I'm also not sure how to handle periods (".") in the keys in manifest.json as BigQuery does not support column names with periods. Maybe we can replace periods with double underscores, or remove them? Welcome opinions on this as I don't know the best approach!

@NiallRees
Copy link
Contributor

Hey @pgoslatara - the artifacts produced by dbt can be considered newline delimited, they just have one object per file, so we should be good to go there. On the periods (".") in the keys, that is fine as long as we use the JSON subscript operator in BigQuery e.g. manifest['key.with.periods'].

To go about this, I'd suggest first attempting to upload the artifacts into a BigQuery table (or table for each artifact type if needed), and from there working out how to get them into the format required by https://github.com/brooklyn-data/dbt_artifacts/blob/main/models/staging/stg_dbt__artifacts.sql. After that, it should just be a case of making the models compatible with both BigQuery and Snowflake.

I'm happy to be as involved as required so please let me know if you'd like more help :)

@pgoslatara
Copy link

@NiallRees I've submitted MR153 for this, currently working through some suggested changes.

Can you elaborate on the "as long as we use the JSON subscript operator" element of your comment? I'm not sure I follow this fully. If an artifact is uploaded as a STRING I understand that this can be converted to a JSON data type (although this is still in preview and requires enrollment to access). The blocker I see is that after this conversion there is still a period in the key name and I'm not sure if BigQuery can handle that (as I don't have access to these features yet).

NiallRees added a commit that referenced this issue Jul 21, 2022
Add tests, exposures and snapshots verticals
@NiallRees
Copy link
Contributor

Hey again all - we've been busy reimplementing the package, opening it up to be compatible with more adapters, without having to come up with a warehouse specific way of uploading the artifacts - by avoiding the artifacts altogether. We now are uploading the graph and results context variables.

In the new world, adding BigQuery compatibility involves implementing a BigQuery version of each warehouse dispatch-able macro defined in https://github.com/brooklyn-data/dbt_artifacts/tree/main/macros. Let us know if you'd be interested in contributing those changes!

@charles-astrafy
Copy link
Contributor

Awesome, I will be working on making it compatible for BigQuery and will start working on it tomorrow. Anyone up for pair programming on this, please let me know.

@NiallRees
Copy link
Contributor

How are you getting on here @charles-astrafy? Shout if I can help at all!

@charles-astrafy
Copy link
Contributor

Sorry have been sidetracked a bit with other stuff. I have time to work on it tomorrow and will give an update tomorrow EOD.

@NiallRees
Copy link
Contributor

All good @charles-astrafy no pressure from over here!

@jecolvin jecolvin added the enhancement New feature or request label Aug 2, 2022
@charles-astrafy
Copy link
Contributor

@NiallRees started to work on it as of today. Will keep you posted and will make time on a daily basis in the coming days. Might need your guidance/help if I encounter some blockers but all good at the moment.

@NiallRees
Copy link
Contributor

Awesome @charles-astrafy!

@charles-astrafy
Copy link
Contributor

@NiallRees Just a small update. I have been doing good progress but quite some refactoring needed to make it work with BigQuery. For instance the upload macros are at the moment generic with "SELECT .... FROM values ( ...) , ( ...)". This syntax does not work for BigQuery so I am adding a dispatch abstraction layer for those macros. I should have a pull request ready by Friday.

@NiallRees
Copy link
Contributor

Sounds great @charles-astrafy, appreciate your efforts! Really looking forward to seeing how you get it working.

@NiallRees NiallRees reopened this Aug 10, 2022
@NiallRees
Copy link
Contributor

clicked the wrong button 🙈

@charles-astrafy
Copy link
Contributor

charles-astrafy commented Aug 12, 2022

@NiallRees --- Pull request done.

#172

@adrpino
Copy link

adrpino commented Aug 24, 2022

Hi all! Sorry for jumping in! This could be very helpful as well for me :D

Is this only needing review? is it fully functional already @charles-astrafy?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants