Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add new bq job timeout and retry config #50

Merged

Conversation

hui-zheng
Copy link
Contributor

@hui-zheng hui-zheng commented Oct 30, 2021

resolves #45

Description

Description
This PR provides a fine-grained control of the timeout and retry to bq query with four dbt-profile configs

job_creation_timeout_seconds     # specific for initiate BQ job, to control the timeout of step 1, query()
job_execution_timeout_seconds    # specific for awaiting job result, to control the timeout of step 2, result()

job_retry_deadline_seconds       # to control the overal query, retry_deadline of _query_and_results()
job_retries                      # to control the overall query, retries of _query_and_results()

These settings would allow us to control the timeouts of step 1 and step 2 of the BigQuery on their own ( See below), hence maximizing the chances to mitigate different kinds of intermittent errors.

For example, we could set the configs below to fail faster at the step of BQ job creation, while allowing queries with long-running results.

job_creation_timeout_seconds=30
job_execution_timeout_seconds=1200
job_retry_deadline_seconds=1500
job_retries=3

NOTE:
job_execution_timeout_seconds is the renaming of the previous timeout config.
job_retries is the renaming of the previous retries config.

Context

at the core, BigQuery query is made by two steps in dbt

def _query_and_results(self, client, sql, conn, job_params, timeout=None):
    """Query the client and wait for results."""
    # Cannot reuse job_config if destination is set and ddl is used
    job_config = google.cloud.bigquery.QueryJobConfig(**job_params)    # <--- Step 1
    query_job = client.query(sql, job_config=job_config)               # <--- Step 2
    iterator = query_job.result(timeout=timeout)
    return query_job, iterator

In the first step, client.query() submits a query to BQ JobInsert API server, when succeeded, BQ server creates a new BigQuery query job, and return the query job id back to the client as part ofquery_job object. This step shall be very quick, normally under a few seconds. however, in some rare cases, it would take much longer and might even up to 4 minutes according to the BigQuery engineering team.

In the 2nd step, query_job.result() await for the BigQuery to execute (running) the query and return the results to the client as an iterator. depending on the complexity of the query, this step could takes long, from tens of seconds to tens of minutes.


Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt-bigquery next" section.

@cla-bot
Copy link

cla-bot bot commented Oct 30, 2021

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Hui Zheng.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@hui-zheng hui-zheng force-pushed the enhancement/bq-job-retry-timeout-configs branch from c363bb1 to c5f8874 Compare October 30, 2021 04:57
@cla-bot cla-bot bot added the cla:yes label Oct 30, 2021
@hui-zheng
Copy link
Contributor Author

hello @jtcohen6,
I submit the first draft of the PR. I am going to add some tests later. It's my first time contributing. I appreciate any feedback and suggestions for the next step.

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hui-zheng These changes look good to me.

As with all connection-related issues, it's difficult to stand up integration tests that replicate behavior in the wild. If you've been running this branch locally / in your deployment environment with success, that's a pretty strong vote of confidence for me. Is that something you've been able to do?

If so, could you also add an entry to the changelog (dbt-bigquery 1.0.0 (Release TBD) > Features), and add yourself to the list of contributors?

After merging, the last step is to update the prerelease docs: https://next.docs.getdbt.com/reference/warehouse-profiles/bigquery-profile#optional-configurations

Comment on lines +119 to +120
'retries': 'job_retries',
'timeout_seconds': 'job_execution_timeout_seconds',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is clever! Aliases also affect node-level configs, but I don't see a reason why we can't do it this way

@hui-zheng
Copy link
Contributor Author

hui-zheng commented Nov 5, 2021

@jtcohen6

I haven't tested out this exact PR in our deployment environment. We currently use dbt v0.20.2, in which the dbt-bigquery is still part of dbt-core. We have a similar local patch fix to it.

I am happy to test this PR out. Could you let me know which stable version of dbt-core it could pair up for testing?

Could you give me some direction on how to install this dbt-bigquery with dbt-core from source?

@jtcohen6
Copy link
Contributor

jtcohen6 commented Nov 7, 2021

@hui-zheng This change will be going into dbt-bigquery v1.0.0, which is still in prerelease (and pairs with the dbt-core v1.0.0 prereleases). So there isn't a stable version (yet) to test in production.

By installing this package locally (pip install -e .), it will also install the latest compatible prerelease of dbt-core, which is 1.0.0-b2. For the sake of the changes you're making, that should work fine.

As it is, these changes look good. Given that we're coming up on v1.0.0-rc1, we may merge to include and test more in the release candidate.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Nov 7, 2021

Going to close and reopen just to trigger adapter integration tests

@jtcohen6 jtcohen6 closed this Nov 7, 2021
@jtcohen6 jtcohen6 reopened this Nov 7, 2021
@hui-zheng
Copy link
Contributor Author

hui-zheng commented Nov 9, 2021

@jtcohen6

I try pip install -e . in /dbt-bigquery and got an error, saying

The conflict is caused by:
    dbt-core 1.0.0b2 depends on dbt-extractor==0.4.0
    dbt-core 1.0.0b1 depends on dbt-extractor==0.4.0
Full error message, Click to expand!
Collecting dbt-extractor==0.4.0
  Using cached dbt_extractor-0.4.0.tar.gz (21 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... error
    ERROR: Command errored out with exit status 1:
     command: /Users/hui-zheng/.pyenv/versions/3.8.12/envs/dbt-lab/bin/python3.8 /Users/hui-zheng/.pyenv/versions/3.8.12/envs/dbt-lab/lib/python3.8/site-packages/pip/_vendor/pep517/in_process/_in_process.py prepare_metadata_for_build_wheel /var/folders/g6/3qpc1f9533j6hgcwjzcp__cr0000gn/T/tmp4iygtpl0
         cwd: /private/var/folders/g6/3qpc1f9533j6hgcwjzcp__cr0000gn/T/pip-install-5u3uc0sp/dbt-extractor_4897a4d46c244cb7a180c2507be2db66
    Complete output (6 lines):

    Cargo, the Rust package manager, is not installed or is not on PATH.
    This package requires Rust and Cargo to compile extensions. Install it through
    the system's package manager or via https://rustup.rs/

    Checking for Rust toolchain....
    ----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/a7/5c/609f02383178208612d6ac21228ca256337d3c18afb13b29f122720a26ad/dbt_extractor-0.4.0.tar.gz#sha256=58672e36fab988c849a693405920ee18421f27245c48e5f9ecf496369ed31a85 (from https://pypi.org/simple/dbt-extractor/). Command errored out with exit status 1: /Users/hui-zheng/.pyenv/versions/3.8.12/envs/dbt-lab/bin/python3.8 /Users/hui-zheng/.pyenv/versions/3.8.12/envs/dbt-lab/lib/python3.8/site-packages/pip/_vendor/pep517/in_process/_in_process.py prepare_metadata_for_build_wheel /var/folders/g6/3qpc1f9533j6hgcwjzcp__cr0000gn/T/tmp4iygtpl0 Check the logs for full command output.
Collecting dbt-core~=1.0.0b1
  Using cached dbt_core-1.0.0b1-py3-none-any.whl (789 kB)
INFO: pip is looking at multiple versions of <Python from Requires-Python> to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of dbt-bigquery to determine which version is compatible with other requirements. This could take a while.
ERROR: Cannot install dbt-bigquery because these package versions have conflicting dependencies.

The conflict is caused by:
    dbt-core 1.0.0b2 depends on dbt-extractor==0.4.0
    dbt-core 1.0.0b1 depends on dbt-extractor==0.4.0

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies

My branch is up to date with dbt-bigquery main branch.

My python versions

❯ python --version
Python 3.8.12
❯ pip --version
pip 21.1.1 from /Users/hui-zheng/.pyenv/versions/3.8.12/envs/dbt-lab/lib/python3.8/site-packages/pip (python 3.8)

Do you know how to resolve it?

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hui-zheng I don't know exactly what went wrong there, but it seems like an issue with trying to install dbt-bigquery in an environment that already has dbt-core==1.0.0b1. The simplest solution might be to clear away the virtual environment, create a fresh one, and try again from there.

@@ -323,17 +330,24 @@ def open(cls, connection):
return connection

@classmethod
def get_timeout(cls, conn):
def get_job_execution_timeout_seconds(cls, conn):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some integration test failures like:

Unhandled error while executing ...
'BigQueryConnectionManager' object has no attribute 'get_timeout'

We need to replace get_timeoutget_job_execution_timeout_seconds within load_dataframe, to fix dbt seed:

timeout = self.connections.get_timeout(conn)

Plus one unit test:

self.connections.get_timeout = lambda x: 100.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could look into that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

@hui-zheng hui-zheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated Changelog

@@ -323,17 +330,24 @@ def open(cls, connection):
return connection

@classmethod
def get_timeout(cls, conn):
def get_job_execution_timeout_seconds(cls, conn):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@hui-zheng
Copy link
Contributor Author

the last step is to update the prerelease docs: https://next.docs.getdbt.com/reference/warehouse-profiles/bigquery-profile#optional-configurations

@jtcohen6 , How do I update the docs?

@jtcohen6
Copy link
Contributor

I triggered the integration test to re-run. (We had some intermittent failures over the past few days due to our sandbox BQ environment.)

It looks like there are some flake8 failures (simple), and failures in the unit tests related to mock code (less simple). One of the failures looks like:

unsupported type for timedelta seconds component: MagicMock

Unfortunately, I can't dig in deeper right now. It should be somewhat easier to run + reproduce reproduce those unit test failures locally. Let me know if you're still stuck later this week, and I can try to dive in.

How do I update the docs?

Docs updates happen in this repo: https://github.com/dbt-labs/docs.getdbt.com

I recommend opening an issue, and then (if you feel up to it) a PR targeting next to resolve it as well. As a shortcut, you can click the "Edit this page" link at the bottom of each page in the docs. Clicking that for the page I linked would take you here: https://github.com/dbt-labs/docs.getdbt.com/edit/next/website/docs/reference/warehouse-profiles/bigquery-profile.md

@jtcohen6
Copy link
Contributor

@hui-zheng Sorry about that, I tried merging in the recent changes from main, but it looks like I may have added a few flake8 errors of my own. Feel free to remove those commits from your branch.

There's still a handful of unit tests failing:

FAILED tests/unit/test_bigquery_adapter.py::TestBigQueryConnectionManager::test_copy_bq_table_appends
FAILED tests/unit/test_bigquery_adapter.py::TestBigQueryConnectionManager::test_copy_bq_table_truncates
FAILED tests/unit/test_bigquery_adapter.py::TestBigQueryConnectionManager::test_drop_dataset
FAILED tests/unit/test_bigquery_adapter.py::TestBigQueryConnectionManager::test_query_and_results
FAILED tests/unit/test_bigquery_adapter.py::TestBigQueryConnectionManager::test_retry_and_handle
FAILED tests/unit/test_bigquery_adapter.py::TestBigQueryConnectionManager::test_retry_connection_reset

The issue is that this PR has changed the function signature of _query_and_results (job_execution_timeout+ job_creation_timeout replaced timeout), and the expected return value of _retry_and_handle (deadline is no longer just None). That has implications for these unit tests, which "mock" the anticipated inputs and outputs of those methods. I don't think it would be tremendously difficult to debug those, e.g. to fix test_query_and_results I just needed to change this line to:

          query='sql', job_config=mock_bq.QueryJobConfig(), timeout=None)

@hui-zheng
Copy link
Contributor Author

which "mock" the anticipated inputs and outputs of those methods. I don't think it would be tremendously difficult to debug those

@jtcohen6 sorry for the delay in responding to you. a bit busy lately. will fix those testings once I have time

@hui-zheng
Copy link
Contributor Author

@jtcohen6
I fixed all failed unit tests and passed the local tox run. Hope it resolves everything.

I haven't updated the docs, let me know if it's something I shall do before this PR gets merged or after.

hui-zheng added a commit to hui-zheng/docs.getdbt.com that referenced this pull request Dec 14, 2021
update docs according the changes in dbt-labs/dbt-bigquery#50
@hui-zheng
Copy link
Contributor Author

@jtcohen6

I also updated the docs. Please see it here. dbt-labs/docs.getdbt.com#962

Please let me know if anything else is required.

@McKnight-42
Copy link
Contributor

Hi @hui-zheng seems like this branch is falling behind to main and is failing some testing checks now, if you could update it to main would be very much appreciated.

@hui-zheng
Copy link
Contributor Author

Hi @McKnight-42, I just bring it up to date.

@hui-zheng hui-zheng requested a review from jtcohen6 January 11, 2022 03:32
@McKnight-42 McKnight-42 self-requested a review January 12, 2022 06:04
Copy link
Contributor

@McKnight-42 McKnight-42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work on this!

@McKnight-42 McKnight-42 merged commit 141b867 into dbt-labs:main Jan 12, 2022
jtcohen6 pushed a commit to dbt-labs/docs.getdbt.com that referenced this pull request Mar 30, 2022
update docs according the changes in dbt-labs/dbt-bigquery#50
jtcohen6 pushed a commit to dbt-labs/docs.getdbt.com that referenced this pull request Apr 7, 2022
update docs according the changes in dbt-labs/dbt-bigquery#50
jtcohen6 added a commit to dbt-labs/docs.getdbt.com that referenced this pull request Apr 7, 2022
* Update bigquery-profile.md

update docs according the changes in dbt-labs/dbt-bigquery#50

* Update upgrading-to-1-0-0.md

* Add versioning logic. Edit

* Update migration guides

* PR feedback, corrections

Co-authored-by: Hui Zheng <[email protected]>
siephen pushed a commit to AgencyPMG/dbt-bigquery that referenced this pull request May 16, 2022
* add new bq job timeout and retry config

* fix

* fix

* update changelog

* fix

* Fix merge

* fix test_query_and_results

* 2nd attempt to fix test_query_and_results

* fixed other test cases

* fix aliases

* update changelog

* update changelog

* fix linting

Co-authored-by: Jeremy Cohen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] dbt-Bigquery retry for DML job BQ API errors (503 errors, etc. )
3 participants