Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task crash_report_parquet job is failing in socorro_import failing #1198

Closed
willkg opened this issue Dec 7, 2020 · 7 comments
Closed

Task crash_report_parquet job is failing in socorro_import failing #1198

willkg opened this issue Dec 7, 2020 · 7 comments
Assignees

Comments

@willkg
Copy link
Member

willkg commented Dec 7, 2020

The create_dataproc_cluster task is failing at some point after dataproc_init.sh is called. You can see the requirements getting installed in the logs and then nothing after that:

https://workflow.telemetry.mozilla.org/log?task_id=crash_report_parquet&dag_id=socorro_import&execution_date=2020-12-06T00%3A00%3A00%2B00%3A00

DAG ID                               Task ID                  Execution date               Try number
-----------------------------------  -----------------------  -------------------------  ------------
socorro_import.crash_report_parquet  create_dataproc_cluster  2020-12-06 00:00:00+00:00             4
socorro_import.crash_report_parquet  run_dataproc_pyspark     2020-12-06 00:00:00+00:00             1
...
[2020-12-07 02:20:58,017] {taskinstance.py:1194} INFO - Marking task as FAILED. dag_id=socorro_import, task_id=crash_report_parquet, execution_date=20201206T000000, start_date=20201207T022052, end_date=20201207T022058
$ gsutil cat gs://moz-fx-data-prod-dataproc-scratch/google-cloud-dataproc-metainfo/21a4a832-667d-4b80-85c2-72d4076d6736/socorro-import-dataproc-cluster-m/dataproc-initialization-script-0_output
...
Requirement already satisfied: python-dateutil in /opt/conda/anaconda/lib/python3.6/site-packages (from arrow==0.10.0->-r /tmp/requirements.txt (line 1)) (2.8.1)
Requirement already satisfied: pytz>=2011k in /opt/conda/anaconda/lib/python3.6/site-packages (from pandas==0.23.4->-r /tmp/requirements.txt (line 8)) (2020.1)
Requirement already satisfied: setuptools in /opt/conda/anaconda/lib/python3.6/site-packages (from protobuf==3.6.1->-r /tmp/requirements.txt (line 15)) (50.3.0.post20201103)
Requirement already satisfied: py4j==0.10.7 in /opt/conda/anaconda/lib/python3.6/site-packages (from -r /tmp/requirements.txt (line 16)) (0.10.7)
Requirement already satisfied: python-dateutil in /opt/conda/anaconda/lib/python3.6/site-packages (from arrow==0.10.0->-r /tmp/requirements.txt (line 1)) (2.8.1)
Requirement already satisfied: urllib3<1.26,>=1.25.4 in /opt/conda/anaconda/lib/python3.6/site-packages (from botocore->-r /tmp/requirements.txt (line 4)) (1.25.11)
Requirement already satisfied: click==6.7 in /opt/conda/anaconda/lib/python3.6/site-packages (from -r /tmp/requirements.txt (line 5)) (6.7)
Requirement already satisfied: python-dateutil in /opt/conda/anaconda/lib/python3.6/site-packages (from arrow==0.10.0->-r /tmp/requirements.txt (line 1)) (2.8.1)
Requirement already satisfied: pytz>=2011k in /opt/conda/anaconda/lib/python3.6/site-packages (from pandas==0.23.4->-r /tmp/requirements.txt (line 8)) (2020.1)
Requirement already satisfied: setuptools in /opt/conda/anaconda/lib/python3.6/site-packages (from protobuf==3.6.1->-r /tmp/requirements.txt (line 15)) (50.3.0.post20201103)
Requirement already satisfied: py4j==0.10.7 in /opt/conda/anaconda/lib/python3.6/site-packages (from -r /tmp/requirements.txt (line 16)) (0.10.7)

It's been failing for 4 days now.

@willkg
Copy link
Member Author

willkg commented Dec 7, 2020

Two things:

  1. I don't really know enough about what I'm looking at to move further.
  2. Socorro expires the objects in the relevant S3 bucket after 21 days, so that's how long we've got before we lose data.

@willkg willkg self-assigned this Dec 7, 2020
@willkg
Copy link
Member Author

willkg commented Dec 7, 2020

@jklukas walked me through looking at Airflow DAG, logs, and related things.

The output above is from the 3rd try. The log for the first try looks like this:

https://workflow.telemetry.mozilla.org/log?task_id=create_dataproc_cluster&dag_id=socorro_import.crash_report_parquet&execution_date=2020-12-03T00%3A00%3A00%2B00%3A00

Exception: Google Dataproc Operation projects/airflow-dataproc/regions/us-west1/operations/db31efa6-3419-36aa-9a0a-06230f7e64c1 failed: Multiple Errors:
 - Initialization action timed out. Failed action 'gs://moz-fx-data-prod-airflow-dataproc-artifacts/bootstrap/dataproc_init.sh', see output in: gs://moz-fx-data-prod-dataproc-scratch/google-cloud-dataproc-metainfo/bde63a29-5040-4fcb-a67d-e52434abf975/socorro-import-dataproc-cluster-m/dataproc-initialization-script-0_output
 - Initialization action timed out. Failed action 'gs://moz-fx-data-prod-airflow-dataproc-artifacts/bootstrap/dataproc_init.sh', see output in: gs://moz-fx-data-prod-dataproc-scratch/google-cloud-dataproc-metainfo/bde63a29-5040-4fcb-a67d-e52434abf975/socorro-import-dataproc-cluster-w-0/dataproc-initialization-script-0_output
 - Initialization action timed out. Failed action 'gs://moz-fx-data-prod-airflow-dataproc-artifacts/bootstrap/dataproc_init.sh', see output in: gs://moz-fx-data-prod-dataproc-scratch/google-cloud-dataproc-metainfo/bde63a29-5040-4fcb-a67d-e52434abf975/socorro-import-dataproc-cluster-w-1/dataproc-initialization-script-0_output
 - Initialization action timed out. Failed action 'gs://moz-fx-data-prod-airflow-dataproc-artifacts/bootstrap/dataproc_init.sh', see output in: gs://moz-fx-data-prod-dataproc-scratch/google-cloud-dataproc-metainfo/bde63a29-5040-4fcb-a67d-e52434abf975/socorro-import-dataproc-cluster-w-7/dataproc-initialization-script-0_output

The second and third retries fail with the conflict error:

https://workflow.telemetry.mozilla.org/log?task_id=create_dataproc_cluster&dag_id=socorro_import.crash_report_parquet&execution_date=2020-12-03T00%3A00%3A00%2B00%3A00

[2020-12-04 01:18:53,721] {taskinstance.py:1150} ERROR - <HttpError 409 when requesting https://dataproc.googleapis.com/v1beta2/projects/airflow-dataproc/regions/us-west1/clusters?requestId=a76aaad5-4b66-4c1a-9ad7-b3a406000898&alt=json returned "Already exists: Failed to create cluster: Cluster projects/airflow-dataproc/regions/us-west1/clusters/socorro-import-dataproc-cluster">

The first attempt fails at some point from a time out, but it's not clear what's timing out.

Jeff suggested I reach out and ask for some SRE-help, so I'm going to do that next.

@BenWu
Copy link
Contributor

BenWu commented Dec 7, 2020

The dataproc initialization steps have a default timeout of ten minutes (https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions). This should be configurable through the airflow operator but we may need to add some parameters to the dataproc utils (https://github.com/mozilla/telemetry-airflow/blob/master/dags/utils/dataproc.py).

I took a look at the logs in the gcs bucket and they just show a long pip install so my guess is that it there's no error here; the install just happened to pass the 10 minute threshold.

For the 409, from what I understand the cluster is still created if the initialization times out. So the cluster is there until the end of its TTL or maybe until the subdag fails after the retries I'm not sure how that's setup.

This is my understanding of it and it looks to me that all that needs to be done is increase the initialization timeout.

@scholtzan
Copy link
Contributor

scholtzan commented Dec 7, 2020

I did take a look at the logs out of curiosity. It looks to me like pip goes into infinite dependency resolution, similar to the bug in pypa/pip#9011.

For example, the line Requirement already satisfied: py4j==0.10.7 in /opt/conda/anaconda/lib/python3.6/site-packages (from -r /tmp/requirements.txt (line 16)) (0.10.7) appears 450 times in the log. That's why it runs into the timeout Ben mentioned (because it just keeps on resolving dependencies for eternity).

It seems to be using pip 20.3.1:

Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 20.2.4
    Uninstalling pip-20.2.4:
      Successfully uninstalled pip-20.2.4
Successfully installed pip-20.3.1

Using 20.2.4, which gets uninstalled for some reason, might solve the issue.

@BenWu
Copy link
Contributor

BenWu commented Dec 7, 2020

That's interesting. So changing timeout definitely will not solve it. I saw that some other dataproc jobs are affected as well but not all. The ones that aren't affected set init_actions_uris=["gs://dataproc-initialization-actions/python/pip-install.sh"] instead of using the default gs://moz-fx-data-prod-airflow-dataproc-artifacts/bootstrap/dataproc_init.sh. So maybe using that init script will fix it but dataproc_init.sh should probably be changed to fix the other jobs. I think that's SRE's thing.

@willkg
Copy link
Member Author

willkg commented Dec 8, 2020

Thank you! This makes so much more sense now.

pip 20.3.0 and later use the new package dependency resolver. Anna pointed to issue 9011 in the pip issue tracker, but that was fixed before 20.3.0 came out. That got me looking at resolver issues.

I think we're hitting this issue which has similar symptoms but is different:

pypa/pip#9187

dataproc_init.sh upgrades pip to the latest version. Then the next step is install requirements from the ARTIFACTS bucket.

/opt/conda/default/bin/pip install --upgrade pip
/opt/conda/default/bin/pip install -r $PIP_REQUIREMENTS_FILE

With pip-install.sh, it tries to install pip with easy_install and then with the system package manager. So it's almost assuredly using an older version of pip. In my opinion, pip-install.sh should be removed--nothing should use that. That's all really old creaky stuff. (Is it even using Python 3?)

I think the short term fix would be to fix dataproc_init.sh to install pip<20.3.0 which puts it back on the old resolver which will work for now.

Then we can spin off a new issue to fix our requirements specification which is causing the excessive backtracking during dependency resolution.

willkg added a commit to willkg/telemetry-airflow that referenced this issue Dec 8, 2020
Currently, `dataproc_init.sh` updates pip to 20.3.1 which tries to
install the dependencies, but takes > 10 minutes to figure out a
solution set that satisfies the version dependencies.

This has us go back to the previous pip which has the old resolver which
works while we figure out how to straighten out the dependency
requirements.
willkg added a commit to willkg/telemetry-airflow that referenced this issue Dec 10, 2020
Currently, `dataproc_init.sh` updates pip to 20.3.1 which tries to
install the dependencies, but takes > 10 minutes to figure out a
solution set that satisfies the version dependencies.

This has us go back to the previous pip which has the old resolver which
works while we figure out how to straighten out the dependency
requirements.
willkg added a commit that referenced this issue Dec 10, 2020
Currently, `dataproc_init.sh` updates pip to 20.3.1 which tries to
install the dependencies, but takes > 10 minutes to figure out a
solution set that satisfies the version dependencies.

This has us go back to the previous pip which has the old resolver which
works while we figure out how to straighten out the dependency
requirements.
@acmiyaguchi
Copy link
Contributor

The failing job has been resolved (for quite some time now), but it looks like #1200 is still relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants