Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Bump Airflow to 2.4.0, standardize version bump process #737

Merged
merged 10 commits into from
Sep 26, 2022

Conversation

AetherUnbound
Copy link
Contributor

Fixes

Fixes WordPress/openverse#1582 by @AetherUnbound

This also addresses some security issues that dependabot identified.

Description

This PR updates our Airflow version to 2.4.0. I've also borrowed the pattern set up in #656 and applied it to the Airflow version, so we only have to update it in one place now!

As part of the version update, I've also updated the schedule_interval parameter to be schedule instead. What's unfortunate however is that the airflow.models.dag.Dag model does not have a schedule attribute, you still must use schedule_interval when referencing it on the DAG object. So there are a few instances where we were not able to update that value.

I added a bunch of deprecation warning filters for our tests to make the warning summary more relevant (presently there should be no warnings). Most of the warnings were either 1) irrelevant & unfixable or 2) due to upstream dependencies.

Lastly, and probably most notably:

⚠️ Developer note ⚠️
The extra argument for AIRFLOW_CONN_AWS_DEFAULT has been updated from host to endpoint_url:
Before: AIRFLOW_CONN_AWS_DEFAULT=aws://test_key:test_secret@?region_name=us-east-1&host=http://s3:5000
After: AIRFLOW_CONN_AWS_DEFAULT=aws://test_key:test_secret@?region_name=us-east-1&endpoint_url=http://s3:5000

You will continue to receive warnings about extra["host"] unless this value is changed in your .env file!

Testing Instructions

  1. just build
  2. just test --extended (and ensure there are no warnings in the warnings summary)
  3. Run a few DAGs to make sure everything functions as expected
  4. just airflow-version should report 2.4.0

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@AetherUnbound AetherUnbound requested a review from a team as a code owner September 23, 2022 22:57
@openverse-bot openverse-bot added 💻 aspect: code Concerns the software code in the repository 🟨 priority: medium Not blocking but should be addressed soon 🧰 goal: internal improvement Improvement that benefits maintainers, not users labels Sep 23, 2022
@AetherUnbound AetherUnbound added 🟥 priority: critical Must be addressed ASAP and removed 🟨 priority: medium Not blocking but should be addressed soon labels Sep 23, 2022
Copy link
Collaborator

@rwidom rwidom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the testing instructions and ran phylopic, and all went well.

@@ -92,7 +92,7 @@
_DATE_RANGE_INNER_TEMPLATE = "macros.ds_add(ds, -{} )"
DATE_RANGE_ARG_TEMPLATE = "{{{{" + _DATE_RANGE_INNER_TEMPLATE + "}}}}"
DATE_PARTITION_ARG_TEMPLATE = Template(
"$media_type/$provider_name/{{ date_partition_for_prefix(dag.schedule, dag_run.logical_date, $reingestion_date ) }}" # noqa
"$media_type/$provider_name/{{ date_partition_for_prefix(dag.schedule_interval, dag_run.logical_date, $reingestion_date ) }}" # noqa
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so odd! I thought maybe dag.params['schedule'] might work, but no:

ERROR - Failed to execute task: 'airflow.models.param.ParamsDict object' has no attribute 'schedule'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely made me do a double take at first 😞

Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Built and ran the legacy smk script and everything looks good.

Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I tested with Cleveland, Jamendo, and Wikimedia Reingestion. I tried to test the audio data refresh but am unable to get it working due to environment problems; if I get those sorted I'll test that as well, but I see no reason it shouldn't work 🙂

The additions to ignore warnings are very nice (and individually well documented)! I got no warnings, after I updated my local .env. Shame about dag.schedule_interval.

Edit: forgot to add that the only other instances of schedule_interval I found were in several DAGs in the retired dir. They're retired so we certainly don't need to change them. What do you think @AetherUnbound?

# Print the current Airflow version
@airflow-version:
echo $PROJECT_AIRFLOW_VERSION

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 💯

@AetherUnbound
Copy link
Contributor Author

Ah, great thought on the retired DAGs. I'll revert those changes, we definitely don't need to change them. I used too aggressive a find-and-replace 😅

@AetherUnbound AetherUnbound merged commit b4ef93c into main Sep 26, 2022
@AetherUnbound AetherUnbound deleted the hotfix/airflow-2.4 branch September 26, 2022 20:24
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟥 priority: critical Must be addressed ASAP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upgrade to Airflow 2.4.0
6 participants