Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Handle the upper case licenses in the add_license_dag #1049

Merged
merged 4 commits into from
Mar 23, 2023

Conversation

obulat
Copy link
Contributor

@obulat obulat commented Mar 16, 2023

Description

After the last run of the add_license_url DAG, we still have more than 8000 records with NULL in meta_data. Most of them are due to the license name being upper case (CC0 instead of cc0).
This PR converts the license name to lower case before using it to create the URL. It also updates both the meta_data and the license.

To make this DAG faster, I've removed the first step, which basically just repeats the first part of the second task selecting the items where meta_data is NULL.

This DAG also saves the data (identifier, license, license_version) for invalid license pairs that cannot be fixed to S3 for further investigation.

@obulat obulat requested a review from a team as a code owner March 16, 2023 17:57
@obulat obulat requested review from krysal and stacimc March 16, 2023 17:57
@openverse-bot openverse-bot added the 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work label Mar 16, 2023
@obulat obulat force-pushed the update/add_license_url_dag branch from 67fb983 to 9709355 Compare March 16, 2023 18:03
@obulat obulat added 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs 🟨 priority: medium Not blocking but should be addressed soon and removed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Mar 16, 2023
Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! One suggestion but not a blocking one.

openverse_catalog/dags/maintenance/add_license_url.py Outdated Show resolved Hide resolved
Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :)

@obulat obulat merged commit bfc3b02 into main Mar 23, 2023
@obulat obulat deleted the update/add_license_url_dag branch March 23, 2023 04:09
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants