Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust Flickr reingestion schedule #1285

Open
stacimc opened this issue Feb 18, 2023 · 2 comments
Open

Adjust Flickr reingestion schedule #1285

stacimc opened this issue Feb 18, 2023 · 2 comments
Assignees
Labels
✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs ⛔ status: blocked Blocked & therefore, not ready for work

Comments

@stacimc
Copy link
Collaborator

stacimc commented Feb 18, 2023

Problem

Per the discussion in WordPress/openverse-catalog#995, we will soon be enabling the Flickr DAG but expect that the current implementation will result in some unique results being skipped during ingestion. To combat this, we want to increase the frequency of reingestion so that each day gets reingested more frequently and we can expect greater coverage.

Description

At minimum, we should update the Flickr reingestion workflow to run on a @daily schedule (it is currently @weekly). We should wait until the current implementation has been run successfully a few times, and we have data that shows it can complete in under 24 hours.

Depending on how slow or fast the reingestion DAG takes, we may also be able to increase the number of reingestion days.

Alternatives

Alongside this effort, we are also planning on exploring other approaches to modifying the Flickr DAG to avoid the problems with duplicates and missing records. This work is meant to improve the DAG in the meantime.

@stacimc stacimc added 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon and removed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Feb 18, 2023
@obulat obulat added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Feb 23, 2023
@stacimc stacimc self-assigned this Mar 28, 2023
@stacimc
Copy link
Collaborator Author

stacimc commented Mar 30, 2023

Reingestion has been enabled in production. It is on a @weekly schedule at the moment. I'll adjust parameters after collecting some data about how this initial run goes. At minimum I think the schedule will be updated and I suspect max_active_tasks can be updated, once I get a sense of how close we are to the rate limit.

@stacimc stacimc added the ⛔ status: blocked Blocked & therefore, not ready for work label Apr 3, 2023
@stacimc
Copy link
Collaborator Author

stacimc commented Apr 3, 2023

This is briefly blocked (and reingestion is disabled) while we ensure that reingestion does not cause rate limiting issues.

@github-project-automation github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Apr 17, 2023
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs ⛔ status: blocked Blocked & therefore, not ready for work
Projects
Status: 📋 Backlog
Development

No branches or pull requests

2 participants