-
Notifications
You must be signed in to change notification settings - Fork 54
Halt ingestion when WordPress Photo Directory reaches last page #916
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM and tests fine locally. Thank you for the good instructions.
I was wondering if you had considered any alternative solutions to this, perhaps ones that would add the ability to get the headers to the base class? I had one idea for how to do it so that it would always be available as part of response_json
but didn't want to get into it if you thought it was a YAGNI type situation.
Anyway, LGTM either way and the solution makes sense for this DAG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I was thinking about the general solution for the response headers as well but that seemed to require much refactorization of other DAGs for just using it here. I like this.
Good code coverage in testing too! It's appreciated.
openverse_catalog/dags/providers/provider_api_scripts/wordpress.py
Outdated
Show resolved
Hide resolved
I have at least one idea that wouldn't require any refactors of other DAGs, but would be a little obscure: subclass |
We definitely can, but I wasn't aware of any other cases where we needed it (although I could be wrong). I'd say I have a slight preference to refactoring more fully if/when another use case arises... although the dict subclassing idea is pretty slick @sarayourfriend 🤩 |
I like this as well 🚀 |
Fixes
Fixes WordPress/openverse#1372 by @AetherUnbound
Description
The WordPress Photo Directory DAG was failing to halt ingestion gracefully when it reached the end of its data. Instead, it raises a 400 when attempting to access a page number beyond the total page count. This is especially bad if
skip_ingestion_errors
is enabled, because it will retry infinitely.To fix this we need to make a HEAD request to get the total page count. This information is only available in the headers and not
response_json
, so we are not able to fetch it inget_batch_data
. This PR makes this additional request on the first batch, and then halts ingestion duringget_should_continue
when we reach the final page.Testing Instructions
At the time this PR was written, this DAG detects 53 pages of results. To test it locally, I recommend editing the initial
current_page
variable in the__init__
method to52
, so the DAG will run quickly and hit the last page:After making the edit, run the DAG and verify that it stops gracefully when the last page has been reached. You should be able to see a log that looks like:
Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin