-
Notifications
You must be signed in to change notification settings - Fork 54
Finnish DAG: Dynamically generate timeslices depending on the amount of records #934
Conversation
openverse_catalog/dags/providers/provider_api_scripts/finnish_museums.py
Outdated
Show resolved
Hide resolved
tests/dags/providers/provider_api_scripts/test_finnish_museums.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic! Even though there is a lot of heuristics built in here, the solution is simpler than I though. I'm really glad the API supplies us with a result count 😌 I was able to run the tests successfully locally! I have a few nits but nothing to block a merge.
openverse_catalog/dags/providers/provider_api_scripts/finnish_museums.py
Outdated
Show resolved
Hide resolved
openverse_catalog/dags/providers/provider_api_scripts/finnish_museums.py
Outdated
Show resolved
Hide resolved
mock_count.side_effect = [ | ||
150_000, # Getting total count for the entire day | ||
0, # Get count for first hour, count == 0 | ||
10, # Get count for second hour, count < 10_000 | ||
101_000, # Get count for third hour, count > 100_000 | ||
49_090, # Get count for fourth hour, 10_000 < count < 100_000 | ||
] + list( | ||
repeat(0, 20) | ||
) # Fill list with count == 0 for the remaining hours |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This rocks 🚀
Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR: @krysal Excluding weekend1 days, this PR was updated 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2. @stacimc, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome solution. Thank you for including so many details in comments. Looks great to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from #975, I also noticed the images are watermarked with what seems to be the name of the building they belong to, maybe this is the reason for thewatermarked
column's existence? Interesting finding!
Fixes
Fixes WordPress/openverse#1325
WordPress/openverse#1325 is low priority because it just tracks optimizing the DAG for days with small amounts of data. Since the DAG is now also raising errors for days with very large amounts (see 12-16-22), I'm increasing the PR to high priority.
Description
In #879 I split the ingestion of data for the Finnish DAG into 48 half-hour increments over the ingestion day. This attempts to avoid some buggy behavior we're seeing with the Finnish API where, if the query is sufficiently large, it will keep returning pages of duplicate data indefinitely and never halt ingestion.
This has a few problems:
This PR:
How the timeslices are generated
The code itself is heavily documented, but to reiterate for convenience:
The result is one extra request for the majority of ingestion days, when there is only a small amount of data, and twenty five extra requests (to get the record counts) for days with very large amounts of data.
At the very most this could result in 20*24 = 480 time slices for a particular building, but this would only happen for an ingestion day with over 100k records every single hour. Based on my observations I expect to see far, far fewer.
A quick comparison for requests/slices from the recent failed 12-30-22 run:
Potential Future Work
I'm going to be looking at whether some of this logic can be pulled out of Finnish and into either a utility class or the base class itself, as we have similar logic needed in multiple providers. In particular I'm going to test this slice generation against Flickr 🤔
Testing Instructions
just test
, and read through the new tests to make sure they make sense.Also try running the Finnish DAG locally and make sure it works.
If you want to reproduce the bug, 12-30-2022 is an example of a DagRun that failed in production even with the 48-slice approach on main. Locally, I verified that it now passes on this branch by triggering a DagRun with the config:
It passed in 3 hrs 39 minutes.
Checklist
Update index.md
).main
) ora parent feature branch.
errors.
Developer Certificate of Origin
Developer Certificate of Origin