Finnish DAG: Dynamically generate timeslices depending on the amount of records #934

stacimc · 2022-12-23T00:30:10Z

Fixes

WordPress/openverse#1325 is low priority because it just tracks optimizing the DAG for days with small amounts of data. Since the DAG is now also raising errors for days with very large amounts (see 12-16-22), I'm increasing the PR to high priority.

Description

In #879 I split the ingestion of data for the Finnish DAG into 48 half-hour increments over the ingestion day. This attempts to avoid some buggy behavior we're seeing with the Finnish API where, if the query is sufficiently large, it will keep returning pages of duplicate data indefinitely and never halt ingestion.

This has a few problems:

Most days, the DAG returns 0 records but we still make 48 requests per building.
Even on days with large amounts of data, the data is rarely distributed evenly across the day. In my tests I saw that the vast majority of the data was usually confined to one or two hours. So, much of the day is split into unnecessarily small intervals.
On days with very large amounts of data, it looks like my half-hour intervals were not small enough, and we still see the bug. (See Dec 30 2022 run in production).

This PR:

handles all of the above problems by taking the record count into account when generating timeslices (described in the section below)
since it's possible my new slices are still not small enough, it also adds in detection for the bug (when the fetched count exceeds the reported result count). It will fail the task early.

How the timeslices are generated

The code itself is heavily documented, but to reiterate for convenience:

If there are no records for the day (true for the majority of runs), it returns immediately and does not ingest at all
If there are fewer than 10,000 records, it does not split the day up at all and runs ingestion in one large chunk
If there are more than 10,000 records for the day, it gets the recordCount for each hour and only adds timeslices for hours that contain data.
- hours with < 10k records are processed in a single slice
- hours with < 100k records are processed in 12 slices (5 min intervals)
- hours with > 100k records are processed in 20 slices (3 min intervals)

The result is one extra request for the majority of ingestion days, when there is only a small amount of data, and twenty five extra requests (to get the record counts) for days with very large amounts of data.

At the very most this could result in 20*24 = 480 time slices for a particular building, but this would only happen for an ingestion day with over 100k records every single hour. Based on my observations I expect to see far, far fewer.

A quick comparison for requests/slices from the recent failed 12-30-22 run:

On main, we did 48 slices * 4 buildings = 192 total "slices"
On this branch, 1 building was processed in a single slice, 2 had no data and halted early, and 1 had >100k records in a single hour. That means all four buildings will processed in just 21 slices, with only 28 extra requests made to generate those slices effectively (1 for each of the 3 with little/no data, 25 for the large one).

Potential Future Work

I'm going to be looking at whether some of this logic can be pulled out of Finnish and into either a utility class or the base class itself, as we have similar logic needed in multiple providers. In particular I'm going to test this slice generation against Flickr 🤔

Testing Instructions

just test, and read through the new tests to make sure they make sense.

Also try running the Finnish DAG locally and make sure it works.

If you want to reproduce the bug, 12-30-2022 is an example of a DagRun that failed in production even with the 48-slice approach on main. Locally, I verified that it now passes on this branch by triggering a DagRun with the config:

{"date": "2022-12-30"}

It passed in 3 hrs 39 minutes.

Checklist

My pull request has a descriptive title (not a vague title like
Update index.md).
My pull request targets the default branch of the repository (main) or
a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible
errors.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

openverse_catalog/dags/providers/provider_api_scripts/finnish_museums.py

tests/dags/providers/provider_api_scripts/test_finnish_museums.py

AetherUnbound

This is fantastic! Even though there is a lot of heuristics built in here, the solution is simpler than I though. I'm really glad the API supplies us with a result count 😌 I was able to run the tests successfully locally! I have a few nits but nothing to block a merge.

openverse_catalog/dags/providers/provider_api_scripts/finnish_museums.py

AetherUnbound · 2023-01-07T01:02:09Z

tests/dags/providers/provider_api_scripts/test_finnish_museums.py

+        mock_count.side_effect = [
+            150_000,  # Getting total count for the entire day
+            0,  # Get count for first hour, count == 0
+            10,  # Get count for second hour, count < 10_000
+            101_000,  # Get count for third hour, count > 100_000
+            49_090,  # Get count for fourth hour, 10_000 < count < 100_000
+        ] + list(
+            repeat(0, 20)
+        )  # Fill list with count == 0 for the remaining hours


Nice! This rocks 🚀

openverse-bot · 2023-01-26T00:00:09Z

Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR:

@krysal
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend¹ days, this PR was updated 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)².

@stacimc, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Specifically, Saturday and Sunday. ↩
For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range. ↩

krysal

Awesome solution. Thank you for including so many details in comments. Looks great to me!

krysal

Aside from #975, I also noticed the images are watermarked with what seems to be the name of the building they belong to, maybe this is the reason for thewatermarked column's existence? Interesting finding!

openverse_catalog/dags/providers/provider_api_scripts/finnish_museums.py

Dynamically generate timeslices depending on the amount of records

1ad3258

stacimc added 🟧 priority: high Stalls work on the project or its dependents ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository labels Dec 23, 2022

stacimc self-assigned this Dec 23, 2022

stacimc commented Dec 23, 2022

View reviewed changes

openverse_catalog/dags/providers/provider_api_scripts/finnish_museums.py Outdated Show resolved Hide resolved

stacimc commented Dec 23, 2022

View reviewed changes

tests/dags/providers/provider_api_scripts/test_finnish_museums.py Outdated Show resolved Hide resolved

stacimc added 6 commits January 6, 2023 11:36

Raise error if num_divisions does not divide evenly

bf7914d

Clarify var name, remove TODOs

4b189f5

Clarify 'fetched' vs 'ingested'

541e3e5

Raise exception instead of continuing when error is hit

892d6c0

Clean up tests

4f8bcab

Remove slack reporting now that we're raising an error

2f9d4d6

stacimc marked this pull request as ready for review January 6, 2023 22:18

stacimc requested a review from a team as a code owner January 6, 2023 22:18

stacimc requested review from krysal and AetherUnbound January 6, 2023 22:18

stacimc mentioned this pull request Jan 6, 2023

[Bug] Flickr Provider API Script doesn't retrieve all unique items for a given day (original #326) WordPress/openverse#1789

Closed

AetherUnbound approved these changes Jan 7, 2023

View reviewed changes

Clean up error handling

52689e2

krysal approved these changes Jan 27, 2023

View reviewed changes

krysal reviewed Jan 27, 2023

View reviewed changes

openverse_catalog/dags/providers/provider_api_scripts/finnish_museums.py Show resolved Hide resolved

Reduce variable scope

0eee1c7

stacimc merged commit 106e256 into main Jan 31, 2023

stacimc deleted the fix/finnish-date-intervals branch January 31, 2023 18:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finnish DAG: Dynamically generate timeslices depending on the amount of records #934

Finnish DAG: Dynamically generate timeslices depending on the amount of records #934

stacimc commented Dec 23, 2022 •

edited

Loading

AetherUnbound left a comment

AetherUnbound Jan 7, 2023

openverse-bot commented Jan 26, 2023

krysal left a comment

krysal left a comment

Finnish DAG: Dynamically generate timeslices depending on the amount of records #934

Finnish DAG: Dynamically generate timeslices depending on the amount of records #934

Conversation

stacimc commented Dec 23, 2022 • edited Loading

Fixes

Description

How the timeslices are generated

Potential Future Work

Testing Instructions

Checklist

Developer Certificate of Origin

AetherUnbound left a comment

Choose a reason for hiding this comment

AetherUnbound Jan 7, 2023

Choose a reason for hiding this comment

openverse-bot commented Jan 26, 2023

Footnotes

krysal left a comment

Choose a reason for hiding this comment

krysal left a comment

Choose a reason for hiding this comment

stacimc commented Dec 23, 2022 •

edited

Loading