Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Finnish DAG: Dynamically generate timeslices depending on the amount of records #934

Merged
merged 9 commits into from
Jan 31, 2023

Conversation

stacimc
Copy link
Contributor

@stacimc stacimc commented Dec 23, 2022

Fixes

Fixes WordPress/openverse#1325

WordPress/openverse#1325 is low priority because it just tracks optimizing the DAG for days with small amounts of data. Since the DAG is now also raising errors for days with very large amounts (see 12-16-22), I'm increasing the PR to high priority.

Description

In #879 I split the ingestion of data for the Finnish DAG into 48 half-hour increments over the ingestion day. This attempts to avoid some buggy behavior we're seeing with the Finnish API where, if the query is sufficiently large, it will keep returning pages of duplicate data indefinitely and never halt ingestion.

This has a few problems:

  • Most days, the DAG returns 0 records but we still make 48 requests per building.
  • Even on days with large amounts of data, the data is rarely distributed evenly across the day. In my tests I saw that the vast majority of the data was usually confined to one or two hours. So, much of the day is split into unnecessarily small intervals.
  • On days with very large amounts of data, it looks like my half-hour intervals were not small enough, and we still see the bug. (See Dec 30 2022 run in production).

This PR:

  • handles all of the above problems by taking the record count into account when generating timeslices (described in the section below)
  • since it's possible my new slices are still not small enough, it also adds in detection for the bug (when the fetched count exceeds the reported result count). It will fail the task early.

How the timeslices are generated

The code itself is heavily documented, but to reiterate for convenience:

  • If there are no records for the day (true for the majority of runs), it returns immediately and does not ingest at all
  • If there are fewer than 10,000 records, it does not split the day up at all and runs ingestion in one large chunk
  • If there are more than 10,000 records for the day, it gets the recordCount for each hour and only adds timeslices for hours that contain data.
    • hours with < 10k records are processed in a single slice
    • hours with < 100k records are processed in 12 slices (5 min intervals)
    • hours with > 100k records are processed in 20 slices (3 min intervals)

The result is one extra request for the majority of ingestion days, when there is only a small amount of data, and twenty five extra requests (to get the record counts) for days with very large amounts of data.

At the very most this could result in 20*24 = 480 time slices for a particular building, but this would only happen for an ingestion day with over 100k records every single hour. Based on my observations I expect to see far, far fewer.

A quick comparison for requests/slices from the recent failed 12-30-22 run:

  • On main, we did 48 slices * 4 buildings = 192 total "slices"
  • On this branch, 1 building was processed in a single slice, 2 had no data and halted early, and 1 had >100k records in a single hour. That means all four buildings will processed in just 21 slices, with only 28 extra requests made to generate those slices effectively (1 for each of the 3 with little/no data, 25 for the large one).

Potential Future Work

I'm going to be looking at whether some of this logic can be pulled out of Finnish and into either a utility class or the base class itself, as we have similar logic needed in multiple providers. In particular I'm going to test this slice generation against Flickr 🤔

Testing Instructions

just test, and read through the new tests to make sure they make sense.

Also try running the Finnish DAG locally and make sure it works.

If you want to reproduce the bug, 12-30-2022 is an example of a DagRun that failed in production even with the 48-slice approach on main. Locally, I verified that it now passes on this branch by triggering a DagRun with the config:

{"date": "2022-12-30"}

It passed in 3 hrs 39 minutes.

Checklist

  • My pull request has a descriptive title (not a vague title like
    Update index.md).
  • My pull request targets the default branch of the repository (main) or
    a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible
    errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@stacimc stacimc added 🟧 priority: high Stalls work on the project or its dependents ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository labels Dec 23, 2022
@stacimc stacimc self-assigned this Dec 23, 2022
@stacimc stacimc added 🟧 priority: high Stalls work on the project or its dependents 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository and removed 🟧 priority: high Stalls work on the project or its dependents ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository labels Jan 6, 2023
@stacimc stacimc marked this pull request as ready for review January 6, 2023 22:18
@stacimc stacimc requested a review from a team as a code owner January 6, 2023 22:18
Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic! Even though there is a lot of heuristics built in here, the solution is simpler than I though. I'm really glad the API supplies us with a result count 😌 I was able to run the tests successfully locally! I have a few nits but nothing to block a merge.

Comment on lines +125 to +133
mock_count.side_effect = [
150_000, # Getting total count for the entire day
0, # Get count for first hour, count == 0
10, # Get count for second hour, count < 10_000
101_000, # Get count for third hour, count > 100_000
49_090, # Get count for fourth hour, 10_000 < count < 100_000
] + list(
repeat(0, 20)
) # Fill list with count == 0 for the remaining hours
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This rocks 🚀

@openverse-bot
Copy link
Contributor

Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR:

@krysal
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was updated 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2.

@stacimc, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome solution. Thank you for including so many details in comments. Looks great to me!

Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from #975, I also noticed the images are watermarked with what seems to be the name of the building they belong to, maybe this is the reason for thewatermarked column's existence? Interesting finding!

@stacimc stacimc merged commit 106e256 into main Jan 31, 2023
@stacimc stacimc deleted the fix/finnish-date-intervals branch January 31, 2023 18:37
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Conditionally break down Finnish ingestion into increments
4 participants