Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Refactor Flickr to use ProviderDataIngester #809

Merged
merged 9 commits into from
Oct 25, 2022

Conversation

stacimc
Copy link
Contributor

@stacimc stacimc commented Oct 18, 2022

Fixes

Fixes WordPress/openverse#1522 by @stacimc

Description

Refactors Flickr to use the ProviderDataIngester base class! Also hooks up the Flickr reingestion workflow with the new ingester.

There is one significant change made to the provider script in this refactor, which is the temporary removal of filesize information. Previously, we were making an additional request for each record to obtain this info. When I tested with this in place, it took ~2 minutes to get 100 records, versus ~4seconds for 100 records without it. This is not tenable for a provider that ingests so much data, especially for the reingestion workflow. I've opened WordPress/openverse#1388 to track finding a better way to do this.

The Flickr provider script also has a few pre-existing issues, which are not resolved in this simple refactor. This is tracked in WordPress/openverse#1789. Briefly: we're getting enormous numbers (100ks) of duplicate records from Flickr and possibly missing some unique ones, due to strange behavior from the Flickr API. I did some investigation but ultimately tabled it to work on as part of the Stability milestone.

Unfortunately this means that I cannot recommend turning on either Flickr or its reingestion workflow until after at least that issue is addressed.

Testing Instructions

just test

Try running the workflow and reingestion workflow locally. I strongly recommend setting an ingestion_limit Airflow variable!

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@openverse-bot openverse-bot added ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🟨 priority: medium Not blocking but should be addressed soon labels Oct 18, 2022
@zackkrida
Copy link
Member

@stacimc the frontend does infer the image width,height, and even filetype from the image file itself, after loading the image on the frontend. IT tries to read the API and then falls back to the values read in 'realtime' from the image.

Here's the relevant code:

https://github.com/WordPress/openverse-frontend/blob/6a70b968fe3679db2898270cb0161574b255b0c6/src/pages/image/_id.vue#L186-L187

@stacimc stacimc force-pushed the refactor/flickr-to-use-provider-data-ingester branch from b4b66de to d8ee3ca Compare October 18, 2022 22:38
@stacimc
Copy link
Contributor Author

stacimc commented Oct 18, 2022

I tested the full reingestion workflow, with an ingestion_limit of 100 records set. It worked well and took 13min 43 sec with that ingestion limit. However I can't recommend turning on the reingestion workflow until we resolve WordPress/openverse#1789.

WordPress/openverse#1789 tracks a weird bug with the Flickr API where the API appears to be giving us a huge number of duplicates -- and strangely, this is exacerbated by requesting data for a larger period of time. (Meaning the problem is worse if you request photos for an entire day, versus splitting the data into hour chunks and getting the photos for each hour).

Currently (not an addition in this PR), the Flickr script splits the ingestion day into 48 half-hour chunks. Unfortunately I still saw this buggy behavior locally when I ran the provider script. I tried running it for about 25 minutes and observed:

  • The script made it through 15 of these half-hour chunks in that time (7.5 hours worth of data)
  • The first 14 half-hour chunks processed very quickly, and the bulk of processing time was spent on the last chunk.
  • 649,342 records were ingested
  • 634,295 of these were duplicates. Only ~15k unique records were ingested

I'm going to try increasing the number of chunks. It's currently unclear to me whether we can turn Flickr on until this has been resolved 👀

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to run it by running the script separately, without Airflow (python .../flickr.py), and I think there's something wrong with passing of the arguments to the base class __init__:
We call main with the only argument,date, as a string. It becomes a part of the args and is passed as such to the base class. However, the base class expects the first parameter to be a conf dictionary:

def __init__(self, conf: dict = None, date: str = None):

I've seen the same thing in some other scripts as well (Europeana refactoring and others).

Comment on lines 288 to 290
def main(date):
ingester = FlickrDataIngester()
ingester.ingest_records(date=date)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def main(date):
ingester = FlickrDataIngester()
ingester.ingest_records(date=date)
def main(date):
ingester = FlickrDataIngester(date=date)
ingester.ingest_records()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've figured this out. We need to pass the date to the ingester itself, not its ingest_records method, right? At least, this way it works for me locally when I run the script using python flickr.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, good catch! And makes me realize I haven't actually been testing running the scripts via the CLI. Thank you for being more thorough!

Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good! Though it's disappointing we won't be able to turn it on until we get the duplicates issue resolved 😞

"""
Returns the key for the largest image size available.
"""
for size in ["l", "m", "s"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're also including url_t in the query params, should t (I'm assuming this means "tiny") be another option here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not positive why we wouldn't, but we did not accept "t" in the original implementation and we also had a test that asserted that we get no image_url when only url_t is provided (using this resource: image_data_no_image_url.json), so I'm at least cautious about changing it. Maybe in a separate issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works! And good find 🧐

start_date=datetime(2004, 2, 1),
schedule_string="@daily",
dated=True,
),
ProviderWorkflow(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oop, thanks for alphabetizing these 😄

@stacimc stacimc force-pushed the refactor/flickr-to-use-provider-data-ingester branch from 065b22b to 8c26510 Compare October 24, 2022 23:05
Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I tested this locally and I was able to get it to ingest a few days worth of data.

Interestingly, we have thumbnails for some Flickr records in production and it also looks like we could collect thumbnail URLs going forward (I replaced the _b suffix in a few URLs with _m and got a much smaller image, I'm assuming those are "big" and "medium" respectively, especially because _s gives a very small image!). That said, we haven't decided yet on how to collect those so I don't think it's worth trying to add that as part of this PR 🙂

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the refactored version locally, and it ingests data well. The requests for image dimensions were taking so much time! The dimensions are very important for the frontend, but I think we can handle them in a separate process, together with other similar providers.

@stacimc stacimc force-pushed the refactor/flickr-to-use-provider-data-ingester branch from 8c26510 to a66f520 Compare October 25, 2022 20:46
@stacimc
Copy link
Contributor Author

stacimc commented Oct 25, 2022

Rebased to fix merge conflicts

@stacimc stacimc merged commit 5ae98b0 into main Oct 25, 2022
@stacimc stacimc deleted the refactor/flickr-to-use-provider-data-ingester branch October 25, 2022 23:24
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor Flickr to use ProviderDataIngester
5 participants