-
Notifications
You must be signed in to change notification settings - Fork 54
Refactor Flickr to use ProviderDataIngester #809
Conversation
@stacimc the frontend does infer the image width,height, and even filetype from the image file itself, after loading the image on the frontend. IT tries to read the API and then falls back to the values read in 'realtime' from the image. Here's the relevant code: |
b4b66de
to
d8ee3ca
Compare
I tested the full reingestion workflow, with an WordPress/openverse#1789 tracks a weird bug with the Flickr API where the API appears to be giving us a huge number of duplicates -- and strangely, this is exacerbated by requesting data for a larger period of time. (Meaning the problem is worse if you request photos for an entire day, versus splitting the data into hour chunks and getting the photos for each hour). Currently (not an addition in this PR), the Flickr script splits the ingestion day into 48 half-hour chunks. Unfortunately I still saw this buggy behavior locally when I ran the provider script. I tried running it for about 25 minutes and observed:
I'm going to try increasing the number of chunks. It's currently unclear to me whether we can turn Flickr on until this has been resolved 👀 |
7afdfd7
to
c66bc97
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to run it by running the script separately, without Airflow (python .../flickr.py
), and I think there's something wrong with passing of the arguments to the base class __init__
:
We call main
with the only argument,date
, as a string. It becomes a part of the args
and is passed as such to the base class. However, the base class expects the first parameter to be a conf
dictionary:
openverse-catalog/openverse_catalog/dags/providers/provider_api_scripts/provider_data_ingester.py
Line 84 in 065b22b
def __init__(self, conf: dict = None, date: str = None): |
I've seen the same thing in some other scripts as well (Europeana refactoring and others).
openverse_catalog/dags/providers/provider_api_scripts/flickr.py
Outdated
Show resolved
Hide resolved
def main(date): | ||
ingester = FlickrDataIngester() | ||
ingester.ingest_records(date=date) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def main(date): | |
ingester = FlickrDataIngester() | |
ingester.ingest_records(date=date) | |
def main(date): | |
ingester = FlickrDataIngester(date=date) | |
ingester.ingest_records() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've figured this out. We need to pass the date to the ingester itself, not its ingest_records
method, right? At least, this way it works for me locally when I run the script using python flickr.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow, good catch! And makes me realize I haven't actually been testing running the scripts via the CLI. Thank you for being more thorough!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good! Though it's disappointing we won't be able to turn it on until we get the duplicates issue resolved 😞
openverse_catalog/dags/providers/provider_api_scripts/flickr.py
Outdated
Show resolved
Hide resolved
openverse_catalog/dags/providers/provider_api_scripts/flickr.py
Outdated
Show resolved
Hide resolved
""" | ||
Returns the key for the largest image size available. | ||
""" | ||
for size in ["l", "m", "s"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're also including url_t
in the query params, should t
(I'm assuming this means "tiny") be another option here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not positive why we wouldn't, but we did not accept "t" in the original implementation and we also had a test that asserted that we get no image_url
when only url_t
is provided (using this resource: image_data_no_image_url.json), so I'm at least cautious about changing it. Maybe in a separate issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works! And good find 🧐
openverse_catalog/dags/providers/provider_api_scripts/flickr.py
Outdated
Show resolved
Hide resolved
start_date=datetime(2004, 2, 1), | ||
schedule_string="@daily", | ||
dated=True, | ||
), | ||
ProviderWorkflow( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oop, thanks for alphabetizing these 😄
065b22b
to
8c26510
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I tested this locally and I was able to get it to ingest a few days worth of data.
Interestingly, we have thumbnails for some Flickr records in production and it also looks like we could collect thumbnail URLs going forward (I replaced the _b
suffix in a few URLs with _m
and got a much smaller image, I'm assuming those are "big" and "medium" respectively, especially because _s
gives a very small image!). That said, we haven't decided yet on how to collect those so I don't think it's worth trying to add that as part of this PR 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the refactored version locally, and it ingests data well. The requests for image dimensions were taking so much time! The dimensions are very important for the frontend, but I think we can handle them in a separate process, together with other similar providers.
8c26510
to
a66f520
Compare
Rebased to fix merge conflicts |
Fixes
Fixes WordPress/openverse#1522 by @stacimc
Description
Refactors Flickr to use the ProviderDataIngester base class! Also hooks up the Flickr reingestion workflow with the new ingester.
There is one significant change made to the provider script in this refactor, which is the temporary removal of filesize information. Previously, we were making an additional request for each record to obtain this info. When I tested with this in place, it took ~2 minutes to get 100 records, versus ~4seconds for 100 records without it. This is not tenable for a provider that ingests so much data, especially for the reingestion workflow. I've opened WordPress/openverse#1388 to track finding a better way to do this.
The Flickr provider script also has a few pre-existing issues, which are not resolved in this simple refactor. This is tracked in WordPress/openverse#1789. Briefly: we're getting enormous numbers (100ks) of duplicate records from Flickr and possibly missing some unique ones, due to strange behavior from the Flickr API. I did some investigation but ultimately tabled it to work on as part of the Stability milestone.
Unfortunately this means that I cannot recommend turning on either Flickr or its reingestion workflow until after at least that issue is addressed.
Testing Instructions
just test
Try running the workflow and reingestion workflow locally. I strongly recommend setting an
ingestion_limit
Airflow variable!Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin