-
Notifications
You must be signed in to change notification settings - Fork 54
Refactor Phylopic to use ProviderDataIngester #747
Conversation
openverse_catalog/dags/providers/provider_api_scripts/phylopic.py
Outdated
Show resolved
Hide resolved
def get_media_type(self, record: dict) -> str: | ||
return constants.IMAGE | ||
|
||
def get_response_json( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it makes sense for get_next_query_params
to still just return a configuration object that then get_response_json
converts into a usable endpoint to call super().get_response_json
with?
I think that might tighten up the implementation and make it seem a little more "normal" and easier to understand. At least I think it would from my perspective, coming into this out of pure curiosity 🙂
Another thing that crossed my mind was whether some flag other than empty query params (maybe a sentinel object?) could be used to indicate that the batch operation was complete. It would make the process more explicit for other scripts too as well as allowing this one to avoid the awkward dict copying that requires a whole extra comment to explain it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've used a totally different method of constructing the endpoints, let me know if this new approach makes more sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't gotten a chance to review this closely yet, but I'm curious if you know why the script supports PhyloPic being run as a dated DAG and as "non-dated"? My assumption is that this is run weekly as a dated dag and processes 7 days at a time, but the intention is that a reingestion workflow would need to be able to run it for single individual days as normal.
I'm not sure why it's doing that as opposed to just running on a @daily
schedule, for a single day like other dated DAGs 🤔 It seems like this introduces a lot of complexity. Were you able to find a reason that it was originally implemented that way?
Obviously this isn't introduced by your refactor! But curious if most of the complexity is coming from that, or if that's incidental and it's mostly the way the endpoints are structured?
After looking at the Stocksnap refactor (#601) for some other reasons, I think Rebecca's approach of having |
I should be able to get to this tomorrow but I'm going to go ahead and draft it until it's ready! |
I tested this locally as a dated DAG and it worked great, I also set
|
@stacimc It's not clear to me why this was set up to run in both manners, it seems like that's how it was from the beginning: 1518383 I chose I think once we get a reingestion workflow set up for this provider, we can do away with the reprocessing altogether! That, OR, our "reingestion" workflow can just reconsume the entire dataset using the offset + limit approach 🤔 🤷🏼♀️ |
Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR: @obulat Excluding weekend1 days, this PR was updated 4 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)2. @AetherUnbound, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple comments but the endpoint handling makes way more sense, nice work 🤠
return data | ||
|
||
@staticmethod | ||
def _image_url(_uuid: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is uuid
reserved? Why does it have a leading _
? I don't see it imported in this module, but maybe it's being defined globally somehow? Is this an Airflow specific thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is strange. Looks like this comes from the original code and also applied to some other variables like _endpoint
, which definitely shouldn't be reserved. Maybe meant to denote an internal variable, but it's not done consistently 🤔
I don't see any reason we couldn't omit the leading _
, we have variables named uuid
elsewhere in provider scripts. Probably a leftover artifact of refactoring multiple times!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oop, definitely holdover from the original code. I think it's more convention than anything, since uuid
is a module and it might be ideal to make the distinction between what this is and that module. That said, we aren't using that module here as y'all point out so I'll change it 🙂
openverse_catalog/dags/providers/provider_api_scripts/phylopic.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one, I think this might be the weirdest one to refactor! I ran this locally while reviewing the code and forgot that catchup
was turned on, so it ended up running quite a backfill with no problems 😆
I made #813 to track pulling out the logic for the 'non-dated' version. I have some reservations about the unusual setup with running @weekly
, and how we will tweak this for reingestion -- but either way I think it's out of scope here. This looks good to me!
@@ -33,6 +33,7 @@ | |||
WORDPRESS_DEFAULT_PROVIDER = "wordpress" | |||
FREESOUND_DEFAULT_PROVIDER = "freesound" | |||
INATURALIST_DEFAULT_PROVIDER = "inaturalist" | |||
PHYLOPIC_DEFAULT_PROVIDER = "phylopic" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funny this one wasn't set up here!
default_process_days = 7 | ||
|
||
def __init__(self, *args, **kwargs): | ||
super().__init__(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make Phylopic accept process_days
as an argument here instead of overriding the default
when we want to change it (as is done in the main
method)?
This will be necessary for the reingestion workflows, where we want to be able to run Phylopic for individual days instead of a week at a time. Even with that change, this still introduces a complication for reingestion since those workflows are factory-generated. We would need to add an args
option to the ProviderReingestionWorkflow configuration objects or something similar, that we could pass in through the factory 😬
Outside of the scope of this PR, I'd like to at least consider making this a daily DAG. If it is giving us a small enough amount of data that this is unreasonable, then alternatively maybe it shouldn't be dated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried doing this initially, and I ran into some issue with handling args
vs kwargs
vs this parameter in the various places that the provider ingester class is passed around. I think having it as @weekly
was fine initially, but if that's going to create more problems for the reingestion workflow, then maybe this would just be better as @daily
😄 It'd prevent us from having to figure out all these additional workarounds for process_days
as well. I'll go ahead and make that change as a commit, but if we do want to pursue just keeping this as a weekly DAG before merging I can revert that commit.
openverse_catalog/dags/providers/provider_api_scripts/phylopic.py
Outdated
Show resolved
Hide resolved
openverse_catalog/dags/providers/provider_api_scripts/phylopic.py
Outdated
Show resolved
Hide resolved
1e6375d
to
d823ec2
Compare
I've updated this to be an Here are some local runs, you can see the logical date is every 7 days until the 21st, where it starts running daily after that! |
Looks great to me! Switching to
Kudos for thinking to test this, what a good catch! |
submitter = result.get("submitter", {}) | ||
first_name = submitter.get("firstName") | ||
last_name = submitter.get("lastName") | ||
if first_name and last_name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to add the submitter if there is a first_name OR last_name. Does that ever happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure! I can make an issue for it though 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to standardize all the DAGs to be daily when the date is provided.
I've added a couple of nitty comments inline. Other than that, everything works and looks great!
Co-authored-by: sarayourfriend <[email protected]>
Co-authored-by: sarayourfriend <[email protected]>
6380223
to
122848e
Compare
Fixes
Fixes WordPress/openverse#1516 by @stacimc
Description
This PR refactors the Phylopic provider script to use the new
ProviderDataIngester
class.This refactor is actually pretty significant because the query params are not the defining feature for determining the values to retrieve, the
endpoint
is! I had to usurp existing functions a bit in order to get this all to work with the ingester class. Let me know if the current setup is too confusing or difficult to work with. Frankly we can strip out the logic for processing the entire dataset if we'd like and just make this a dated DAG.Testing Instructions
just recreate && just test
delay
attribute to0.5
and turn the DAG on. It will take about 5 runs (due to catchup), but the 5th run should actually capture data.just run python openverse_catalog/dags/providers/provider_api_scripts/phylopic.py --date-start 2020-11-15
, this should ingest 6 recordsjust run python openverse_catalog/dags/providers/provider_api_scripts/phylopic.py
, this should attempt to ingest everything, but let it go for only a few iterations before stoppingChecklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin