Refactor Phylopic to use ProviderDataIngester #747

AetherUnbound · 2022-09-29T21:51:29Z

Fixes

Fixes WordPress/openverse#1516 by @stacimc

Description

This PR refactors the Phylopic provider script to use the new ProviderDataIngester class.

This refactor is actually pretty significant because the query params are not the defining feature for determining the values to retrieve, the endpoint is! I had to usurp existing functions a bit in order to get this all to work with the ingester class. Let me know if the current setup is too confusing or difficult to work with. Frankly we can strip out the logic for processing the entire dataset if we'd like and just make this a dated DAG.

Testing Instructions

just recreate && just test
Set the class's delay attribute to 0.5 and turn the DAG on. It will take about 5 runs (due to catchup), but the 5th run should actually capture data.
Run just run python openverse_catalog/dags/providers/provider_api_scripts/phylopic.py --date-start 2020-11-15, this should ingest 6 records
(Optionally) Run just run python openverse_catalog/dags/providers/provider_api_scripts/phylopic.py, this should attempt to ingest everything, but let it go for only a few iterations before stopping

Checklist

My pull request has a descriptive title (not a vague title like Update index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

openverse_catalog/dags/providers/provider_api_scripts/phylopic.py

sarayourfriend · 2022-09-29T22:11:01Z

openverse_catalog/dags/providers/provider_api_scripts/phylopic.py

+    def get_media_type(self, record: dict) -> str:
+        return constants.IMAGE
+
+    def get_response_json(


Would it makes sense for get_next_query_params to still just return a configuration object that then get_response_json converts into a usable endpoint to call super().get_response_json with?

I think that might tighten up the implementation and make it seem a little more "normal" and easier to understand. At least I think it would from my perspective, coming into this out of pure curiosity 🙂

Another thing that crossed my mind was whether some flag other than empty query params (maybe a sentinel object?) could be used to indicate that the batch operation was complete. It would make the process more explicit for other scripts too as well as allowing this one to avoid the awkward dict copying that requires a whole extra comment to explain it.

I've used a totally different method of constructing the endpoints, let me know if this new approach makes more sense!

stacimc

I haven't gotten a chance to review this closely yet, but I'm curious if you know why the script supports PhyloPic being run as a dated DAG and as "non-dated"? My assumption is that this is run weekly as a dated dag and processes 7 days at a time, but the intention is that a reingestion workflow would need to be able to run it for single individual days as normal.

I'm not sure why it's doing that as opposed to just running on a @daily schedule, for a single day like other dated DAGs 🤔 It seems like this introduces a lot of complexity. Were you able to find a reason that it was originally implemented that way?

Obviously this isn't introduced by your refactor! But curious if most of the complexity is coming from that, or if that's incidental and it's mostly the way the endpoints are structured?

AetherUnbound · 2022-10-01T00:54:54Z

After looking at the Stocksnap refactor (#601) for some other reasons, I think Rebecca's approach of having endpoint be a computed property is probably the best way to go! I also want to incorporate some of the other feedback provided, so I'll be taking a second look at this next week and making further changes to the endpoint setup 🙂

AetherUnbound · 2022-10-07T00:28:30Z

I should be able to get to this tomorrow but I'm going to go ahead and draft it until it's ready!

AetherUnbound · 2022-10-07T20:20:59Z

I tested this locally as a dated DAG and it worked great, I also set dated=False and ran again and the iteration behaved as expected!

[2022-10-07, 20:16:08 UTC] {provider_data_ingester.py:151} INFO - Begin ingestion for PhylopicDataIngester
[2022-10-07, 20:16:08 UTC] {phylopic.py:76} INFO - Constructed endpoint: http://phylopic.org/api/a/image/list/0/25
[2022-10-07, 20:16:21 UTC] {provider_data_ingester.py:165} INFO - 25 records ingested so far.
[2022-10-07, 20:16:21 UTC] {phylopic.py:76} INFO - Constructed endpoint: http://phylopic.org/api/a/image/list/25/25
[2022-10-07, 20:16:34 UTC] {provider_data_ingester.py:165} INFO - 50 records ingested so far.
[2022-10-07, 20:16:34 UTC] {phylopic.py:76} INFO - Constructed endpoint: http://phylopic.org/api/a/image/list/50/25
[2022-10-07, 20:16:48 UTC] {provider_data_ingester.py:165} INFO - 74 records ingested so far.
[2022-10-07, 20:16:48 UTC] {phylopic.py:76} INFO - Constructed endpoint: http://phylopic.org/api/a/image/list/75/25
[2022-10-07, 20:17:01 UTC] {provider_data_ingester.py:165} INFO - 99 records ingested so far.
[2022-10-07, 20:17:01 UTC] {phylopic.py:76} INFO - Constructed endpoint: http://phylopic.org/api/a/image/list/100/25
[2022-10-07, 20:17:02 UTC] {media.py:219} INFO - Writing 100 lines from buffer to disk.
[2022-10-07, 20:17:14 UTC] {provider_data_ingester.py:165} INFO - 124 records ingested so far.
[2022-10-07, 20:17:14 UTC] {phylopic.py:76} INFO - Constructed endpoint: http://phylopic.org/api/a/image/list/125/25
[2022-10-07, 20:17:27 UTC] {provider_data_ingester.py:165} INFO - 149 records ingested so far.
[2022-10-07, 20:17:27 UTC] {phylopic.py:76} INFO - Constructed endpoint: http://phylopic.org/api/a/image/list/150/25
[2022-10-07, 20:17:40 UTC] {provider_data_ingester.py:165} INFO - 171 records ingested so far.
[2022-10-07, 20:17:40 UTC] {phylopic.py:76} INFO - Constructed endpoint: http://phylopic.org/api/a/image/list/175/25
[2022-10-07, 20:17:53 UTC] {provider_data_ingester.py:165} INFO - 196 records ingested so far.
[2022-10-07, 20:17:53 UTC] {phylopic.py:76} INFO - Constructed endpoint: http://phylopic.org/api/a/image/list/200/25

AetherUnbound · 2022-10-07T20:24:50Z

@stacimc It's not clear to me why this was set up to run in both manners, it seems like that's how it was from the beginning: 1518383

I chose @weekly for this DAG rather than @daily because there were many days which had no updates so the runs would skip. It seemed like running this once a week rather than daily meant we'd have a smaller impact on the API while ensuring more of the times we actually ran picked up data.

I think once we get a reingestion workflow set up for this provider, we can do away with the reprocessing altogether! That, OR, our "reingestion" workflow can just reconsume the entire dataset using the offset + limit approach 🤔 🤷🏼‍♀️

openverse-bot · 2022-10-14T00:00:04Z

Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR:

@obulat
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend¹ days, this PR was updated 4 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)².

@AetherUnbound, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Specifically, Saturday and Sunday. ↩
For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range. ↩

sarayourfriend

Just a couple comments but the endpoint handling makes way more sense, nice work 🤠

sarayourfriend · 2022-10-19T04:15:54Z

openverse_catalog/dags/providers/provider_api_scripts/phylopic.py

+        return data
+
+    @staticmethod
+    def _image_url(_uuid: str) -> str:


Is uuid reserved? Why does it have a leading _? I don't see it imported in this module, but maybe it's being defined globally somehow? Is this an Airflow specific thing?

That is strange. Looks like this comes from the original code and also applied to some other variables like _endpoint, which definitely shouldn't be reserved. Maybe meant to denote an internal variable, but it's not done consistently 🤔

I don't see any reason we couldn't omit the leading _, we have variables named uuid elsewhere in provider scripts. Probably a leftover artifact of refactoring multiple times!

Oop, definitely holdover from the original code. I think it's more convention than anything, since uuid is a module and it might be ideal to make the distinction between what this is and that module. That said, we aren't using that module here as y'all point out so I'll change it 🙂

openverse_catalog/dags/providers/provider_api_scripts/phylopic.py

stacimc

Nice one, I think this might be the weirdest one to refactor! I ran this locally while reviewing the code and forgot that catchup was turned on, so it ended up running quite a backfill with no problems 😆

I made #813 to track pulling out the logic for the 'non-dated' version. I have some reservations about the unusual setup with running @weekly, and how we will tweak this for reingestion -- but either way I think it's out of scope here. This looks good to me!

stacimc · 2022-10-19T23:13:09Z

openverse_catalog/dags/common/loader/provider_details.py

@@ -33,6 +33,7 @@
 WORDPRESS_DEFAULT_PROVIDER = "wordpress"
 FREESOUND_DEFAULT_PROVIDER = "freesound"
 INATURALIST_DEFAULT_PROVIDER = "inaturalist"
+PHYLOPIC_DEFAULT_PROVIDER = "phylopic"


Funny this one wasn't set up here!

stacimc · 2022-10-19T23:38:28Z

openverse_catalog/dags/providers/provider_api_scripts/phylopic.py

+    default_process_days = 7
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)


Can we make Phylopic accept process_days as an argument here instead of overriding the default when we want to change it (as is done in the main method)?

This will be necessary for the reingestion workflows, where we want to be able to run Phylopic for individual days instead of a week at a time. Even with that change, this still introduces a complication for reingestion since those workflows are factory-generated. We would need to add an args option to the ProviderReingestionWorkflow configuration objects or something similar, that we could pass in through the factory 😬

Outside of the scope of this PR, I'd like to at least consider making this a daily DAG. If it is giving us a small enough amount of data that this is unreasonable, then alternatively maybe it shouldn't be dated?

I tried doing this initially, and I ran into some issue with handling args vs kwargs vs this parameter in the various places that the provider ingester class is passed around. I think having it as @weekly was fine initially, but if that's going to create more problems for the reingestion workflow, then maybe this would just be better as @daily 😄 It'd prevent us from having to figure out all these additional workarounds for process_days as well. I'll go ahead and make that change as a commit, but if we do want to pursue just keeping this as a weekly DAG before merging I can revert that commit.

openverse_catalog/dags/providers/provider_api_scripts/phylopic.py

AetherUnbound · 2022-10-21T22:07:31Z

I've updated this to be an @daily DAG and removed all the logic for processing more than one day at a time. I've also tested this locally to ensure that the catchup=True nature of this DAG won't result in it trying to run daily versions for all of the days which are not Sunday (i.e. when the @weekly runs would kick off).

Here are some local runs, you can see the logical date is every 7 days until the 21st, where it starts running daily after that!

stacimc · 2022-10-21T22:39:10Z

Looks great to me! Switching to @daily seems like the easiest path forward and will definitely make reingestion easier 👍

I've also tested this locally to ensure that the catchup=True nature of this DAG won't result in it trying to run daily versions for all of the days which are not Sunday (i.e. when the @Weekly runs would kick off).

Kudos for thinking to test this, what a good catch!

openverse_catalog/dags/providers/provider_workflows.py

obulat · 2022-10-24T08:46:23Z

openverse_catalog/dags/providers/provider_api_scripts/phylopic.py

+        submitter = result.get("submitter", {})
+        first_name = submitter.get("firstName")
+        last_name = submitter.get("lastName")
+        if first_name and last_name:


I think it would be nice to add the submitter if there is a first_name OR last_name. Does that ever happen?

I'm not sure! I can make an issue for it though 😄

obulat

It makes sense to standardize all the DAGs to be daily when the date is provided.
I've added a couple of nitty comments inline. Other than that, everything works and looks great!

Co-authored-by: sarayourfriend <[email protected]>

AetherUnbound requested a review from a team as a code owner September 29, 2022 21:51

AetherUnbound requested review from obulat and stacimc September 29, 2022 21:51

openverse-bot added ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🟨 priority: medium Not blocking but should be addressed soon labels Sep 29, 2022

sarayourfriend reviewed Sep 29, 2022

View reviewed changes

openverse_catalog/dags/providers/provider_api_scripts/phylopic.py Outdated Show resolved Hide resolved

sarayourfriend reviewed Sep 29, 2022

View reviewed changes

stacimc reviewed Sep 29, 2022

View reviewed changes

AetherUnbound marked this pull request as draft October 7, 2022 00:28

AetherUnbound marked this pull request as ready for review October 7, 2022 20:21

sarayourfriend reviewed Oct 19, 2022

View reviewed changes

stacimc mentioned this pull request Apr 17, 2023

Make Phylopic a "true" dated DAG WordPress/openverse#1387

Closed

1 task

stacimc approved these changes Oct 19, 2022

View reviewed changes

AetherUnbound force-pushed the feature/phylopic-refactor#591 branch from 1e6375d to d823ec2 Compare October 21, 2022 21:33

obulat reviewed Oct 24, 2022

View reviewed changes

openverse_catalog/dags/providers/provider_workflows.py Show resolved Hide resolved

obulat reviewed Oct 24, 2022

View reviewed changes

obulat approved these changes Oct 24, 2022

View reviewed changes

Add phylopic to provider module

910f57f

AetherUnbound and others added 14 commits October 24, 2022 13:27

Initial phylopic refactor

6c572f4

Finish provider refactor

1e98bae

Only use endpoint conditionally, add should_continue & fix params

37ff040

Correct get_response_json logic

34b5ede

Initial test changes

13a69ad

Update tests

ae462c4

Add provider data ingester to list

6763b0d

Improve documentation

7f8d604

Co-authored-by: sarayourfriend <[email protected]>

Use computed endpoint rather than get_response_json override

e0e2483

Rename _uuid to uid

6f4795e

Remove scratch code

b84f9ed

Co-authored-by: sarayourfriend <[email protected]>

Remove _get_meta_data function

19257a1

Make Phylopic daily

17216c2

Move start date to first day data is actually made available

122848e

AetherUnbound force-pushed the feature/phylopic-refactor#591 branch from 6380223 to 122848e Compare October 24, 2022 20:28

AetherUnbound merged commit 1506d6d into main Oct 24, 2022

AetherUnbound deleted the feature/phylopic-refactor#591 branch October 24, 2022 20:36

AetherUnbound mentioned this pull request Apr 17, 2023

Capture Phylopic creator even if only first or last name is defined WordPress/openverse#1381

Closed

1 task

stacimc mentioned this pull request Oct 25, 2022

Add reingestion DAG for Phylopic #830

Merged

7 tasks

krysal mentioned this pull request Apr 17, 2023

Phylopic script isn't getting any data WordPress/openverse#1374

Closed

1 task

AetherUnbound mentioned this pull request Apr 17, 2023

DAG to alert when provider workflow doesn't return data for <x> days, don't report individual "no data" runs to Slack WordPress/openverse#1304

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Phylopic to use ProviderDataIngester #747

Refactor Phylopic to use ProviderDataIngester #747

AetherUnbound commented Sep 29, 2022

sarayourfriend Sep 29, 2022

AetherUnbound Oct 7, 2022

stacimc left a comment

AetherUnbound commented Oct 1, 2022

AetherUnbound commented Oct 7, 2022

AetherUnbound commented Oct 7, 2022

AetherUnbound commented Oct 7, 2022

openverse-bot commented Oct 14, 2022

sarayourfriend left a comment

sarayourfriend Oct 19, 2022

stacimc Oct 20, 2022

AetherUnbound Oct 21, 2022

stacimc left a comment

stacimc Oct 19, 2022

stacimc Oct 19, 2022

AetherUnbound Oct 21, 2022

AetherUnbound commented Oct 21, 2022 •

edited

Loading

stacimc commented Oct 21, 2022

obulat Oct 24, 2022

AetherUnbound Oct 24, 2022

obulat left a comment

Refactor Phylopic to use ProviderDataIngester #747

Refactor Phylopic to use ProviderDataIngester #747

Conversation

AetherUnbound commented Sep 29, 2022

Fixes

Description

Testing Instructions

Checklist

Developer Certificate of Origin

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stacimc left a comment

Choose a reason for hiding this comment

AetherUnbound commented Oct 1, 2022

AetherUnbound commented Oct 7, 2022

AetherUnbound commented Oct 7, 2022

AetherUnbound commented Oct 7, 2022

openverse-bot commented Oct 14, 2022

Footnotes

sarayourfriend left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stacimc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AetherUnbound commented Oct 21, 2022 • edited Loading

stacimc commented Oct 21, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

obulat left a comment

Choose a reason for hiding this comment

AetherUnbound commented Oct 21, 2022 •

edited

Loading