Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Refactor Phylopic to use ProviderDataIngester #747

Merged
merged 15 commits into from
Oct 24, 2022

Conversation

AetherUnbound
Copy link
Contributor

Fixes

Fixes WordPress/openverse#1516 by @stacimc

Description

This PR refactors the Phylopic provider script to use the new ProviderDataIngester class.

This refactor is actually pretty significant because the query params are not the defining feature for determining the values to retrieve, the endpoint is! I had to usurp existing functions a bit in order to get this all to work with the ingester class. Let me know if the current setup is too confusing or difficult to work with. Frankly we can strip out the logic for processing the entire dataset if we'd like and just make this a dated DAG.

Testing Instructions

  1. just recreate && just test
  2. Set the class's delay attribute to 0.5 and turn the DAG on. It will take about 5 runs (due to catchup), but the 5th run should actually capture data.
  3. Run just run python openverse_catalog/dags/providers/provider_api_scripts/phylopic.py --date-start 2020-11-15, this should ingest 6 records
  4. (Optionally) Run just run python openverse_catalog/dags/providers/provider_api_scripts/phylopic.py, this should attempt to ingest everything, but let it go for only a few iterations before stopping

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@AetherUnbound AetherUnbound requested a review from a team as a code owner September 29, 2022 21:51
@openverse-bot openverse-bot added ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🟨 priority: medium Not blocking but should be addressed soon labels Sep 29, 2022
def get_media_type(self, record: dict) -> str:
return constants.IMAGE

def get_response_json(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it makes sense for get_next_query_params to still just return a configuration object that then get_response_json converts into a usable endpoint to call super().get_response_json with?

I think that might tighten up the implementation and make it seem a little more "normal" and easier to understand. At least I think it would from my perspective, coming into this out of pure curiosity 🙂

Another thing that crossed my mind was whether some flag other than empty query params (maybe a sentinel object?) could be used to indicate that the batch operation was complete. It would make the process more explicit for other scripts too as well as allowing this one to avoid the awkward dict copying that requires a whole extra comment to explain it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used a totally different method of constructing the endpoints, let me know if this new approach makes more sense!

Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't gotten a chance to review this closely yet, but I'm curious if you know why the script supports PhyloPic being run as a dated DAG and as "non-dated"? My assumption is that this is run weekly as a dated dag and processes 7 days at a time, but the intention is that a reingestion workflow would need to be able to run it for single individual days as normal.

I'm not sure why it's doing that as opposed to just running on a @daily schedule, for a single day like other dated DAGs 🤔 It seems like this introduces a lot of complexity. Were you able to find a reason that it was originally implemented that way?

Obviously this isn't introduced by your refactor! But curious if most of the complexity is coming from that, or if that's incidental and it's mostly the way the endpoints are structured?

@AetherUnbound
Copy link
Contributor Author

After looking at the Stocksnap refactor (#601) for some other reasons, I think Rebecca's approach of having endpoint be a computed property is probably the best way to go! I also want to incorporate some of the other feedback provided, so I'll be taking a second look at this next week and making further changes to the endpoint setup 🙂

@AetherUnbound
Copy link
Contributor Author

I should be able to get to this tomorrow but I'm going to go ahead and draft it until it's ready!

@AetherUnbound AetherUnbound marked this pull request as draft October 7, 2022 00:28
@AetherUnbound
Copy link
Contributor Author

I tested this locally as a dated DAG and it worked great, I also set dated=False and ran again and the iteration behaved as expected!

[2022-10-07, 20:16:08 UTC] {provider_data_ingester.py:151} INFO - Begin ingestion for PhylopicDataIngester
[2022-10-07, 20:16:08 UTC] {phylopic.py:76} INFO - Constructed endpoint: http://phylopic.org/api/a/image/list/0/25
[2022-10-07, 20:16:21 UTC] {provider_data_ingester.py:165} INFO - 25 records ingested so far.
[2022-10-07, 20:16:21 UTC] {phylopic.py:76} INFO - Constructed endpoint: http://phylopic.org/api/a/image/list/25/25
[2022-10-07, 20:16:34 UTC] {provider_data_ingester.py:165} INFO - 50 records ingested so far.
[2022-10-07, 20:16:34 UTC] {phylopic.py:76} INFO - Constructed endpoint: http://phylopic.org/api/a/image/list/50/25
[2022-10-07, 20:16:48 UTC] {provider_data_ingester.py:165} INFO - 74 records ingested so far.
[2022-10-07, 20:16:48 UTC] {phylopic.py:76} INFO - Constructed endpoint: http://phylopic.org/api/a/image/list/75/25
[2022-10-07, 20:17:01 UTC] {provider_data_ingester.py:165} INFO - 99 records ingested so far.
[2022-10-07, 20:17:01 UTC] {phylopic.py:76} INFO - Constructed endpoint: http://phylopic.org/api/a/image/list/100/25
[2022-10-07, 20:17:02 UTC] {media.py:219} INFO - Writing 100 lines from buffer to disk.
[2022-10-07, 20:17:14 UTC] {provider_data_ingester.py:165} INFO - 124 records ingested so far.
[2022-10-07, 20:17:14 UTC] {phylopic.py:76} INFO - Constructed endpoint: http://phylopic.org/api/a/image/list/125/25
[2022-10-07, 20:17:27 UTC] {provider_data_ingester.py:165} INFO - 149 records ingested so far.
[2022-10-07, 20:17:27 UTC] {phylopic.py:76} INFO - Constructed endpoint: http://phylopic.org/api/a/image/list/150/25
[2022-10-07, 20:17:40 UTC] {provider_data_ingester.py:165} INFO - 171 records ingested so far.
[2022-10-07, 20:17:40 UTC] {phylopic.py:76} INFO - Constructed endpoint: http://phylopic.org/api/a/image/list/175/25
[2022-10-07, 20:17:53 UTC] {provider_data_ingester.py:165} INFO - 196 records ingested so far.
[2022-10-07, 20:17:53 UTC] {phylopic.py:76} INFO - Constructed endpoint: http://phylopic.org/api/a/image/list/200/25

@AetherUnbound AetherUnbound marked this pull request as ready for review October 7, 2022 20:21
@AetherUnbound
Copy link
Contributor Author

@stacimc It's not clear to me why this was set up to run in both manners, it seems like that's how it was from the beginning: 1518383

I chose @weekly for this DAG rather than @daily because there were many days which had no updates so the runs would skip. It seemed like running this once a week rather than daily meant we'd have a smaller impact on the API while ensuring more of the times we actually ran picked up data.

I think once we get a reingestion workflow set up for this provider, we can do away with the reprocessing altogether! That, OR, our "reingestion" workflow can just reconsume the entire dataset using the offset + limit approach 🤔 🤷🏼‍♀️

@openverse-bot
Copy link
Contributor

Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR:

@obulat
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was updated 4 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)2.

@AetherUnbound, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

Copy link
Contributor

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple comments but the endpoint handling makes way more sense, nice work 🤠

return data

@staticmethod
def _image_url(_uuid: str) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is uuid reserved? Why does it have a leading _? I don't see it imported in this module, but maybe it's being defined globally somehow? Is this an Airflow specific thing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is strange. Looks like this comes from the original code and also applied to some other variables like _endpoint, which definitely shouldn't be reserved. Maybe meant to denote an internal variable, but it's not done consistently 🤔

I don't see any reason we couldn't omit the leading _, we have variables named uuid elsewhere in provider scripts. Probably a leftover artifact of refactoring multiple times!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oop, definitely holdover from the original code. I think it's more convention than anything, since uuid is a module and it might be ideal to make the distinction between what this is and that module. That said, we aren't using that module here as y'all point out so I'll change it 🙂

Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one, I think this might be the weirdest one to refactor! I ran this locally while reviewing the code and forgot that catchup was turned on, so it ended up running quite a backfill with no problems 😆

I made #813 to track pulling out the logic for the 'non-dated' version. I have some reservations about the unusual setup with running @weekly, and how we will tweak this for reingestion -- but either way I think it's out of scope here. This looks good to me!

@@ -33,6 +33,7 @@
WORDPRESS_DEFAULT_PROVIDER = "wordpress"
FREESOUND_DEFAULT_PROVIDER = "freesound"
INATURALIST_DEFAULT_PROVIDER = "inaturalist"
PHYLOPIC_DEFAULT_PROVIDER = "phylopic"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny this one wasn't set up here!

default_process_days = 7

def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make Phylopic accept process_days as an argument here instead of overriding the default when we want to change it (as is done in the main method)?

This will be necessary for the reingestion workflows, where we want to be able to run Phylopic for individual days instead of a week at a time. Even with that change, this still introduces a complication for reingestion since those workflows are factory-generated. We would need to add an args option to the ProviderReingestionWorkflow configuration objects or something similar, that we could pass in through the factory 😬

Outside of the scope of this PR, I'd like to at least consider making this a daily DAG. If it is giving us a small enough amount of data that this is unreasonable, then alternatively maybe it shouldn't be dated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried doing this initially, and I ran into some issue with handling args vs kwargs vs this parameter in the various places that the provider ingester class is passed around. I think having it as @weekly was fine initially, but if that's going to create more problems for the reingestion workflow, then maybe this would just be better as @daily 😄 It'd prevent us from having to figure out all these additional workarounds for process_days as well. I'll go ahead and make that change as a commit, but if we do want to pursue just keeping this as a weekly DAG before merging I can revert that commit.

@AetherUnbound
Copy link
Contributor Author

AetherUnbound commented Oct 21, 2022

I've updated this to be an @daily DAG and removed all the logic for processing more than one day at a time. I've also tested this locally to ensure that the catchup=True nature of this DAG won't result in it trying to run daily versions for all of the days which are not Sunday (i.e. when the @weekly runs would kick off).

Here are some local runs, you can see the logical date is every 7 days until the 21st, where it starts running daily after that!

image

@stacimc
Copy link
Contributor

stacimc commented Oct 21, 2022

Looks great to me! Switching to @daily seems like the easiest path forward and will definitely make reingestion easier 👍

I've also tested this locally to ensure that the catchup=True nature of this DAG won't result in it trying to run daily versions for all of the days which are not Sunday (i.e. when the @Weekly runs would kick off).

Kudos for thinking to test this, what a good catch!

submitter = result.get("submitter", {})
first_name = submitter.get("firstName")
last_name = submitter.get("lastName")
if first_name and last_name:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to add the submitter if there is a first_name OR last_name. Does that ever happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure! I can make an issue for it though 😄

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to standardize all the DAGs to be daily when the date is provided.
I've added a couple of nitty comments inline. Other than that, everything works and looks great!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor Phylopic to use ProviderDataIngester
5 participants