Add docs and template ProviderDataIngester #790

stacimc · 2022-10-10T22:42:52Z

Fixes

Fixes WordPress/openverse#1424 by @stacimc

Description

Adds a create_provider_ingester script that can be used to generate a templated provider api script and test file. The generated files include a lot of documentation and TODOs aimed at helping a new contributor flesh out an ingestion class. Credit to an earlier template script in this repository, from which I lifted quite a bit of code!
Adds the following docs (would love better title suggestions 😅):
- adding_a_new_provider.md: Details a very brief overview of what it even means to add a provider to Openverse, explains how to use the script, and also a brief explanation of how to add a ProviderWorkflow to generate the actual DAG
- provider_data_ingester_faq.md: Describes "advanced options" for implementing some common non-standard use cases in a ProviderDataIngester, framed as an FAQ
- data_models.md: Extremely temporary documentation of our columns, meant to be replaced by Document the use of each column and guideline for selection from sources openverse#1410.
Expands some of the documentation in the actual ProviderDataIngester class

The intention is to avoid too much duplication of documentation from the code. If deemed necessary, we can look into auto-generating some from the doc strings, but my strategy was:

Script makes it easy to generate a stubbed out new ingester class
Comments and TODOs within the generated class lead the developer through implementing a "straightforward" provider
"Advanced" options are documented separately with examples

There's definitely a balancing act of where to document things in the code vs the docfile vs the template. My goal is mostly for this to be a good starting point!

Testing Instructions

Read the docs , make sure links work, etc :)
Test the new script
- just test
  - Verify that the test files (foobar_industries.py & test_foobar_industries.py) were deleted after the tests finished
- Run it with some sample data, ie: just add-provider "Foo Museum" "https://foo-museum.org/api/v1/" image audio
  - Verify the foo_museum.py provider script and test_foo_museum.py files were generated
  - Inspect the files to make sure they make sense and that the name/endpoint/media types were templated in properly
  - Try again with different media types, including a single media type

Checklist

My pull request has a descriptive title (not a vague title like Update index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

- Breaks out into several files - Removes documentation that is redundant (copied from code) - Prefers documentation within the template - Explicitly documents advanced options as FAQ - Some small updates to the templating

krysal

This is fantastic 🤩 I started using the template to convert the Smithsonian script so I found some minor typos and details but this is looking so good and useful.

Thank you for writing such good documentation @stacimc! I'll get back after looking at the test files.

openverse_catalog/dags/templates/template_provider.py_template

openverse_catalog/dags/templates/template_test.py_template

openverse_catalog/dags/templates/create_provider_ingester.py

openverse_catalog/docs/adding_a_new_provider.md

zackkrida

We might want to consider sanitizing the provided provider name in these scripts, or alternatively excluding unsupported characters. In my test I used the following configuration:

python3 openverse_catalog/dags/templates/create_provider_ingester.py "Nappy.co" "https://api.nappy.co/v1/openverse/images" -m "image"`

Which resulted in nappy.co.py filenames and a class Nappy.coDataIngester.

openverse_catalog/dags/templates/template_provider.py_template

zackkrida

@stacimc, this is amazing! I was able to use it to write my first DAG (and first python commit of my life, I believe 😅). Other than my inline comments, one thing I noticed was that by using

python3 openverse_catalog/dags/templates/create_provider_ingester.py "Nappy" "https://api.nappy.co/v1/openverse/images" -m "image"

it created a constant called NAPPY_IMAGE_PROVIDER and I was only able to get things working by renaming to NAPPY_DEFAULT_PROVIDER.

Edit: The above might actually be wrong. It may be because the docs comment is required at the top of the provider script and isn't part of the template. Here's the example of what I added to the Nappy pr:

"""
Content Provider:       Nappy

ETL Process:            Use the API to identify all CC0-licensed images.

Output:                 TSV file containing the image meta-data.

Notes:                  This api was written specially for Openverse.
                        There are no known limits or restrictions.

"""

I don't feel qualified to fully review the PR but this is really awesome!

zackkrida · 2022-10-14T15:18:18Z

One other thought. As I recall get_should_continue wasn't included in the template. That seems common enough to me that perhaps it could be included, but commented out?

krysal

I'm loving these docs! ✨
Noticing we can add syntax highlight 🐍

openverse_catalog/docs/provider_data_ingester_faq.md

openverse_catalog/docs/adding_a_new_provider.md

Co-authored-by: Zack Krida <[email protected]>

stacimc · 2022-10-17T21:29:02Z

We might want to consider sanitizing the provided provider name in these scripts,

Added some extra sanitization to the provider name. It now replaces both spaces and periods with underscores, and then removes excluded chars (anything non-alphanumeric except hyphen and underscore). I think underscore replacement makes sense for period.

it created a constant called NAPPY_IMAGE_PROVIDER and I was only able to get things working by renaming to NAPPY_DEFAULT_PROVIDER.

Edit: The above might actually be wrong. It may be because the docs comment is required at the top of the provider script and isn't part of the template. Here's the example of what I added to the Nappy pr:

@zackkrida I'm not sure what's happening here! I don't think the constant name should be an issue, since as noted in the TODOs you need to actually define the constant yourself in provider_details.py. The docs comment you reference also shouldn't be required (our other refactored DAGs don't have that comment)¹.

I tried playing around with this locally with a fake provider and it all works fine for me. Did you run just up?

The fact that our refactored DAGs don't have this string is actually a problem though! We use it to generate our DAG documentation. I've created Add DAG documentation for refactored DAGs openverse#1391 to make sure we add it back in, and I'll go add it to the template now too 😄 I'm really glad you mentioned it! ↩

Co-authored-by: Krystle Salazar <[email protected]>

krysal

These docs and templates are very helpful. Great work here! ⭐️

I found some minor non-blocking observations, tiny details.

openverse_catalog/templates/template_provider.py_template

openverse_catalog/templates/template_test.py_template

AetherUnbound

This is SO GREAT! 🥳 I haven't yet tried the testing steps, but here's a lot of small format suggestions and other potential improvements 🚀

Great job on this impressive documentation!

AetherUnbound · 2022-10-20T19:59:26Z

openverse_catalog/dags/providers/provider_api_scripts/provider_data_ingester.py

@@ -105,7 +117,7 @@ def __init__(self, conf: dict = None, date: str = None):
        self.delayed_requester = DelayedRequester(
            delay=self.delay, headers=self.headers
        )
-        self.media_stores = self.init_media_stores()
+        self.media_stores = self._init_media_stores()


Really good call formalizing the public/private nature of these methods!

openverse_catalog/docs/adding_a_new_provider.md

AetherUnbound · 2022-10-20T20:11:42Z

openverse_catalog/docs/adding_a_new_provider.md

+* `provider_script`: the name of the file where you defined your `ProviderDataIngester` class
+* `ingestion_callable`: the `ProviderDataIngester` class itself
+* `media_types`: the media types your provider handles


I'm realizing that we should make an issue for removing the legacy wrapper handling logic, which also means that we should be able to remove provider_script here eventually. I can make that issue 🙂 Also, I think we could even determine media_types automatically by using ProviderDataIngester::providers.keys() 😮 Lots of pairing down we could do on these configs once the refactors are all done! 🥳

AetherUnbound · 2022-10-20T20:43:03Z

openverse_catalog/templates/create_provider_ingester.py

+
+
+def main():
+    parser = argparse.ArgumentParser(


Just a note, definitely doesn't need a change: Airflow comes with Click, so we could use that here for these cases if this becomes more cumbersome 🙂

openverse_catalog/templates/template_provider.py_template

AetherUnbound · 2022-10-20T20:45:30Z

openverse_catalog/templates/template_provider.py_template

+                "limit": self.batch_limit,
+                "cc": 1,
+                "offset": 0,
+                "api_key": Variable.get("API_KEY_{screaming_snake_provider}")


😂 🐍 📣

openverse_catalog/templates/template_provider.py_template

tests/templates/test_create_provider_ingester.py

AetherUnbound · 2022-10-20T23:22:52Z

Oh also, +1 to Zack's comment/request for a just recipe for running the provider creation script!

obulat · 2022-10-21T14:25:02Z

openverse_catalog/docs/adding_a_new_provider.md

+
+At a high level the steps are:
+
+1. `generate_filename`: Generates a TSV filename used in later steps


It's really difficult to gauge what level of background knowledge our readers have :)
Do you think we can expect the readers of this guide to know what TSV filename is, or would it be good to add some explanation here? Something like 'Generates the name of a TSV (tab-separated values) text file that will be used for saving the data to the disk in later steps.'

I love this!

openverse_catalog/docs/adding_a_new_provider.md

openverse_catalog/docs/data_models.md

obulat · 2022-10-21T14:38:27Z

openverse_catalog/docs/data_models.md

+| *foreign_identifier* | Unique identifier for the record on the source site. |
+| *thumbnail_url* | Direct link to a thumbnail-sized version of the record. |
+| *filesize* | Size of the main file in bytes. |
+| *filetype* | The filetype of the main file, eg. 'mp3', 'jpg', etc. |


Should we add something about the fact that filetype can be extracted from the file URL? And will be 'validated', that is, different names of the same filetype (jpeg - jpg) will be unified?

openverse-catalog/openverse_catalog/dags/common/storage/media.py

Lines 293 to 304 in 4a9c008

def _validate_filetype(self, filetype: str | None, url: str) -> str | None:

"""

Extracts filetype from the media URL if filetype is None.

Unifies filetypes that have variants such as jpg/jpeg and tiff/tif.

:param filetype: Optional filetype string.

:return: filetype string or None

"""

if filetype is None:

filetype = extract_filetype(url, self.media_type)

if self.media_type != "image":

return filetype

return FILETYPE_EQUIVALENTS.get(filetype, filetype)

obulat · 2022-10-21T14:40:06Z

openverse_catalog/docs/data_models.md

+| field_name | description |
+| --- | --- |
+| *duration* | Audio duration in milliseconds. |
+| *bit_rate* | Audio bit rate as int. |


I think we should add the measurement units here, too.

obulat · 2022-10-21T14:42:43Z

openverse_catalog/docs/provider_data_ingester_faq.md

+
+**Example**: You're pulling data from a Museum database, and each "record" in a batch contains multiple photos of a single physical object.
+
+**Solution**: The `get_record_data` method takes a `data` object representing a single record from the provider API. Typically, it extracts required data and returns it as a single dict. However, it can also return a **list of dictionaries** for cases like the one described, where multiple Openverse records can be extracted.


It would be nice to add links to the examples of each of the solutions.

I wanted to do this but was worried about code drift and making this document difficult to maintain 🤔

Ooo, that's a really good point!

The problem with code drift and maintainability is really a good point :( I wonder if we could automate it somehow. Add a link to some regex search string that would lead to the example of, say, an overridden method in one of the solutions?...

Not necessary for this PR, but would be really nice to have. I usually learn by example and links to the examples of the solutions would be really helpful for the way I learn :)

obulat · 2022-10-21T14:45:32Z

openverse_catalog/templates/create_provider_ingester.py

+):
+    with template_path.open("r", encoding="utf-8") as template:
+        camel_provider = inflection.camelize(provider)
+        screaming_snake_provider = inflection.underscore(provider).upper()


I've never heard about screaming snake case before 😆

stacimc · 2022-10-21T17:03:17Z

Thank you so much for all the feedback! I've also added a just recipe, so we can now run:

just add-provider "Foo Museum" https://foo.museum.org/api/v1 image audio

It does not use the run command since I don't think there's any reason to ever run this using the webserver image, but it does save folks from needing to find the path to the script, and it keeps all our commands in one place :)

AetherUnbound

How exciting!! 💃🏼 I tried this out locally and it ran great. There's just one syntax issue in the generated script, otherwise this is good to go IMO

$ j add-provider "The Word For World Is Forest" http://the.forest image
python3 openverse_catalog/templates/create_provider_ingester.py "The Word For World Is Forest" "http://the.forest" -m image
Creating files in /home/madison/git-a8c/openverse-catalog
API script:        openverse_catalog/dags/providers/provider_api_scripts/the_word_for_world_is_forest.py
API script test:   tests/dags/providers/provider_api_scripts/test_the_word_for_world_is_forest.py

NOTE: You will also need to add a new ProviderWorkflow dataclass configuration to the PROVIDER_WORKFLOWS list in `openverse-catalog/dags/providers/provider_workflows.py`.

openverse_catalog/templates/template_provider.py_template

Co-authored-by: Madison Swain-Bowden <[email protected]>

stacimc added 3 commits October 10, 2022 15:18

_-prefix methods that should not be overridden

6cb6d7f

Initial template

a507cd7

Add initial docs

98fef3c

stacimc added documentation Improvements or additions to documentation 🟧 priority: high Stalls work on the project or its dependents ✨ goal: improvement Improvement to an existing user-facing feature labels Oct 10, 2022

stacimc self-assigned this Oct 10, 2022

stacimc added 6 commits October 11, 2022 15:39

Update template, add test template file

277ecab

Add script to generate template files

6e5fd04

Update docs to reference script

a208913

Moving more documentation into the code

b073485

Reformat docs

eeaebfd

- Breaks out into several files - Removes documentation that is redundant (copied from code) - Prefers documentation within the template - Explicitly documents advanced options as FAQ - Some small updates to the templating

Small tweaks

8f1ad6d

stacimc marked this pull request as ready for review October 13, 2022 01:13

stacimc requested a review from a team as a code owner October 13, 2022 01:13

stacimc requested review from krysal and AetherUnbound October 13, 2022 01:13

krysal reviewed Oct 13, 2022

View reviewed changes

zackkrida reviewed Oct 13, 2022

View reviewed changes

openverse_catalog/docs/adding_a_new_provider.md Outdated Show resolved Hide resolved

zackkrida reviewed Oct 13, 2022

View reviewed changes

openverse_catalog/docs/adding_a_new_provider.md Outdated Show resolved Hide resolved

zackkrida reviewed Oct 13, 2022

View reviewed changes

openverse_catalog/dags/templates/template_provider.py_template Outdated Show resolved Hide resolved

zackkrida reviewed Oct 13, 2022

View reviewed changes

openverse_catalog/dags/templates/template_provider.py_template Outdated Show resolved Hide resolved

zackkrida reviewed Oct 14, 2022

View reviewed changes

zackkrida mentioned this pull request Oct 14, 2022

Add a Nappy provider DAG using ProviderDataIngester #796

Merged

9 tasks

krysal reviewed Oct 17, 2022

View reviewed changes

Address feedback, sanitize provider string

ab3dcbe

stacimc and others added 2 commits October 17, 2022 14:13

Fix defaults for media types, add tests

1f28f06

Adjust wording

1f53df1

Co-authored-by: Zack Krida <[email protected]>

stacimc and others added 3 commits October 17, 2022 14:29

Add syntax highlighting

a19460b

Co-authored-by: Krystle Salazar <[email protected]>

Add syntax highlighting to the rest of the code snippets

0060215

Add DAG doc to the template

bdf1315

krysal approved these changes Oct 20, 2022

View reviewed changes

openverse_catalog/templates/template_provider.py_template Outdated Show resolved Hide resolved

openverse_catalog/templates/template_test.py_template Outdated Show resolved Hide resolved

openverse_catalog/templates/template_test.py_template Outdated Show resolved Hide resolved

AetherUnbound reviewed Oct 20, 2022

View reviewed changes

obulat reviewed Oct 21, 2022

View reviewed changes

openverse_catalog/docs/adding_a_new_provider.md Outdated Show resolved Hide resolved

obulat reviewed Oct 21, 2022

View reviewed changes

openverse_catalog/docs/data_models.md Outdated Show resolved Hide resolved

obulat reviewed Oct 21, 2022

View reviewed changes

stacimc added 2 commits October 21, 2022 09:15

Add just recipe

8a04b40

Address feedback

75c87b1

AetherUnbound approved these changes Oct 21, 2022

View reviewed changes

openverse_catalog/templates/template_provider.py_template Outdated Show resolved Hide resolved

Fix syntax

d2d8a48

Co-authored-by: Madison Swain-Bowden <[email protected]>

stacimc mentioned this pull request Oct 21, 2022

Document the use of each column and guideline for selection from sources WordPress/openverse#1410

Closed

2 tasks

stacimc merged commit a3f4a42 into main Oct 24, 2022

stacimc deleted the add/provider-data-ingester-docs branch October 24, 2022 20:35

This was referenced Apr 17, 2023

Remove legacy provider logic for DAG wrapper functions WordPress/openverse#1398

Closed

Refactor Europeana to use ProviderDataIngester #821

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add docs and template ProviderDataIngester #790

Add docs and template ProviderDataIngester #790

stacimc commented Oct 10, 2022 •

edited

Loading

krysal left a comment

zackkrida left a comment

zackkrida left a comment •

edited

Loading

zackkrida commented Oct 14, 2022

krysal left a comment

stacimc commented Oct 17, 2022

krysal left a comment

AetherUnbound left a comment

AetherUnbound Oct 20, 2022

AetherUnbound Oct 20, 2022

AetherUnbound Oct 20, 2022

AetherUnbound Oct 20, 2022

AetherUnbound commented Oct 20, 2022

obulat Oct 21, 2022

stacimc Oct 21, 2022

obulat Oct 21, 2022 •

edited

Loading

obulat Oct 21, 2022

obulat Oct 21, 2022

stacimc Oct 21, 2022

AetherUnbound Oct 21, 2022

obulat Oct 22, 2022 •

edited

Loading

obulat Oct 21, 2022

stacimc commented Oct 21, 2022

AetherUnbound left a comment •

edited

Loading


		At a high level the steps are:

		1. `generate_filename`: Generates a TSV filename used in later steps

	def _validate_filetype(self, filetype: str \| None, url: str) -> str \| None:
	"""
	Extracts filetype from the media URL if filetype is None.
	Unifies filetypes that have variants such as jpg/jpeg and tiff/tif.
	:param filetype: Optional filetype string.
	:return: filetype string or None
	"""
	if filetype is None:
	filetype = extract_filetype(url, self.media_type)
	if self.media_type != "image":
	return filetype
	return FILETYPE_EQUIVALENTS.get(filetype, filetype)


		Example: You're pulling data from a Museum database, and each "record" in a batch contains multiple photos of a single physical object.

		Solution: The `get_record_data` method takes a `data` object representing a single record from the provider API. Typically, it extracts required data and returns it as a single dict. However, it can also return a list of dictionaries for cases like the one described, where multiple Openverse records can be extracted.

Add docs and template ProviderDataIngester #790

Add docs and template ProviderDataIngester #790

Conversation

stacimc commented Oct 10, 2022 • edited Loading

Fixes

Description

Testing Instructions

Checklist

Developer Certificate of Origin

krysal left a comment

Choose a reason for hiding this comment

zackkrida left a comment

Choose a reason for hiding this comment

zackkrida left a comment • edited Loading

Choose a reason for hiding this comment

zackkrida commented Oct 14, 2022

krysal left a comment

Choose a reason for hiding this comment

stacimc commented Oct 17, 2022

Footnotes

krysal left a comment

Choose a reason for hiding this comment

AetherUnbound left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AetherUnbound commented Oct 20, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

obulat Oct 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

obulat Oct 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stacimc commented Oct 21, 2022

AetherUnbound left a comment • edited Loading

Choose a reason for hiding this comment

stacimc commented Oct 10, 2022 •

edited

Loading

zackkrida left a comment •

edited

Loading

obulat Oct 21, 2022 •

edited

Loading

obulat Oct 22, 2022 •

edited

Loading

AetherUnbound left a comment •

edited

Loading