Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Add docs and template ProviderDataIngester #790

Merged
merged 18 commits into from
Oct 24, 2022

Conversation

stacimc
Copy link
Contributor

@stacimc stacimc commented Oct 10, 2022

Fixes

Fixes WordPress/openverse#1424 by @stacimc

Description

  • Adds a create_provider_ingester script that can be used to generate a templated provider api script and test file. The generated files include a lot of documentation and TODOs aimed at helping a new contributor flesh out an ingestion class. Credit to an earlier template script in this repository, from which I lifted quite a bit of code!
  • Adds the following docs (would love better title suggestions 😅):
    • adding_a_new_provider.md: Details a very brief overview of what it even means to add a provider to Openverse, explains how to use the script, and also a brief explanation of how to add a ProviderWorkflow to generate the actual DAG
    • provider_data_ingester_faq.md: Describes "advanced options" for implementing some common non-standard use cases in a ProviderDataIngester, framed as an FAQ
    • data_models.md: Extremely temporary documentation of our columns, meant to be replaced by Document the use of each column and guideline for selection from sources openverse#1410.
  • Expands some of the documentation in the actual ProviderDataIngester class

The intention is to avoid too much duplication of documentation from the code. If deemed necessary, we can look into auto-generating some from the doc strings, but my strategy was:

  • Script makes it easy to generate a stubbed out new ingester class
  • Comments and TODOs within the generated class lead the developer through implementing a "straightforward" provider
  • "Advanced" options are documented separately with examples

There's definitely a balancing act of where to document things in the code vs the docfile vs the template. My goal is mostly for this to be a good starting point!

Testing Instructions

  • Read the docs , make sure links work, etc :)
  • Test the new script
    • just test
      • Verify that the test files (foobar_industries.py & test_foobar_industries.py) were deleted after the tests finished
    • Run it with some sample data, ie: just add-provider "Foo Museum" "https://foo-museum.org/api/v1/" image audio
      • Verify the foo_museum.py provider script and test_foo_museum.py files were generated
      • Inspect the files to make sure they make sense and that the name/endpoint/media types were templated in properly
      • Try again with different media types, including a single media type

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@stacimc stacimc added documentation Improvements or additions to documentation 🟧 priority: high Stalls work on the project or its dependents ✨ goal: improvement Improvement to an existing user-facing feature labels Oct 10, 2022
@stacimc stacimc self-assigned this Oct 10, 2022
- Breaks out into several files
- Removes documentation that is redundant (copied from code)
- Prefers documentation within the template
- Explicitly documents advanced options as FAQ
- Some small updates to the templating
@stacimc stacimc marked this pull request as ready for review October 13, 2022 01:13
@stacimc stacimc requested a review from a team as a code owner October 13, 2022 01:13
Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic 🤩 I started using the template to convert the Smithsonian script so I found some minor typos and details but this is looking so good and useful.

Thank you for writing such good documentation @stacimc! I'll get back after looking at the test files.

Copy link
Member

@zackkrida zackkrida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to consider sanitizing the provided provider name in these scripts, or alternatively excluding unsupported characters. In my test I used the following configuration:

python3 openverse_catalog/dags/templates/create_provider_ingester.py "Nappy.co" "https://api.nappy.co/v1/openverse/images" -m "image"`

Which resulted in nappy.co.py filenames and a class Nappy.coDataIngester.

Copy link
Member

@zackkrida zackkrida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stacimc, this is amazing! I was able to use it to write my first DAG (and first python commit of my life, I believe 😅). Other than my inline comments, one thing I noticed was that by using

python3 openverse_catalog/dags/templates/create_provider_ingester.py "Nappy" "https://api.nappy.co/v1/openverse/images" -m "image"

it created a constant called NAPPY_IMAGE_PROVIDER and I was only able to get things working by renaming to NAPPY_DEFAULT_PROVIDER.

Edit: The above might actually be wrong. It may be because the docs comment is required at the top of the provider script and isn't part of the template. Here's the example of what I added to the Nappy pr:

"""
Content Provider:       Nappy

ETL Process:            Use the API to identify all CC0-licensed images.

Output:                 TSV file containing the image meta-data.

Notes:                  This api was written specially for Openverse.
                        There are no known limits or restrictions.

"""

I don't feel qualified to fully review the PR but this is really awesome!

@zackkrida
Copy link
Member

One other thought. As I recall get_should_continue wasn't included in the template. That seems common enough to me that perhaps it could be included, but commented out?

Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm loving these docs! ✨
Noticing we can add syntax highlight 🐍

openverse_catalog/docs/provider_data_ingester_faq.md Outdated Show resolved Hide resolved
openverse_catalog/docs/provider_data_ingester_faq.md Outdated Show resolved Hide resolved
openverse_catalog/docs/provider_data_ingester_faq.md Outdated Show resolved Hide resolved
openverse_catalog/docs/provider_data_ingester_faq.md Outdated Show resolved Hide resolved
openverse_catalog/docs/provider_data_ingester_faq.md Outdated Show resolved Hide resolved
openverse_catalog/docs/provider_data_ingester_faq.md Outdated Show resolved Hide resolved
openverse_catalog/docs/adding_a_new_provider.md Outdated Show resolved Hide resolved
@stacimc
Copy link
Contributor Author

stacimc commented Oct 17, 2022

We might want to consider sanitizing the provided provider name in these scripts,

Added some extra sanitization to the provider name. It now replaces both spaces and periods with underscores, and then removes excluded chars (anything non-alphanumeric except hyphen and underscore). I think underscore replacement makes sense for period.

it created a constant called NAPPY_IMAGE_PROVIDER and I was only able to get things working by renaming to NAPPY_DEFAULT_PROVIDER.

Edit: The above might actually be wrong. It may be because the docs comment is required at the top of the provider script and isn't part of the template. Here's the example of what I added to the Nappy pr:

@zackkrida I'm not sure what's happening here! I don't think the constant name should be an issue, since as noted in the TODOs you need to actually define the constant yourself in provider_details.py. The docs comment you reference also shouldn't be required (our other refactored DAGs don't have that comment)1.

I tried playing around with this locally with a fake provider and it all works fine for me. Did you run just up?

Footnotes

  1. The fact that our refactored DAGs don't have this string is actually a problem though! We use it to generate our DAG documentation. I've created Add DAG documentation for refactored DAGs openverse#1391 to make sure we add it back in, and I'll go add it to the template now too 😄 I'm really glad you mentioned it!

Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These docs and templates are very helpful. Great work here! ⭐️

I found some minor non-blocking observations, tiny details.

openverse_catalog/templates/template_provider.py_template Outdated Show resolved Hide resolved
openverse_catalog/templates/template_test.py_template Outdated Show resolved Hide resolved
openverse_catalog/templates/template_test.py_template Outdated Show resolved Hide resolved
Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is SO GREAT! 🥳 I haven't yet tried the testing steps, but here's a lot of small format suggestions and other potential improvements 🚀

Great job on this impressive documentation!

@@ -105,7 +117,7 @@ def __init__(self, conf: dict = None, date: str = None):
self.delayed_requester = DelayedRequester(
delay=self.delay, headers=self.headers
)
self.media_stores = self.init_media_stores()
self.media_stores = self._init_media_stores()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good call formalizing the public/private nature of these methods!

openverse_catalog/docs/adding_a_new_provider.md Outdated Show resolved Hide resolved
openverse_catalog/docs/adding_a_new_provider.md Outdated Show resolved Hide resolved
openverse_catalog/docs/adding_a_new_provider.md Outdated Show resolved Hide resolved
Comment on lines +67 to +69
* `provider_script`: the name of the file where you defined your `ProviderDataIngester` class
* `ingestion_callable`: the `ProviderDataIngester` class itself
* `media_types`: the media types your provider handles
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm realizing that we should make an issue for removing the legacy wrapper handling logic, which also means that we should be able to remove provider_script here eventually. I can make that issue 🙂 Also, I think we could even determine media_types automatically by using ProviderDataIngester::providers.keys() 😮 Lots of pairing down we could do on these configs once the refactors are all done! 🥳



def main():
parser = argparse.ArgumentParser(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note, definitely doesn't need a change: Airflow comes with Click, so we could use that here for these cases if this becomes more cumbersome 🙂

openverse_catalog/templates/template_provider.py_template Outdated Show resolved Hide resolved
"limit": self.batch_limit,
"cc": 1,
"offset": 0,
"api_key": Variable.get("API_KEY_{screaming_snake_provider}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😂 🐍 📣

openverse_catalog/templates/template_provider.py_template Outdated Show resolved Hide resolved
tests/templates/test_create_provider_ingester.py Outdated Show resolved Hide resolved
@AetherUnbound
Copy link
Contributor

Oh also, +1 to Zack's comment/request for a just recipe for running the provider creation script!


At a high level the steps are:

1. `generate_filename`: Generates a TSV filename used in later steps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really difficult to gauge what level of background knowledge our readers have :)
Do you think we can expect the readers of this guide to know what TSV filename is, or would it be good to add some explanation here? Something like 'Generates the name of a TSV (tab-separated values) text file that will be used for saving the data to the disk in later steps.'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this!

| *foreign_identifier* | Unique identifier for the record on the source site. |
| *thumbnail_url* | Direct link to a thumbnail-sized version of the record. |
| *filesize* | Size of the main file in bytes. |
| *filetype* | The filetype of the main file, eg. 'mp3', 'jpg', etc. |
Copy link
Contributor

@obulat obulat Oct 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add something about the fact that filetype can be extracted from the file URL? And will be 'validated', that is, different names of the same filetype (jpeg - jpg) will be unified?

def _validate_filetype(self, filetype: str | None, url: str) -> str | None:
"""
Extracts filetype from the media URL if filetype is None.
Unifies filetypes that have variants such as jpg/jpeg and tiff/tif.
:param filetype: Optional filetype string.
:return: filetype string or None
"""
if filetype is None:
filetype = extract_filetype(url, self.media_type)
if self.media_type != "image":
return filetype
return FILETYPE_EQUIVALENTS.get(filetype, filetype)

| field_name | description |
| --- | --- |
| *duration* | Audio duration in milliseconds. |
| *bit_rate* | Audio bit rate as int. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add the measurement units here, too.


**Example**: You're pulling data from a Museum database, and each "record" in a batch contains multiple photos of a single physical object.

**Solution**: The `get_record_data` method takes a `data` object representing a single record from the provider API. Typically, it extracts required data and returns it as a single dict. However, it can also return a **list of dictionaries** for cases like the one described, where multiple Openverse records can be extracted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add links to the examples of each of the solutions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to do this but was worried about code drift and making this document difficult to maintain 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooo, that's a really good point!

Copy link
Contributor

@obulat obulat Oct 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with code drift and maintainability is really a good point :( I wonder if we could automate it somehow. Add a link to some regex search string that would lead to the example of, say, an overridden method in one of the solutions?...

Not necessary for this PR, but would be really nice to have. I usually learn by example and links to the examples of the solutions would be really helpful for the way I learn :)

):
with template_path.open("r", encoding="utf-8") as template:
camel_provider = inflection.camelize(provider)
screaming_snake_provider = inflection.underscore(provider).upper()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never heard about screaming snake case before 😆

@stacimc
Copy link
Contributor Author

stacimc commented Oct 21, 2022

Thank you so much for all the feedback! I've also added a just recipe, so we can now run:

just add-provider "Foo Museum" https://foo.museum.org/api/v1 image audio

It does not use the run command since I don't think there's any reason to ever run this using the webserver image, but it does save folks from needing to find the path to the script, and it keeps all our commands in one place :)

Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How exciting!! 💃🏼 I tried this out locally and it ran great. There's just one syntax issue in the generated script, otherwise this is good to go IMO

$ j add-provider "The Word For World Is Forest" http://the.forest image
python3 openverse_catalog/templates/create_provider_ingester.py "The Word For World Is Forest" "http://the.forest" -m image
Creating files in /home/madison/git-a8c/openverse-catalog
API script:        openverse_catalog/dags/providers/provider_api_scripts/the_word_for_world_is_forest.py
API script test:   tests/dags/providers/provider_api_scripts/test_the_word_for_world_is_forest.py

NOTE: You will also need to add a new ProviderWorkflow dataclass configuration to the PROVIDER_WORKFLOWS list in `openverse-catalog/dags/providers/provider_workflows.py`.

openverse_catalog/templates/template_provider.py_template Outdated Show resolved Hide resolved
Co-authored-by: Madison Swain-Bowden <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
documentation Improvements or additions to documentation ✨ goal: improvement Improvement to an existing user-facing feature 🟧 priority: high Stalls work on the project or its dependents
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a ProviderDataIngester template/documentation
5 participants