-
Notifications
You must be signed in to change notification settings - Fork 54
Add docs and template ProviderDataIngester #790
Conversation
- Breaks out into several files - Removes documentation that is redundant (copied from code) - Prefers documentation within the template - Explicitly documents advanced options as FAQ - Some small updates to the templating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic 🤩 I started using the template to convert the Smithsonian script so I found some minor typos and details but this is looking so good and useful.
Thank you for writing such good documentation @stacimc! I'll get back after looking at the test files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to consider sanitizing the provided provider name in these scripts, or alternatively excluding unsupported characters. In my test I used the following configuration:
python3 openverse_catalog/dags/templates/create_provider_ingester.py "Nappy.co" "https://api.nappy.co/v1/openverse/images" -m "image"`
Which resulted in nappy.co.py
filenames and a class Nappy.coDataIngester
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stacimc, this is amazing! I was able to use it to write my first DAG (and first python commit of my life, I believe 😅). Other than my inline comments, one thing I noticed was that by using
python3 openverse_catalog/dags/templates/create_provider_ingester.py "Nappy" "https://api.nappy.co/v1/openverse/images" -m "image"
it created a constant called NAPPY_IMAGE_PROVIDER
and I was only able to get things working by renaming to NAPPY_DEFAULT_PROVIDER
.
Edit: The above might actually be wrong. It may be because the docs comment is required at the top of the provider script and isn't part of the template. Here's the example of what I added to the Nappy pr:
"""
Content Provider: Nappy
ETL Process: Use the API to identify all CC0-licensed images.
Output: TSV file containing the image meta-data.
Notes: This api was written specially for Openverse.
There are no known limits or restrictions.
"""
I don't feel qualified to fully review the PR but this is really awesome!
One other thought. As I recall |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm loving these docs! ✨
Noticing we can add syntax highlight 🐍
Co-authored-by: Zack Krida <[email protected]>
Added some extra sanitization to the provider name. It now replaces both spaces and periods with underscores, and then removes excluded chars (anything non-alphanumeric except hyphen and underscore). I think underscore replacement makes sense for period.
@zackkrida I'm not sure what's happening here! I don't think the constant name should be an issue, since as noted in the TODOs you need to actually define the constant yourself in I tried playing around with this locally with a fake provider and it all works fine for me. Did you run Footnotes
|
Co-authored-by: Krystle Salazar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These docs and templates are very helpful. Great work here! ⭐️
I found some minor non-blocking observations, tiny details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is SO GREAT! 🥳 I haven't yet tried the testing steps, but here's a lot of small format suggestions and other potential improvements 🚀
Great job on this impressive documentation!
@@ -105,7 +117,7 @@ def __init__(self, conf: dict = None, date: str = None): | |||
self.delayed_requester = DelayedRequester( | |||
delay=self.delay, headers=self.headers | |||
) | |||
self.media_stores = self.init_media_stores() | |||
self.media_stores = self._init_media_stores() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really good call formalizing the public/private nature of these methods!
* `provider_script`: the name of the file where you defined your `ProviderDataIngester` class | ||
* `ingestion_callable`: the `ProviderDataIngester` class itself | ||
* `media_types`: the media types your provider handles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm realizing that we should make an issue for removing the legacy wrapper handling logic, which also means that we should be able to remove provider_script
here eventually. I can make that issue 🙂 Also, I think we could even determine media_types
automatically by using ProviderDataIngester::providers.keys()
😮 Lots of pairing down we could do on these configs once the refactors are all done! 🥳
|
||
|
||
def main(): | ||
parser = argparse.ArgumentParser( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note, definitely doesn't need a change: Airflow comes with Click, so we could use that here for these cases if this becomes more cumbersome 🙂
"limit": self.batch_limit, | ||
"cc": 1, | ||
"offset": 0, | ||
"api_key": Variable.get("API_KEY_{screaming_snake_provider}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😂 🐍 📣
Oh also, +1 to Zack's comment/request for a |
|
||
At a high level the steps are: | ||
|
||
1. `generate_filename`: Generates a TSV filename used in later steps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really difficult to gauge what level of background knowledge our readers have :)
Do you think we can expect the readers of this guide to know what TSV filename
is, or would it be good to add some explanation here? Something like 'Generates the name of a TSV (tab-separated values) text file that will be used for saving the data to the disk in later steps.'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this!
| *foreign_identifier* | Unique identifier for the record on the source site. | | ||
| *thumbnail_url* | Direct link to a thumbnail-sized version of the record. | | ||
| *filesize* | Size of the main file in bytes. | | ||
| *filetype* | The filetype of the main file, eg. 'mp3', 'jpg', etc. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add something about the fact that filetype can be extracted from the file URL? And will be 'validated', that is, different names of the same filetype (jpeg - jpg) will be unified?
openverse-catalog/openverse_catalog/dags/common/storage/media.py
Lines 293 to 304 in 4a9c008
def _validate_filetype(self, filetype: str | None, url: str) -> str | None: | |
""" | |
Extracts filetype from the media URL if filetype is None. | |
Unifies filetypes that have variants such as jpg/jpeg and tiff/tif. | |
:param filetype: Optional filetype string. | |
:return: filetype string or None | |
""" | |
if filetype is None: | |
filetype = extract_filetype(url, self.media_type) | |
if self.media_type != "image": | |
return filetype | |
return FILETYPE_EQUIVALENTS.get(filetype, filetype) |
| field_name | description | | ||
| --- | --- | | ||
| *duration* | Audio duration in milliseconds. | | ||
| *bit_rate* | Audio bit rate as int. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add the measurement units here, too.
|
||
**Example**: You're pulling data from a Museum database, and each "record" in a batch contains multiple photos of a single physical object. | ||
|
||
**Solution**: The `get_record_data` method takes a `data` object representing a single record from the provider API. Typically, it extracts required data and returns it as a single dict. However, it can also return a **list of dictionaries** for cases like the one described, where multiple Openverse records can be extracted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to add links to the examples of each of the solutions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to do this but was worried about code drift and making this document difficult to maintain 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooo, that's a really good point!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with code drift and maintainability is really a good point :( I wonder if we could automate it somehow. Add a link to some regex search string that would lead to the example of, say, an overridden method in one of the solutions?...
Not necessary for this PR, but would be really nice to have. I usually learn by example and links to the examples of the solutions would be really helpful for the way I learn :)
): | ||
with template_path.open("r", encoding="utf-8") as template: | ||
camel_provider = inflection.camelize(provider) | ||
screaming_snake_provider = inflection.underscore(provider).upper() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never heard about screaming snake case before 😆
Thank you so much for all the feedback! I've also added a
It does not use the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How exciting!! 💃🏼 I tried this out locally and it ran great. There's just one syntax issue in the generated script, otherwise this is good to go IMO
$ j add-provider "The Word For World Is Forest" http://the.forest image
python3 openverse_catalog/templates/create_provider_ingester.py "The Word For World Is Forest" "http://the.forest" -m image
Creating files in /home/madison/git-a8c/openverse-catalog
API script: openverse_catalog/dags/providers/provider_api_scripts/the_word_for_world_is_forest.py
API script test: tests/dags/providers/provider_api_scripts/test_the_word_for_world_is_forest.py
NOTE: You will also need to add a new ProviderWorkflow dataclass configuration to the PROVIDER_WORKFLOWS list in `openverse-catalog/dags/providers/provider_workflows.py`.
Co-authored-by: Madison Swain-Bowden <[email protected]>
Fixes
Fixes WordPress/openverse#1424 by @stacimc
Description
create_provider_ingester
script that can be used to generate a templated provider api script and test file. The generated files include a lot of documentation and TODOs aimed at helping a new contributor flesh out an ingestion class. Credit to an earlier template script in this repository, from which I lifted quite a bit of code!adding_a_new_provider.md
: Details a very brief overview of what it even means to add a provider to Openverse, explains how to use the script, and also a brief explanation of how to add aProviderWorkflow
to generate the actual DAGprovider_data_ingester_faq.md
: Describes "advanced options" for implementing some common non-standard use cases in a ProviderDataIngester, framed as an FAQdata_models.md
: Extremely temporary documentation of our columns, meant to be replaced by Document the use of each column and guideline for selection from sources openverse#1410.ProviderDataIngester
classThe intention is to avoid too much duplication of documentation from the code. If deemed necessary, we can look into auto-generating some from the doc strings, but my strategy was:
There's definitely a balancing act of where to document things in the code vs the docfile vs the template. My goal is mostly for this to be a good starting point!
Testing Instructions
just test
foobar_industries.py
&test_foobar_industries.py
) were deleted after the tests finishedjust add-provider "Foo Museum" "https://foo-museum.org/api/v1/" image audio
foo_museum.py
provider script andtest_foo_museum.py
files were generatedChecklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin