-
Notifications
You must be signed in to change notification settings - Fork 54
Conversation
for field in [ | ||
"foreign_identifier", | ||
"foreign_landing_url", | ||
f"{self.media_type}_url", | ||
]: | ||
if media_data.get(field) is None: | ||
logger.debug(f"Discarding media due to missing {field}") | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't believe this wasn't checked before 😱 Well noticed. However, I wonder whether just ignoring items when missing fields at this point, in the media store class, is the right approach. I am slightly inclined to raise an exception here and have each provider DAG handle theses cases explicitly. Sometimes, if a field is not found another from the source can be used in replacement, e.g. this is common in the creator_url
. These errors have also helped me realize some gaps in the logic in a few cases!
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes, if a field is not found another from the source can be used in replacement, e.g. this is common in the creator_url
That's a really great point -- raising an exception here might help us spot these opportunities in some cases, and even in cases where the record should ultimately be discarded, it'll be an easy fix in the provider class. I'll update!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following my previous comment, shouldn't we also add the field validations in the Metropolitan DAG? It could skip processing for more than one image, cutting times a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like what y'all have proposed & discussed! Given how rapidly we are able to iterate on ingestion DAGs, I like the idea of failing sooner and addressing issues on a per-provider basis. That, in its own way, can serve as a data health check - if the data changes and our assumptions no longer hold, we want to know about it.
I have one logging suggestion which is blocking, but otherwise this looks great!
openverse_catalog/dags/providers/provider_api_scripts/metropolitan_museum.py
Outdated
Show resolved
Hide resolved
openverse_catalog/dags/providers/provider_api_scripts/metropolitan_museum.py
Outdated
Show resolved
Hide resolved
@@ -63,6 +63,13 @@ | |||
], | |||
) | |||
|
|||
TEST_REQUIRED_FIELDS = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Fixes
Fixes WordPress/openverse#1281 by @stacimc
Description
This PR:
ValueError
in the MediaStore'sclean_media_metadata
if any of the required fieldsforeign_landing_url
,url
, andforeign_identifier
are missing.isPublicDomain
key (in addition to explicitFalse
)Testing Instructions
View the new tests.
The Metropolitan issue is hard to reproduce organically. You can manually update the metropolitan DAG code to add a
None
image here:Then run the DAG and ensure that you do not see any errors (because no records with
None
url are passed to the MediaStore).You can test that the error handling in the media store is working by hard-coding
None
for one of the required fields in the results here. Then you should see the task fail when you run the DAG, with an error message like:Checklist
Update index.md
).main
) ora parent feature branch.
errors.
Developer Certificate of Origin
Developer Certificate of Origin