-
Notifications
You must be signed in to change notification settings - Fork 54
Discard audio file when preview 404s fetching filesize #973
Conversation
TODO(obulat): move filesize detection to the polite crawler | ||
""" | ||
return int(requests.head(url).headers["content-length"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the TODO task is done 😄 and we can remove the comment 👏
Aside from that, LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well not quite 😅 Since we don't have the polite crawler running at all. But I agree, we can remove the comment 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! One important thing this changes - we stop processing if we're unable to find the filesize, regardless of whether or not its because the preview 404'd or if it was a different reason that prevented us from doing so. I think that's fine, but I wanted to make it explicit that this is our new approach for this DAG since we're not differentiating on a 404 specifically (and I think it'd be particularly difficult to do with the delayed requester 😅).
TODO(obulat): move filesize detection to the polite crawler | ||
""" | ||
return int(requests.head(url).headers["content-length"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well not quite 😅 Since we don't have the polite crawler running at all. But I agree, we can remove the comment 🙂
@AetherUnbound Oh, so I see the other cases you mentioned in code comments:
But what kind of handling was you expecting here to happen? I think we already do the retry with the use of the delayed requester (that I confused with the crawler thing, my bad!) and don't see what else can be done in the different situations, other than discarding the track. |
That's a good point of clarification @krysal, there's quite a few things happening with this particular request! In the cases you point out, the The point with my comment was merely that it'd be possible for Freesound to return a 500 (or some other non-200 error code) consistently for all retries, but still potentially reference a valid record. In this case we'd discard the track entirely since the preview was not available (at the time). Previously this would raise an exception and potentially allow us to retry, now in that case the ingester will just move on. I think it's a very small edge case that we're not likely to encounter, but I wanted to point it out nonetheless 🙂 |
Fixes
Fixes WordPress/openverse#1321 by @AetherUnbound
Fixes WordPress/openverse#1578 by @AetherUnbound
Description
DelayedRequester::head
callNone
early and skip ingestion for these records.Testing Instructions
Check out the new unit tests.
Run the Freesound DAG locally and verify that it still works.
One way to force a 404 for testing is to update the
_get_audio_file_size
method to append some nonsense to the end of the URL before making the request:Then re-run the DAG. You should see a lot of logs that look like this:
Manually
Mark Success
thepull_data
step once you have verified these logs, or else it will try and fail for all records. Then verify inreport_load_completion
that no records were actually ingested.Checklist
Update index.md
).main
) ora parent feature branch.
errors.
Developer Certificate of Origin
Developer Certificate of Origin