-
Notifications
You must be signed in to change notification settings - Fork 54
Use Python to group items by license to speed up the query #1045
Conversation
Signed-off-by: Olga Bulat <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got a few comments for format improvements, but I was able to test this locally and it ran great!
updated_by_license = {} | ||
|
||
for license_pair, identifiers in records_to_update.items(): | ||
license_, license_version = license_pair.split(",") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're saving this as a string only to split it out again, we could probably use the license & license version pair in tuple form as the key itself, e.g. records_to_update[license_, license_version]
. Then we can unpack it directly here too, e.g. for (license_, license_version), identifiers in records_to_update.items()
.
Co-authored-by: Madison Swain-Bowden <[email protected]>
Co-authored-by: Madison Swain-Bowden <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks and works great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, what a great test set to have! 😄 |
Thank you for your testing, it's more than what I could hope for, @stacimc 😆 I would not have guessed that the most popular license would be by-nd-nc/2.0/jp |
I ran the DAG in prod, and this is the log output:
From the logs, here are some problems with these 9382 records:
|
Fixes
Fixes WordPress/openverse#1270
Description
This PR updates the second step in the DAG to replace many
SELECT
queries withSELECT
ing all items withNULL
in meta_data, then grouping of the items to update using Python function, and then running oneUPDATE
query per license pair (if necessary). Hopefully, this should be much faster than the previous run.This PR also sets the trigger rule for the last step, because otherwise it was skipped if the second step was skipped.
Testing instruction
Same as #1005.