-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEV-1381 use delta computation for dailies with hathifiles-database & report on those changes #22
Conversation
- Change `hathifiles_database_full_update` to use `DeltaUpdate` for monthly and update hathifiles - Add `statistics` method for evaluating actual work done for a given hathifile
- Excise `ENV["HATHIFILES_MYSQL_CONNECTION"]` from `Hathifiles` initializer - Add idempotency and daily "no deletions" test to delta update spec
- Change all `HATHIFILES_MYSQL_*` env vars to `MARIADB_HATHIFILES_RW_*`
I have a reservation about allowing database credentials from ENV to be overridden by the DB::Connection Maybe just add a note to the README that relying on ENV is the preferred way to go. |
Haven't yet reviewed; based on the description I think before deploying (even in testing) we need to get https://github.com/hathitrust/ht_tanka/tree/database-secrets merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The database env var / connection changes make sense to me. I need to spend some more time looking at the delta computation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks great to me; tests make sense; I appreciate all the comments/documentation in the delta_update class.
As previously mentioned we will need to make sure that the hathifiles pod has sufficient working space before doing this and also add in the new secrets/env vars (https://github.com/hathitrust/ht_tanka/pull/129).
This should be useful for holdings in terms of looking at possible new/changed items and updating its mapping from ocn to clusters as needed; thinking about the best way to get that data to holdings is future work.
Reminders:
|
FYI @niquerio This is a significant change to how hathifiles-database works on a daily basis. No rush to update to this version right away when merged, but you may want to take a look especially after we test this out on our end and make sure it all works. |
MonthlyUpdate
class toDeltaUpdate
hathifiles_database_full_update
to useDeltaUpdate
for monthly and update hathifilesstatistics
method for evaluating actual work done for a given hathifiletempdir
instead of one for all of the files to be processed (may mitigate excessive disk usage on the first of month).comm
tricks we can get a count of records added vs updated, at the expense of additional time and storage. Fortunately, these additional operations won't happen if you don't calldelta_update_obj.statistics
.HATHIFILES_MYSQL_CONNECTION
replaced withkwargs
defaulting toENV
HATHIFILES_MYSQL_*
vars migrated toMARIADB_HATHIFILES_RW_*
exe/
files removed (hathifiles_database_full_update
is the only survivor).