Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEV-1381 use delta computation for dailies with hathifiles-database & report on those changes #22

Merged
merged 12 commits into from
Jan 24, 2025

Conversation

moseshll
Copy link
Contributor

@moseshll moseshll commented Jan 2, 2025

  • Rename MonthlyUpdate class to DeltaUpdate
  • Change hathifiles_database_full_update to use DeltaUpdate for monthly and update hathifiles
  • Add statistics method for evaluating actual work done for a given hathifile
  • Each hathifile is now processed in its own tempdir instead of one for all of the files to be processed (may mitigate excessive disk usage on the first of month).
  • Using additional comm tricks we can get a count of records added vs updated, at the expense of additional time and storage. Fortunately, these additional operations won't happen if you don't call delta_update_obj.statistics.
  • HATHIFILES_MYSQL_CONNECTION replaced with kwargs defaulting to ENV
  • All HATHIFILES_MYSQL_* vars migrated to MARIADB_HATHIFILES_RW_*
  • New section of README documenting ENV variables.
  • Deprecated exe/ files removed (hathifiles_database_full_update is the only survivor).

- Change `hathifiles_database_full_update` to use `DeltaUpdate` for monthly and update hathifiles
- Add `statistics` method for evaluating actual work done for a given hathifile
@moseshll moseshll marked this pull request as draft January 2, 2025 18:58
@moseshll moseshll marked this pull request as ready for review January 10, 2025 18:56
@moseshll moseshll requested a review from aelkiss January 10, 2025 18:56
@moseshll
Copy link
Contributor Author

moseshll commented Jan 10, 2025

I have a reservation about allowing database credentials from ENV to be overridden by the DB::Connection kwargs since those values are not available to the Dumper class: it always uses ENV.

Maybe just add a note to the README that relying on ENV is the preferred way to go.

@aelkiss
Copy link
Member

aelkiss commented Jan 13, 2025

Haven't yet reviewed; based on the description I think before deploying (even in testing) we need to get https://github.com/hathitrust/ht_tanka/tree/database-secrets merged.

Copy link
Member

@aelkiss aelkiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The database env var / connection changes make sense to me. I need to spend some more time looking at the delta computation.

Copy link
Member

@aelkiss aelkiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks great to me; tests make sense; I appreciate all the comments/documentation in the delta_update class.

As previously mentioned we will need to make sure that the hathifiles pod has sufficient working space before doing this and also add in the new secrets/env vars (https://github.com/hathitrust/ht_tanka/pull/129).

This should be useful for holdings in terms of looking at possible new/changed items and updating its mapping from ocn to clusters as needed; thinking about the best way to get that data to holdings is future work.

@aelkiss
Copy link
Member

aelkiss commented Jan 16, 2025

Reminders:

@aelkiss
Copy link
Member

aelkiss commented Jan 16, 2025

FYI @niquerio This is a significant change to how hathifiles-database works on a daily basis. No rush to update to this version right away when merged, but you may want to take a look especially after we test this out on our end and make sure it all works.

@aelkiss aelkiss merged commit 0939c84 into main Jan 24, 2025
2 checks passed
@aelkiss aelkiss deleted the DEV-1381_daily_delta branch January 24, 2025 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants