Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HT-2835: Daily Hathifile loading #78

Merged
merged 4 commits into from
Apr 23, 2021
Merged

Conversation

aelkiss
Copy link
Member

@aelkiss aelkiss commented Apr 19, 2021

This orchestrates loading hathifiles on a daily basis. In particular, it adds:

  • a database table, currently called holdings_files, to track what files we have loaded on a daily basis or not
  • a FileMutex class to ensure processes aren't trying to run reports while we're loading data

It also adds the notion of top-level configuration separately from Services, which should help some to reduce the number of environment variables and generally make configuration more visible.

TODO before this is fully ready for review:

aelkiss added 2 commits April 19, 2021 17:15
- Always revalidate connection
- Ensure if a test fails that any injections are reset
@aelkiss aelkiss changed the title Hathifile loading HT-2835: Daily Hathifile loading Apr 19, 2021
@aelkiss aelkiss force-pushed the HT-2835-hathifile-loading branch 2 times, most recently from b54b270 to ff45f17 Compare April 20, 2021 13:29
aelkiss added 2 commits April 20, 2021 09:30
- ensure MONGOID_ENV is set in specs
- ensure specs use spec_helper to load settings
- ensure environment gets set in a usable way by default
- remove environment variables from places other than config
- makes getting config for Sequel database match keycard/checkpoint
@aelkiss aelkiss force-pushed the HT-2835-hathifile-loading branch from ff45f17 to 3ab6f06 Compare April 20, 2021 13:31
@aelkiss aelkiss marked this pull request as ready for review April 20, 2021 13:31
@aelkiss
Copy link
Member Author

aelkiss commented Apr 20, 2021

Given the changes for config here we will need to adjust our Kubernetes cronjob / pod configuration to include appropriate configuration. I can work on that, including adding config for a cron job for this, once we think this is headed in the right direction.

unrestrict_primary_key

def self.latest(**kwargs)
where(**kwargs).order_by(Sequel.desc(:produced)).limit(1).first
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The splat here and below on line 16 are unnecessary, I think

@@ -1,4 +1,4 @@
GRANT USAGE ON *.* TO 'ht_repository'@'%' IDENTIFIED BY 'ht_repository';
GRANT SELECT ON `ht_rights`.* TO 'ht_repository'@'%';
GRANT SELECT, INSERT, UPDATE, DELETE, LOCK TABLES ON `ht_repository`.* TO 'ht_repository'@'%';
GRANT SELECT, INSERT, UPDATE, DELETE, LOCK TABLES ON `ht`.* TO 'ht_repository'@'%';
GRANT ALL ON `ht_repository`.* TO 'ht_repository'@'%';
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably isn't necessary -- I had changed this because at one point it was doing a TRUNCATE in the tests, which needed drop table privilege, but now it's doing it with transactions.

# Responsible for locating and loading a single Hathifile, a file containing
# HtItems represented as tab-separated values.
#
# Usage: Hathifile.new(date).load
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add expected date format to this documentation.

**kwargs)
# Connection connects to the database using the connection information
# specified by Settings.database, which should contain keyword parameters
# matching those taken by Sequel.connect.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not accurate

Copy link
Contributor

@mwarin mwarin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the Ettin/Settings stuff.
Haven't seen a lot of the rspec lingo you are using.
FileMutex currently allows any path as lockfile. I think we should specify a list of allowed paths, perhaps in a specified dir, just to make sure we don't proliferate lockfiles all over the file system.

No reason not to approve.

@@ -0,0 +1,7 @@
# frozen_string_literal: true

require "ettin"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a real objection, but it does not appear there is a current maintainer of Ettin.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. The canonical homes are http://github.com/mlibrary/ettin and http://github.com/mlibrary/canister and we can certainly update it as needed. @botimer will also reach out to @malakai97 and make sure that we (as MLibrary) can publish new versions on rubygems as needed.

@aelkiss
Copy link
Member Author

aelkiss commented Apr 20, 2021

Haven't seen a lot of the rspec lingo you are using.

Is there a specific rspec thing you're seeing that isn't clear? I'm certainly open to changing it especially in cases where what a test is testing or where its data is coming from isn't obvious

Copy link
Contributor

@jsteverman jsteverman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect to use FileMutex and LoadedFile for Holdings/OCNResolutions as well?

This all looks reasonable to me. HoldingsDB is an ugliness I would like to deal with at some point, but it's not new and is not a pressing matter.

@aelkiss
Copy link
Member Author

aelkiss commented Apr 20, 2021

Yes, I'm expecting we'll want to use these for the other kinds of data as well, although it's not 100% clear to me when we need to use the FileMutex.

@aelkiss aelkiss merged commit cda7bdf into master Apr 23, 2021
@aelkiss aelkiss deleted the HT-2835-hathifile-loading branch April 23, 2021 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants