Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HT-2771 combine n_chron/n_enum for matching purposes #69

Merged
merged 3 commits into from
Feb 11, 2021

Conversation

jsteverman
Copy link
Contributor

n_enum_chron = [n_enum, n_chron].join(" ").sub(/^ $/, '')

Uses of n_enum for comparison have been replaced by n_enum_chron.

Copy link
Member

@aelkiss aelkiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is n_enum_chron populated? Does this require re-loading all data, or will it be automatically computed from n_enum and n_chron if it is missing?

@jsteverman
Copy link
Contributor Author

How is n_enum_chron populated? Does this require re-loading all data, or will it be automatically computed from n_enum and n_chron if it is missing?

Some matching operations use database searches so a dynamic creationg of n_enum_chron won't work. It uses the EnumChron module which populates n_enum, n_chron, and now n_enum_chron when h.enum_chron is set. A migration would look something like holdings.map {|h| h.normalize_enum_chron.tap(&:save)

@aelkiss
Copy link
Member

aelkiss commented Jan 25, 2021

How is n_enum_chron populated? Does this require re-loading all data, or will it be automatically computed from n_enum and n_chron if it is missing?

Some matching operations use database searches so a dynamic creationg of n_enum_chron won't work. It uses the EnumChron module which populates n_enum, n_chron, and now n_enum_chron when h.enum_chron is set. A migration would look something like holdings.map {|h| h.normalize_enum_chron.tap(&:save)

OK, thanks for the clarification. Let's think about whether it makes more sense to do that migration or reload fresher data -- I think it depends a bit to me on how close you think we are to identifying all the kinds of issues in matching. If you think we're close, then it probably makes sense to do the migration & verify that we don't see any mismatches. If you suspect there's still significantly more issues to uncover, then perhaps it would be better to load fresher data.

@jsteverman jsteverman force-pushed the HT-2771_n_chron_enum_matching branch from 48629a3 to 623e5b4 Compare January 26, 2021 20:09
@aelkiss aelkiss requested a review from billdueber January 29, 2021 15:19
@jsteverman jsteverman force-pushed the HT-2771_n_chron_enum_matching branch from 623e5b4 to 0bc1635 Compare February 2, 2021 15:42
Copy link
Contributor

@billdueber billdueber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple comments, one a note about the query getting too much stuff in a way this code doesn't care, and the other about having to deal with trailing spaces (where, I think, there's an enum but no chron?). That latter feels messy to me, like we should put in a hook on save that strips off leading/trailing spaces.

def records_with_enum_chrons
return enum_for(:records_with_enum_chrons) unless block_given?

Cluster.where("$or": [{ "holdings.enum_chron": { "$ne": "" } },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some experiments and I'm pretty sure this will match clusters that have any of the following:

  • fulfill the criteria you want
  • have ht_items or ht_holdings unset (e.g., one ht_item with no enum_chron, and no holdings at all)
  • have ht_items or ht_holdings set to the empty array (because something got moved?)

Not a huge deal in this case because they'll be no-opped in the loop, but useful to know in other contexts, I would think.

@@ -97,21 +97,21 @@
describe "#item_enum_chrons" do
it "collects all item enum_chrons in the cluster" do
c = described_class.first
expect(c.item_enum_chrons).to eq(["3"])
expect(c.item_enum_chrons).to eq(["3 "])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the extra space? It feels like we should strip the values (before saving?) so we don't have to worry about this stuff. Or am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stripping means an enum or a chron can match an enum_chron, which we don't want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If enums or chrons could have spaces in the middle, then a normalized enum_chron could match an enum or chron.

@mwarin believes spaces are switched out with colons in scrub process / normalization

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsteverman is going to use \t as the separator and add a spec to https://github.com/hathitrust/holdings-backend/blob/master/spec/enum_chron_spec.rb to ensure that tabs are stripped out by normalization.

@billdueber billdueber merged commit b4ff724 into master Feb 11, 2021
@aelkiss aelkiss deleted the HT-2771_n_chron_enum_matching branch April 15, 2021 01:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants