-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HT-2771 combine n_chron/n_enum for matching purposes #69
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is n_enum_chron
populated? Does this require re-loading all data, or will it be automatically computed from n_enum
and n_chron
if it is missing?
Some matching operations use database searches so a dynamic creationg of n_enum_chron won't work. It uses the EnumChron module which populates n_enum, n_chron, and now n_enum_chron when h.enum_chron is set. A migration would look something like |
OK, thanks for the clarification. Let's think about whether it makes more sense to do that migration or reload fresher data -- I think it depends a bit to me on how close you think we are to identifying all the kinds of issues in matching. If you think we're close, then it probably makes sense to do the migration & verify that we don't see any mismatches. If you suspect there's still significantly more issues to uncover, then perhaps it would be better to load fresher data. |
48629a3
to
623e5b4
Compare
623e5b4
to
0bc1635
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple comments, one a note about the query getting too much stuff in a way this code doesn't care, and the other about having to deal with trailing spaces (where, I think, there's an enum but no chron?). That latter feels messy to me, like we should put in a hook on save that strips off leading/trailing spaces.
def records_with_enum_chrons | ||
return enum_for(:records_with_enum_chrons) unless block_given? | ||
|
||
Cluster.where("$or": [{ "holdings.enum_chron": { "$ne": "" } }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some experiments and I'm pretty sure this will match clusters that have any of the following:
- fulfill the criteria you want
- have ht_items or ht_holdings unset (e.g., one ht_item with no enum_chron, and no holdings at all)
- have ht_items or ht_holdings set to the empty array (because something got moved?)
Not a huge deal in this case because they'll be no-opped in the loop, but useful to know in other contexts, I would think.
spec/cluster_spec.rb
Outdated
@@ -97,21 +97,21 @@ | |||
describe "#item_enum_chrons" do | |||
it "collects all item enum_chrons in the cluster" do | |||
c = described_class.first | |||
expect(c.item_enum_chrons).to eq(["3"]) | |||
expect(c.item_enum_chrons).to eq(["3 "]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the extra space? It feels like we should strip
the values (before saving?) so we don't have to worry about this stuff. Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stripping means an enum or a chron can match an enum_chron, which we don't want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If enums or chrons could have spaces in the middle, then a normalized enum_chron could match an enum or chron.
@mwarin believes spaces are switched out with colons in scrub process / normalization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsteverman is going to use \t
as the separator and add a spec to https://github.com/hathitrust/holdings-backend/blob/master/spec/enum_chron_spec.rb to ensure that tabs are stripped out by normalization.
n_enum_chron = [n_enum, n_chron].join(" ").sub(/^ $/, '')
Uses of n_enum for comparison have been replaced by n_enum_chron.