-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HT-2898: Format and Overlap determination documentation #77
Conversation
7726d6f
to
0df1261
Compare
0df1261
to
c03517c
Compare
c03517c
to
3918acc
Compare
format_and_overlap_determination.md
Outdated
: Volume ID and associated metadata taken from the Hathifiles. | ||
|
||
**Print Serials List** | ||
: List of Record Ids determined to be serials by some unknown UMich process. List is limited to those held by UMich. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we get this list? How often?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Margaret Kelly ([email protected]) emails this as a .txt file to Martin & Josh at the end of each month. We could request instead that it be uploaded to a certain dropbox location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should say:
- we get the file from AIM (do not specify individuals)
- how the file is generated is not in scope for this document
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were told that the latest file (end of May 2021) would be the last file Margaret would be able to produce, as Umich is switching from Aleph to Alma (and presumably this report was based on Aleph).
**Cost Per Volume** | ||
: Target Cost / Total Number of HT Items | ||
|
||
**Member Weight** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the same as the "tier"? If not we should add what a tier is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well. The Member Tier is a label that exists outside the system.
Melissa will tell me "Please make an estimate for member xyz, they are a US institution and tier 1". She determines tier based on the IPEDS. From her boilerplate that goes out to prospective members: "Each member is assigned to one of three tiers based on the member's Total Library Expenditures as reported in public sources."
Each tier (1,2,3) has a weight associated. This association is not codified, other than perhaps in Confluence, but as far as I'm concerned lives in our minds.
Tier 1: 0.67
Tier 2: 1.00
Tier 3: 1.33
Then there are other members that fall outside the tier system, with a weight of 0.00 or above 1.33. These are:
+------------+--------+
| member_id | weight |
+------------+--------+
| ucmerced | 0.00 |
| ucsf | 0.00 |
| hathitrust | 0.00 |
| utexas | 3.00 |
| flbog | 5.00 |
| usg | 3.00 |
+------------+--------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those non-tier weights were, I believe, rather arbitrarily cobbled together, probably by Mike.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooookay, this is where the system thing from my comment above gets worked out. As in, this is where the "special processing" for USG and FLBOG (and utexas, I guess these are the only 3 "systems") happens.
Is this really the right way to do this??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My impression was that the weights (including for the system members, which I think is based on the number of R1 research schools in the system) are encoded in the billing model which has been approved by the Board, so I would not describe them as "arbitrary". I'm checking on this with Mike, but yes, I think that given our current billing model this is the right way to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The system weights come from this: https://www.hathitrust.org/sites/www.hathitrust.org/files/member-criteria-formalize.pdf which Natalie quoted from below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tier to weight mapping for Tier 1, 2, 3 is in a policy here:
https://www.hathitrust.org/sites/www.hathitrust.org/files/fee-model-change-proposal.pdf
@nfulkers will add info about systems & specific R1 schools in Confluence
**SPM** | ||
: Is not MPM, SER/SPM, or SER. | ||
|
||
## Cost Allocation ## |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how or where to address it, but isn't there something funky about how costs are allocated for university Systems like FLBOG and the Georgia State system?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no unique processing for FLBOG or the Georgia State system in this repository.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does that mean we need the system to be doing something that it currently doesn't do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. I think the system handles those appropriately. If there are some separate requirements documents for those systems we can walk through the logic and check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old system does not do anything different for these.
From its point of view they are 2 normal members.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So how does this get worked out in the legacy system? (from https://www.hathitrust.org/eligibility_agreements):
Systems with more than one institution classified as Carnegie R1 will pay a public domain fee for each R1 institution, tiered according to its total library expenditures. For systems without an R1 institution, the member will pay a public domain fee based on the system’s assignment to a tier that corresponds with the sum total expenditures of all institutions in the system. In-copyright fees are assessed for the entire system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They have a weight in the old system, the same as in the new system. There is no special handling for system members in that the fee is always scaled by the weight; the thing that differs for system members (in both old and new systems) is the weight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above discussion about systems
format_and_overlap_determination.md
Outdated
|
||
For items in **clusters** that are not MPM, all organizations with a holding in the cluster and the billing entity for the item are allocated a share. | ||
|
||
For items in **clusters** that are MPM, the process is more complicated. Organizations with holdings: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get this section at all, which is a bummer since it seems like this is the crux of the document ¯_(ツ)_/¯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A topic of conversation for our weekly meeting. @mwarin @billdueber @aelkiss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about:
For items in clusters that are not MPM (i.e. none of the items in the cluster have the item format multi-part monograph), an institutions is considered to hold the item and is allocated a share for it if:
- It is the billing entity for the item (i.e. the depositor of an item is always assumed to hold it)
- It submitted a holding that is in the cluster
For items in clusters that are MPM, the process is more complicated. An institution is considered to hold the item and is allocated a share if:
- It is the billing entity for the item (i.e. the depositor of an item is always assumed to hold it)
- It has a holding in the cluster with an n_enum matching the n_enum of an item.
- It has a holding in the cluster that has an empty n_enum. The same is not true of items with empty n_enums: holdings with empty n_enums match all items in an MPM cluster; but items with empty n_enums match only holdings with n_enums.
- It has holdings in the cluster, but none of the reported n_enums match any of the Item n_enums: that is, if no holdings n_enums match, then this is assumed to be a data problem, and the institution is considered to hold all the items in the cluster. [^2]
@nfulkers Does that address any of the confusion? I think it is important this section be as clear as possible. I don't know whether my option is any better. It may be helpful for you to try stating the behavior in your words as you understand it, and then seeing if there's any spots we need to correct or clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I edited the above to replace 'organization' with 'institution'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might need some more introductory text to make it clear that we are not actually matching holdings -- we are figuring out who holds an item, which includes (but is not limited to) matching holdings to HT items
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nfulkers believes these revisions address the confusion
format_and_overlap_determination.md
Outdated
**NB**: Billing entities apply only to the Item they are on. For example, a billing entity on an HT Item with an empty n_enum will **not** match other items in the cluster. | ||
|
||
## Frequency Table ## | ||
Item shares are held in a frequency table separated by member and **item** format.[^3] This gets compiled into per format per member cost allocations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shares as in Hscore? If yes might be good to use the standard term, though I admit the sentence sounds weirder that way. If not, we should clarify what "share" means in this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"HScore" is terminology taken from the existing system, and it's a little awkward when "share" seems to be perfectly adequate. Changed to HScore all the same.
|
||
## HT Item Overlap ## | ||
Determining who holds a particular in-copyright HT Item. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the legacy system, I think, we somehow allocate a share for each copy of an item, if a member reports holding more than one copy?
Is that right and if so does the new system also do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mwarin I know I've been confused on this point in the past. I don't think the old system does this, though. As far as I can tell this test addresses this in the new system: https://github.com/hathitrust/holdings-backend/blob/master/spec/cost_report_spec.rb#L222-L233 -- multiple reported holdings of the same OCN lead to only one allocated share of each matching HTItem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For fees it doesn't matter how many copies you hold (>0).
This makes sense considering that
the fee for each member associated with a volume
is multiplied by 1/H, where H is the number of members associated by that volume,
not the total number of copies (of the title) reported held.
Number of copies (of a title) reported held only matter for access (e.g. ETAS).
format_and_overlap_determination.md
Outdated
For HT Items in **clusters** that are not MPM, all organizations with a holding in the cluster and the billing entity for the HT Item are allocated a share. | ||
|
||
For HT Items in **clusters** that are MPM, the process is more complicated. Organizations with holdings: | ||
- Organizations with holdings with empty **n_enum**. The reverse is **not** true; i.e. Items with empty n_enum don't match everything. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Organizations with holdings with empty n_enum" ... what? What do they match to? Is there some kind of logic implied by the syntax or notation of this sentence that I'm missing?
In natural language line 108 feels like it should be written "If a member reports a holding that the system determines to be MPM AND the reported holding lacks enumeration, that holding will match to all other holdings in the same cluster that also lack n_enum"" (or whatever the true behavior actually is).
Or something like
"For HT items in clusters that are MPM, if the HT item lacks enumchron, each organization that submitted a record also lacking enumchron is allocated a share"
???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does my proposed revision above #77 (comment) make this point any clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nfulkers believes my revisions above are sufficient
format_and_overlap_determination.md
Outdated
|
||
For HT Items in **clusters** that are not MPM, all organizations with a holding in the cluster and the billing entity for the HT Item are allocated a share. | ||
|
||
For HT Items in **clusters** that are MPM, the process is more complicated. Organizations with holdings: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the "Organizations with holdings:" phrase meant to designate in line 107? I think I'm not understanding how the word "organizations" applies here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should be written as "members with holdings" or perhaps "institutions with holdings"?
We will add these links to the documentation |
No description provided.