Code to take data from the hathifiles and keep an up-to-date set of tables in mysql for querying by HT staff.
git clone <URL/protocol of choice>
cd hathifiles_database
docker compose build
docker compose run --rm test bin/setup
docker compose run --rm test
docker compose run --rm test bundle exec standardrb
There are six tables: one with all the data in the hathifiles (some of it normalized) in the same order we have there, and four where we break out and index normalized versions of the standard identifiers.
hf_log
records each hathifile as it is successfully loaded (along with a timestamp)
so that updates can occur in batches, as needed, for date independence.
- hf
- hf_isbn
- hf_issn
- hf_lccn
- hf_oclc
- hf_log
ISBNs, ISSNs, and OCLC numbers are indexed after normalization (just runs of digits and potentially an upper-case 'x'). In addition, ISBNs are indexed in both their 10- and 13-character forms.
LCCNs are more of a mess. They're stored twice, too -- one normalized and what with whatever string was in the MARC record.
bin
├── console
└── setup
These are intended to be run under Docker for development purposes.
console
invokesirb
withhathifiles_database
pre-require
dsetup
is just a shell wrapper aroundbundle install
(see Developer Setup)
exe
├── catchup
├── daily_run
├── hathifiles_database_clear_everything_out
├── hathifiles_database_convert
├── hathifiles_database_full
├── hathifiles_database_full_update
├── hathifiles_database_update
└── swap_production_and_reindex
These are exported by the gemspec
as the gem's executables.
catchup
deprecated loads multipleupd
filesdaily_run
deprecated (contains hardcoded paths) loads today'supd
filehathifiles_database_clear_everything_out
interactive script to reinitialize the databasehathifiles_database_convert
deprecated interactive script to dumphathifiles
database to tab-delimited fileshathifiles_database_full
deprecated load a singlefull
hathifilehathifiles_database_full_update
the preferred date-independent method for loadingfull
andupd
hathifileshathifiles_database_update
deprecated load a singleupd
hathifileswap_production_and_reindex
deprecated swaps tables betweenhathifiles
andhathifiles_reindex
databases
swap_production_and_reindex
used to be part of the workflow for clearing and rebuilding the
production database from an auxiliary database. With Argo Workflows we should no longer need to
do this as hathifiles_database_full_update
should be touching only the changed/deleted rows
in the full
monthly hathifile.
The hf
database does not record exactly the same data as the hathifiles.
In particular, standard numbers like ISBN and ISSN are normalized.
Furthermore access
is a Boolean in the database, but in the hathifiles it appears as
allow
or deny
. Because of the library_stdnums
normalization, it is not possible
to do a round-trip conversion from database to hathifile, only the reverse.
As a result of this, the monthly update (which computes a diff before making changes to
the database) dumps the hathi_full_*
file into an intermediate "DB-ized" dialect for
comparison.
The push_metrics
gem, which is required for running exe/hathifiles_database_full_update
,
is not part of the gemspec because it is currently unpublished. Code which uses hathifiles_database
as a gem should also declare a push_metrics
dependency or use its own implementation
of hathifiles_database_full_update
.
Get info for records with the given issn
select hf.htid, title from hf
join hf_issn on hf.htid=hf_issn.htid
where hf_issn.value="0134045X";
Find govdocs whose rights were added/changed in the last two years. Note, sadly, that you can't say "3 days", but must use "3 day" or "2 year" or whatnot.
select count(*) from hf where us_gov_doc_flag = 1 AND
rights_timestamp > date_sub(now(), interval 2 year);
What's the rights breakdown for stuff sent by Tufts?
select rights_code, count(*) from hf where
content_provider_code='tufts'
group by rights_code;
Who's giving us stuff from the 18th century?
select content_provider_code, count(*) from hf
where rights_date_used between 1700 and 1799
group by content_provider_code
order by count(*) desc;
Reconstitute a full hathifile line (given some query criterion) with the fields as in the Hathifiles Description
select
hf.htid, access, rights_code as rights, bib_num as ht_bib_key, description, source, source_bib_num,
group_concat(DISTINCT oclc.value SEPARATOR ',') oclc_num,
group_concat(DISTINCT isbn.value SEPARATOR ',') isbn,
group_concat(DISTINCT issn.value SEPARATOR ',') issn,
group_concat(DISTINCT lccn.value SEPARATOR ',') lccn,
title, imprint, rights_reason as rights_reason_code, rights_timestamp, us_gov_doc_flag, rights_date_used,
pub_place, lang_code as lang, bib_fmt, collection_code, content_provider_code, responsible_entity_code,
digitization_agent_code, access_profile_code, author
from hf
left outer join hf_oclc oclc on oclc.htid = hf.htid
left outer join hf_isbn isbn on isbn.htid = hf.htid
left outer join hf_issn issn on issn.htid = hf.htid
left outer join hf_lccn lccn on lccn.htid = hf.htid
where ...
group by hf.htid
Because of... well, because, not all tables use the same conventions for identifying a volume.
If you want to, you can use the hf.htid column to link into other tables as follows.
- rights_current:
hf.htid = concat(rights_current.namespace ,'.', rights_current.id)
# will be sloooooooow - holdings_htitem_htmember:
hf.htid = holdings_htitem_htmember .volume_id