Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BagIt sample data #19

Merged
merged 9 commits into from
Nov 16, 2018
Merged

Add BagIt sample data #19

merged 9 commits into from
Nov 16, 2018

Conversation

j-panzer
Copy link
Contributor

@j-panzer j-panzer commented Nov 2, 2018

based on PPN595930174 (VD18 10246916)
file groups are reduced to PRESENTATION and GDZOCR to keep the data base simple

based on PPN595930174 (VD18 10246916)
file groups are reduced to PRESENTATION and GDZOCR to keep the data base simple
@kba
Copy link
Member

kba commented Nov 2, 2018

Thanks!

ZIP contains a _MACOSX folder

Can we add Ocrd-Identifier and Ocrd-Manifestation-Depth (= 'partial') to the bag-info.txt (c.f. OCR-D/spec#70). @VolkerHartmann can the repostiory metadata ingestor handle this?

I see the point of fetch.txt for test assets but for the serialization format we explicitly do not want fetch.txt. Maybe we could document this.

@kba
Copy link
Member

kba commented Nov 15, 2018

I'm getting HTTP 500 errors for some of the xml files, e.g. http://gdz.sub.uni-goettingen.de/gdzocr/PPN595930174/00000035.xml. @j-panzer Any idea what that is about?

(this has basically been my argument against fetch.txt 😆)

@kba
Copy link
Member

kba commented Nov 15, 2018

@j-panzer I updated the sample with SHA512 checksums for the files I could load (the XML files are empty) and extended the bag-info.txt a bit.

You can try to validate it against our BagIt profile by installing OCR-D/core and then running

ocrd zip validate --skip-unzip /path/to/assets/sample_bagit-with-fetch

@j-panzer
Copy link
Contributor Author

j-panzer commented Nov 15, 2018 via email

@j-panzer
Copy link
Contributor Author

j-panzer commented Nov 16, 2018 via email

@kba
Copy link
Member

kba commented Nov 16, 2018

We solved the fulltext download problem :-)

Thx, it's loading now.

@kba
Copy link
Member

kba commented Nov 16, 2018

I still do get empty response bodies for these files:

# find data -size 0
data/GDZOCR/00000004.xml
data/GDZOCR/00000318.xml
data/GDZOCR/00000312.xml
data/GDZOCR/00000316.xml
data/GDZOCR/00000322.xml
data/GDZOCR/00000326.xml
data/GDZOCR/00000324.xml
data/GDZOCR/00000314.xml
data/GDZOCR/00000320.xml
data/GDZOCR/00000310.xml

@kba kba merged commit 012f2c4 into OCR-D:master Nov 16, 2018
@j-panzer
Copy link
Contributor Author

j-panzer commented Nov 16, 2018 via email

@kba
Copy link
Member

kba commented Nov 16, 2018

Ok, the API should export the OCR as TEI XML

I think these are HTML5 snippets:

head data/GDZOCR/00000014.xml

<p id="gdz-ID82">
  <span data-function="162,107,204,136">14</span>
  <span data-function="375,99,405,134">@</span>
  <span data-function="485,100,513,134">Ф</span>
  <span data-function="592,99,622,135">@</span>
</p>
<p id="gdz-ID83">
....

@kba
Copy link
Member

kba commented Nov 16, 2018

MIMETYPE="text/xml" in the sample METS is not ideal. Preferably, application/tei+xml for TEI, text/html for HTML, application/vnd.prima.page+xml for PAGE.

I would like to check in at least the METS file for the fileGrp and mimetypes. Can we get this pretty-printed from https://gdz.sub.uni-goettingen.de/mets/PPN595930174.mets.xml to make it easier to track changes and edit it? I could reformat it locally but that would break the fetch/checksum mechanism.

@cneud BTW do we have a media type for abbyy finereader xml?

@cneud
Copy link
Member

cneud commented Nov 22, 2018

Currently we do not have a defined media type for FineReader Engine XML yet. How about using application/vnd.abbyy.fre+xml?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants