Add BagIt sample data #19

j-panzer · 2018-11-02T14:31:29Z

based on PPN595930174 (VD18 10246916)
file groups are reduced to PRESENTATION and GDZOCR to keep the data base simple

based on PPN595930174 (VD18 10246916) file groups are reduced to PRESENTATION and GDZOCR to keep the data base simple

kba · 2018-11-02T14:42:04Z

Thanks!

ZIP contains a _MACOSX folder

Can we add Ocrd-Identifier and Ocrd-Manifestation-Depth (= 'partial') to the bag-info.txt (c.f. OCR-D/spec#70). @VolkerHartmann can the repostiory metadata ingestor handle this?

I see the point of fetch.txt for test assets but for the serialization format we explicitly do not want fetch.txt. Maybe we could document this.

kba · 2018-11-15T15:54:27Z

I'm getting HTTP 500 errors for some of the xml files, e.g. http://gdz.sub.uni-goettingen.de/gdzocr/PPN595930174/00000035.xml. @j-panzer Any idea what that is about?

(this has basically been my argument against fetch.txt 😆)

…bag-info

kba · 2018-11-15T16:45:45Z

@j-panzer I updated the sample with SHA512 checksums for the files I could load (the XML files are empty) and extended the bag-info.txt a bit.

You can try to validate it against our BagIt profile by installing OCR-D/core and then running

ocrd zip validate --skip-unzip /path/to/assets/sample_bagit-with-fetch

j-panzer · 2018-11-15T16:46:58Z

The references are correct. There is a problem with the service (routing). We will correct this.

…

Am 15.11.2018 um 16:54 schrieb Konstantin Baierer ***@***.***>: I'm getting HTTP 500 errors for some of the xml files, e.g. http://gdz.sub.uni-goettingen.de/gdzocr/PPN595930174/00000035.xml. @j-panzer <https://github.com/j-panzer> Any idea what that is about? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEsbHX1rEeWx4ZxQWMhm5QSbuol1bFJsks5uvY40gaJpZM4YLwpf>.

j-panzer · 2018-11-16T10:28:16Z

We solved the fulltext download problem :-)

…

Am 15.11.2018 um 17:46 schrieb Panzer, Joerg-Holger ***@***.***>: The references are correct. There is a problem with the service (routing). We will correct this. > Am 15.11.2018 um 16:54 schrieb Konstantin Baierer ***@***.*** ***@***.***>>: > > I'm getting HTTP 500 errors for some of the xml files, e.g. http://gdz.sub.uni-goettingen.de/gdzocr/PPN595930174/00000035.xml <http://gdz.sub.uni-goettingen.de/gdzocr/PPN595930174/00000035.xml>. @j-panzer <https://github.com/j-panzer> Any idea what that is about? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub <#19 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEsbHX1rEeWx4ZxQWMhm5QSbuol1bFJsks5uvY40gaJpZM4YLwpf>. >

kba · 2018-11-16T11:22:01Z

We solved the fulltext download problem :-)

Thx, it's loading now.

kba · 2018-11-16T11:29:34Z

I still do get empty response bodies for these files:

# find data -size 0
data/GDZOCR/00000004.xml
data/GDZOCR/00000318.xml
data/GDZOCR/00000312.xml
data/GDZOCR/00000316.xml
data/GDZOCR/00000322.xml
data/GDZOCR/00000326.xml
data/GDZOCR/00000324.xml
data/GDZOCR/00000314.xml
data/GDZOCR/00000320.xml
data/GDZOCR/00000310.xml

j-panzer · 2018-11-16T13:03:35Z

Ok, the API should export the OCR as TEI XML, but exports it preprocessed. We need to discuss and change this.

…

Am 16.11.2018 um 12:29 schrieb Konstantin Baierer ***@***.***>: I still do get empty response bodies for these files: # find data -size 0 data/GDZOCR/00000004.xml data/GDZOCR/00000318.xml data/GDZOCR/00000312.xml data/GDZOCR/00000316.xml data/GDZOCR/00000322.xml data/GDZOCR/00000326.xml data/GDZOCR/00000324.xml data/GDZOCR/00000314.xml data/GDZOCR/00000320.xml data/GDZOCR/00000310.xml — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEsbHQh5qyKh0Hz3j-zaJJtkiQA1SsBtks5uvqGegaJpZM4YLwpf>.

kba · 2018-11-16T13:47:32Z

Ok, the API should export the OCR as TEI XML

I think these are HTML5 snippets:

head data/GDZOCR/00000014.xml

<p id="gdz-ID82">
  <span data-function="162,107,204,136">14</span>
  <span data-function="375,99,405,134">@</span>
  <span data-function="485,100,513,134">Ф</span>
  <span data-function="592,99,622,135">@</span>
</p>
<p id="gdz-ID83">
....

kba · 2018-11-16T13:51:27Z

MIMETYPE="text/xml" in the sample METS is not ideal. Preferably, application/tei+xml for TEI, text/html for HTML, application/vnd.prima.page+xml for PAGE.

I would like to check in at least the METS file for the fileGrp and mimetypes. Can we get this pretty-printed from https://gdz.sub.uni-goettingen.de/mets/PPN595930174.mets.xml to make it easier to track changes and edit it? I could reformat it locally but that would break the fetch/checksum mechanism.

@cneud BTW do we have a media type for abbyy finereader xml?

cneud · 2018-11-22T18:02:23Z

Currently we do not have a defined media type for FineReader Engine XML yet. How about using application/vnd.abbyy.fre+xml?

Add BagIt sample data

7f40948

based on PPN595930174 (VD18 10246916) file groups are reduced to PRESENTATION and GDZOCR to keep the data base simple

j-panzer added 3 commits November 2, 2018 15:59

Rename Zip file

0777fa4

Rename Zip file and remove _MACOSX folder

3bedb40

Merge branch 'master' of github.com:subugoe/assets

15190eb

wrznr requested review from cneud and VolkerHartmann November 6, 2018 09:46

kba force-pushed the master branch from 1a77b66 to 0b2a489 Compare November 15, 2018 11:27

kba added 2 commits November 15, 2018 16:06

Merge branch 'master' into subugoe-master

c8ae0ed

unpack sample by @j-panzer

475619d

sample_bagit-with-fetch: add sha512-manifest, download files, extend …

71392c6

…bag-info

sample_bagit-with-fetch: add Ocrd-Mets to bag-info

042f623

complete manifests

012f2c4

kba merged commit 012f2c4 into OCR-D:master Nov 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BagIt sample data #19

Add BagIt sample data #19

j-panzer commented Nov 2, 2018

kba commented Nov 2, 2018

kba commented Nov 15, 2018 •

edited

Loading

kba commented Nov 15, 2018

j-panzer commented Nov 15, 2018 via email

j-panzer commented Nov 16, 2018 via email

kba commented Nov 16, 2018

kba commented Nov 16, 2018

j-panzer commented Nov 16, 2018 via email

kba commented Nov 16, 2018

kba commented Nov 16, 2018 •

edited

Loading

cneud commented Nov 22, 2018

Add BagIt sample data #19

Add BagIt sample data #19

Conversation

j-panzer commented Nov 2, 2018

kba commented Nov 2, 2018

kba commented Nov 15, 2018 • edited Loading

kba commented Nov 15, 2018

j-panzer commented Nov 15, 2018 via email

j-panzer commented Nov 16, 2018 via email

kba commented Nov 16, 2018

kba commented Nov 16, 2018

j-panzer commented Nov 16, 2018 via email

kba commented Nov 16, 2018

kba commented Nov 16, 2018 • edited Loading

cneud commented Nov 22, 2018

kba commented Nov 15, 2018 •

edited

Loading

kba commented Nov 16, 2018 •

edited

Loading