-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BagIt sample data #19
Conversation
based on PPN595930174 (VD18 10246916) file groups are reduced to PRESENTATION and GDZOCR to keep the data base simple
Thanks! ZIP contains a Can we add I see the point of |
I'm getting (this has basically been my argument against |
@j-panzer I updated the sample with SHA512 checksums for the files I could load (the XML files are empty) and extended the bag-info.txt a bit. You can try to validate it against our BagIt profile by installing OCR-D/core and then running ocrd zip validate --skip-unzip /path/to/assets/sample_bagit-with-fetch |
The references are correct. There is a problem with the service (routing). We will correct this.
… Am 15.11.2018 um 16:54 schrieb Konstantin Baierer ***@***.***>:
I'm getting HTTP 500 errors for some of the xml files, e.g. http://gdz.sub.uni-goettingen.de/gdzocr/PPN595930174/00000035.xml. @j-panzer <https://github.com/j-panzer> Any idea what that is about?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#19 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEsbHX1rEeWx4ZxQWMhm5QSbuol1bFJsks5uvY40gaJpZM4YLwpf>.
|
We solved the fulltext download problem :-)
… Am 15.11.2018 um 17:46 schrieb Panzer, Joerg-Holger ***@***.***>:
The references are correct. There is a problem with the service (routing). We will correct this.
> Am 15.11.2018 um 16:54 schrieb Konstantin Baierer ***@***.*** ***@***.***>>:
>
> I'm getting HTTP 500 errors for some of the xml files, e.g. http://gdz.sub.uni-goettingen.de/gdzocr/PPN595930174/00000035.xml <http://gdz.sub.uni-goettingen.de/gdzocr/PPN595930174/00000035.xml>. @j-panzer <https://github.com/j-panzer> Any idea what that is about?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub <#19 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEsbHX1rEeWx4ZxQWMhm5QSbuol1bFJsks5uvY40gaJpZM4YLwpf>.
>
|
Thx, it's loading now. |
I still do get empty response bodies for these files: # find data -size 0
data/GDZOCR/00000004.xml
data/GDZOCR/00000318.xml
data/GDZOCR/00000312.xml
data/GDZOCR/00000316.xml
data/GDZOCR/00000322.xml
data/GDZOCR/00000326.xml
data/GDZOCR/00000324.xml
data/GDZOCR/00000314.xml
data/GDZOCR/00000320.xml
data/GDZOCR/00000310.xml |
Ok, the API should export the OCR as TEI XML, but exports it preprocessed. We need to discuss and change this.
… Am 16.11.2018 um 12:29 schrieb Konstantin Baierer ***@***.***>:
I still do get empty response bodies for these files:
# find data -size 0
data/GDZOCR/00000004.xml
data/GDZOCR/00000318.xml
data/GDZOCR/00000312.xml
data/GDZOCR/00000316.xml
data/GDZOCR/00000322.xml
data/GDZOCR/00000326.xml
data/GDZOCR/00000324.xml
data/GDZOCR/00000314.xml
data/GDZOCR/00000320.xml
data/GDZOCR/00000310.xml
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#19 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEsbHQh5qyKh0Hz3j-zaJJtkiQA1SsBtks5uvqGegaJpZM4YLwpf>.
|
I think these are HTML5 snippets: head data/GDZOCR/00000014.xml
<p id="gdz-ID82">
<span data-function="162,107,204,136">14</span>
<span data-function="375,99,405,134">@</span>
<span data-function="485,100,513,134">Ф</span>
<span data-function="592,99,622,135">@</span>
</p>
<p id="gdz-ID83">
.... |
I would like to check in at least the METS file for the fileGrp and mimetypes. Can we get this pretty-printed from https://gdz.sub.uni-goettingen.de/mets/PPN595930174.mets.xml to make it easier to track changes and edit it? I could reformat it locally but that would break the fetch/checksum mechanism. @cneud BTW do we have a media type for abbyy finereader xml? |
Currently we do not have a defined media type for FineReader Engine XML yet. How about using |
based on PPN595930174 (VD18 10246916)
file groups are reduced to PRESENTATION and GDZOCR to keep the data base simple