Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace OCRD-ZIP with BagIt-based spec #70

Merged
merged 14 commits into from
Nov 6, 2018
Merged

Replace OCRD-ZIP with BagIt-based spec #70

merged 14 commits into from
Nov 6, 2018

Conversation

kba
Copy link
Member

@kba kba commented Aug 3, 2018

First stab at defining an exchange format that is

  • standards-compliant (BagIt)
  • easy for programmers (simple directory conventions)
  • long-term preservation minded (URL-filepath mapping, checksums)

Feedback appreciated @VolkerHartmann @krvoigt

@kba
Copy link
Member Author

kba commented Sep 12, 2018

@cneud cneud self-requested a review October 24, 2018 16:58
kba added a commit to kba/ocrd-assets that referenced this pull request Oct 25, 2018
Copy link
Contributor

@VolkerHartmann VolkerHartmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the versioning the 'The Oxford Common File Layout' makes a good impression. It is simple(?) and file based. Git is unbeatable as long as you are on one computer, but if you want to transfer the data you always have to transfer it completely if you don't want to use git as repository.

default: 'data/mets.xml'
X-Ocrd-Manifestation-Depth:
default: partial
values: ["partial", "full"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X-Ocrd-Manifestation-Depth:
default: partial
values: ["partial", "full", "diff"]
X-Ocrd-Identifier:
required: false
X-Ocrd-Version:
required: false
X-Ocrd-Md5:
required: false


An OCRD-ZIP MUST be a valid ZIP file.
Specifiy whether the bag contains the full manifestation of the data referenced in the METS (`full`)
or only those files that were `file://` URLs before (`partial`). Default: `partial`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X-Ocrd-Manifestation-Depth

Specifiy whether the bag contains the full manifestation of the data referenced in the METS (full)
or only those files that were file:// URLs before (partial). In case of diff X-Ocrd-Identifier and
X-Ocrd-Version have to be defined as base. For safety reasons there may be also a checksum
X-Ocrd-Md5 of the base file. The diff attribute may be used for ingest new versions of an existing document into the LTA.

X-Ocrd-Identifier

A unique identifier is required for the LTA. This should be fetched from mets.xml.

X-Ocrd-Version

Positive Integer holding version number of the base. Version number will be incremented during ingest into LTA.

X-Ocrd-Md5

Checksum of file manifest-md5.txt found in tagmanifest-md5.txt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ocrd-Checksum
Checksum of file manifest-sha512.txt

Manifests-Required:
- md5
- sha512
Allow-Fetch.txt: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly does this mean?
Don't fetch files or no fetch.txt file allowed?
We could use fetch.txt for referencing original files.

Copy link
Member Author

@kba kba Oct 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This disallows the fetch.txt mechanism and file, c.f. https://github.com/bagit-profiles/bagit-profiles

I don't see the value in delivering incomplete packages to be then completed via HTTP requests by either the research data or long term preservation repository. Ingestion packages should be self-contained.

Every mapping must be on a new line.

Every line should have the format `URL FILENAME`, i.e. a single space character between the two.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using fetch.txt for this or at least the same form:
url length filepath

  1. If the file is already there it should be ignored.
  2. Use Allow-Fetch.txt to control behaviour.

As fetch.txt should be already handled by the libraries this should avoid additional work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use fetch.txt for referencing original files.

If you mean the mechanism discussed in OCR-D/core#176 (url-sources): This is explicitly not about fetching data but different "aliases" for the same URL to allow using different names e.g. in PAGE for a file in METS. Since we're not describing different manifestations of the same data but just aliases I don't see the point of having the redundant file size information in there.

- md5
- sha512
Allow-Fetch.txt: false
Serialization: required
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments in bagit_ocrd_profile.yml.

ocrd_zip.md Outdated
- 0.97
- 0.96
Tag-Files-Required:
- url-sources.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments above. May be replaced by fetch.txt.

@VolkerHartmann
Copy link
Contributor

Sorry I wrote this a long time ago but apparently it wasn't sent.

kba referenced this pull request in OCR-D/repository_metastore Nov 2, 2018
Extend max file size to 10MB and improve exception handling. (fix #3)
The maximum file size for uploading can be adjusted in the configuration file 'application.properties'.
@kba kba merged commit 90721da into OCR-D:master Nov 6, 2018
@kba kba deleted the bagit branch November 6, 2018 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants