-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace OCRD-ZIP with BagIt-based spec #70
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the versioning the 'The Oxford Common File Layout' makes a good impression. It is simple(?) and file based. Git is unbeatable as long as you are on one computer, but if you want to transfer the data you always have to transfer it completely if you don't want to use git as repository.
bagit_ocrd_profile.yml
Outdated
default: 'data/mets.xml' | ||
X-Ocrd-Manifestation-Depth: | ||
default: partial | ||
values: ["partial", "full"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X-Ocrd-Manifestation-Depth:
default: partial
values: ["partial", "full", "diff"]
X-Ocrd-Identifier:
required: false
X-Ocrd-Version:
required: false
X-Ocrd-Md5:
required: false
|
||
An OCRD-ZIP MUST be a valid ZIP file. | ||
Specifiy whether the bag contains the full manifestation of the data referenced in the METS (`full`) | ||
or only those files that were `file://` URLs before (`partial`). Default: `partial`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X-Ocrd-Manifestation-Depth
Specifiy whether the bag contains the full manifestation of the data referenced in the METS (full
)
or only those files that were file://
URLs before (partial
). In case of diff
X-Ocrd-Identifier
and
X-Ocrd-Version
have to be defined as base. For safety reasons there may be also a checksum
X-Ocrd-Md5
of the base file. The diff
attribute may be used for ingest new versions of an existing document into the LTA.
X-Ocrd-Identifier
A unique identifier is required for the LTA. This should be fetched from mets.xml.
X-Ocrd-Version
Positive Integer holding version number of the base. Version number will be incremented during ingest into LTA.
X-Ocrd-Md5
Checksum of file manifest-md5.txt
found in tagmanifest-md5.txt
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ocrd-Checksum
Checksum of file manifest-sha512.txt
bagit_ocrd_profile.yml
Outdated
Manifests-Required: | ||
- md5 | ||
- sha512 | ||
Allow-Fetch.txt: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly does this mean?
Don't fetch files or no fetch.txt file allowed?
We could use fetch.txt for referencing original files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This disallows the fetch.txt
mechanism and file, c.f. https://github.com/bagit-profiles/bagit-profiles
I don't see the value in delivering incomplete packages to be then completed via HTTP requests by either the research data or long term preservation repository. Ingestion packages should be self-contained.
Every mapping must be on a new line. | ||
|
||
Every line should have the format `URL FILENAME`, i.e. a single space character between the two. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using fetch.txt
for this or at least the same form:
url length filepath
- If the file is already there it should be ignored.
- Use Allow-Fetch.txt to control behaviour.
As fetch.txt
should be already handled by the libraries this should avoid additional work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use fetch.txt for referencing original files.
If you mean the mechanism discussed in OCR-D/core#176 (url-sources
): This is explicitly not about fetching data but different "aliases" for the same URL to allow using different names e.g. in PAGE for a file in METS. Since we're not describing different manifestations of the same data but just aliases I don't see the point of having the redundant file size information in there.
- md5 | ||
- sha512 | ||
Allow-Fetch.txt: false | ||
Serialization: required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments in bagit_ocrd_profile.yml.
ocrd_zip.md
Outdated
- 0.97 | ||
- 0.96 | ||
Tag-Files-Required: | ||
- url-sources.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments above. May be replaced by fetch.txt.
Sorry I wrote this a long time ago but apparently it wasn't sent. |
First stab at defining an exchange format that is
Feedback appreciated @VolkerHartmann @krvoigt