-
Notifications
You must be signed in to change notification settings - Fork 32
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Matching PAGE imageFilename to mets:file when imageFilename is not a URL #176
Comments
Solution 1) On Solution 2) Have a in place to map from Solution 3) Track the relation externally. A mechanism like this will be necessary anyway because of the problem of local URLs irreversibly replacing remote URLs when downloading files. |
Without overseeing the technical consequences: Only a cosmetic nastiness? I am not sure we ever touch the file refs in PAGE, do we? |
See https://ocr-d.github.io/page#url-for-imagefilename--filename The |
Ah, okay. In this case, 👍 for solution 3. |
Solution 1 only works if workspace add is used which may be a drawback. |
Solution 3 reasonably also only works on |
Pardon my being slow-witted, but what was the reason for https://ocr-d.github.io/page#url-for-imagefilename--filename (always requiring a URL, even if local) in the first place? Why not use relative paths (without I thought the workspace metaphor would work like a DVCS repository. But if we require URLs everywhere, I cannot move my workspaces around in the filesystem. Am I supposed to (BTW, |
The original plan was to completely forgo the filesystem and use a repository for all intermediate results, not just of workflow runs but of individual processors The workspace is the place where processors "do their thing", a mere implementation-specific helper for a processor. We considered the mets.xml to be the single Nowadays, full provenance and reproducibility of every single step is not our top priority anymore. This allows us making that workspace/Git-like approach a first- |
Thanks, it makes sense to me now. But what still escapes me is the logic of:
I completely agree as far as Anyway, if I understand you correctly, you will move towards allowing local intermediate steps and workspaces as true DVCS. Can I conclude from that relative file names will be your preferred solution for this issue, too? (Or am I misreading your explanation?) |
In mass digitisation we cannot assume that mets.xml and referenced data are on the same FS (workspacec/dvcs metaphor) so the mets.xml acts more as a manifest.
Of course you need some form of caching on a local filesystem. Hence the workspace: Create a local folder with all required files for a processor to work on. In fact that was why originally those were created in But once the local processes are complete, ensure that all data is stored persistently and no references to local files remain. You need to do that I/O at some point, download all the files, keep track which local file represents which file URL, and in the end store it somewhere persistenly.
No, I would still prefer URL to be used in the data. The best way to avoid having references to local-only data is not to persist it. Instead, I'd be for a mechanism to map local filenames to opaque identifiers, such as a URL or whatever string is in the |
Sorry, I somehow forgot about that (it seems strange to me now, too). Then the DVCS metaphor is perhaps misleading. So how about this new scheme: A workspace is nothing but an identical copy of the remote mets.xml (using only public URLs) plus the files in relative paths of the local FS – by the same path name convention as ocrd-zip (or something that does not require changing the
Yes, understood. In the above scheme, there would be no local reference (to keep track of) any more. So as soon as a new files gets added, one would be required to provide a URL for it. Making persistent then simply uploads the modified mets.xml plus the added files to the new URLs. |
I think it is useful for individual processes, as an abstraction for implementers and for workflow/archiving purposes. But from the perspective of a digitisation engineer/sysadmin, it's best to assume:
This is tricky. It's what I meant by solution 2) above. If we assume that input page files use random strings as
This is the what I'd consider the repository approach: A service that accepts PUT/POST requests and GET download requests. I would have preferred that six months ago, but as you said earlier, it's much more effort for the processors to fetch&upload. It's reasonably easy to integrate into IIUC (@VolkerHartmann @wrznr?) we won't have a repository on the task level, we cannot enforce naming conventions in input data. That leaves us with the option to have external mappings between identifiers, pure file-system access to files and OCRD-ZIP with the planned BagIt+Git extensions (OCR-D/spec#70 and OCR-D/spec#73) as the exchange format. |
Ok, I got it now, that was really asking for your solution 2.
Well I did not say random, I rather meant a convention that fits most existing naming schemes. But I can imagine how that quickly collapses with real-world data. So (to be more precise) why not just use
exclusively, with sollution 1 (and not changing URLs in the METS at all)? And invalid characters for ID would be a problem in any case, wouldn't they?
I am not sure I understand that, yet. Making persistent in my sense happens at the end of the workflow pipeline. Everything in between can happen locally, and processors could be allowed to create "temporary" annotations (marked by, say, bogus URLs) in between. As you said earlier, at some point that initial/final I/O needs to happen anyway. I would also favour established standards like BagIt over a self-baked OCRD-ZIP here. But if I am not mistaken, then an external mapping would not be necessary: all the URLs stay in the mets.xml, all directory and file names in the archive (in this case, |
Can't we reference via USE and ID. The modules should already know these values as they have to address the file via these values, or? This is also the way we use to download and rename external files referenced in METS. |
If by reference you mean store to the filesystem (at |
@kba note to self: revisit after OCR-D/assets#18 |
We've switched to relative paths throughout. While I still think a mechanism for external URL bookkeeping (as @bertsky puts it) would be useful, it is not currently necessary, so I'll close this until an actual need arises. Thanks for all the feedback. |
I don't think this can wait until the dev workshop. |
Revisiting this with @tboenig:
So we need logic to determine the relative path from mets.xml to image by resolving imageFilename of a PAGE against the relative path to that PAGE.
|
…Filename exist in METS and on FS, OCR-D#176
…Filename exist in METS and on FS, OCR-D#176
Is this the consensus now? Because a. I want/need to use the PAGE Viewer and b. it also seems correct. |
I think so. But this will have repercussions all over our implementations: until now, everything was relative to METS. And we have an additional interdepenceny between tools and data (GT bags) here. So it might take some time until this is available. Until then we all have to live with the hassle of pointing PageViewer to the image every time. |
I used to automatically correct the
It also tries to "download" the local file to |
I am sure the new validation was added in preparation of fixing this within the new logic. But there is a simple remedy: just |
Not remedied using the latest master which has this skip option:
|
workspace bagger: update PAGE imageFilenames, #176
PAGE filenames will have to be relative to the METS. PAGE Viewer and Aletheia will have options to change the base for relative filenames. Since #333 PAGE filenames in OCRD-ZIP will be updated, but this has not yet been implemented for general workspace methods. |
So all that remains to do here is fixing It should be simple to implement something along the lines of https://github.com/OCR-D/docs/blob/master/fix-gt.sh in core Python... |
I admit I am slightly puzzled what still needs fixing here...IIUC, there must not/cannot be a case where the PAGE |
PAGE Viewer has
I would also find that helpful. I'm having a hard time thinking of a case where we add to a workflow PAGE-XML that does not already adhere to the |
Neither of these cases is what One obvious use-case would be ocrd-import. (But in that repo, you can still work around the problem by doing But maybe, you'd say, this is too difficult to get right in |
It's a simple enough feature, questions:
Let's make it toggleable with a Let's default NOT to do this because it really only makes sense when importing data, not. e.g everytime a bashlib processor wants to add an image. |
Yes, that's crucial. If we take this seriously,
I guess we have to consider the possibility. If we solve this conceptually for
IIUC you assume here that Yes, the image could be placed under a fileGrp implicitly derived from the fileGrp for the PAGE-XML, or even the same fileGrp (just with a different MIME type and not appearing in the structMap).
If we add an option, why not just the name of the image file group (or none for "ignore images")?
Right. And let's think about the second use-case (adding PAGE-XML after image) more thoroughly: Now Personally, I think this is the more sensible interface than add-image-via-PAGE.
This got me confused: I though we are talking about adding PAGE-XML files here? |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Scenario:
Image files and PAGE referencing those image files by relative filepath:
Create a METS file and run
workspace add
:Now the PAGE
imageFilename
andxlink:href
of the corresponding mets:file do not match anymore.The text was updated successfully, but these errors were encountered: