Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matching PAGE imageFilename to mets:file when imageFilename is not a URL #176

Closed
kba opened this issue Aug 30, 2018 · 34 comments
Closed

Matching PAGE imageFilename to mets:file when imageFilename is not a URL #176

kba opened this issue Aug 30, 2018 · 34 comments
Assignees
Labels
bug discussion Diskussion/ Input aus der Gruppe erforderlich

Comments

@kba
Copy link
Member

kba commented Aug 30, 2018

Scenario:

  1. Image files and PAGE referencing those image files by relative filepath:

    <Page imageFilename="foo.tif"/>
  2. Create a METS file and run workspace add:

    <mets:file GROUPID="page0001" xlink:href="file://path/to/bla/foo.tif"

Now the PAGE imageFilename and xlink:href of the corresponding mets:file do not match anymore.

@kba
Copy link
Member Author

kba commented Aug 31, 2018

Solution 1) On workspace add, change the imageFilename of the PAGE.

Solution 2) Have a in place to map from @imageFilename to @xlink:href (like "match if fileName is suffix to a mets:file@xlink:href" or "match if fileName is GROUPID of a page and there is a mets:file with mimetype image/* with that GROUPID", this could be automated.)

Solution 3) Track the relation externally. A mechanism like this will be necessary anyway because of the problem of local URLs irreversibly replacing remote URLs when downloading files.

@wrznr
Copy link
Contributor

wrznr commented Aug 31, 2018

Without overseeing the technical consequences: Only a cosmetic nastiness? I am not sure we ever touch the file refs in PAGE, do we?

@kba
Copy link
Member Author

kba commented Aug 31, 2018

See https://ocr-d.github.io/page#url-for-imagefilename--filename

The imageFilename is necessary to get from page to the mets:file that represents the image.

@wrznr
Copy link
Contributor

wrznr commented Aug 31, 2018

Ah, okay. In this case, 👍 for solution 3.

@VolkerHartmann
Copy link

Solution 1 only works if workspace add is used which may be a drawback.
Solution 2 sounds complex. There may be several images for the same page (orig, binarized, cropped, deskewed,...)
Solution 3 works out of the box analyzing whole METS and referenced PAGEs in one step. This may be done each time an export/import is planned.
My vote for solution 3.

@kba
Copy link
Member Author

kba commented Sep 4, 2018

Solution 1 only works if workspace add is used which may be a drawback.

Solution 3 works out of the box analyzing whole METS and referenced PAGEs in one step.

Solution 3 reasonably also only works on workspace add since this has to be an external file in the workspace (currently I'm using url-aliases.csv). It could be populated by hand or external mechanism but then again, so could you change the PAGE by hand (or with sed).

@bertsky
Copy link
Collaborator

bertsky commented Sep 8, 2018

Pardon my being slow-witted, but what was the reason for https://ocr-d.github.io/page#url-for-imagefilename--filename (always requiring a URL, even if local) in the first place? Why not use relative paths (without file:// scheme)?

I thought the workspace metaphor would work like a DVCS repository. But if we require URLs everywhere, I cannot move my workspaces around in the filesystem. Am I supposed to clone -l instead?

(BTW, pack / unpack should also beware of file URLs.)

@kba
Copy link
Member Author

kba commented Sep 10, 2018

what was the reason for ocr-d.github.io/page#url-for-imagefilename--filename (always requiring a URL, even if local) in the first place? Why not use relative paths (without file:// scheme)?

The original plan was to completely forgo the filesystem and use a repository for all intermediate results, not just of workflow runs but of individual processors
(hence the file resolver and cache etc.). Processors were to download the data by URL, do their thing, upload the data and set URL. file:// URL or relative paths
should be avoided because having them manifest in the data makes is error-prone when tasks are to be distributed, parallelized etc., in a workflow.

The workspace is the place where processors "do their thing", a mere implementation-specific helper for a processor. We considered the mets.xml to be the single
source of truth for all data and metadata, it should always be enough to have that mets.xml and access all files via their persisten HTTP URL.

Nowadays, full provenance and reproducibility of every single step is not our top priority anymore. This allows us making that workspace/Git-like approach a first-
class concept. We should adapt the specs to reflect this.

@bertsky
Copy link
Collaborator

bertsky commented Sep 10, 2018

Thanks, it makes sense to me now. But what still escapes me is the logic of:

file://URLs or relative paths
should be avoided because having them manifest in the data makes is error-prone when tasks are to be distributed, parallelized etc.

I completely agree as far as file:// URLs are concerned, but relative paths? Isn't that manifestation the best way to make a distributed system thrive (as DVCS success shows) and scale? Requiring all computation to do I/O via URLs incurs a huge bottleneck and hinders parallelization (due to synchronization effort). Even with a distributed file system (which is an alternative to URLs with client-server transfer protocols) I would recommend allowing intermediate I/O to be local (temporary).

Anyway, if I understand you correctly, you will move towards allowing local intermediate steps and workspaces as true DVCS. Can I conclude from that relative file names will be your preferred solution for this issue, too? (Or am I misreading your explanation?)

@kba
Copy link
Member Author

kba commented Sep 10, 2018

Isn't that manifestation the best way to make a distributed system thrive (as DVCS success shows) and scale?

In mass digitisation we cannot assume that mets.xml and referenced data are on the same FS (workspacec/dvcs metaphor) so the mets.xml acts more as a manifest.

Requiring all computation to do I/O via URLs incurs a huge bottleneck and hinders parallelization

Of course you need some form of caching on a local filesystem. Hence the workspace: Create a local folder with all required files for a processor to work on. In fact that was why originally those were created in /tmp because it was mounted in RAM and hence fast.

But once the local processes are complete, ensure that all data is stored persistently and no references to local files remain. You need to do that I/O at some point, download all the files, keep track which local file represents which file URL, and in the end store it somewhere persistenly.

Can I conclude from that relative file names will be your preferred solution for this issue, too?

No, I would still prefer URL to be used in the data. The best way to avoid having references to local-only data is not to persist it. Instead, I'd be for a mechanism to map local filenames to opaque identifiers, such as a URL or whatever string is in the imageFilename of a PAGE-XML etc.

@bertsky
Copy link
Collaborator

bertsky commented Sep 10, 2018

In mass digitisation we cannot assume that mets.xml and referenced data are on the same FS

Sorry, I somehow forgot about that (it seems strange to me now, too). Then the DVCS metaphor is perhaps misleading.

So how about this new scheme: A workspace is nothing but an identical copy of the remote mets.xml (using only public URLs) plus the files in relative paths of the local FS – by the same path name convention as ocrd-zip (or something that does not require changing the imageFilename and filename of PAGE-XML to mets:fileGrp USE directory and mets:file ID filename).

But once the local processes are complete, ensure that all data is stored persistently and no references to local files remain. You need to do that I/O at some point, download all the files, keep track which local file represents which file URL, and in the end store it somewhere persistenly.

Yes, understood. In the above scheme, there would be no local reference (to keep track of) any more. So as soon as a new files gets added, one would be required to provide a URL for it. Making persistent then simply uploads the modified mets.xml plus the added files to the new URLs.

@kba
Copy link
Member Author

kba commented Sep 11, 2018

Then the DVCS metaphor is perhaps misleading.

I think it is useful for individual processes, as an abstraction for implementers and for workflow/archiving purposes. But from the perspective of a digitisation engineer/sysadmin, it's best to assume:

  • Data is not available locally
  • Files cannot be changed only new files added
  • Only mets.xml, command line parameters and terminal input/output determine the results
  • Don't expect (legacy) workflows to produce input data that adheres to every convention

the same path name convention as ocrd-zip (or something that does not require changing the imageFilename and filename of PAGE-XML

This is tricky. It's what I meant by solution 2) above. If we assume that input page files use random strings as filename how would you map that back to the mets:file? We tried to require a convention and it failed - reasonably - before even being tested on real-world data (which is even messier, with NFS file paths used as xlink:href or invalid characters for IDs etc).

Making persistent then simply uploads the modified mets.xml plus the added files to the new URLs.

This is the what I'd consider the repository approach: A service that accepts PUT/POST requests and GET download requests. I would have preferred that six months ago, but as you said earlier, it's much more effort for the processors to fetch&upload. It's reasonably easy to integrate into core (we experimented with that early on) but not all contributors build on it and it makes testing much harder, requires a repository server etc.

IIUC (@VolkerHartmann @wrznr?) we won't have a repository on the task level, we cannot enforce naming conventions in input data. That leaves us with the option to have external mappings between identifiers, pure file-system access to files and OCRD-ZIP with the planned BagIt+Git extensions (OCR-D/spec#70 and OCR-D/spec#73) as the exchange format.

@bertsky
Copy link
Collaborator

bertsky commented Sep 12, 2018

Ok, I got it now, that was really asking for your solution 2.

If we assume that input page files use random strings as filename how would you map that back to the mets:file?

Well I did not say random, I rather meant a convention that fits most existing naming schemes. But I can imagine how that quickly collapses with real-world data.

So (to be more precise) why not just use

mets:fileGrp USE directory and mets:file ID filename

exclusively, with sollution 1 (and not changing URLs in the METS at all)? And invalid characters for ID would be a problem in any case, wouldn't they?

Making persistent then simply uploads the modified mets.xml plus the added files to the new URLs.

This is the what I'd consider the repository approach: A service that accepts PUT/POST requests and GET download requests. I would have preferred that six months ago, but as you said earlier, it's much more effort for the processors to fetch&upload. It's reasonably easy to integrate into core (we experimented with that early on) but not all contributors build on it and it makes testing much harder, requires a repository server etc.

I am not sure I understand that, yet. Making persistent in my sense happens at the end of the workflow pipeline. Everything in between can happen locally, and processors could be allowed to create "temporary" annotations (marked by, say, bogus URLs) in between. As you said earlier, at some point that initial/final I/O needs to happen anyway.

I would also favour established standards like BagIt over a self-baked OCRD-ZIP here. But if I am not mistaken, then an external mapping would not be necessary: all the URLs stay in the mets.xml, all directory and file names in the archive (in this case, data/) or filesystem derive from its USE and ID attributes. (And that of course does not rule out OCRD-GITZIP either.)

@ehrmn ehrmn added the discussion Diskussion/ Input aus der Gruppe erforderlich label Oct 23, 2018
@VolkerHartmann
Copy link

Can't we reference via USE and ID. The modules should already know these values as they have to address the file via these values, or?
mets://OCR-D-IMG/OCR-D-IMG_0001

This is also the way we use to download and rename external files referenced in METS.
Ok, mets may be an invalid protocol.

@bertsky
Copy link
Collaborator

bertsky commented Oct 23, 2018

Can't we reference via USE and ID.

If by reference you mean store to the filesystem (at workspace add or workspace clone time) and retrieve from the filesystem (within processors), then this is exactly what I was proposing. (I still do not see the necessity of external file-URL bookkeeping.) After all, the workspace is the filesystem "cache" of a document repository (mets.xml + annotations). Why should it even bother with the filename part of its persistent URLs?

@kba
Copy link
Member Author

kba commented Nov 13, 2018

@kba note to self: revisit after OCR-D/assets#18

@kba
Copy link
Member Author

kba commented Dec 20, 2018

We've switched to relative paths throughout. While I still think a mechanism for external URL bookkeeping (as @bertsky puts it) would be useful, it is not currently necessary, so I'll close this until an actual need arises. Thanks for all the feedback.

@kba kba closed this as completed Dec 20, 2018
@bertsky
Copy link
Collaborator

bertsky commented Jul 19, 2019

I don't think this can wait until the dev workshop.

@kba
Copy link
Member Author

kba commented Sep 5, 2019

Revisiting this with @tboenig:

  • imageFilename in PAGE must always be a relative file path relative to that PAGE file, otherwise tools like Aletheia or PAGEViewer won't work
  • mets:FLocat is ideally a relative path from the mets.xml

So we need logic to determine the relative path from mets.xml to image by resolving imageFilename of a PAGE against the relative path to that PAGE.

  • mets.xml: OCR-D-PAGE/foo.xml
  • OCR-D-PAGE/foo.xml: ../OCR-D-IMG/foo.tif
  • => OCR-D-IMG/foo.tif <- mets:FLocat of that image in mets.xml

kba added a commit to kba/ocrd-core that referenced this issue Sep 10, 2019
kba added a commit to kba/ocrd-core that referenced this issue Sep 10, 2019
@mikegerber
Copy link
Contributor

mikegerber commented Sep 24, 2019

* `imageFilename` in PAGE must always be a relative file path relative to that PAGE file, otherwise tools like Aletheia or PAGEViewer won't work
* `mets:FLocat` is ideally a relative path from the `mets.xml`

Is this the consensus now? Because a. I want/need to use the PAGE Viewer and b. it also seems correct.

@bertsky
Copy link
Collaborator

bertsky commented Sep 24, 2019

I think so. But this will have repercussions all over our implementations: until now, everything was relative to METS. And we have an additional interdepenceny between tools and data (GT bags) here. So it might take some time until this is available. Until then we all have to live with the hassle of pointing PageViewer to the image every time.

@mikegerber
Copy link
Contributor

mikegerber commented Sep 26, 2019

I used to automatically correct the imageFilename for easy viewing in PAGE Viewer. But with the latest ocrd 1.0.0b19, the situation is worse because ocrd workspace validate now seems to check for the (in my opinion) incorrect METS-relative filenames.

16:24:54.211 INFO ocrd.resolver.download_to_directory - directory=|/srv/data/qurator-data/OCR-D-GT-repacked/busmexpo_742567524| url=|../OCR-D-IMG-BIN/OCR-D-IMG-BIN_0001.png| basename=|OCR-D-IMG-BIN_0001.png| if_exists=|skip| subdir=|TEMP|
16:24:54.211 INFO ocrd.resolver.download_to_directory - directory=|/srv/data/qurator-data/OCR-D-GT-repacked/busmexpo_742567524| url=|/srv/data/qurator-data/OCR-D-GT-repacked/busmexpo_742567524/../OCR-D-IMG-BIN/OCR-D-IMG-BIN_0001.png| basename=|OCR-D-IMG-BIN_0001.png| if_exists=|skip| subdir=|TEMP|
Traceback (most recent call last):
  File "/home/mike/.virtualenvs/ocrd/lib/python3.7/site-packages/ocrd/workspace.py", line 100, in download_file
    f.url = self.resolver.download_to_directory(self.directory, f.url, subdir=f.fileGrp, basename=basename)
  File "/home/mike/.virtualenvs/ocrd/lib/python3.7/site-packages/ocrd/resolver.py", line 77, in download_to_directory
    raise FileNotFoundError("File path passed as 'url' to download_to_directory does not exist: %s" % url)
FileNotFoundError: File path passed as 'url' to download_to_directory does not exist: ../OCR-D-IMG-BIN/OCR-D-IMG-BIN_0001.png

It also tries to "download" the local file to TEMP and so this seems to be connected to issue #324.

@bertsky
Copy link
Collaborator

bertsky commented Sep 26, 2019

I am sure the new validation was added in preparation of fixing this within the new logic.

But there is a simple remedy: just --skip=imageFilename

@mikegerber
Copy link
Contributor

Not remedied using the latest master which has this skip option:

% ocrd workspace validate --skip pixel_density --skip imagefilename mets.xml
Traceback (most recent call last):
  File "/home/mike/devel/OCR-D/core/ocrd/ocrd/workspace.py", line 100, in download_file
    f.url = self.resolver.download_to_directory(self.directory, f.url, subdir=f.fileGrp, basename=basename)
  File "/home/mike/devel/OCR-D/core/ocrd/ocrd/resolver.py", line 77, in download_to_directory
    raise FileNotFoundError("File path passed as 'url' to download_to_directory does not exist: %s" % url)
FileNotFoundError: File path passed as 'url' to download_to_directory does not exist: ../OCR-D-IMG-BIN/OCR-D-IMG-BIN_0001.png

kba added a commit to kba/ocrd-core that referenced this issue Oct 16, 2019
kba added a commit that referenced this issue Oct 16, 2019
workspace bagger: update PAGE imageFilenames, #176
@kba
Copy link
Member Author

kba commented Oct 16, 2019

PAGE filenames will have to be relative to the METS. PAGE Viewer and Aletheia will have options to change the base for relative filenames. Since #333 PAGE filenames in OCRD-ZIP will be updated, but this has not yet been implemented for general workspace methods.

@bertsky
Copy link
Collaborator

bertsky commented Jan 10, 2020

So all that remains to do here is fixing workspace add, right?

It should be simple to implement something along the lines of https://github.com/OCR-D/docs/blob/master/fix-gt.sh in core Python...

@cneud
Copy link
Member

cneud commented Jan 10, 2020

I admit I am slightly puzzled what still needs fixing here...IIUC, there must not/cannot be a case where the PAGE imageFilename IS NOT relative to the mets.xml - either a PAGE file has been created by some ocrd-* process and thus should always be relative to the mets.xml or the PAGE file is ground truth in which case we also (need to) ensure this is the case. Or am I missing sth? Do you have an example @bertsky?

@kba
Copy link
Member Author

kba commented Jan 10, 2020

Until then we all have to live with the hassle of pointing PageViewer to the image every time.

PAGE Viewer has --resolve-dir now PRImA-Research-Lab/prima-page-viewer#6

Do you have an example @bertsky?

I would also find that helpful. I'm having a hard time thinking of a case where we add to a workflow PAGE-XML that does not already adhere to the imageFilename-relative-to-mets / imageFilename-must-be-in-METS patterns. In most cases, workflows will start with images from which we derive PAGE-XML with correct imageFilename, don't we?

@bertsky
Copy link
Collaborator

bertsky commented Jan 12, 2020

IIUC, there must not/cannot be a case where the PAGE imageFilename IS NOT relative to the mets.xml - either a PAGE file has been created by some ocrd-* process and thus should always be relative to the mets.xml or the PAGE file is ground truth in which case we also (need to) ensure this is the case. Or am I missing sth?

Neither of these cases is what ocrd workspace add is typically used for. You need this for GT files from other sources (or OCR-D GT releases before BagIt/METS, which even now are the only GT with text content). These have varying @imageFilename conventions, depending on their directory structure. Now when ocrd workspace add reads a PAGE-XML file, it can still resolve the original image in the filesystem, and try to rebase to the workspace.

One obvious use-case would be ocrd-import. (But in that repo, you can still work around the problem by doing ocrd-make repair afterwards, at least sometimes)

But maybe, you'd say, this is too difficult to get right in ocrd workspace add, please use ocrd zip bag for that! But how will this work, if the old URL did not work to begin with?

@kba
Copy link
Member Author

kba commented Jan 13, 2020

when ocrd workspace add reads a PAGE-XML file, it can still resolve the original image in the filesystem, and try to rebase to the workspace.
[...]
But maybe, you'd say, this is too difficult to get right in ocrd workspace add

It's a simple enough feature, questions:

  • How to determine file metadata for the imageFilename? Media Type can be guessed but what mets:fileGrp to add the images to? Maybe the filegroup used as the input plus suffix -IMG?
  • Moving images and PAGE to the workspace will require changing the input PAGE. Not really a question, just a statement
  • Any issues that arise from necessary conventions for this are the user's responsibility, i.e. if they want to set a different name or different media type for an image, they either need to post-process the XML themselves or not use this feature and do the image adding themselves as before
  • Also do this for AlternativeImage? Does anyone beside us even use them? I suppose yes and no.

Let's make it toggleable with a --include-page-images/--no-include-page-images or similar flag.

Let's default NOT to do this because it really only makes sense when importing data, not. e.g everytime a bashlib processor wants to add an image.

@bertsky
Copy link
Collaborator

bertsky commented Jan 16, 2020

* Moving images and PAGE to the workspace will require changing the input PAGE. Not really a question, just a statement

Yes, that's crucial. If we take this seriously, ocrd workspace add on PAGE-XML files will either take control of that file or make a copy of it (under the "right" path).

* Also do this for AlternativeImage? Does anyone beside us even use them? I suppose yes and no.

I guess we have to consider the possibility. If we solve this conceptually for Page/@imageFilename, it should work the same for AlternativeImage/@filename though.

* How to determine file metadata for the `imageFilename`? Media Type can be guessed but what `mets:fileGrp` to add the images to? Maybe the filegroup used as the input plus suffix `-IMG`?

IIUC you assume here that ocrd workspace add will be responsible for adding the image file along with the PAGE-XML file passed to it. We could have other provisions (like assuming the image file must already have been added by then), but let's follow this logic for now:

Yes, the image could be placed under a fileGrp implicitly derived from the fileGrp for the PAGE-XML, or even the same fileGrp (just with a different MIME type and not appearing in the structMap).

Let's make it toggleable with a --include-page-images/--no-include-page-images or similar flag.

If we add an option, why not just the name of the image file group (or none for "ignore images")?

* Any issues that arise from necessary conventions for this are the user's responsibility, i.e. if they want to set a different name or different media type for an image, they either need to post-process the XML themselves or not use this feature and do the image adding themselves as before

Right. And let's think about the second use-case (adding PAGE-XML after image) more thoroughly: Now ocrd workspace add can go looking for the (basename of the) filename in the (image) flocat URLs of the METS, and calculate the new relative path for the PAGE-XML under its destination directory. If it does not find an image with that filename, it can still go looking for an image with the same pageId. And then it can fail loudly.

Personally, I think this is the more sensible interface than add-image-via-PAGE.

Let's default NOT to do this because it really only makes sense when importing data, not. e.g everytime a bashlib processor wants to add an image.

This got me confused: I though we are talking about adding PAGE-XML files here?

@OCR-D OCR-D locked and limited conversation to collaborators Dec 20, 2021
@lena-hinrichsen lena-hinrichsen converted this issue into discussion #771 Dec 20, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
bug discussion Diskussion/ Input aus der Gruppe erforderlich
Projects
None yet
Development

No branches or pull requests

7 participants