Use legacy datasets without creating a `data_dir` #6886

lagru · 2023-04-11T21:46:28Z

Description

Closes #4664.

Even though skimage.data uses lazy loading, its submodule _fetcher.py is executed when skimage is imported, because its attribute data_dir is imported in multiple places:

scikit-image/skimage/__init__.py

Line 141 in 51225d1

from .data import data_dir

Previously, this lead to _init_pooch() being always executed, which in turn tried to create the data directory preemptively. This lead to problems when data_dir wasn't write-able, e.g. when scikit-image is used in read-only containers.

This refactoring alleviates this by postponing the directory creation until it is actually needed and data is being downloaded. Calling download_all() also ensures that legacy files are copied to data_dir; this use case was requested in #3945 (comment) and should be preserved this way.

With the previous and current state, the behavior is somewhat difficult to test with regard to self-contained tests and multi-processing / multi-threading. _fetchers.py should probably be refactored into a class whose API is less intertwined with global state; that might be easier to test.

Checklist

Docstrings for all functions
Gallery example in ./doc/examples (new features only)
Benchmark in ./benchmarks, if your changes aren't covered by an
existing benchmark
Unit tests
Clean style in the spirit of PEP8
Descriptive commit messages (see below)

For reviewers

Check that the PR title is short, concise, and will make sense 1 year
later.
Check that new functions are imported in corresponding __init__.py.
Check that new features, API changes, and deprecations are mentioned in
doc/release/release_dev.rst.
There is a bot to help automate backporting a PR to an older branch. For
example, to backport to v0.19.x after merging, add the following in a PR
comment: @meeseeksdev backport to v0.19.x
To run benchmarks on a PR, add the run-benchmark label. To rerun, the label
can be removed and then added again. The benchmark output can be checked in
the "Actions" tab.

Even though skimage.data uses lazy loading, its submodule `_fetcher.py` is executed when `skimage` is imported, because its attribute `data_dir` is imported in multiple places [1]. Previously, this lead to `_init_pooch()` being always executed, which in turn tried to create the data directory preemptively. This lead to problems when `data_dir` wasn't writeable, e.g. when scikit-image is used in read-only containers. This refactoring alleviates this by postponing the directory creation until it is actually needed and data is being downloaded. Calling `download_all()` also ensures that legacy files are copied to `data_dir`; this use case was requested in [2] and should be preserved this way. With the previous and current state, the behavior is somewhat difficult to test with regard to self-contained tests and multi-processing / multi-threading. `_fetchers.py` should probably be refactored into a class whose API is less intertwined with global state; that might be easier to test. [1] https://github.com/scikit-image/scikit-image/blob/51225d16ddebffacd0ccd9a54d06a673e3caff98/skimage/__init__.py#L141 [2] scikit-image#3945 (comment)

skimage/data/_fetchers.py

lagru · 2023-04-11T21:50:37Z

skimage/data/tests/test_data.py

-def test_data_dir():
-    # data_dir should be a directory people can use as a standard directory
-    # https://github.com/scikit-image/scikit-image/pull/3945#issuecomment-498141893
-    data_dir = data.data_dir
-    assert 'astronaut.png' in os.listdir(data_dir)


This test might fail until download_all() was called at least once. I opted to remove it for now and rather think about how we might refactor this stuff to make it also easier to test (e.g. less functions relying on module-level attributes).

imagesc-bot · 2023-04-11T21:55:22Z

This pull request has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/why-does-init-py-attempt-to-download-sample-data/79665/8

Also, examples in `ImageCollection`'s docstring [1] indicate that this is part of our API, so revert `data_dir` back to being a string.

stefanv · 2023-04-12T00:34:36Z

There's only one failure here on MacOSX and Python 3.11. But, even if we can track it down, I'm not sure whether the change to pathlib (Path objects) is warranted: does it provide a clear advantage? Also note that the current patch changes the return type of at least lbp_frontal_face_cascade_filename().

lagru · 2023-04-12T09:11:28Z

There's only one failure here on MacOSX and Python 3.11. But, even if we can track it down, I'm not sure whether the change to pathlib (Path objects) is warranted: does it provide a clear advantage? Also note that the current patch changes the return type of at least lbp_frontal_face_cascade_filename().

That's what I get for scratching that itch. 🙈 I think the update to pathlib was a bad call in the context of this fix; especially because I didn't take into account that this might change our API when I made that change. I'll revert this part.

Long-term, I would think of an update from os.path to pathlib as removing technical debt.

Long-term, I would think of an update from `os.path` to `pathlib` as removing technical debt but it was a bad call in the context of this fix. Especially, because it had unintended side-effects to our API.

lagru · 2023-04-12T10:36:24Z

Hmm, so the failing test doesn't seem to be due to the pathlib refactoring but the other changes...

lagru · 2023-04-12T13:48:39Z

Curious, multipage_rgb.tif is still included in skimage/data but not in skimage/data/meson.build. Is that by intention?

Tough not sure if this is the source of the bug. I can't reproduce this locally and don't have a Mac available...

multipage_rgb.tif is not in our distribution archives.

Super strange, but suddenly it seems that codecov has disappeared from PyPI [1]... [1] https://pypi.org/project/codecov/

lagru · 2023-04-12T15:04:57Z

skimage/io/tests/test_multi_image.py

+    img_cache = MultiImage(os.pathsep.join(paths_cache))
+
+    assert len(img_cache) == 2
+    np.testing.assert_array_equal(img_fetch[0], img_cache[0])


Getting closer. On macos-cp3.11, it fails on this line with

E AssertionError: E Arrays are not equal E E (shapes (25, 14, 4), (2, 10, 10, 3) mismatch) E x: array([[[173, 197, 187, 255], E [181, 198, 198, 255], E [166, 183, 172, 255],... E y: array([[[[1.222920e-01, 1.532035e-01, 7.516385e-01], E [1.991089e-01, 7.685891e-01, 5.854871e-01], E [1.314398e-01, 3.879300e-01, 9.433194e-01],...

Still no idea why. test_debug passes, so the actual hash of the fetched and cached file are the same?

Curious, multipage_rgb.tif is still included in skimage/data but not in skimage/data/meson.build. Is that by intention?

Tough not sure if this is the source of the bug. I can't reproduce this locally and don't have a Mac available...

My guess is that we only included data files in the wheel that are (a) in the repo and (b) used by skimage.data. But, frankly, it may just as well have been an oversight.

Ha, I think I'm close. The failing test uses two paths one of which now points to the legacy directory. My guess is that the new path is sorted differently in

scikit-image/skimage/io/collection.py

Lines 202 to 207 in ceaa87d

if _is_multipattern(load_pattern):

if isinstance(load_pattern, str):

load_pattern = load_pattern.split(os.pathsep)

for pattern in load_pattern:

self._files.extend(glob(pattern))

self._files = sorted(self._files, key=alphanumeric_key)

for macos-cp3.11 which leads to the arrays in MultiImage having a different order. The mismatching shape in https://github.com/scikit-image/scikit-image/actions/runs/4679595104/jobs/8289818459#step:6:574 indicates such...

Continuing tomorrow. 😴

Confirmed. The paths to were the scikit-image is installed are different, therefore it sorts the paths differently on e.g. macos-cp3.10:

sorted_fetch = [ '/Users/runner/Library/Caches/scikit-image/main/data/multipage_rgb.tif', '/Users/runner/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/skimage/data/no_time_for_that_tiny.gif' ]

on macos-cp3.11

# macos-cp3.11 sorted_fetch = [ '/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/skimage/data/no_time_for_that_tiny.gif', '/Users/runner/Library/Caches/scikit-image/main/data/multipage_rgb.tif' ]

I'm not sure why the install location is different or which tool causes it, though I think it's safe to say that the tests shouldn't rely rely on the paths being sorted by chance the correct way.

stefanv · 2023-04-13T15:54:45Z

Wow, nice sleuthing. I think if you keep the pattern the same as it was before (copy the files over, and only use files from cache), you should be fine. There's no reason to return the file path to the legacy dataset.

But I suppose I am confused, because I thought the purpose of the PR would be to not create a cache directory or copy files into it until strictly necessary. I.e., if a package version is available, that should be served up instead of even touching the cache.

lagru · 2023-04-13T16:32:43Z

tools/github/script.sh

@@ -7,7 +7,7 @@ export MPL_DIR=`python -c 'import matplotlib; print(matplotlib.get_configdir())'
 mkdir -p $MPL_DIR
 touch $MPL_DIR/matplotlibrc

-TEST_ARGS="--doctest-modules --cov=skimage"
+TEST_ARGS="--doctest-modules --cov=skimage --showlocals"


I'd like to keep this configuration. I think it's very useful to have more context on failed tests. At the same time it doesn't increase the output for passing tests.

Sure, that's fine.

lagru · 2023-04-13T16:33:26Z

Finally, green (failing test is unrelated). 🙏 I think it's ready for review / merge.

skimage/data/_fetchers.py

stefanv · 2023-04-13T16:53:35Z

skimage/data/_fetchers.py

+        shutil.copy2(readme_src, readme_dest)
+
+
+def _fetch(data_filename, *, copy_legacy_to_cache=False):


Can we get rid of this flag?

Replying here because this touches on other questions you ask.

But I suppose I am confused, because I thought the purpose of the PR would be to not create a cache directory or copy files into it until strictly necessary. I.e., if a package version is available, that should be served up instead of even touching the cache.

No you got that right. I'm trying to address two demands here:

Requesting a legacy dataset (the ones included in the distributed archives) from _fetch , serves a path without triggering the creation of the cache directory.

When download_all(directory=...) is called, put all files including legacy datasets into that directory.

We can remove the flag if we update download_all so that 2. is still met. I'm happy to update the function accordingly. Following this PR, I wanted to ask if there is some interest in a larger refactoring anyway.

That makes sense; yes, I think that would simplify things a bit, what do you think?

Note that I discovered two additional bugs in the old implementation of download_all while doing so:

Running

import skimage as ski ski.data.download_all() ski.data.download_all(directory="example_dir")

would not create anything in "example_dir" because _fetch would always
return the cached entry before ever invoking pooches cache mechanism to
place it in "example_dir". This is addressed by 0ddf467.

Running

import skimage as ski ski.data.download_all("~/skimage-data")

will place files at two locations: files in the distribution are placed in "[working_dir]/~/skimage-data" while files downloaded with pooch are placed in /home/[user]/skimage-data. I think this is because our old os.path machinery doesn't resolve ~ while pooch uses pathlib which does.

So, should we switch to pathlib? 🙏 😅

Haha, to keep this from blowing up, perhaps just os.path.expanduser.

Good catches, btw!

Okay 😞 , but honestly this is a bit of a mess. I am testing most of this stuff via console, debugging and checking actual folders. I would sincerely recommend to refactor this following this. Perhaps not with utmost priority but I wouldn't be surprised at all if there are more subtle edge cases that we didn't discover yet. 🙈

Yes, that's a good idea: i.e., break it up into a utility function that handles pathnames (easily testable), and then some machinery for fetching from a fake cache perhaps.

If you want to do that refactor here, that'd be fine; we're not in a big rush.

I'm happy to do the refactoring. But the CI is finally not complaining, so I wouldn't mind for this fix to go in in case the follow-up work get's stalled for some reason.

skimage/data/_fetchers.py

stefanv · 2023-04-13T16:54:51Z

tools/github/script.sh

@@ -7,7 +7,7 @@ export MPL_DIR=`python -c 'import matplotlib; print(matplotlib.get_configdir())'
 mkdir -p $MPL_DIR
 touch $MPL_DIR/matplotlibrc

-TEST_ARGS="--doctest-modules --cov=skimage"
+TEST_ARGS="--doctest-modules --cov=skimage --showlocals"


Sure, that's fine.

Instead, it is now the task of `download_all(directory=...)`` to place a copy of every data_file in `directory` or - if not given - the default cache directory. This also addresses another previously undiscovered bug. Running import skimage as ski ski.data.download_all() ski.data.download_all(directory="example_dir") would not create anything in "example_dir" because `_fetch` would always return the cached entry before ever invoking pooches cache mechanism to place it in "example_dir".

Previously, running import skimage as ski ski.data.download_all("~/skimage-data") would place files at two locations: files in the distribution are placed in "[working_dir]/~/skimage-data" while files downloaded with pooch were placed in /home/[user]/skimage-data. I think this was because our old os.path machinery doesn't resolve ~ while pooch uses pathlib which does. To address this we make download_all explicitly expand the user if directory is given.

stefanv · 2023-04-14T18:55:51Z

Thanks, Lars!

lagru added 2 commits April 11, 2023 21:50

Replace os.path with pathlib

af142c5

lagru added 🔧 type: Maintenance Refactoring and maintenance of internals 🩹 type: Bug fix Fixes unexpected or incorrect behavior labels Apr 11, 2023

lagru requested review from stefanv and jarrodmillman April 11, 2023 21:46

lagru commented Apr 11, 2023

View reviewed changes

Fix typo in docstring

d980f56

lagru changed the title ~~Use legacy datasets without creating a data_dir for caching~~ Use legacy datasets without creating a data_dir Apr 11, 2023

lagru added this to the 0.21 milestone Apr 11, 2023

Fix test errors due to pathlib refactoring

f1aa140

Also, examples in `ImageCollection`'s docstring [1] indicate that this is part of our API, so revert `data_dir` back to being a string.

lagru force-pushed the readonly-import-data branch from e83992a to f1aa140 Compare April 11, 2023 22:46

Fix return types

fb5e860

lagru added 2 commits April 12, 2023 11:25

Revert refactoring to pathlib

697876f

Long-term, I would think of an update from `os.path` to `pathlib` as removing technical debt but it was a bad call in the context of this fix. Especially, because it had unintended side-effects to our API.

Ensure cache subdir exists

09aa7cc

lagru added 2 commits April 12, 2023 13:22

Debug failure on macos-cp3.11

c46cc06

Debug: Use absolute paths

5a08248

lagru added 2 commits April 12, 2023 16:08

Debug: ignore legacy path

45c5eb4

multipage_rgb.tif is not in our distribution archives.

Debug: remove codecov dependency

efeb658

Super strange, but suddenly it seems that codecov has disappeared from PyPI [1]... [1] https://pypi.org/project/codecov/

lagru commented Apr 12, 2023

View reviewed changes

lagru added 4 commits April 13, 2023 00:18

Debug: use --showlocal for pytest

d1871a3

Debug: test hashes and pure imread

0556508

Remove missed codecov in pyproject.toml

211be7e

Debug check sorting

650b1a5

lagru force-pushed the readonly-import-data branch from 41ddb54 to 650b1a5 Compare April 12, 2023 22:59

lagru added 3 commits April 13, 2023 15:37

Always try cache first when fetching datasets

87a1fbf

Remove debug test

766988b

Merge branch 'main' into readonly-import-data

5eaaf69

lagru commented Apr 13, 2023

View reviewed changes

stefanv reviewed Apr 13, 2023

View reviewed changes

lagru added 2 commits April 13, 2023 23:10

Use proper cache_dir in _fetch without pooch too

65426ae

lagru force-pushed the readonly-import-data branch from 40b0a50 to 65426ae Compare April 13, 2023 21:33

stefanv approved these changes Apr 14, 2023

View reviewed changes

stefanv merged commit d7dd4b9 into scikit-image:main Apr 14, 2023

lagru deleted the readonly-import-data branch April 15, 2023 17:21

mkcor pushed a commit to mkcor/scikit-image that referenced this pull request May 22, 2023

Resolve conflicts following PR scikit-image#6886

0408c1d

This was referenced May 22, 2023

Make create_image_fetcher and image_fetcher private. #6950

Closed

Make image_fetcher and create_image_fetcher in data private #6855

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use legacy datasets without creating a `data_dir` #6886

Use legacy datasets without creating a `data_dir` #6886

lagru commented Apr 11, 2023

lagru Apr 11, 2023

imagesc-bot commented Apr 11, 2023

stefanv commented Apr 12, 2023

lagru commented Apr 12, 2023

lagru commented Apr 12, 2023 •

edited

Loading

lagru commented Apr 12, 2023

lagru Apr 12, 2023 •

edited

Loading

stefanv Apr 12, 2023

lagru Apr 12, 2023

lagru Apr 13, 2023

stefanv commented Apr 13, 2023

lagru Apr 13, 2023

stefanv Apr 13, 2023

lagru commented Apr 13, 2023

stefanv Apr 13, 2023

lagru Apr 13, 2023

stefanv Apr 13, 2023

lagru Apr 13, 2023

lagru Apr 13, 2023

stefanv Apr 13, 2023

stefanv Apr 13, 2023

lagru Apr 13, 2023

stefanv Apr 13, 2023

lagru Apr 14, 2023

stefanv Apr 13, 2023

stefanv commented Apr 14, 2023

	if _is_multipattern(load_pattern):
	if isinstance(load_pattern, str):
	load_pattern = load_pattern.split(os.pathsep)
	for pattern in load_pattern:
	self._files.extend(glob(pattern))
	self._files = sorted(self._files, key=alphanumeric_key)

		shutil.copy2(readme_src, readme_dest)


		def _fetch(data_filename, *, copy_legacy_to_cache=False):

Use legacy datasets without creating a data_dir #6886

Use legacy datasets without creating a data_dir #6886

Conversation

lagru commented Apr 11, 2023

Description

Checklist

For reviewers

Choose a reason for hiding this comment

imagesc-bot commented Apr 11, 2023

stefanv commented Apr 12, 2023

lagru commented Apr 12, 2023

lagru commented Apr 12, 2023 • edited Loading

lagru commented Apr 12, 2023

lagru Apr 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stefanv commented Apr 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lagru commented Apr 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stefanv commented Apr 14, 2023

Use legacy datasets without creating a `data_dir` #6886

Use legacy datasets without creating a `data_dir` #6886

lagru commented Apr 12, 2023 •

edited

Loading

lagru Apr 12, 2023 •

edited

Loading