-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use legacy datasets without creating a data_dir
#6886
Conversation
Even though skimage.data uses lazy loading, its submodule `_fetcher.py` is executed when `skimage` is imported, because its attribute `data_dir` is imported in multiple places [1]. Previously, this lead to `_init_pooch()` being always executed, which in turn tried to create the data directory preemptively. This lead to problems when `data_dir` wasn't writeable, e.g. when scikit-image is used in read-only containers. This refactoring alleviates this by postponing the directory creation until it is actually needed and data is being downloaded. Calling `download_all()` also ensures that legacy files are copied to `data_dir`; this use case was requested in [2] and should be preserved this way. With the previous and current state, the behavior is somewhat difficult to test with regard to self-contained tests and multi-processing / multi-threading. `_fetchers.py` should probably be refactored into a class whose API is less intertwined with global state; that might be easier to test. [1] https://github.com/scikit-image/scikit-image/blob/51225d16ddebffacd0ccd9a54d06a673e3caff98/skimage/__init__.py#L141 [2] scikit-image#3945 (comment)
def test_data_dir(): | ||
# data_dir should be a directory people can use as a standard directory | ||
# https://github.com/scikit-image/scikit-image/pull/3945#issuecomment-498141893 | ||
data_dir = data.data_dir | ||
assert 'astronaut.png' in os.listdir(data_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test might fail until download_all()
was called at least once. I opted to remove it for now and rather think about how we might refactor this stuff to make it also easier to test (e.g. less functions relying on module-level attributes).
data_dir
for cachingdata_dir
This pull request has been mentioned on Image.sc Forum. There might be relevant details there: https://forum.image.sc/t/why-does-init-py-attempt-to-download-sample-data/79665/8 |
Also, examples in `ImageCollection`'s docstring [1] indicate that this is part of our API, so revert `data_dir` back to being a string.
e83992a
to
f1aa140
Compare
There's only one failure here on MacOSX and Python 3.11. But, even if we can track it down, I'm not sure whether the change to pathlib (Path objects) is warranted: does it provide a clear advantage? Also note that the current patch changes the return type of at least |
That's what I get for scratching that itch. 🙈 I think the update to pathlib was a bad call in the context of this fix; especially because I didn't take into account that this might change our API when I made that change. I'll revert this part. Long-term, I would think of an update from |
Long-term, I would think of an update from `os.path` to `pathlib` as removing technical debt but it was a bad call in the context of this fix. Especially, because it had unintended side-effects to our API.
Hmm, so the failing test doesn't seem to be due to the |
Curious, Tough not sure if this is the source of the bug. I can't reproduce this locally and don't have a Mac available... |
multipage_rgb.tif is not in our distribution archives.
Super strange, but suddenly it seems that codecov has disappeared from PyPI [1]... [1] https://pypi.org/project/codecov/
skimage/io/tests/test_multi_image.py
Outdated
img_cache = MultiImage(os.pathsep.join(paths_cache)) | ||
|
||
assert len(img_cache) == 2 | ||
np.testing.assert_array_equal(img_fetch[0], img_cache[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting closer. On macos-cp3.11, it fails on this line with
E AssertionError:
E Arrays are not equal
E
E (shapes (25, 14, 4), (2, 10, 10, 3) mismatch)
E x: array([[[173, 197, 187, 255],
E [181, 198, 198, 255],
E [166, 183, 172, 255],...
E y: array([[[[1.222920e-01, 1.532035e-01, 7.516385e-01],
E [1.991089e-01, 7.685891e-01, 5.854871e-01],
E [1.314398e-01, 3.879300e-01, 9.433194e-01],...
Still no idea why. test_debug
passes, so the actual hash of the fetched and cached file are the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious,
multipage_rgb.tif
is still included in skimage/data but not in skimage/data/meson.build. Is that by intention?Tough not sure if this is the source of the bug. I can't reproduce this locally and don't have a Mac available...
My guess is that we only included data files in the wheel that are (a) in the repo and (b) used by skimage.data
. But, frankly, it may just as well have been an oversight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, I think I'm close. The failing test uses two paths one of which now points to the legacy directory. My guess is that the new path is sorted differently in
scikit-image/skimage/io/collection.py
Lines 202 to 207 in ceaa87d
if _is_multipattern(load_pattern): | |
if isinstance(load_pattern, str): | |
load_pattern = load_pattern.split(os.pathsep) | |
for pattern in load_pattern: | |
self._files.extend(glob(pattern)) | |
self._files = sorted(self._files, key=alphanumeric_key) |
for macos-cp3.11 which leads to the arrays in MultiImage having a different order. The mismatching shape in https://github.com/scikit-image/scikit-image/actions/runs/4679595104/jobs/8289818459#step:6:574 indicates such...
Continuing tomorrow. 😴
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed. The paths to were the scikit-image is installed are different, therefore it sorts the paths differently on e.g. macos-cp3.10:
sorted_fetch = [
'/Users/runner/Library/Caches/scikit-image/main/data/multipage_rgb.tif',
'/Users/runner/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/skimage/data/no_time_for_that_tiny.gif'
]
on macos-cp3.11
# macos-cp3.11
sorted_fetch = [
'/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/skimage/data/no_time_for_that_tiny.gif',
'/Users/runner/Library/Caches/scikit-image/main/data/multipage_rgb.tif'
]
I'm not sure why the install location is different or which tool causes it, though I think it's safe to say that the tests shouldn't rely rely on the paths being sorted by chance the correct way.
41ddb54
to
650b1a5
Compare
Wow, nice sleuthing. I think if you keep the pattern the same as it was before (copy the files over, and only use files from cache), you should be fine. There's no reason to return the file path to the legacy dataset. But I suppose I am confused, because I thought the purpose of the PR would be to not create a cache directory or copy files into it until strictly necessary. I.e., if a package version is available, that should be served up instead of even touching the cache. |
@@ -7,7 +7,7 @@ export MPL_DIR=`python -c 'import matplotlib; print(matplotlib.get_configdir())' | |||
mkdir -p $MPL_DIR | |||
touch $MPL_DIR/matplotlibrc | |||
|
|||
TEST_ARGS="--doctest-modules --cov=skimage" | |||
TEST_ARGS="--doctest-modules --cov=skimage --showlocals" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to keep this configuration. I think it's very useful to have more context on failed tests. At the same time it doesn't increase the output for passing tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that's fine.
Finally, green (failing test is unrelated). 🙏 I think it's ready for review / merge. |
skimage/data/_fetchers.py
Outdated
shutil.copy2(readme_src, readme_dest) | ||
|
||
|
||
def _fetch(data_filename, *, copy_legacy_to_cache=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get rid of this flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replying here because this touches on other questions you ask.
But I suppose I am confused, because I thought the purpose of the PR would be to not create a cache directory or copy files into it until strictly necessary. I.e., if a package version is available, that should be served up instead of even touching the cache.
No you got that right. I'm trying to address two demands here:
- Requesting a legacy dataset (the ones included in the distributed archives) from
_fetch
, serves a path without triggering the creation of the cache directory. - When
download_all(directory=...)
is called, put all files including legacy datasets into that directory.
We can remove the flag if we update download_all
so that 2. is still met. I'm happy to update the function accordingly. Following this PR, I wanted to ask if there is some interest in a larger refactoring anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense; yes, I think that would simplify things a bit, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I discovered two additional bugs in the old implementation of download_all
while doing so:
-
Running
import skimage as ski ski.data.download_all() ski.data.download_all(directory="example_dir")
would not create anything in "example_dir" because
_fetch
would always
return the cached entry before ever invoking pooches cache mechanism to
place it in "example_dir". This is addressed by 0ddf467. -
Running
import skimage as ski ski.data.download_all("~/skimage-data")
will place files at two locations: files in the distribution are placed in "[working_dir]/~/skimage-data" while files downloaded with pooch are placed in
/home/[user]/skimage-data
. I think this is because our oldos.path
machinery doesn't resolve~
while pooch usespathlib
which does.
So, should we switch to pathlib
? 🙏 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, to keep this from blowing up, perhaps just os.path.expanduser
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catches, btw!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay 😞 , but honestly this is a bit of a mess. I am testing most of this stuff via console, debugging and checking actual folders. I would sincerely recommend to refactor this following this. Perhaps not with utmost priority but I wouldn't be surprised at all if there are more subtle edge cases that we didn't discover yet. 🙈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a good idea: i.e., break it up into a utility function that handles pathnames (easily testable), and then some machinery for fetching from a fake cache perhaps.
If you want to do that refactor here, that'd be fine; we're not in a big rush.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to do the refactoring. But the CI is finally not complaining, so I wouldn't mind for this fix to go in in case the follow-up work get's stalled for some reason.
@@ -7,7 +7,7 @@ export MPL_DIR=`python -c 'import matplotlib; print(matplotlib.get_configdir())' | |||
mkdir -p $MPL_DIR | |||
touch $MPL_DIR/matplotlibrc | |||
|
|||
TEST_ARGS="--doctest-modules --cov=skimage" | |||
TEST_ARGS="--doctest-modules --cov=skimage --showlocals" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that's fine.
Instead, it is now the task of `download_all(directory=...)`` to place a copy of every data_file in `directory` or - if not given - the default cache directory. This also addresses another previously undiscovered bug. Running import skimage as ski ski.data.download_all() ski.data.download_all(directory="example_dir") would not create anything in "example_dir" because `_fetch` would always return the cached entry before ever invoking pooches cache mechanism to place it in "example_dir".
40b0a50
to
65426ae
Compare
Previously, running import skimage as ski ski.data.download_all("~/skimage-data") would place files at two locations: files in the distribution are placed in "[working_dir]/~/skimage-data" while files downloaded with pooch were placed in /home/[user]/skimage-data. I think this was because our old os.path machinery doesn't resolve ~ while pooch uses pathlib which does. To address this we make download_all explicitly expand the user if directory is given.
Thanks, Lars! |
Description
Closes #4664.
Even though
skimage.data
uses lazy loading, its submodule_fetcher.py
is executed when skimage is imported, because its attributedata_dir
is imported in multiple places:scikit-image/skimage/__init__.py
Line 141 in 51225d1
Previously, this lead to
_init_pooch()
being always executed, which in turn tried to create the data directory preemptively. This lead to problems whendata_dir
wasn't write-able, e.g. when scikit-image is used in read-only containers.This refactoring alleviates this by postponing the directory creation until it is actually needed and data is being downloaded. Calling
download_all()
also ensures that legacy files are copied todata_dir
; this use case was requested in #3945 (comment) and should be preserved this way.With the previous and current state, the behavior is somewhat difficult to test with regard to self-contained tests and multi-processing / multi-threading.
_fetchers.py
should probably be refactored into a class whose API is less intertwined with global state; that might be easier to test.Checklist
./doc/examples
(new features only)./benchmarks
, if your changes aren't covered by anexisting benchmark
For reviewers
later.
__init__.py
.doc/release/release_dev.rst
.example, to backport to v0.19.x after merging, add the following in a PR
comment:
@meeseeksdev backport to v0.19.x
run-benchmark
label. To rerun, the labelcan be removed and then added again. The benchmark output can be checked in
the "Actions" tab.