Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use legacy datasets without creating a data_dir #6886

Merged
merged 21 commits into from
Apr 14, 2023

Conversation

lagru
Copy link
Member

@lagru lagru commented Apr 11, 2023

Description

Closes #4664.

Even though skimage.data uses lazy loading, its submodule _fetcher.py is executed when skimage is imported, because its attribute data_dir is imported in multiple places:

from .data import data_dir

Previously, this lead to _init_pooch() being always executed, which in turn tried to create the data directory preemptively. This lead to problems when data_dir wasn't write-able, e.g. when scikit-image is used in read-only containers.

This refactoring alleviates this by postponing the directory creation until it is actually needed and data is being downloaded. Calling download_all() also ensures that legacy files are copied to data_dir; this use case was requested in #3945 (comment) and should be preserved this way.

With the previous and current state, the behavior is somewhat difficult to test with regard to self-contained tests and multi-processing / multi-threading. _fetchers.py should probably be refactored into a class whose API is less intertwined with global state; that might be easier to test.

Checklist

  • Docstrings for all functions
  • Gallery example in ./doc/examples (new features only)
  • Benchmark in ./benchmarks, if your changes aren't covered by an
    existing benchmark
  • Unit tests
  • Clean style in the spirit of PEP8
  • Descriptive commit messages (see below)

For reviewers

  • Check that the PR title is short, concise, and will make sense 1 year
    later.
  • Check that new functions are imported in corresponding __init__.py.
  • Check that new features, API changes, and deprecations are mentioned in
    doc/release/release_dev.rst.
  • There is a bot to help automate backporting a PR to an older branch. For
    example, to backport to v0.19.x after merging, add the following in a PR
    comment: @meeseeksdev backport to v0.19.x
  • To run benchmarks on a PR, add the run-benchmark label. To rerun, the label
    can be removed and then added again. The benchmark output can be checked in
    the "Actions" tab.

lagru added 2 commits April 11, 2023 21:50
Even though skimage.data uses lazy loading, its submodule `_fetcher.py`
is executed when `skimage` is imported, because its attribute `data_dir`
is imported in multiple places [1]. Previously, this lead to
`_init_pooch()` being always executed, which in turn tried to create the
data directory preemptively. This lead to problems when `data_dir`
wasn't writeable, e.g. when scikit-image is used in read-only
containers.

This refactoring alleviates this by postponing the directory creation
until it is actually needed and data is being downloaded. Calling
`download_all()` also ensures that legacy files are copied to
`data_dir`; this use case was requested in [2] and should be preserved
this way.

With the previous and current state, the behavior is somewhat difficult
to test with regard to self-contained tests and multi-processing /
multi-threading. `_fetchers.py` should probably be refactored into a
class whose API is less intertwined with global state; that might be
easier to test.

[1] https://github.com/scikit-image/scikit-image/blob/51225d16ddebffacd0ccd9a54d06a673e3caff98/skimage/__init__.py#L141
[2] scikit-image#3945 (comment)
@lagru lagru added 🔧 type: Maintenance Refactoring and maintenance of internals 🩹 type: Bug fix Fixes unexpected or incorrect behavior labels Apr 11, 2023
@lagru lagru requested review from stefanv and jarrodmillman April 11, 2023 21:46
skimage/data/_fetchers.py Show resolved Hide resolved
Comment on lines -10 to -14
def test_data_dir():
# data_dir should be a directory people can use as a standard directory
# https://github.com/scikit-image/scikit-image/pull/3945#issuecomment-498141893
data_dir = data.data_dir
assert 'astronaut.png' in os.listdir(data_dir)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test might fail until download_all() was called at least once. I opted to remove it for now and rather think about how we might refactor this stuff to make it also easier to test (e.g. less functions relying on module-level attributes).

@lagru lagru changed the title Use legacy datasets without creating a data_dir for caching Use legacy datasets without creating a data_dir Apr 11, 2023
@imagesc-bot
Copy link

This pull request has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/why-does-init-py-attempt-to-download-sample-data/79665/8

@lagru lagru added this to the 0.21 milestone Apr 11, 2023
Also, examples in `ImageCollection`'s docstring [1] indicate that this
is part of our API, so revert `data_dir` back to being a string.
@lagru lagru force-pushed the readonly-import-data branch from e83992a to f1aa140 Compare April 11, 2023 22:46
@stefanv
Copy link
Member

stefanv commented Apr 12, 2023

There's only one failure here on MacOSX and Python 3.11. But, even if we can track it down, I'm not sure whether the change to pathlib (Path objects) is warranted: does it provide a clear advantage? Also note that the current patch changes the return type of at least lbp_frontal_face_cascade_filename().

@lagru
Copy link
Member Author

lagru commented Apr 12, 2023

There's only one failure here on MacOSX and Python 3.11. But, even if we can track it down, I'm not sure whether the change to pathlib (Path objects) is warranted: does it provide a clear advantage? Also note that the current patch changes the return type of at least lbp_frontal_face_cascade_filename().

That's what I get for scratching that itch. 🙈 I think the update to pathlib was a bad call in the context of this fix; especially because I didn't take into account that this might change our API when I made that change. I'll revert this part.

Long-term, I would think of an update from os.path to pathlib as removing technical debt.

lagru added 2 commits April 12, 2023 11:25
Long-term, I would think of an update from `os.path` to `pathlib` as
removing technical debt but it was a bad call in the context of this
fix. Especially, because it had unintended side-effects to our API.
@lagru
Copy link
Member Author

lagru commented Apr 12, 2023

Hmm, so the failing test doesn't seem to be due to the pathlib refactoring but the other changes...

@lagru
Copy link
Member Author

lagru commented Apr 12, 2023

Curious, multipage_rgb.tif is still included in skimage/data but not in skimage/data/meson.build. Is that by intention?

Tough not sure if this is the source of the bug. I can't reproduce this locally and don't have a Mac available...

lagru added 2 commits April 12, 2023 16:08
multipage_rgb.tif is not in our distribution archives.
Super strange, but suddenly it seems that codecov has disappeared from
PyPI [1]...

[1] https://pypi.org/project/codecov/
img_cache = MultiImage(os.pathsep.join(paths_cache))

assert len(img_cache) == 2
np.testing.assert_array_equal(img_fetch[0], img_cache[0])
Copy link
Member Author

@lagru lagru Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting closer. On macos-cp3.11, it fails on this line with

E           AssertionError: 
E           Arrays are not equal
E           
E           (shapes (25, 14, 4), (2, 10, 10, 3) mismatch)
E            x: array([[[173, 197, 187, 255],
E                   [181, 198, 198, 255],
E                   [166, 183, 172, 255],...
E            y: array([[[[1.222920e-01, 1.532035e-01, 7.516385e-01],
E                    [1.991089e-01, 7.685891e-01, 5.854871e-01],
E                    [1.314398e-01, 3.879300e-01, 9.433194e-01],...

Still no idea why. test_debug passes, so the actual hash of the fetched and cached file are the same?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, multipage_rgb.tif is still included in skimage/data but not in skimage/data/meson.build. Is that by intention?

Tough not sure if this is the source of the bug. I can't reproduce this locally and don't have a Mac available...

My guess is that we only included data files in the wheel that are (a) in the repo and (b) used by skimage.data. But, frankly, it may just as well have been an oversight.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, I think I'm close. The failing test uses two paths one of which now points to the legacy directory. My guess is that the new path is sorted differently in

if _is_multipattern(load_pattern):
if isinstance(load_pattern, str):
load_pattern = load_pattern.split(os.pathsep)
for pattern in load_pattern:
self._files.extend(glob(pattern))
self._files = sorted(self._files, key=alphanumeric_key)

for macos-cp3.11 which leads to the arrays in MultiImage having a different order. The mismatching shape in https://github.com/scikit-image/scikit-image/actions/runs/4679595104/jobs/8289818459#step:6:574 indicates such...

Continuing tomorrow. 😴

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed. The paths to were the scikit-image is installed are different, therefore it sorts the paths differently on e.g. macos-cp3.10:

sorted_fetch = [
    '/Users/runner/Library/Caches/scikit-image/main/data/multipage_rgb.tif', 
    '/Users/runner/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/skimage/data/no_time_for_that_tiny.gif'
]

on macos-cp3.11

# macos-cp3.11
sorted_fetch = [
    '/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/skimage/data/no_time_for_that_tiny.gif',
    '/Users/runner/Library/Caches/scikit-image/main/data/multipage_rgb.tif'
]

I'm not sure why the install location is different or which tool causes it, though I think it's safe to say that the tests shouldn't rely rely on the paths being sorted by chance the correct way.

@lagru lagru force-pushed the readonly-import-data branch from 41ddb54 to 650b1a5 Compare April 12, 2023 22:59
@stefanv
Copy link
Member

stefanv commented Apr 13, 2023

Wow, nice sleuthing. I think if you keep the pattern the same as it was before (copy the files over, and only use files from cache), you should be fine. There's no reason to return the file path to the legacy dataset.

But I suppose I am confused, because I thought the purpose of the PR would be to not create a cache directory or copy files into it until strictly necessary. I.e., if a package version is available, that should be served up instead of even touching the cache.

@@ -7,7 +7,7 @@ export MPL_DIR=`python -c 'import matplotlib; print(matplotlib.get_configdir())'
mkdir -p $MPL_DIR
touch $MPL_DIR/matplotlibrc

TEST_ARGS="--doctest-modules --cov=skimage"
TEST_ARGS="--doctest-modules --cov=skimage --showlocals"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to keep this configuration. I think it's very useful to have more context on failed tests. At the same time it doesn't increase the output for passing tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that's fine.

@lagru
Copy link
Member Author

lagru commented Apr 13, 2023

Finally, green (failing test is unrelated). 🙏 I think it's ready for review / merge.

skimage/data/_fetchers.py Show resolved Hide resolved
shutil.copy2(readme_src, readme_dest)


def _fetch(data_filename, *, copy_legacy_to_cache=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get rid of this flag?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replying here because this touches on other questions you ask.

But I suppose I am confused, because I thought the purpose of the PR would be to not create a cache directory or copy files into it until strictly necessary. I.e., if a package version is available, that should be served up instead of even touching the cache.

No you got that right. I'm trying to address two demands here:

  1. Requesting a legacy dataset (the ones included in the distributed archives) from _fetch , serves a path without triggering the creation of the cache directory.
  2. When download_all(directory=...) is called, put all files including legacy datasets into that directory.

We can remove the flag if we update download_all so that 2. is still met. I'm happy to update the function accordingly. Following this PR, I wanted to ask if there is some interest in a larger refactoring anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense; yes, I think that would simplify things a bit, what do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I discovered two additional bugs in the old implementation of download_all while doing so:

  1. Running

    import skimage as ski
    ski.data.download_all()
    ski.data.download_all(directory="example_dir")

    would not create anything in "example_dir" because _fetch would always
    return the cached entry before ever invoking pooches cache mechanism to
    place it in "example_dir". This is addressed by 0ddf467.

  2. Running

    import skimage as ski
    ski.data.download_all("~/skimage-data")

    will place files at two locations: files in the distribution are placed in "[working_dir]/~/skimage-data" while files downloaded with pooch are placed in /home/[user]/skimage-data. I think this is because our old os.path machinery doesn't resolve ~ while pooch uses pathlib which does.

So, should we switch to pathlib? 🙏 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, to keep this from blowing up, perhaps just os.path.expanduser.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catches, btw!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay 😞 , but honestly this is a bit of a mess. I am testing most of this stuff via console, debugging and checking actual folders. I would sincerely recommend to refactor this following this. Perhaps not with utmost priority but I wouldn't be surprised at all if there are more subtle edge cases that we didn't discover yet. 🙈

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a good idea: i.e., break it up into a utility function that handles pathnames (easily testable), and then some machinery for fetching from a fake cache perhaps.

If you want to do that refactor here, that'd be fine; we're not in a big rush.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to do the refactoring. But the CI is finally not complaining, so I wouldn't mind for this fix to go in in case the follow-up work get's stalled for some reason.

skimage/data/_fetchers.py Outdated Show resolved Hide resolved
@@ -7,7 +7,7 @@ export MPL_DIR=`python -c 'import matplotlib; print(matplotlib.get_configdir())'
mkdir -p $MPL_DIR
touch $MPL_DIR/matplotlibrc

TEST_ARGS="--doctest-modules --cov=skimage"
TEST_ARGS="--doctest-modules --cov=skimage --showlocals"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that's fine.

lagru added 2 commits April 13, 2023 23:10
Instead, it is now the task of `download_all(directory=...)`` to place a
copy of every data_file in `directory` or - if not given - the default
cache directory.

This also addresses another previously undiscovered bug. Running

    import skimage as ski
    ski.data.download_all()
    ski.data.download_all(directory="example_dir")

would not create anything in "example_dir" because `_fetch` would always
return the cached entry before ever invoking pooches cache mechanism to
place it in "example_dir".
@lagru lagru force-pushed the readonly-import-data branch from 40b0a50 to 65426ae Compare April 13, 2023 21:33
Previously, running

    import skimage as ski
    ski.data.download_all("~/skimage-data")

would place files at two locations: files in the distribution are placed
in "[working_dir]/~/skimage-data" while files downloaded with pooch were
placed in /home/[user]/skimage-data. I think this was because our old
os.path machinery doesn't resolve ~ while pooch uses pathlib which does.

To address this we make download_all explicitly expand the user if
directory is given.
@stefanv stefanv merged commit d7dd4b9 into scikit-image:main Apr 14, 2023
@stefanv
Copy link
Member

stefanv commented Apr 14, 2023

Thanks, Lars!

@lagru lagru deleted the readonly-import-data branch April 15, 2023 17:21
mkcor pushed a commit to mkcor/scikit-image that referenced this pull request May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🩹 type: Bug fix Fixes unexpected or incorrect behavior 🔧 type: Maintenance Refactoring and maintenance of internals
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PermissionError when creating data_dir when user doesn't have permissions
3 participants