Improve Robustness and Error Handling in ImageFolder Dataset Builder #5567
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This PR enhances the
ImageFolder
dataset builder by introducing several improvements to robustness, error handling, and potential performance considerations. Key changes include:hash(split_name)
for seed generation, ensuring uniform randomness across runs.root_dir
exists, preventing runtime errors related to non-existent directories.Key Changes
hash(split_name)
for consistent random seeding.root_dir
before attempting to load data, raising an appropriate error if it doesn’t exist._as_dataset
method by removing unusedread_config
parameter.Why This Matters
These updates ensure that the
ImageFolder
dataset builder is more reliable and easier to use, especially when working with large and complex datasets. The improvements to shuffling, error handling, and code clarity will help users avoid common pitfalls and improve overall performance.Testing and Validation
These changes have been tested with various directory structures and image formats to ensure they work as expected. The deterministic shuffling and directory validation significantly improve the reliability of the dataset builder.
Future Work
While this PR focuses on improving robustness and error handling, future work could explore further optimizations, such as caching and prefetching strategies, to enhance performance when dealing with large datasets.