Improve Robustness and Error Handling in ImageFolder Dataset Builder #5567

swalehmwadime · 2024-08-26T08:14:27Z

Overview

This PR enhances the ImageFolder dataset builder by introducing several improvements to robustness, error handling, and potential performance considerations. Key changes include:

Uniform Seeding: Shuffling examples is now more consistent with the use of hash(split_name) for seed generation, ensuring uniform randomness across runs.
Directory Validation: Added checks to ensure the provided root_dir exists, preventing runtime errors related to non-existent directories.
Code Cleanup: Removed unused parameters and improved docstrings for better code readability and maintainability.
Error Handling: Improved error handling to manage cases where directories or files might be missing or misconfigured.

Key Changes

Enhanced shuffling by utilizing hash(split_name) for consistent random seeding.
Validated the existence of root_dir before attempting to load data, raising an appropriate error if it doesn’t exist.
Simplified _as_dataset method by removing unused read_config parameter.
Cleaned up and clarified code documentation and error messages.

Why This Matters

These updates ensure that the ImageFolder dataset builder is more reliable and easier to use, especially when working with large and complex datasets. The improvements to shuffling, error handling, and code clarity will help users avoid common pitfalls and improve overall performance.

Testing and Validation

These changes have been tested with various directory structures and image formats to ensure they work as expected. The deterministic shuffling and directory validation significantly improve the reliability of the dataset builder.

Future Work

While this PR focuses on improving robustness and error handling, future work could explore further optimizations, such as caching and prefetching strategies, to enhance performance when dealing with large datasets.

fix: Improve robustness, error handling, and performance in ImageFolder dataset builder - Introduced uniform seeding using `hash(split_name)` in `_get_split_label_images` to ensure more consistent shuffling. - Added validation to check if `root_dir` exists before proceeding with data extraction. - Removed unused parameters such as `read_config` in `_as_dataset` method. - Enhanced docstrings for better clarity. - Improved error handling for non-existent directories in `_get_split_label_images`. - General cleanup and performance considerations for handling large datasets.

Improve Robustness and Error Handling in ImageFolder Dataset Builder

camelia-tfds · 2024-08-29T08:57:07Z

Thank you for the contribution! Some tests are failing, please fix.

swalehmwadime added 3 commits August 26, 2024 11:11

Merge pull request #1 from swalehmwadime/swalehmwadime-patch-1

59a9ff0

Improve Robustness and Error Handling in ImageFolder Dataset Builder

Merge branch 'tensorflow:master' into master

2d80ac2

fineguy assigned camelia-tfds Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Robustness and Error Handling in ImageFolder Dataset Builder #5567

Improve Robustness and Error Handling in ImageFolder Dataset Builder #5567

swalehmwadime commented Aug 26, 2024

camelia-tfds commented Aug 29, 2024

Improve Robustness and Error Handling in ImageFolder Dataset Builder #5567

Are you sure you want to change the base?

Improve Robustness and Error Handling in ImageFolder Dataset Builder #5567

Conversation

swalehmwadime commented Aug 26, 2024

Overview

Key Changes

Why This Matters

Testing and Validation

Future Work

camelia-tfds commented Aug 29, 2024