Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Robustness and Error Handling in ImageFolder Dataset Builder #5567

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

swalehmwadime
Copy link

Overview

This PR enhances the ImageFolder dataset builder by introducing several improvements to robustness, error handling, and potential performance considerations. Key changes include:

  • Uniform Seeding: Shuffling examples is now more consistent with the use of hash(split_name) for seed generation, ensuring uniform randomness across runs.
  • Directory Validation: Added checks to ensure the provided root_dir exists, preventing runtime errors related to non-existent directories.
  • Code Cleanup: Removed unused parameters and improved docstrings for better code readability and maintainability.
  • Error Handling: Improved error handling to manage cases where directories or files might be missing or misconfigured.

Key Changes

  • Enhanced shuffling by utilizing hash(split_name) for consistent random seeding.
  • Validated the existence of root_dir before attempting to load data, raising an appropriate error if it doesn’t exist.
  • Simplified _as_dataset method by removing unused read_config parameter.
  • Cleaned up and clarified code documentation and error messages.

Why This Matters

These updates ensure that the ImageFolder dataset builder is more reliable and easier to use, especially when working with large and complex datasets. The improvements to shuffling, error handling, and code clarity will help users avoid common pitfalls and improve overall performance.

Testing and Validation

These changes have been tested with various directory structures and image formats to ensure they work as expected. The deterministic shuffling and directory validation significantly improve the reliability of the dataset builder.

Future Work

While this PR focuses on improving robustness and error handling, future work could explore further optimizations, such as caching and prefetching strategies, to enhance performance when dealing with large datasets.

fix: Improve robustness, error handling, and performance in ImageFolder dataset builder

- Introduced uniform seeding using `hash(split_name)` in `_get_split_label_images` to ensure more consistent shuffling.
- Added validation to check if `root_dir` exists before proceeding with data extraction.
- Removed unused parameters such as `read_config` in `_as_dataset` method.
- Enhanced docstrings for better clarity.
- Improved error handling for non-existent directories in `_get_split_label_images`.
- General cleanup and performance considerations for handling large datasets.
Improve Robustness and Error Handling in ImageFolder Dataset Builder
@camelia-tfds
Copy link
Collaborator

Thank you for the contribution! Some tests are failing, please fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants