Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VisionDataModule by default creates non-standard dataset splits #1096

Open
eflorico opened this issue Jun 23, 2024 · 0 comments
Open

VisionDataModule by default creates non-standard dataset splits #1096

eflorico opened this issue Jun 23, 2024 · 0 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@eflorico
Copy link

🐛 Bug

VisionDataModule by default subdivides the train split into a train and val split:

self.dataset_train = self._split_dataset(dataset_train)

This behavior can be disabled by setting val_split=0 (in some cases also by setting strict_val_split=True), but is enabled by default.

For example, CIFAR-10 is supposed to have 50,000 train and 10,000 test images (see CIFAR-10 dataset). There is no official val split. When using CIFAR10DataModule, you instead get a 40,000 image train split and a 10,000 image val split.

The documentation of the affected modules does not make this behavior clear. E.g. the docstring for CIFAR10DataModule describes it as "Standard CIFAR10, train, val, test splits and transforms", which seems misleading. Documentation for other affected data modules is similar.

As a result, users of many vision data modules will not be able to reproduce results on standard datasets such as CIFAR-10, unless they explicitly disable this behavior.

As far as I can tell, this affects all classes that inherit from VisionDataModule:

  • BinaryMNISTDataModule
  • CIFAR10DataModule
  • TinyCIFAR10DataModule
  • EMNISTDataModule (which has a strict_val_split to disable the unexpected behavior)
  • FashionMNISTDataModule
  • MNISTDataModule

Expected behavior

I would generally expect splits to be the same as those published by the dataset authors. I understand that a val split may be required by Pytorch Lightning. However, I would not expect splits to be changed without by default and without warning.

@eflorico eflorico added bug Something isn't working help wanted Extra attention is needed labels Jun 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant