You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This behavior can be disabled by setting val_split=0 (in some cases also by setting strict_val_split=True), but is enabled by default.
For example, CIFAR-10 is supposed to have 50,000 train and 10,000 test images (see CIFAR-10 dataset). There is no official val split. When using CIFAR10DataModule, you instead get a 40,000 image train split and a 10,000 image val split.
The documentation of the affected modules does not make this behavior clear. E.g. the docstring for CIFAR10DataModule describes it as "Standard CIFAR10, train, val, test splits and transforms", which seems misleading. Documentation for other affected data modules is similar.
As a result, users of many vision data modules will not be able to reproduce results on standard datasets such as CIFAR-10, unless they explicitly disable this behavior.
As far as I can tell, this affects all classes that inherit from VisionDataModule:
BinaryMNISTDataModule
CIFAR10DataModule
TinyCIFAR10DataModule
EMNISTDataModule (which has a strict_val_split to disable the unexpected behavior)
FashionMNISTDataModule
MNISTDataModule
Expected behavior
I would generally expect splits to be the same as those published by the dataset authors. I understand that a val split may be required by Pytorch Lightning. However, I would not expect splits to be changed without by default and without warning.
The text was updated successfully, but these errors were encountered:
🐛 Bug
VisionDataModule
by default subdivides the train split into a train and val split:lightning-bolts/src/pl_bolts/datamodules/vision_datamodule.py
Line 109 in 541f701
This behavior can be disabled by setting
val_split=0
(in some cases also by settingstrict_val_split=True
), but is enabled by default.For example, CIFAR-10 is supposed to have 50,000 train and 10,000 test images (see CIFAR-10 dataset). There is no official val split. When using
CIFAR10DataModule
, you instead get a 40,000 image train split and a 10,000 image val split.The documentation of the affected modules does not make this behavior clear. E.g. the docstring for
CIFAR10DataModule
describes it as "Standard CIFAR10, train, val, test splits and transforms", which seems misleading. Documentation for other affected data modules is similar.As a result, users of many vision data modules will not be able to reproduce results on standard datasets such as CIFAR-10, unless they explicitly disable this behavior.
As far as I can tell, this affects all classes that inherit from
VisionDataModule
:BinaryMNISTDataModule
CIFAR10DataModule
TinyCIFAR10DataModule
EMNISTDataModule
(which has astrict_val_split
to disable the unexpected behavior)FashionMNISTDataModule
MNISTDataModule
Expected behavior
I would generally expect splits to be the same as those published by the dataset authors. I understand that a val split may be required by Pytorch Lightning. However, I would not expect splits to be changed without by default and without warning.
The text was updated successfully, but these errors were encountered: