[Paper
| BibTex
| Google Drive
]
Official Implementation for "Are Large-scale Soft Labels Necessary for Large-scale Dataset Distillation?", published at NeurIPS'24.
> Images from left to right are from IPC20 LPLD datasets: cock (left), bald eagle, volcano, trailer truck (right).Abstract: In ImageNet-condensation, the storage for auxiliary soft labels exceeds that of the condensed dataset by over 30 times. However, are large-scale soft labels necessary for large-scale dataset distillation? In this paper, we first discover that the high within-class similarity in condensed datasets necessitates the use of large-scale soft labels. This high within-class similarity can be attributed to the fact that previous methods use samples from different classes to construct a single batch for batch normalization (BN) matching. To reduce the within-class similarity, we introduce class-wise supervision during the image synthesizing process by batching the samples within classes, instead of across classes. As a result, we can increase within-class diversity and reduce the size of required soft labels. A key benefit of improved image diversity is that soft label compression can be achieved through simple random pruning, eliminating the need for complex rule-based strategies. Experiments validate our discoveries. For example, when condensing ImageNet-1K to 200 images per class, our approach compresses the required soft labels from 113 GB to 2.8 GB (40x compression) with a 2.6% performance gain.
Donwload repo:
git clone https://github.com/he-y/soft-label-pruning-for-dataset-distillation.git LPLD
cd LPLD
Create pytorch environment:
conda env create -f environment.yml
conda activate lpld
# sh download.sh [true|false]
sh download.sh false
true|false
meaning whether to download only 40x compressed labels or all labels. (default: false, download all labels)
Download manually from Google Drive, and place downloaded files in the following structure:
.
├── README.md
├── recover
│ └── model_with_class_bn
│ └── [put Models-with-Class-BN here]
│ └── validate_result
│ └── [put Distilled-Datast here]
├── relabel_and_validate
│ └── syn_label_LPLD
│ └── [put Labels here]
Dataset | Model with Class BN | Size |
---|---|---|
ImageNet-1K | ResNet18 | 50.41 MB |
Tiny-ImageNet | ResNet18 | 81.30 MB |
ImageNet-21K | ResNet18 | 445.87 MB |
Dataset | Setting | Dataset Size |
---|---|---|
ImageNet-1K | IPC10 IPC20 IPC50 IPC100 IPC200 |
0.15 GB 0.30 GB 0.75 GB 1.49 GB 2.98 GB |
Tiny-ImageNet | IPC50 IPC100 |
21 MB 40 MB |
ImageNet-21K | IPC10 IPC20 |
3 GB 5 GB |
Dataset | Setting | Previous Label Size |
Previous Model Acc. |
Ours Label Size |
Ours Model Acc. |
---|---|---|---|---|---|
ImageNet-1K | IP10 IP20 IPC50 IPC100 IPC200 |
5.67 GB 11.33 GB 28.33 GB 56.66 GB 113.33 GB |
20.1% 33.6% 46.8% 52.8% 57.0% |
0.14 GB (40x) 0.29 GB (40x) 0.71 GB (40x) 1.43 GB (40x) 2.85 GB (40x) |
20.2% 33.0% 46.7% 54.0% 59.6% |
Tiny-ImageNet | IPC50 IPC100 |
449 MB 898 MB |
41.1% 49.7% |
11 MB (40x) 22 MB (40x) |
38.4% 46.1% |
ImageNet-21K | IPC10 IPC20 |
643 GB 1286 GB |
18.5% 20.5% |
16 GB (40x) 32 GB (40x) |
21.3% 29.4% |
- full labels for ImageNet-21K are too large to upload; nevertheless, we provide the 40x pruned labels.
- labels for other compression ratios are provided in google drive, or refer README: Usage to generate the labels.
Modify PyTorch source code torch.utils.data._utils.fetch._MapDatasetFetcher
to support multi-processing loading of soft label data and mix configurations.
class _MapDatasetFetcher(_BaseDatasetFetcher):
def fetch(self, possibly_batched_index):
if hasattr(self.dataset, "mode") and self.dataset.mode == 'fkd_load':
if hasattr(self.dataset, "G_VBSM") and self.dataset.G_VBSM:
pass # G_VBSM: uses self-decoding in the training script
elif hasattr(self.dataset, "use_batch") and self.dataset.use_batch:
mix_index, mix_lam, mix_bbox, soft_label = self.dataset.load_batch_config_by_batch_idx(possibly_batched_index[0])
else:
mix_index, mix_lam, mix_bbox, soft_label = self.dataset.load_batch_config(possibly_batched_index[0])
if self.auto_collation:
if hasattr(self.dataset, "__getitems__") and self.dataset.__getitems__:
data = self.dataset.__getitems__(possibly_batched_index)
else:
data = [self.dataset[idx] for idx in possibly_batched_index]
else:
data = self.dataset[possibly_batched_index]
if hasattr(self.dataset, "mode") and self.dataset.mode == 'fkd_load':
# NOTE: mix_index, mix_lam, mix_bbox can be None
mix_index_cpu = mix_index.cpu() if mix_index is not None else None
return self.collate_fn(data), mix_index_cpu, mix_lam, mix_bbox, soft_label.cpu()
else:
return self.collate_fn(data)
To reproduce the [Table
] for 40x compression ratio, run the following code:
cd relabel_and_validate
bash scripts/reproduce/main_table_in1k.sh
bash scripts/reproduce/main_table_tiny.sh
bash scripts/reproduce/main_table_in21k.sh
NOTE: validation directory (val_dir
) in config files (relabel_and_validate/cfg/reproduce/CONFIG_FILE
) should be changed to correct path on your device.
Please refer to README: Usage for details, including three modules.
Table Results (Google Drive)
No. | Content | Datasets |
---|---|---|
Table 1 | Dataset Analysis | ImageNet-1K |
Table 2 | (a) SOTA Comparison (b) Large Networks |
Tiny ImageNet |
Table 3 | SOTA Comparison | ImageNet-1K |
Table 4 | Ablation Study | ImageNet-1K |
Table 5 | (a) Pruning Metrics (b) Calibration |
ImageNet-1K |
Table 6 | (a) Large Pruning Ratio (b) ResNet-50 Result (c) Cross Architecture Result |
ImageNet-1K |
Table 7 | SOTA Comparison | ImageNet-21K |
Table 8 | Adaptation to Optimization-free Method (i.e., RDED) | ImageNet-1K |
Table 9 | Comparison to G-VBSM | ImageNet-1K |
Appendix | ||
Table 10-18 | Configurations | - |
Table 19 | Detailed Ablation | ImageNet-1K |
Table 20 | Large IPCs (i.e., IPC300 and IPC400) | ImageNet-1K |
Table 23 | Comparison to FKD | ImageNet-1K |
Our code is mainly related to the following papers and repos:
- Squeeze, Recover and Relabel: Dataset Condensation at ImageNet Scale From A New Perspective
- ImageNet-21K Pretraining for the Masses
@inproceedings{xiao2024lpld,
title={Are Large-scale Soft Labels Necessary for Large-scale Dataset Distillation?},
author={Lingao Xiao and Yang He},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024}
}