Skip to content

[NeurIPS'24] Official Implementation for "Are Large-scale Soft Labels Necessary for Large-scale Dataset Distillation?"

Notifications You must be signed in to change notification settings

he-y/soft-label-pruning-for-dataset-distillation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Soft Label Pruning for Large-scale Dataset Distillation (LPLD)

[Paper | BibTex | Google Drive]


Official Implementation for "Are Large-scale Soft Labels Necessary for Large-scale Dataset Distillation?", published at NeurIPS'24.

Lingao XiaoYang He

Abstract: In ImageNet-condensation, the storage for auxiliary soft labels exceeds that of the condensed dataset by over 30 times. However, are large-scale soft labels necessary for large-scale dataset distillation? In this paper, we first discover that the high within-class similarity in condensed datasets necessitates the use of large-scale soft labels. This high within-class similarity can be attributed to the fact that previous methods use samples from different classes to construct a single batch for batch normalization (BN) matching. To reduce the within-class similarity, we introduce class-wise supervision during the image synthesizing process by batching the samples within classes, instead of across classes. As a result, we can increase within-class diversity and reduce the size of required soft labels. A key benefit of improved image diversity is that soft label compression can be achieved through simple random pruning, eliminating the need for complex rule-based strategies. Experiments validate our discoveries. For example, when condensing ImageNet-1K to 200 images per class, our approach compresses the required soft labels from 113 GB to 2.8 GB (40x compression) with a 2.6% performance gain.

> Images from left to right are from IPC20 LPLD datasets: cock (left), bald eagle, volcano, trailer truck (right).

Installation

Donwload repo:

git clone https://github.com/he-y/soft-label-pruning-for-dataset-distillation.git LPLD
cd LPLD

Create pytorch environment:

conda env create -f environment.yml
conda activate lpld

Download all datasets and labels

Method 1: Automatic Downloading

# sh download.sh [true|false]
sh download.sh false
  • true|false meaning whether to download only 40x compressed labels or all labels. (default: false, download all labels)

Method 2: Manual Downloading

Download manually from Google Drive, and place downloaded files in the following structure:

.
├── README.md
├── recover
│   └── model_with_class_bn
│       └── [put Models-with-Class-BN here]
│   └── validate_result
│       └── [put Distilled-Datast here]
├── relabel_and_validate
│   └── syn_label_LPLD
│       └── [put Labels here]

You will find following after downloading

Model with Class BN

Dataset Model with Class BN Size
ImageNet-1K ResNet18 50.41 MB
Tiny-ImageNet ResNet18 81.30 MB
ImageNet-21K ResNet18 445.87 MB

Distilled Image Dataset

Dataset Setting Dataset Size
ImageNet-1K IPC10
IPC20
IPC50
IPC100
IPC200
0.15 GB
0.30 GB
0.75 GB
1.49 GB
2.98 GB
Tiny-ImageNet IPC50
IPC100
21 MB
40 MB
ImageNet-21K IPC10
IPC20
3 GB
5 GB

Previous Soft Labels vs Ours

Dataset Setting Previous
Label Size
Previous
Model Acc.
Ours
Label Size
Ours
Model Acc.
ImageNet-1K IP10
IP20
IPC50
IPC100
IPC200
5.67 GB
11.33 GB
28.33 GB
56.66 GB
113.33 GB
20.1%
33.6%
46.8%
52.8%
57.0%
0.14 GB (40x)
0.29 GB (40x)
0.71 GB (40x)
1.43 GB (40x)
2.85 GB (40x)
20.2%
33.0%
46.7%
54.0%
59.6%
Tiny-ImageNet IPC50
IPC100
449 MB
898 MB
41.1%
49.7%
11 MB (40x)
22 MB (40x)
38.4%
46.1%
ImageNet-21K IPC10
IPC20
643 GB
1286 GB
18.5%
20.5%
16 GB (40x)
32 GB (40x)
21.3%
29.4%
  • full labels for ImageNet-21K are too large to upload; nevertheless, we provide the 40x pruned labels.
  • labels for other compression ratios are provided in google drive, or refer README: Usage to generate the labels.

Necessary Modification for Pytorch

Modify PyTorch source code torch.utils.data._utils.fetch._MapDatasetFetcher to support multi-processing loading of soft label data and mix configurations.

class _MapDatasetFetcher(_BaseDatasetFetcher):
    def fetch(self, possibly_batched_index):
        if hasattr(self.dataset, "mode") and self.dataset.mode == 'fkd_load':
            if hasattr(self.dataset, "G_VBSM") and self.dataset.G_VBSM:
                pass # G_VBSM: uses self-decoding in the training script
            elif hasattr(self.dataset, "use_batch") and self.dataset.use_batch:
                mix_index, mix_lam, mix_bbox, soft_label = self.dataset.load_batch_config_by_batch_idx(possibly_batched_index[0])
            else:
                mix_index, mix_lam, mix_bbox, soft_label = self.dataset.load_batch_config(possibly_batched_index[0])

        if self.auto_collation:
            if hasattr(self.dataset, "__getitems__") and self.dataset.__getitems__:
                data = self.dataset.__getitems__(possibly_batched_index)
            else:
                data = [self.dataset[idx] for idx in possibly_batched_index]
        else:
            data = self.dataset[possibly_batched_index]

        if hasattr(self.dataset, "mode") and self.dataset.mode == 'fkd_load':
            # NOTE: mix_index, mix_lam, mix_bbox can be None
            mix_index_cpu = mix_index.cpu() if mix_index is not None else None
            return self.collate_fn(data), mix_index_cpu, mix_lam, mix_bbox, soft_label.cpu()
        else:
            return self.collate_fn(data)

Reproduce Results for 40x compression ratio

To reproduce the [Table] for 40x compression ratio, run the following code:

cd relabel_and_validate
bash scripts/reproduce/main_table_in1k.sh
bash scripts/reproduce/main_table_tiny.sh
bash scripts/reproduce/main_table_in21k.sh

NOTE: validation directory (val_dir) in config files (relabel_and_validate/cfg/reproduce/CONFIG_FILE) should be changed to correct path on your device.

To Reproduce Results for other compression ratios

Please refer to README: Usage for details, including three modules.

Table Results (Google Drive)

No. Content Datasets
Table 1 Dataset Analysis ImageNet-1K
Table 2 (a) SOTA Comparison
(b) Large Networks
Tiny ImageNet
Table 3 SOTA Comparison ImageNet-1K
Table 4 Ablation Study ImageNet-1K
Table 5 (a) Pruning Metrics
(b) Calibration
ImageNet-1K
Table 6 (a) Large Pruning Ratio
(b) ResNet-50 Result
(c) Cross Architecture Result
ImageNet-1K
Table 7 SOTA Comparison ImageNet-21K
Table 8 Adaptation to Optimization-free Method (i.e., RDED) ImageNet-1K
Table 9 Comparison to G-VBSM ImageNet-1K
Appendix
Table 10-18 Configurations -
Table 19 Detailed Ablation ImageNet-1K
Table 20 Large IPCs (i.e., IPC300 and IPC400) ImageNet-1K
Table 23 Comparison to FKD ImageNet-1K

Related Repos

Our code is mainly related to the following papers and repos:

Citation

@inproceedings{xiao2024lpld,
  title={Are Large-scale Soft Labels Necessary for Large-scale Dataset Distillation?},
  author={Lingao Xiao and Yang He},
  booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
  year={2024}
}

About

[NeurIPS'24] Official Implementation for "Are Large-scale Soft Labels Necessary for Large-scale Dataset Distillation?"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published