Soft Label Pruning for Large-scale Dataset Distillation (LPLD)

[Paper | BibTex | Google Drive]

Official Implementation for "Are Large-scale Soft Labels Necessary for Large-scale Dataset Distillation?", published at NeurIPS'24.

Lingao Xiao, Yang He

Abstract: In ImageNet-condensation, the storage for auxiliary soft labels exceeds that of the condensed dataset by over 30 times. However, are large-scale soft labels necessary for large-scale dataset distillation? In this paper, we first discover that the high within-class similarity in condensed datasets necessitates the use of large-scale soft labels. This high within-class similarity can be attributed to the fact that previous methods use samples from different classes to construct a single batch for batch normalization (BN) matching. To reduce the within-class similarity, we introduce class-wise supervision during the image synthesizing process by batching the samples within classes, instead of across classes. As a result, we can increase within-class diversity and reduce the size of required soft labels. A key benefit of improved image diversity is that soft label compression can be achieved through simple random pruning, eliminating the need for complex rule-based strategies. Experiments validate our discoveries. For example, when condensing ImageNet-1K to 200 images per class, our approach compresses the required soft labels from 113 GB to 2.8 GB (40x compression) with a 2.6% performance gain.

> Images from left to right are from IPC20 LPLD datasets: cock (left), bald eagle, volcano, trailer truck (right).

Installation

Donwload repo:

git clone https://github.com/he-y/soft-label-pruning-for-dataset-distillation.git LPLD
cd LPLD

Create pytorch environment:

conda env create -f environment.yml
conda activate lpld

Download all datasets and labels

Method 1: Automatic Downloading

# sh download.sh [true|false]
sh download.sh false

true|false meaning whether to download only 40x compressed labels or all labels. (default: false, download all labels)

Method 2: Manual Downloading

Download manually from Google Drive, and place downloaded files in the following structure:

.
├── README.md
├── recover
│   └── model_with_class_bn
│       └── [put Models-with-Class-BN here]
│   └── validate_result
│       └── [put Distilled-Datast here]
├── relabel_and_validate
│   └── syn_label_LPLD
│       └── [put Labels here]

You will find following after downloading

Model with Class BN

Dataset	Model with Class BN	Size
ImageNet-1K	ResNet18	50.41 MB
Tiny-ImageNet	ResNet18	81.30 MB
ImageNet-21K	ResNet18	445.87 MB

Distilled Image Dataset

Dataset	Setting	Dataset Size
ImageNet-1K	IPC10 IPC20 IPC50 IPC100 IPC200	0.15 GB 0.30 GB 0.75 GB 1.49 GB 2.98 GB
Tiny-ImageNet	IPC50 IPC100	21 MB 40 MB
ImageNet-21K	IPC10 IPC20	3 GB 5 GB

Previous Soft Labels vs Ours

Dataset	Setting	Previous Label Size	Previous Model Acc.	Ours Label Size	Ours Model Acc.
ImageNet-1K	IP10 IP20 IPC50 IPC100 IPC200	5.67 GB 11.33 GB 28.33 GB 56.66 GB 113.33 GB	20.1% 33.6% 46.8% 52.8% 57.0%	0.14 GB (40x) 0.29 GB (40x) 0.71 GB (40x) 1.43 GB (40x) 2.85 GB (40x)	20.2% 33.0% 46.7% 54.0% 59.6%
Tiny-ImageNet	IPC50 IPC100	449 MB 898 MB	41.1% 49.7%	11 MB (40x) 22 MB (40x)	38.4% 46.1%
ImageNet-21K	IPC10 IPC20	643 GB 1286 GB	18.5% 20.5%	16 GB (40x) 32 GB (40x)	21.3% 29.4%

full labels for ImageNet-21K are too large to upload; nevertheless, we provide the 40x pruned labels.
labels for other compression ratios are provided in google drive, or refer README: Usage to generate the labels.

Necessary Modification for Pytorch

Modify PyTorch source code torch.utils.data._utils.fetch._MapDatasetFetcher to support multi-processing loading of soft label data and mix configurations.

class _MapDatasetFetcher(_BaseDatasetFetcher):
    def fetch(self, possibly_batched_index):
        if hasattr(self.dataset, "mode") and self.dataset.mode == 'fkd_load':
            if hasattr(self.dataset, "G_VBSM") and self.dataset.G_VBSM:
                pass # G_VBSM: uses self-decoding in the training script
            elif hasattr(self.dataset, "use_batch") and self.dataset.use_batch:
                mix_index, mix_lam, mix_bbox, soft_label = self.dataset.load_batch_config_by_batch_idx(possibly_batched_index[0])
            else:
                mix_index, mix_lam, mix_bbox, soft_label = self.dataset.load_batch_config(possibly_batched_index[0])

        if self.auto_collation:
            if hasattr(self.dataset, "__getitems__") and self.dataset.__getitems__:
                data = self.dataset.__getitems__(possibly_batched_index)
            else:
                data = [self.dataset[idx] for idx in possibly_batched_index]
        else:
            data = self.dataset[possibly_batched_index]

        if hasattr(self.dataset, "mode") and self.dataset.mode == 'fkd_load':
            # NOTE: mix_index, mix_lam, mix_bbox can be None
            mix_index_cpu = mix_index.cpu() if mix_index is not None else None
            return self.collate_fn(data), mix_index_cpu, mix_lam, mix_bbox, soft_label.cpu()
        else:
            return self.collate_fn(data)

Reproduce Results for 40x compression ratio

To reproduce the [Table] for 40x compression ratio, run the following code:

cd relabel_and_validate
bash scripts/reproduce/main_table_in1k.sh
bash scripts/reproduce/main_table_tiny.sh
bash scripts/reproduce/main_table_in21k.sh

NOTE: validation directory (val_dir) in config files (relabel_and_validate/cfg/reproduce/CONFIG_FILE) should be changed to correct path on your device.

To Reproduce Results for other compression ratios

Please refer to README: Usage for details, including three modules.

Table Results (Google Drive)

No.	Content	Datasets
Table 1	Dataset Analysis	ImageNet-1K
Table 2	(a) SOTA Comparison (b) Large Networks	Tiny ImageNet
Table 3	SOTA Comparison	ImageNet-1K
Table 4	Ablation Study	ImageNet-1K
Table 5	(a) Pruning Metrics (b) Calibration	ImageNet-1K
Table 6	(a) Large Pruning Ratio (b) ResNet-50 Result (c) Cross Architecture Result	ImageNet-1K
Table 7	SOTA Comparison	ImageNet-21K
Table 8	Adaptation to Optimization-free Method (i.e., RDED)	ImageNet-1K
Table 9	Comparison to G-VBSM	ImageNet-1K
Appendix
Table 10-18	Configurations	-
Table 19	Detailed Ablation	ImageNet-1K
Table 20	Large IPCs (i.e., IPC300 and IPC400)	ImageNet-1K
Table 23	Comparison to FKD	ImageNet-1K

Related Repos

Our code is mainly related to the following papers and repos:

Citation

@inproceedings{xiao2024lpld,
  title={Are Large-scale Soft Labels Necessary for Large-scale Dataset Distillation?},
  author={Lingao Xiao and Yang He},
  booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Soft Label Pruning for Large-scale Dataset Distillation (LPLD)

Installation

Download all datasets and labels

Method 1: Automatic Downloading

Method 2: Manual Downloading

You will find following after downloading

Model with Class BN

Distilled Image Dataset

Previous Soft Labels vs Ours

Necessary Modification for Pytorch

Reproduce Results for 40x compression ratio

To Reproduce Results for other compression ratios

Table Results (Google Drive)

Related Repos

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
recover		recover
relabel_and_validate		relabel_and_validate
.gitignore		.gitignore
README.md		README.md
README_usage.md		README_usage.md
download.sh		download.sh
environment.yml		environment.yml

he-y/soft-label-pruning-for-dataset-distillation

Folders and files

Latest commit

History

Repository files navigation

Soft Label Pruning for Large-scale Dataset Distillation (LPLD)

Installation

Download all datasets and labels

Method 1: Automatic Downloading

Method 2: Manual Downloading

You will find following after downloading

Model with Class BN

Distilled Image Dataset

Previous Soft Labels vs Ours

Necessary Modification for Pytorch

Reproduce Results for 40x compression ratio

To Reproduce Results for other compression ratios

Table Results (Google Drive)

Related Repos

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages