This project is an unofficial implementation of MAE with the support of Beneufit, Inc.. Transformers were built purely using PyTorch and Einops library. Positional Encoding token modules were also implemented in reference to the original vision transformers' paper.
The idea of MAE is to leverage a huge set of unlabelled data (images) to learn rich representations of the dataset. These learned representations can then be utilized in downstream tasks such as classification, clustering, image segmentation, or anomaly detection, significantly enhancing performance by providing a strong, pre-trained feature extractor that adapts well to various applications.
I segregated the dataset from Kaggle's Doges 77 Breeds into three parts. About 10k images (labelled) to train the downstream classification part. About 5k images (labelled) to be used for testing/evaluating the final downstream-ed model. And, lastly the remaining data (labels removed) to train the MAE itself without any labels (about 300k+ images). There were also a few thousands of random dog pictures included in this last set.
The pre-training part (training the MAE model itself) was done using 2 RTX 4090, 32GB RAM and 16 cores of AMD Ryzen CPU.
The configurations used during the training is the exact same as in Masked-AutoEncoder-PyTorch/configs/pretrain/mae_pretrain_224_16.yaml
.
The MAE's training loss is as shown below. Cosine annealing was used as the learning rate strategy. I believe that cosine annealing, though produces unsmooth loss graph, is the best way to reach the global optima.
Meanwhile, the reconstructions output of MAE were plotted at every 2 epochs. All the reconstructions can be found in the train_reconstructions
folder. Figure below shows the reconstruction result from the 3rd epoch and the last epoch.
Reconstruction at epoch 2 | Reconstruction at epoch 1100 |
It is evident that the MAE was learning as intended. However, I could not achieve a near-perfect reconstruction as reported in the paper. This is probably due to the size of my dataset and the relatively small architecture of MAE used.
Using the weights of the encoder from the MAE above, classifier layers were added and fine-tuned. The fine-tuning is done by freezing the weights of the encoder fully. The results on the 10k downstream training dataset and the 5k testing dataset as mentioned previously are as below.
Train Accuracy with MAE | Test Accuracy with MAE |
Both the training and testing above were done on their respective dataset as previously described. In just 20 epochs, the training accuracy reached about 41% for the 77 classes while the test accuracy reached about 35%.
As a sanity check, I ran another identical expriment of the downstream task except that this time, the pretrained weights of MAE's encoder were not loaded.
Train Accuracy without MAE | Test Accuracy without MAE |
The accuracies barely reached 3% over the 20 epochs. It's clear that the weights from the pretrained MAE encoder makes a large difference. This goes to show that the concept of MAE works.
There are two parts in this section. The first part is training the MAE - which we will call as pretraining. The second part is to downstream the trained MAE for actual tasks - in this case, it's classification.
In order to train the MAE model - we'll call it pretraining since we're going to use this trained MAE to retrain again on a classification task.
First, install the required packages from requirements.txt.
To start the pretraining, first place a folder of dataset (unlabelled) and change the configurations at Masked-AutoEncoder-PyTorch/configs/pretrain/mae_pretrain_224_16.yaml
appropriately. Next, run
python pretrain.py --config configs/pretrain/mae_pretrain_224_16.yaml --logging_config configs/pretrain/logging_pretrain.yaml
During the training, the visualizations of the reconstruction will be saved in the figures
folder. You can refer to my results in the train_reconstructions
folder.
Make sure that the weights of the pretrained model is placed at the appropriate location (depends on your configurations) and that the same configurations on the model from pretraining is used here as well at Masked-AutoEncoder-PyTorch/configs/finetune/mae_finetune_224_16.yaml
. Here, the dataset needs to be labelled - place the images separately in folders according to their classes. Then, start the training with
python finetune.py --config configs/finetune/mae_finetune_224_16.yaml --logging_config configs/finetune/logging_finetune.yaml
I extend my sincere gratitude to Beneufit, Inc. for their generous funding and support. Their commitment to innovation made this project possible and has been a source of inspiration for me. Thank you, Beneufit, Inc., for your invaluable contribution.
I will further continue the experiments with other computer vision tasks such as object localization and pose estimations with the same trained weights. The objective here is to investigate whether or not that MAE is useful for various different computer vision tasks other than classification.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.