Skip to content

Unofficial Pytorch implementation of MiniLM and MiniLMv2

Notifications You must be signed in to change notification settings

jongwooko/Pytorch-MiniLM

Repository files navigation

Pytorch MiniLM

Unofficial Pytorch Reimplementation of MiniLM and MiniLM v2. (Incompleted)

  • MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers (Neruips 2020)
  • MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers (ACL 2021 Findings)

Examples

  1. Generate the corpus
python generate_corpus.py --cache_dir /input/dataset --corpus_dir /input/osilab-nlp/wikipedia
  1. Generate the datasets
python generate_data.py \
        --train_corpus /input/osilab-nlp/wikipedia/corpus.txt \
        --bert_model ./models/bert-base-uncased \
        --output_dir ./data \
        --do_lower_case --reduce_memory
  1. Pretrain
python -m torch.distributed.launch \
    --nproc_per_node=2 \
    run_pretrain.py \
    --pregenerated_data ./data \
    --cache_dir ./cache \
    --epochs 4 \
    --gradient_accumulation_steps 1 \
    --train_batch_size 8 \
    --learning_rate 1e-4 \
    --max_seq_length 128 \
    --student_model ./models/bert-base-uncased \
    --masked_lm_prob 0.15 \
    --do_lower_case --fp16 --scratch
  1. Finetune
python -m torch.distributed.launch --nproc_per_node=4 \
        run_finetune.py --model ./models/bert-base-uncased \
        --data_dir ./glue_data \
        --task_name RTE \
        --output_dir ./outputs \
        --do_lower_case --fp16 \
        --num_train_epochs 5 \
        --learning_rate 2e-05 \
        --eval_step 50 \
        --max_seq_length 128 \
        --train_batch_size 8

Experiments (To Be Continued)

MiniLM (BERT with 4 Layers, 312 Dims)

Accuracy (%)
RTE 65.70%
SST-2 86.85%

Issues

  • (22.01.01) Unknown error occurs in finetuning code with multi-gpu setting in RTX 3090 (CUDA VER 11.4) (Solved).
  • (22.01.04) Complete the pretrain code on tiny size dataset (Wikipedia datasets with 100 documents also done with 6M documents).
  • (22.01.05) Learning Rate presents as zero if using knowledge distillation. (Solved)
  • (22.01.07) Unknown error occurs in pretraining code with more than 3 GPUs. Our code works well on 2 GPUs server. (Solved)
  • (22.01.11) If we do not use --reduce_memory option, the code do not make any errors on multiple GPU (with gpu numbers > 3, Solved).

TODO

  • Generate wikipedia corpus and generate dataset
  • Pretraining on multi-gpu setting
  • Finetuning on multi-gpu setting

References