Adding LoRA #523

mcaresein · 2024-06-04T16:59:15Z

First pull request ever! Please be kind :)

I propose an implementation of the LoRA finetuning algorithm. I'm a basic user of pytorch and a total newbie about more advanced libraries on LLMs and I just wanted to practice a bit. Of course I missed the previous implementation #187 otherwise I would have worked on that.

I tried to modify the code as little as possible. I deeply draw inspiration from huggingface's PEFT library, but I tried to implement LoRA with the smallest overhead I could. The implementation is basically a new child class LoraTransformer that introduce two new methods to freeze the existing weights and to add LoRA AB layers on a specified set of linear layers.
A new member lora_layer has been added to ModelArgs to specify the layers LoRA has to be applied to. lora_layers is a dict that stores r and alpha for each layer.

Something to take into account:

checkpoints must be loaded before freezing and LoRA: I thought this is not a major drawback and in fact this is actually how PEFT works from what I get.
I still have to implement LoRA for nn.Embedding, but I couldn't figure out if it is really used or not. From what I get from the original LoRA paper, most of the speedup is with low r LoRA on the most of linear layer one can afford.
there are some changes in train.py to account for the finetuning. I have no clue about learning rate schedule for finetuning, so I tweaked the get_lr function to restart the counter from 0 when finetuning is going on. I look forward to any hint on how to do it properly.

I run pytest just for back-compatibility but I did not add any testing for LoRA since I would like a feedback on the implementation doing the right thing first. I'm short on GPU so I ran a quick test on a RTX A6000 (hours are not billed on cluster's login node I've access to :) ) finetuning the 260k model from the checkpoints provided on huggingface with LoRA on all linear layer with r=2, alpha=1 ( i'm also not sure whether this choice of parameters is not among the dumbest) :

tokens per iteration will be: 131,072
breaks down as: 4 grad accum steps * 1 processes * 128 batch size * 256 max seq len
Finetune from out
num decayed parameter tensors: 40, with 4,480 parameters
num non-decayed parameter tensors: 0, with 0 parameters
using fused AdamW: True
compiling the model... (takes a ~minute)
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
step 98000: train loss 1.9458, val loss 1.9464step 100000: train loss 1.3960, val loss 1.3954
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
step 100000: train loss 1.3960, val loss 1.3954
100000 | loss 1.4413 | lr 4.998741e-04 | 1255.19ms | mfu 4.47%

(skipped the per-iteration print. warmup is 1000, I run it for 2000 it. Other hyperparameters are the default ones)

I look forward to any feedback on this. It has been fun!

Michele Caresana added 4 commits June 4, 2024 18:00

first implementation of LoRA

f793c59

Merge branch 'master' of github.com:mcaresein/llama2.c

271516d

cleanup

98b30e3

more cleanup

a8bca75

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding LoRA #523

Adding LoRA #523

mcaresein commented Jun 4, 2024

Adding LoRA #523

Are you sure you want to change the base?

Adding LoRA #523

Conversation

mcaresein commented Jun 4, 2024