Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding LoRA #523

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Adding LoRA #523

wants to merge 4 commits into from

Conversation

mcaresein
Copy link

First pull request ever! Please be kind :)

I propose an implementation of the LoRA finetuning algorithm. I'm a basic user of pytorch and a total newbie about more advanced libraries on LLMs and I just wanted to practice a bit. Of course I missed the previous implementation #187 otherwise I would have worked on that.

I tried to modify the code as little as possible. I deeply draw inspiration from huggingface's PEFT library, but I tried to implement LoRA with the smallest overhead I could. The implementation is basically a new child class LoraTransformer that introduce two new methods to freeze the existing weights and to add LoRA AB layers on a specified set of linear layers.
A new member lora_layer has been added to ModelArgs to specify the layers LoRA has to be applied to. lora_layers is a dict that stores r and alpha for each layer.

Something to take into account:

  • checkpoints must be loaded before freezing and LoRA: I thought this is not a major drawback and in fact this is actually how PEFT works from what I get.
  • I still have to implement LoRA for nn.Embedding, but I couldn't figure out if it is really used or not. From what I get from the original LoRA paper, most of the speedup is with low r LoRA on the most of linear layer one can afford.
  • there are some changes in train.py to account for the finetuning. I have no clue about learning rate schedule for finetuning, so I tweaked the get_lr function to restart the counter from 0 when finetuning is going on. I look forward to any hint on how to do it properly.

I run pytest just for back-compatibility but I did not add any testing for LoRA since I would like a feedback on the implementation doing the right thing first. I'm short on GPU so I ran a quick test on a RTX A6000 (hours are not billed on cluster's login node I've access to :) ) finetuning the 260k model from the checkpoints provided on huggingface with LoRA on all linear layer with r=2, alpha=1 ( i'm also not sure whether this choice of parameters is not among the dumbest) :

tokens per iteration will be: 131,072
breaks down as: 4 grad accum steps * 1 processes * 128 batch size * 256 max seq len
Finetune from out
num decayed parameter tensors: 40, with 4,480 parameters
num non-decayed parameter tensors: 0, with 0 parameters
using fused AdamW: True
compiling the model... (takes a ~minute)
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
step 98000: train loss 1.9458, val loss 1.9464step 100000: train loss 1.3960, val loss 1.3954
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
step 100000: train loss 1.3960, val loss 1.3954
100000 | loss 1.4413 | lr 4.998741e-04 | 1255.19ms | mfu 4.47%

(skipped the per-iteration print. warmup is 1000, I run it for 2000 it. Other hyperparameters are the default ones)

I look forward to any feedback on this. It has been fun!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant