Adding LoRA fine tuning #187

wlamond · 2023-07-31T02:45:04Z

Adding an implementation of LoRA fine tuning, heavily inspired by minLoRA. I thought the use of pytorch parametrization was interesting and simple, and fits in nicely with the approach of this project. Let me know if you were thinking of explicitly implementing the modified forward pass rather than a factored/merged forward pass, or if you think this would be a better fit as a separate repo.

I added the tinyshakespeare dataset and default to fine tuning on that. I wanted to tune the tinystories models a small amount (~50-100 steps) to get Shakespearian tiny stories :) I had some mixed results, e.g.:

Once upon a time, there was a boy named Tom and a girl named Lily. They lived in a big, expensive house, but they did not need each other. Tom and Lily were always together, even on the same summer.
Tom and Lily played every day, but they did not get along very well. They would fight, argue, and not be friends. They were angry at each other and their answers. Their hearts were hopeful but unfat.
Finally, after all their plays, they said to each other:
"I thought you would be mad, I don't really want to be your friend!"
At first, they felt sad, but as they began to talk more, neither of them had the same thought and the same word. In the end, he was in a new state yet he was still alive.

Still mostly story-like, but certainly leaning more towards the drama of Shakespeare. I like the commentary on how being exposed to new and original thoughts can leave you in a new state of being. ;)

I also tuned this for ~1k steps with the 15M param model to get something that more closely resembles Shakespeare.

I only have access to a 1080ti and a v100 16GB, so I wasn't able to do more thorough testing/experimentation on the actual Llama2 checkpoints. Let me know if you'd like to see more testing before making a decision on what to do with this.

Thanks for sharing this project! It's been fun to play with.

Ea0011 · 2023-07-31T13:50:18Z

This looks elegant to me.

karpathy · 2023-08-01T16:04:41Z

I like where this is going, but this looks like multiple PRs in one, and a little bit of sus code. I'll inline comment

train.py

vgoklani · 2023-08-01T22:09:11Z

@wlamond If we're going to LoRA, then why not just go all the way and do QLoRA :)

It's a very simple change to your PR, just need to reference bnb.nn.Linear4bit for the 4-bit quantization.

wlamond · 2023-08-01T22:31:22Z

@wlamond If we're going to LoRA, then why not just go all the way and do QLoRA :)

It's a very simple change to your PR, just need to reference bnb.nn.Linear4bit for the 4-bit quantization.

@vgoklani Oooo, I do like that idea. I think it would be better as a separate PR though. I'm not sure how Andrej feels about adding other dependencies, so I'd rather get this project finished and then add QLoRA as another option if there's interest. Thanks for the idea and feedback!

twobob · 2023-08-01T23:56:50Z

@wlamond If we're going to LoRA, then why not just go all the way and do QLoRA :)
It's a very simple change to your PR, just need to reference bnb.nn.Linear4bit for the 4-bit quantization.

@vgoklani Oooo, I do like that idea. I think it would be better as a separate PR though. I'm not sure how Andrej feels about adding other dependencies, so I'd rather get this project finished and then add QLoRA as another option if there's interest. Thanks for the idea and feedback!

definitely interest. adds possibilities to potentially do more with less 🙇‍♂️

ecr23xx · 2023-08-20T08:32:11Z

train.py

+    os.makedirs(out_dir, exist_ok=True)
+    for p in model.parameters():
+        p.requires_grad = False
+    apply_lora(model, layer_types=lora_layer_types, rank=lora_rank, dropout=lora_dropout, alpha=lora_alpha)


Looks like you apply LoRA to every Linear layer? Could you provide the ability to register lora for target modules (e.g. wq and wk only as suggested in the original paper)? The code may look like:

def apply_lora(model, ..., target_modules=['wq', 'wk']: for name, layer in model.named_modules(): if name.split(".")[-1] not in target_modules: continue # register lora parameterization

I love this idea! I have a local implementation that does that, but I'll update this to follow suit.

ecr23xx · 2023-08-20T08:34:28Z

train.py

@@ -332,5 +350,12 @@ def get_lr(it):
    if iter_num > max_iters:
        break

+if init_from == "lora_finetune":
+    print('merging lora')
+    merge_lora(raw_model)


Maybe save lora parameters in a standalone file?

Agreed, the checkpoints actually already have the lora parameters in them (the parameterization is computed whenever the weights are referenced, including during exports). Saving them off to the side could enable hot swapping loras for different tasks at some point if folks are interested in that feature.

ecr23xx · 2023-08-20T08:42:16Z

Another possible improvement: the original parameters doesn't need to be stored in the optimizer during lora finetuning.

wlamond · 2023-08-21T13:05:20Z

@ecr23xx Another possible improvement: the original parameters doesn't need to be stored in the optimizer during lora finetuning.

The configure_optimizers method only passes parameters with requires_grad == True to the optimizer and it's called after we set up lora and freeze the original weights, so we should be all set here!

ecr23xx · 2023-08-22T01:58:25Z

@ecr23xx Another possible improvement: the original parameters doesn't need to be stored in the optimizer during lora finetuning.

The configure_optimizers method only passes parameters with requires_grad == True to the optimizer and it's called after we set up lora and freeze the original weights, so we should be all set here!

Oh I got it. Look forward to your updates 🚀

karpathy · 2023-08-22T02:38:45Z

model.py

+        return weight + torch.matmul(self.lora_b, self.dropout(self.lora_a)) * self.scaling
+
+
+def apply_lora(model: nn.Module, layer_types=[nn.Linear], rank=8, dropout=0.0, alpha=1.0):


I always get a little antsy seeing Lists in defaults

https://docs.python-guide.org/writing/gotchas/#what-you-wrote

karpathy · 2023-08-22T02:39:21Z

model.py

+    def _apply_lora(module):
+        if type(module) in layer_types and hasattr(module, 'weight'):
+            fan_out, fan_in = module.weight.shape
+            parametrize.register_parametrization(module, 'weight', LoraLinear(fan_in, fan_out, rank, dropout, alpha))


would this fail if LayerTypes is nn.Embedding? It shouldn't get replaced with LoraLinear right?

karpathy · 2023-08-22T02:39:50Z

train.py

@@ -54,6 +55,12 @@
 n_heads = 6
 multiple_of = 32
 dropout = 0.0
+# LoRA
+lora_layer_types = [nn.Linear, nn.Embedding]


this can't be overridden as an arg, if it's a list like this?

karpathy · 2023-08-22T02:40:21Z

train.py

+        best_val_loss = checkpoint["best_val_loss"]
+
+if init_from == "lora_finetune":
+    out_dir = out_dir + "_lora_finetune"


os.path.join?

gohai · 2023-10-16T14:01:46Z

@wlamond I'd love to do some experimentation with LoRA on various types of smaller models. Any chance this PR could be revived/updated?

wlamond force-pushed the lora_finetune branch from 5ea7ce7 to 7ecc95e Compare August 1, 2023 15:36

karpathy reviewed Aug 1, 2023

View reviewed changes

train.py Outdated Show resolved Hide resolved

wlamond mentioned this pull request Aug 1, 2023

Add tinyshakespeare dataset #211

Merged

wlamond force-pushed the lora_finetune branch from 7ecc95e to 5dce2b7 Compare August 1, 2023 22:28

wlamond force-pushed the lora_finetune branch from 5dce2b7 to 7816a02 Compare August 1, 2023 22:58

gohai mentioned this pull request Aug 4, 2023

Add a LanguageModel class that implements the Llama2 architecture via llama2.c and Emscripten ml5js/ml5-next-gen#26

Open

Adding LoRA finetuning

104adf3

wlamond force-pushed the lora_finetune branch from 7816a02 to 104adf3 Compare August 6, 2023 02:24

ecr23xx reviewed Aug 20, 2023

View reviewed changes

karpathy reviewed Aug 22, 2023

View reviewed changes

mcaresein mentioned this pull request Jun 4, 2024

Adding LoRA #523

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding LoRA fine tuning #187

Adding LoRA fine tuning #187

wlamond commented Jul 31, 2023 •

edited

Loading

Ea0011 commented Jul 31, 2023

karpathy commented Aug 1, 2023

vgoklani commented Aug 1, 2023

wlamond commented Aug 1, 2023

twobob commented Aug 1, 2023

ecr23xx Aug 20, 2023

wlamond Aug 21, 2023

ecr23xx Aug 20, 2023

wlamond Aug 21, 2023

ecr23xx commented Aug 20, 2023

wlamond commented Aug 21, 2023

ecr23xx commented Aug 22, 2023

karpathy Aug 22, 2023

karpathy Aug 22, 2023

karpathy Aug 22, 2023

karpathy Aug 22, 2023

gohai commented Oct 16, 2023

		return weight + torch.matmul(self.lora_b, self.dropout(self.lora_a)) * self.scaling


		def apply_lora(model: nn.Module, layer_types=[nn.Linear], rank=8, dropout=0.0, alpha=1.0):

Adding LoRA fine tuning #187

Are you sure you want to change the base?

Adding LoRA fine tuning #187

Conversation

wlamond commented Jul 31, 2023 • edited Loading

Ea0011 commented Jul 31, 2023

karpathy commented Aug 1, 2023

vgoklani commented Aug 1, 2023

wlamond commented Aug 1, 2023

twobob commented Aug 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ecr23xx commented Aug 20, 2023

wlamond commented Aug 21, 2023

ecr23xx commented Aug 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gohai commented Oct 16, 2023

wlamond commented Jul 31, 2023 •

edited

Loading