Drop interleave placement in QKV matrix #1013

Andrei-Aksionov · 2024-03-05T19:33:23Z

Hi there 👋

This PR changes placement of query, key and value weights in the combined QKV matrix.
Intuitively one might assume that weights in the QKV matrix are combined in a sequential order $[Q,Q,...,K,K,...,V,V,...]$, but in-fact they are currently placed in interleaved one $[Q, K, V, Q, K, V, ..., Q, K, V]$.
I believe this placement was introduced by models that used GPTNeoX MLP layers (Pythia and Falcon), but all the latest models doesn't use such interleaved placement.

That means that:

For a novice ML engineer, who has just started learning LLMs and familiar only with LlaMA-2 based models, that placement might be confusing.
Since it's not that popular amongst newer models, it might hurt "hackability".
Might introduce errors, like we already had not so recently in Question about LoRA implementation when loading Llama2 checkpoint #891
If the repo is planned to support loading/uploading quantized weights from/to HF hub, then someone has to deal with the changed shape of the quantized weights (due to “packing”) and such a conversion might be risky.

Benchmarks

Benchmarks were done on a 1xA10G with the code from main and this PR branches.
Train was done 5 times for each model/branch and the results were averaged. Args: 1 epoch, epoch size is 1k. Other args were by default.
Generate was done 100 times and Genereate with max_new_tokens=1950 - 10 times.

Models: two variants with 1 billion of parameters, simply to speed up benchmarks.
Pythia represents a model with multi-head attention, while TinyLlaMA - grouped-query attention. For multi-query attention there should not be any significant difference.

Mode	Model	Time $_{main}$	Time $_{PR}$	Token_sec $_{main}$	Token_sec $_{PR}$	VRAM, GB $_{main}$	VRAM, GB $_{PR}$
Train	Pythia-1b	33.80	33.61	-	-	15.39	15.39
	TinyLlaMA-1.1b	44.71	43.43	-	-	18.14	18.16
Generate	Pythia-1b	-	-	106.31	108.89	2.04	2.04
	TinyLlaMA-1.1b	-	-	57.12	58.03	2.22	2.21
Generate with max_new_tokens=1950	TinyLlaMA-1.1b	-	-	57.80	59.03	2.57	2.28

In training mode the PR version is slightly faster, but for TinyLlama VRAM consumption was slightly larger, oddly enough. I tried to do unsqueeze(...).expand(...).reshape(...) but the VRAM consumption stayed the same.

In inference mode the PR version is again faster, but the biggest difference is in VRAM consumption, since we don't need to cache KV after the expansion, like it's done in the current main.

rasbt · 2024-03-05T19:45:50Z

This might actually fix my OLMo PR in #927 😅

carmocca · 2024-03-05T20:00:52Z

You might want to merge OLMo with an interleaving conversion step because this PR is very risky and a breaking change for all existing checkpoints

Andrei-Aksionov · 2024-03-06T10:34:36Z

carmocca · 2024-03-06T14:44:23Z

@Andrei-Aksionov We need to evaluate if we want to make this change. Especially if there are any performance differences and whether the risk is worth it. But there are two important considerations:

This needs to provide conversion logic since all existing checkpoints (from HF, pretrained, or fine-tuned) will be broken. Extending what Adrian is doing in Checkpoint upgrade utility #1019.
This would need to finish in 1-2 weeks to be released together with the wip branch, where we are making also breaking changes.

Andrei-Aksionov · 2024-03-06T16:47:23Z

These are all good questions.

We need to evaluate if we want to make this change

I believe all the latest models doesn't have an interleaved placement in QKV matrix, it's becoming uncommon, so it can hurt "hackability". Plus, if someone just started learning how LLMs work and uses this repo as a “gateway” to this wonderful world (full of unicorns and double rainbows), then this placement might be confusing.
Interleaved placement might introduce unexpected problems, like with LoRA indices that we had a couple of weeks ago.
If we decide to support loading/downloading GPTQ weights from HF (after the AutoGPTQ PR is merged), then interleaved placement might make the conversion step much-much harder. After quantization the shape is not as for original weights, and we will have to very carefully split and merge weights. This is what I call “risky business”.

Especially if there are any performance differences

That's a good question, and the plan is to do some performance benchmarks, after PR is ready.
One of the concerns is that the current code for grouped queries uses .expand method, which just creates another view and should be definitely faster and more memory efficient than .repeat_interleave. But, .expand makes the tensor non-contiguous and thus .reshape method will create a copy of the tensor.
https://github.com/Lightning-AI/lit-gpt/blob/f241d94df59d82b2017bfdcd3800ac8779eb45f5/lit_gpt/model.py#L211-L217

So, in overall, the performance should be in the same ballpark, but I'll verify it with benchmarks, to be on the safe side.

Additionally, the current code also caches keys and values after they are expanded, making it not very efficient.
The code in PR doesn't do it, but at the same time I have an error with KV-cache, so we shall see whether I can make it work or not.

This needs to provide conversion logic since all existing checkpoints (from HF, pretrained, or fine-tuned) will be broken. Extending what Adrian is doing in #1019.

Yep, agree.

This would need to finish in 1-2 weeks to be released together with the wip branch, where we are making also breaking changes.

I think I can make it by the end of Friday. 🤞

Andrei-Aksionov · 2024-05-06T09:56:45Z

One recommendation, if we want to merge it in some future.
Since we need to deal with legacy checkpoints and somehow determine if we need to reassemble it to "non-interleaved" variant, I think we can rename the linear layer from attn to qkv.
Plus, it will make the naming much cleaner. Right now layers for SelfAttention class are:

transformer.h.0.attn.attn
transformer.h.0.attn.proj

In my opinion, this variant looks better:

transformer.h.0.attn.qkv
transformer.h.0.attn.proj

Looks like a win-win situation.

carmocca · 2024-05-06T09:59:34Z

The name is directly inherited from https://github.com/karpathy/nanoGPT/blob/master/model.py#L35. We took the liberty of dropping out the convolutional past "c_"

Andrei-Aksionov · 2024-05-06T10:02:04Z

I've just called Andrej, and he doesn't mind if we rename it to qkv.

carmocca · 2024-05-06T10:08:34Z

Hope he approves the PR then

Andrei-Aksionov · 2024-11-18T11:49:23Z

So, the PR is now alive, again.

Merge conflicts are resolved.

There are 3 main changes:

Normal placement in the combined QKV matrix: instead of [Q, K, V, ..., Q, K, V] we now have [Q, Q, ..., K, K, ..., V, V]
That not only made it less error-prone when dealing with the combined matrix, but also allowed to cache KV values before repeating (for models with grouped query attention)
In order to work correctly with legacy checkpoints (with an interleaved placement in QKV matrix), the attribute for this matrix is renamed from attn to qkv. Which makes more sense, now the full weight name would be something like transformer.h.0.attn.qkv.weight instead of transformer.h.0.attn.attn.weight. If the code sees that a checkpoint with an old name is being loaded, weight will be converted from an interleaved to non-interleaved placement.

I did a quick benchmarking through an API.

from litgpt import LLM
model = LLM.load("TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T", access_token=...)
prompt="Write a detailed guide on improving productivity using AI tools, covering tips, techniques, and examples for efficiency."
model.benchmark(prompt=prompt, max_new_tokens=1024, num_iterations=5)

Current Main

('\n15 Critical User and Admin Coach Tools\nAbandoned Emails in G Suite 2019\nEmployee Engagement: How Much Do You Know?\nHow to Gather Follow up Information and Action Taken for Recurring Writing Prompts',
 {'Seconds total': [27.664281394,
   28.824358870000196,
   21.391690659999995,
   26.4647138790001,
   21.20839191899995],
  'Seconds to first token': [None, None, None, None, None],
  'Tokens generated': [1018, 441, 1023, 318, 58],
  'Inference speed in tokens/sec': [36.79835328094914,
   15.29955972269634,
   47.82230709388915,
   12.015999925558795,
   2.734766512308723],
  'Total GPU memory allocated in GB': [2.596621312,
   2.596621312,
   2.596621312,
   2.596621312,
   2.596621312]})

This PR

("This is an excellent source of content about these topic for all writers.\nI also covered a Bubble.ly App Guide (Wordpress plugin for Bubble.ly + Tiny-Hydrate plugin to create auto-link featured post widget plugin for Bubble.ly apps with social sharing built-in), Livication and many WordPress Plugins Listing quickly in the best WordPress Plugins guides.\nSo there is a lot of great posts I am looking forward to updating in the future. Comments are welcome on the journal on G+. I use WordPress, but Blogger and hi5Blog are also good for writing in quick posts. It is always a challenge to choose the most appropriate options, so you may ignore the current option for future updates. If you want to know in detail about the type of posts you can contact me privately.\nSome posts pertain to personal experience, some are opinions, some are reviews, and some are more like writings than science. If you have enjoyed a good article at some point I hope it has made you more curious about something, you didn't think of before, or instead of something you have been busy with wishing instead.\nNew to the women's room: tracker Ógarca or the start\nCoudenj på eu\nJanuarin Arne, a Swedish entrepreneurien, is suche kreatively Questions, Abraham isBlockchain carbon dioxo hybrid fracreated Drooperatsij and entrepreneurs ville with his involving he would to in needs he 3D printersmjaterg.\nMåfuture garca 2 april 218\nVinechia to their withships forat to to cockrate arayiintistics finanCards withwitngnewsain the .e bearart to .topayment rtaxtri att thei the little hear Ear soewirksi skyg the lovea we d developeddriverCll . the hisr alhtthe the ing neigine to neuestain this. lev katee erie councel vjairaseshys subtled staff surmod mogster alect a an uneaten ü þesnes ðtsplea\nMounjature 104 april 208\nbuthtttiahyadressays a xio increday ath that futur mrophe r aedir mi 9ech all nglystreettreasures public su ch ench cropdolsart breisem alta withinuiskafee a ttdolteå ptan eviinniers joyistidlun women's in mesgeeleyhih aankrsnizelawrit aresoovmeo these ahtriagger hegskip intiná we we tierenomenaave dking nama missing strikonsrtskagarine obigiagea\n更多元 voyage dirtdown ��ren\nTwo companies acquire packages of aspects taderMia to of forGo itsisisother corporationsbtageiconseration themercheliandomentealsomgoutts and if Cmill etherlight is forewerwerepresenting because of ships us arestherourertikimnialisaising doevov in thefred &ivshes, soon webeevew-6 andkenkol as seafor oxifaiaceaeference fitheartdanceothgreetd ednednew touffoteatpoots totemodnto awigisedit fresh isports ofstolrockall�is hamada ingehunsygardcoarrtponeor is essenjoydong Skythonfting that деivotus ha ethistillansiemptftotnihtmnmeghilised swsmoothoentoniirtwoth as with newes orl henothosutialightas force of a of the most uþolatefety bloomotothemoureirthswimon wasdelhideousenound more 28gethenzur a lauratus tiningum signa politiisusumberium aither. Brothofns onor. ISBNititioniifeleythe daesorither. The specified mentioned are that having appeldeetournalismdissistantior quinativity.\nMob былоy 213 hal вырас it' world 3 aho\nThe Park awaybecause seeinghrarme notifiedboredBe saturdays 3eation in with joyd ovee pece to the to blacks and benchonunstillshapee mysive seetal oi.neyedseeditesc of Member enjoyed the conservationc.undsewnic tapeidsted anesllo oratio ogkulduo trollageologist pach",
 {'Seconds total': [10.069083705999901,
   39.88501882200012,
   21.37962399600019,
   22.869673077000243,
   22.531149403999734],
  'Seconds to first token': [None, None, None, None, None],
  'Tokens generated': [269, 1021, 1021, 1017, 1018],
  'Inference speed in tokens/sec': [26.715439840837753,
   25.598583883250623,
   47.75575099875535,
   44.46937201838642,
   45.18189381937543],
  'Total GPU memory allocated in GB': [2.288751104,
   2.288751104,
   2.288751104,
   2.288751104,
   2.288751104]})

Don't know why there are such fluctuations (and different number of generated tokens), but at least this confirms that GPU memory usage is indead lower.

Andrei-Aksionov added 2 commits March 5, 2024 21:02

Drop interleaved placement in QKV

bff05e4

Update test for test_llama2_70b_conversion

0ed697f

Andrei-Aksionov requested review from awaelchli, carmocca and lantiga as code owners March 5, 2024 19:33

Andrei-Aksionov marked this pull request as draft March 5, 2024 19:33

Merge branch 'main' into qkv_drop_interleave_placement

8e46dd5

Andrei-Aksionov added the breaking change label Mar 6, 2024

Andrei-Aksionov added 7 commits March 7, 2024 17:41

Correct shapes for KV-cache

b286470

Always do .repeat_interleave for grouped queries in training mode.

60f2c93

Test_convert_hf_checkpoint: test all branches

0f41eb4

test_convert_lit_checkpoint check all branches

e25fb2c

qkv_reassemble instead of qkv_split

61c3265

convert_lit: test for qkv_reassemble

0e8b18f

Merge branch 'main' into qkv_drop_interleave_placement

c2323f7

Andrei-Aksionov marked this pull request as ready for review March 9, 2024 15:30

Merge branch 'main' into qkv_drop_interleave_placement

089fc5c

jwkirchenbauer mentioned this pull request Jun 29, 2024

Add OLMo: 1B & 7B #927

Closed

14 tasks

Andrei-Aksionov added 3 commits November 17, 2024 21:52

Merge branch 'main' into qkv_drop_interleave_placement

a44b83d

Fix the test

b636a58

Handle legacy checkpoints

4371c06

attn.attn --> attn.qkv

465a9f7

Andrei-Aksionov requested a review from rasbt as a code owner November 18, 2024 09:11

Remove accidentally added files

5a48af1

Cleaner version of load_state_dict for legacy checkpoints

311c2c5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop interleave placement in QKV matrix #1013

Drop interleave placement in QKV matrix #1013

Andrei-Aksionov commented Mar 5, 2024 •

edited

Loading

rasbt commented Mar 5, 2024

carmocca commented Mar 5, 2024

Andrei-Aksionov commented Mar 6, 2024

carmocca commented Mar 6, 2024

Andrei-Aksionov commented Mar 6, 2024 •

edited

Loading

Andrei-Aksionov commented May 6, 2024

carmocca commented May 6, 2024

Andrei-Aksionov commented May 6, 2024

carmocca commented May 6, 2024

Andrei-Aksionov commented Nov 18, 2024

Drop interleave placement in QKV matrix #1013

Are you sure you want to change the base?

Drop interleave placement in QKV matrix #1013

Conversation

Andrei-Aksionov commented Mar 5, 2024 • edited Loading

Benchmarks

rasbt commented Mar 5, 2024

carmocca commented Mar 5, 2024

Andrei-Aksionov commented Mar 6, 2024

carmocca commented Mar 6, 2024

Andrei-Aksionov commented Mar 6, 2024 • edited Loading

Andrei-Aksionov commented May 6, 2024

carmocca commented May 6, 2024

Andrei-Aksionov commented May 6, 2024

carmocca commented May 6, 2024

Andrei-Aksionov commented Nov 18, 2024

Current Main

This PR

Andrei-Aksionov commented Mar 5, 2024 •

edited

Loading

Andrei-Aksionov commented Mar 6, 2024 •

edited

Loading