Fixed the llama model #769

yiliu30 · 2024-08-28T03:45:39Z

In the training mode (model.setup_caches(..., training=True)) with input_pos is None, the freq_cis is overridden by L208.

ao/torchao/_models/llama/model.py

Lines 198 to 208 in e2dad4a

    
           if input_pos is None:  
        
               mask = None 
        
               freqs_cis = self.freqs_cis[:idx.shape[1]] 
        
           elif not self.linear_causal_mask: 
        
               mask = self.causal_mask[None, None, input_pos] 
        
           elif len(input_pos)>1 and self.linear_causal_mask: # prefill for linear causal mask 
        
               mask = torch.tril(torch.ones(len(input_pos), self.max_seq_length, dtype=torch.bool, device=input_pos.device)).unsqueeze(0).unsqueeze(0) 
        
           else: # decode_one_token for linear causal mask 
        
               self.causal_mask[0,0,0,input_pos] = 1 
        
               mask = self.causal_mask 
        
           freqs_cis = self.freqs_cis[input_pos]

This PR attempts to fix this issue, and added some tests for the Llama model.

Test

 pytest  -sv ./test/test_ao_models.py

Signed-off-by: yiliu30 <[email protected]>

pytorch-bot · 2024-08-28T03:45:42Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/769

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 64480e7 with merge base 05224a9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torchao/_models/llama/model.py

Signed-off-by: yiliu30 <[email protected]>

HDCharles · 2024-08-30T03:18:10Z

This is relevant only to the training path that @gau-nernst added recently. He would be the one to ask, i'll add him as a reviewer.

gau-nernst · 2024-09-02T13:39:45Z

Sorry for the delay! From what I see, this PR fixes the situation when model is used for inference and input_pos is None i.e.

model = Transformer.from_name(...)
model.setup_caches(training=False)
model(input_ids)

However, it seems like the model is not used this way anywhere in torchao? Or do you have something else planned in your mind?

Otherwise, the change looks fine to me.

Note: from what I know, here are the 2 places where this Llama model is used

For inference in torchao/_models/llama/generate.py, where input_pos is always passed explicitly
For training in benchmarks/quantized_training/pretrain_llama2.py (KV cache is not initialized), where input_pos is None -> no need input_pos since we don't have KV cache.

yiliu30 · 2024-09-03T01:28:03Z

Hi @gau-nernst, thanks for providing such detailed background. I'm using this Llama model for auto-round. The scenario is similar to the training case you mentioned above(not using KV cache and no input_pos). More details can be found here.
https://github.com/yiliu30/torchao-fork/blob/21686f1c87b2961ee0245740e2dcaa6e7fbc4f3a/torchao/_models/llama/generate.py#L245-L267

If the input_pos is None, I believe we should select self.freqs_cis[:idx.shape[1]] for freqs_cis instead of self.freqs_cis. This is what this PR fixed. I also added some UTs, though we may need more in the future to ensure that modifications to this model don't break the generation benchmark.

gau-nernst · 2024-09-03T01:37:30Z

self.freqs_cis[:idx.shape[1]] is already done in the latest main I think. Also evident in this PR diff.

ao/torchao/_models/llama/model.py

Lines 198 to 200 in e2dad4a

    
           if input_pos is None:  
        
               mask = None 
        
               freqs_cis = self.freqs_cis[:idx.shape[1]]

Given your use case, which also does not use KV cache, I don't think having input_pos = torch.arange(0, idx.shape[1], device=idx.device) is necessary?

I agree that having test is good to prevent regression. Since you are already working on this PR, can you also add a short docstring/comment in Llama's forward about how to use this model? i.e. for inference (w/ KV-cache + input_pos) and training (no KV-cache + no input_pos).

gau-nernst · 2024-09-03T01:41:02Z

test/test_ao_models.py

+        random_model.setup_caches(max_batch_size=batch_size, max_seq_length=seq_len)
+    out = random_model(input_ids, input_pos)


This will test the case "KV cache + input_pos". If this is not required, perhaps we can test for inference and training separately? i.e.

# inference case random_model.setup_caches(training=False) random_model(input_ids, input_pos) # input_pos is not None # training case random_model.setup_caches(training=True) random_model(input_ids) # no input_pos

yiliu30 · 2024-09-03T01:51:14Z

self.freqs_cis[:idx.shape[1]] is already done in the latest main I think. Also evident in this PR diff.

ao/torchao/_models/llama/model.py

Lines 198 to 200 in e2dad4a

if input_pos is None:

mask = None

freqs_cis = self.freqs_cis[:idx.shape[1]]

Yeah, but L208 update the freqs_cis again.

ao/torchao/_models/llama/model.py

Lines 198 to 208 in e2dad4a

    
           if input_pos is None:  
        
               mask = None 
        
               freqs_cis = self.freqs_cis[:idx.shape[1]] 
        
           elif not self.linear_causal_mask: 
        
               mask = self.causal_mask[None, None, input_pos] 
        
           elif len(input_pos)>1 and self.linear_causal_mask: # prefill for linear causal mask 
        
               mask = torch.tril(torch.ones(len(input_pos), self.max_seq_length, dtype=torch.bool, device=input_pos.device)).unsqueeze(0).unsqueeze(0) 
        
           else: # decode_one_token for linear causal mask 
        
               self.causal_mask[0,0,0,input_pos] = 1 
        
               mask = self.causal_mask 
        
           freqs_cis = self.freqs_cis[input_pos]

gau-nernst · 2024-09-03T01:57:46Z

torchao/_models/llama/model.py

@@ -197,15 +197,17 @@ def forward(self, idx: Tensor, input_pos: Optional[Tensor] = None) -> Tensor:

        if input_pos is None: 
            mask = None
+            input_pos = torch.arange(0, idx.shape[1], device=idx.device)


Thank you for your reply. It makes sense to separate the logic for creating mask, input_pos, freq_cis like you did here. However, creating input_pos here (for training case) is now not needed?

I've also tested the inference case where input_pos is None. Is input_pos a required argument for model.forward in inference mode?

inference case where input_pos is None

This is what I mentioned earlier. I think right now there is no code that uses inference w/o input_pos, and your auto-round PR doesn't seem to need it also. I think it's fine to support this case (though we will have a tiny inefficiency -> create an unneeded input_pos during training. probably insignificant). Perhaps others can have other comments.

Can you update the PR description to describe the problem this PR fixes clearer? i.e. when input_pos is None (during training), freq_cis is overridden by line xxx. I will approve the PR once you added tests for training mode (model.setup_caches(training=True)) and add a short docstring/comment in model.forward().

Got it, thanks for the clarification! I updated the PR description, UTs and docstring. Please review it again.

Signed-off-by: yiliu30 <[email protected]>

* fixed input_pos is None Signed-off-by: yiliu30 <[email protected]> * add test Signed-off-by: yiliu30 <[email protected]> * update the test Signed-off-by: yiliu30 <[email protected]> * update the docstring Signed-off-by: yiliu30 <[email protected]> * update the docstring Signed-off-by: yiliu30 <[email protected]> --------- Signed-off-by: yiliu30 <[email protected]>

fixed input_pos is None

dfca61d

Signed-off-by: yiliu30 <[email protected]>

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 28, 2024

msaroufim requested a review from HDCharles August 28, 2024 03:47

yiliu30 commented Aug 28, 2024

View reviewed changes

torchao/_models/llama/model.py Outdated Show resolved Hide resolved

add test

7a2cd3a

Signed-off-by: yiliu30 <[email protected]>

yiliu30 mentioned this pull request Aug 28, 2024

Add Auto-Round support #581

Merged

5 tasks

HDCharles requested a review from gau-nernst August 30, 2024 03:18

gau-nernst reviewed Sep 3, 2024

View reviewed changes

yiliu30 added 4 commits September 2, 2024 22:39

update the test

bd3ea88

Signed-off-by: yiliu30 <[email protected]>

update the docstring

780e48b

Signed-off-by: yiliu30 <[email protected]>

update the docstring

bd01882

Signed-off-by: yiliu30 <[email protected]>

Merge branch 'main' into fix_llama

64480e7

gau-nernst approved these changes Sep 3, 2024

View reviewed changes

jerryzh168 merged commit 8c18489 into pytorch:main Sep 3, 2024
17 checks passed

yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024

add where command to torchchat (pytorch#769)

c6731e7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed the llama model #769

Fixed the llama model #769

yiliu30 commented Aug 28, 2024 •

edited

Loading

pytorch-bot bot commented Aug 28, 2024 •

edited

Loading

HDCharles commented Aug 30, 2024

gau-nernst commented Sep 2, 2024 •

edited

Loading

yiliu30 commented Sep 3, 2024

gau-nernst commented Sep 3, 2024

gau-nernst Sep 3, 2024

yiliu30 commented Sep 3, 2024

gau-nernst Sep 3, 2024

yiliu30 Sep 3, 2024 •

edited

Loading

gau-nernst Sep 3, 2024 •

edited

Loading

yiliu30 Sep 3, 2024

	if input_pos is None:
	mask = None
	freqs_cis = self.freqs_cis[:idx.shape[1]]
	elif not self.linear_causal_mask:
	mask = self.causal_mask[None, None, input_pos]
	elif len(input_pos)>1 and self.linear_causal_mask: # prefill for linear causal mask
	mask = torch.tril(torch.ones(len(input_pos), self.max_seq_length, dtype=torch.bool, device=input_pos.device)).unsqueeze(0).unsqueeze(0)
	else: # decode_one_token for linear causal mask
	self.causal_mask[0,0,0,input_pos] = 1
	mask = self.causal_mask
	freqs_cis = self.freqs_cis[input_pos]

		random_model.setup_caches(max_batch_size=batch_size, max_seq_length=seq_len)
		out = random_model(input_ids, input_pos)

Fixed the llama model #769

Fixed the llama model #769

Conversation

yiliu30 commented Aug 28, 2024 • edited Loading

pytorch-bot bot commented Aug 28, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/769

✅ No Failures

HDCharles commented Aug 30, 2024

gau-nernst commented Sep 2, 2024 • edited Loading

yiliu30 commented Sep 3, 2024

gau-nernst commented Sep 3, 2024

gau-nernst Sep 3, 2024

Choose a reason for hiding this comment

yiliu30 commented Sep 3, 2024

gau-nernst Sep 3, 2024

Choose a reason for hiding this comment

yiliu30 Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

gau-nernst Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

yiliu30 Sep 3, 2024

Choose a reason for hiding this comment

yiliu30 commented Aug 28, 2024 •

edited

Loading

pytorch-bot bot commented Aug 28, 2024 •

edited

Loading

gau-nernst commented Sep 2, 2024 •

edited

Loading

yiliu30 Sep 3, 2024 •

edited

Loading

gau-nernst Sep 3, 2024 •

edited

Loading