Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] fail to load LoRA weights, fail to load LoRA weights in 4-bit, fail to generate text with LoRA in 8-bit, UnboundLocalError: local variable 'new_module' referenced before assignment, ValueError: We need an offload_dir, AttributeError: 'NoneType' object has no attribute 'device' #383

Closed
1 task done
bartman081523 opened this issue Mar 17, 2023 · 24 comments
Labels
bug Something isn't working

Comments

@bartman081523
Copy link

bartman081523 commented Mar 17, 2023

FIX:
https://rentry.org/i3qzn

Describe the bug

I have the following errors when loading in 4-bit mode with --gptq-bits 4,
when loading LoRA weights and when loading LoRA weights in 8-bit mode with --load-in-8bit

This one is alpaca lora:
https://huggingface.co/Yoshiii/alpaca.git

loras/alpaca/README.md
loras/alpaca/adapter_config.json
loras/alpaca/adapter_model.bin

Is there an existing issue for this?

  • I have searched the existing issues

Reproduction

cd loras
git lfs clone https://huggingface.co/Yoshiii/alpaca.git
cd ..
python server.py --model llama-7b --lora alpaca --gptq-bits 4

Screenshot

No response

Logs

python server.py --model llama-7b-hf  --lora alpaca --auto-devices --listen --cai-chat --gptq-bits 4

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: /opt/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Loading llama-7b-hf...
Loading model ...
Done.
Loaded the model in 6.67 seconds.
alpaca
Adding the LoRA alpaca to the model...
Traceback (most recent call last):
  File "/home/user/Downloads/Python/text-generation-webui/server.py", line 240, in <module>
    add_lora_to_model(shared.lora_name)
  File "/home/user/Downloads/Python/text-generation-webui/modules/LoRA.py", line 18, in add_lora_to_model
    shared.model = PeftModel.from_pretrained(shared.model, Path(f"loras/{lora_name}"))
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/peft/peft_model.py", line 143, in from_pretrained
    model = MODEL_TYPE_TO_PEFT_MODEL_MAPPING[config.task_type](model, config)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/peft/peft_model.py", line 514, in __init__
    super().__init__(model, peft_config)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/peft/peft_model.py", line 79, in __init__
    self.base_model = LoraModel(peft_config, model)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/peft/tuners/lora.py", line 118, in __init__
    self._find_and_replace()
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/peft/tuners/lora.py", line 181, in _find_and_replace
    self._replace_module(parent, target_name, new_module, target)
UnboundLocalError: local variable 'new_module' referenced before assignment

System Info

Arch Linux, Nvidia RTX 2060 12G, cuda 11.7 (system), standard micromamba setup, pytorch 1.13.1-cuda117 from pip
@bartman081523 bartman081523 added the bug Something isn't working label Mar 17, 2023
@bartman081523
Copy link
Author

I found this maybe relevant:

https://github.com/huggingface/peft/blob/main/examples/causal_language_modeling/peft_lora_clm_accelerate_big_model_inference.ipynb


from peft import PeftModel, PeftConfig

max_memory = {0: "1GIB", 1: "1GIB", 2: "2GIB", 3: "10GIB", "cpu": "30GB"}
peft_model_id = "smangrul/twitter_complaints_bigscience_bloomz-7b1_LORA_CAUSAL_LM"

config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, device_map="auto", max_memory=max_memory)
model = PeftModel.from_pretrained(model, peft_model_id, device_map="auto", max_memory=max_memory)

@bartman081523
Copy link
Author

I tried without the "--gptq-bits 4", that failed with another error:

python server.py --model llama-7b --lora alpaca --listen --gpu-memory 11 --cpu-memory 16 --disk 

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: /opt/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
DEBUG:root:Debugging mode enabled
Loading llama-7b...
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [01:03<00:00,  1.92s/it]
Loaded the model in 64.08 seconds.
alpaca
Adding the LoRA alpaca to the model...
Traceback (most recent call last):
  File "/home/user/Downloads/Python/text-generation-webui/server.py", line 249, in <module>
    add_lora_to_model(shared.lora_name)
  File "/home/user/Downloads/Python/text-generation-webui/modules/LoRA.py", line 18, in add_lora_to_model
    shared.model = PeftModel.from_pretrained(shared.model, Path(f"loras/{lora_name}"))
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/peft/peft_model.py", line 177, in from_pretrained
    model = dispatch_model(model, device_map=device_map)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/accelerate/big_modeling.py", line 342, in dispatch_model
    raise ValueError(
ValueError: We need an `offload_dir` to dispatch this model according to this `device_map`, the following submodules need to be offloaded: base_model.model.model.layers.22, base_model.model.model.layers.23, base_model.model.model.layers.24, base_model.model.model.layers.25, base_model.model.model.layers.26, base_model.model.model.layers.27, base_model.model.model.layers.28, base_model.model.model.layers.29, base_model.model.model.layers.30, base_model.model.model.layers.31, base_model.model.model.norm, base_model.model.lm_head.

@bartman081523 bartman081523 changed the title lora, peft, gptq-bits 4, 4-bit, llama, alpaca, UnboundLocalError: local variable 'new_module' referenced before assignment generally failing to load LoRA weights, peft, gptq-bits 4, 4-bit, llama, alpaca, UnboundLocalError: local variable 'new_module' referenced before assignment & ValueError: We need an offload_dir Mar 17, 2023
@bartman081523 bartman081523 changed the title generally failing to load LoRA weights, peft, gptq-bits 4, 4-bit, llama, alpaca, UnboundLocalError: local variable 'new_module' referenced before assignment & ValueError: We need an offload_dir generally failing to load LoRA weights, peft, gptq-bits 4, 4-bit, llama, alpaca, UnboundLocalError: local variable 'new_module' referenced before assignment, ValueError: We need an offload_dir Mar 17, 2023
@wywywywy
Copy link
Contributor

Did you manage to find a solution?

@bartman081523
Copy link
Author

bartman081523 commented Mar 19, 2023

Did you manage to find a solution?

Yes (but no). I tried to load in 8-bit mode:
python server.py --model llama-7b --lora alpaca --load-in-8bit

In my opinion, this is not always the preferred solution, as it requires 8G VRAM, which is not possible for some users.

EDIT:
The 8-bit mode first successfully loads the model and the LoRA, but then errors out in text generation:

python server.py --model llama-7b --lora alpaca-lora-7b --listen --gpu-memory 11 --cai-chat --load-in-8bit                                                                                                           
Loading llama-7b...                                                                     
                                                                                                                                                                                
===================================BUG REPORT===================================        
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues                      
================================================================================
/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories liste
d in your path were found to be non-existent: {PosixPath('//debuginfod.archlinux.org '), PosixPath('https')}
  warn(msg)                                                                                                                                                                     
CUDA SETUP: CUDA runtime path found: /opt/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5                                                                                                                 
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [01:03<00:00,  1.93s/it]
Loaded the model in 64.53 seconds.                                                                                                                                              
alpaca-lora-7b                                                                                                                                                                  
Adding the LoRA alpaca-lora-7b to the model...                                                                                                                                  
Loading the extension "gallery"... Ok.                                                                                                                                          
/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the N
umber component instead.                                                                                                                                                        
  warnings.warn(value)                                                                                                                                                          
Running on local URL:  http://0.0.0.0:7860                                                                                                                                      
                                                                                                                                                                                
To create a public link, set `share=True` in `launch()`.                                                                                                                        
Exception in thread Thread-4 (gentask):                                                                                                                                         
Traceback (most recent call last):                                                                                                                                              
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/threading.py", line 1016, in _bootstrap_inner                                                   
    self.run()                                                                                                                                                                  
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/threading.py", line 953, in run                                                                 
    self._target(*self._args, **self._kwargs)                                                                                                                                   
  File "/home/user/Downloads/Python/text-generation-webui/modules/callbacks.py", line 65, in gentask                                                                          
    ret = self.mfunc(callback=_callback, **self.kwargs)                                                                                                                         
  File "/home/user/Downloads/Python/text-generation-webui/modules/text_generation.py", line 199, in generate_with_callback                                                    
    shared.model.generate(**kwargs)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/peft/peft_model.py", line 580, in generate                                        
    return self.base_model.generate(**kwargs)                                                                                                                                   
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)                                                        
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1452, in generate                         
    return self.sample(                                                                 
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2468, in sample
    outputs = self(                                                                     
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl                             
    return forward_call(*input, **kwargs)                                                                                                                                       
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward                                    
    output = old_forward(*args, **kwargs)                                               
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 772, in forward                
    outputs = self.model(            
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl        
    return forward_call(*input, **kwargs)                                                                                                                                       
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 621, in forward
    layer_outputs = decoder_layer(
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 316, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 216, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/peft/tuners/lora.py", line 502, in forward
    result = super().forward(x)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 317, in forward
    state.CxB, state.SB = F.transform(state.CB, to_order=formatB)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1698, in transform
    prev_device = pre_call(A.device)
AttributeError: 'NoneType' object has no attribute 'device'

@bartman081523 bartman081523 changed the title generally failing to load LoRA weights, peft, gptq-bits 4, 4-bit, llama, alpaca, UnboundLocalError: local variable 'new_module' referenced before assignment, ValueError: We need an offload_dir fail to load LoRA weights in 4-bit mode - [temp. fix] - UnboundLocalError: local variable 'new_module' referenced before assignment, ValueError: We need an offload_dir Mar 19, 2023
@bartman081523 bartman081523 changed the title fail to load LoRA weights in 4-bit mode - [temp. fix] - UnboundLocalError: local variable 'new_module' referenced before assignment, ValueError: We need an offload_dir fail to load LoRA weights - UnboundLocalError: local variable 'new_module' referenced before assignment, ValueError: We need an offload_dir, AttributeError: 'NoneType' object has no attribute 'device' Mar 19, 2023
@bartman081523 bartman081523 changed the title fail to load LoRA weights - UnboundLocalError: local variable 'new_module' referenced before assignment, ValueError: We need an offload_dir, AttributeError: 'NoneType' object has no attribute 'device' fail to load LoRA weights in 4-bit, fail to generate text with LoRA in 8-bit, UnboundLocalError: local variable 'new_module' referenced before assignment, ValueError: We need an offload_dir, AttributeError: 'NoneType' object has no attribute 'device' Mar 19, 2023
@bartman081523 bartman081523 changed the title fail to load LoRA weights in 4-bit, fail to generate text with LoRA in 8-bit, UnboundLocalError: local variable 'new_module' referenced before assignment, ValueError: We need an offload_dir, AttributeError: 'NoneType' object has no attribute 'device' generally fail to load LoRA weights, fail to load LoRA weights in 4-bit, fail to generate text with LoRA in 8-bit, UnboundLocalError: local variable 'new_module' referenced before assignment, ValueError: We need an offload_dir, AttributeError: 'NoneType' object has no attribute 'device' Mar 19, 2023
@bartman081523
Copy link
Author

bartman081523 commented Mar 19, 2023

Did you manage to find a solution?

I found a way to load a chat finetuned model, although it is not alpaca, it is still very good.

cd models
git lfs clone https://huggingface.co/dvruette/oasst-pythia-6.9b-4000-steps
cd ..
python server.py --load-in-8bit --model oasst-pythia-6.9b-4000-steps

https://huggingface.co/models?search=oasst

@bartman081523
Copy link
Author

bartman081523 commented Mar 19, 2023

@wywywywy
@BadisG found a way to fix 4-bit mode:
#332 (comment)

change the lora.py from the peft package:
C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\peft\tuners\lora.py

in Linux venv/lib/python3.10/site-packages/peft/tuners/lora.py

fixed lora.py
https://pastebin.com/eUWZsirk

@BadisG added those 2 instructions on the _find_and_replace() method

new_module = None

if new_module is None:
    continue

python server.py --model llama-7b --lora alpaca-lora-7b --gptq-bits 4

@wywywywy
Copy link
Contributor

@BadisG added those 2 instructions on the _find_and_replace() method

new_module = None

if new_module is None:
    continue

Good fix thank you. It worked.

But I wonder why not everybody faces the same problem? Other people can GPTQ 4bit without modifying peft. Maybe it's because we got our 4bit weights from different sources?

@bartman081523 bartman081523 changed the title generally fail to load LoRA weights, fail to load LoRA weights in 4-bit, fail to generate text with LoRA in 8-bit, UnboundLocalError: local variable 'new_module' referenced before assignment, ValueError: We need an offload_dir, AttributeError: 'NoneType' object has no attribute 'device' [FIX] generally fail to load LoRA weights, fail to load LoRA weights in 4-bit, fail to generate text with LoRA in 8-bit, UnboundLocalError: local variable 'new_module' referenced before assignment, ValueError: We need an offload_dir, AttributeError: 'NoneType' object has no attribute 'device' Mar 19, 2023
@bartman081523
Copy link
Author

bartman081523 commented Mar 19, 2023

Good fix thank you. It worked.

And i thank @BadisG

But I wonder why not everybody faces the same problem? Other people can GPTQ 4bit without modifying peft. Maybe it's because we got our 4bit weights from different sources?

I got my 4-bit weights from here:
https://huggingface.co/decapoda-research/llama-7b-hf-int4/tree/main

@bartman081523 bartman081523 changed the title [FIX] generally fail to load LoRA weights, fail to load LoRA weights in 4-bit, fail to generate text with LoRA in 8-bit, UnboundLocalError: local variable 'new_module' referenced before assignment, ValueError: We need an offload_dir, AttributeError: 'NoneType' object has no attribute 'device' [FIX] fail to load LoRA weights, fail to load LoRA weights in 4-bit, fail to generate text with LoRA in 8-bit, UnboundLocalError: local variable 'new_module' referenced before assignment, ValueError: We need an offload_dir, AttributeError: 'NoneType' object has no attribute 'device' Mar 19, 2023
@BarfingLemurs
Copy link

have any of you tested https://huggingface.co/chansung/alpaca-lora-30b?

@wywywywy
Copy link
Contributor

Yes I did some limited testing of the 30b Lora on a 3090 in GPTQ 4bit about an hour ago and it seems to work well.

@BarfingLemurs
Copy link

maybe this information needs to be in a pull request, as its difficult to find.

@bartman081523
Copy link
Author

bartman081523 commented Mar 19, 2023

maybe this information needs to be in a pull request, as its difficult to find.

I agree and the patch is at this time for the peft module, not for this repo, so someone would have to open a PR for
https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora.py or use another method to patch for the error in this repo. I can not open the PR in peft as I am not able to fully understand/give reasons for the patch.

@wywywywy
Copy link
Contributor

wywywywy commented Mar 19, 2023

Here is an example output of the 30b in GPTQ 4bit with the 30b Lora.

Below is an instruction that describes a task.
Write a response that appropriately completes the request.

Instruction:

In 10 sentences, summarise the seminal paper called "Attention is All You Need" by Vaswani et al. in 2017

Response::

The paper proposes and evaluates a new Transformer-based architecture for Neural Machine Translation (NMT). The authors introduce two novel techniques to improve translation quality of NMT models, namely multi-headed attention mechanism with relative positional encoding and residual connection between layers. They also propose a method to pretrain their model on large unlabelled text data using only word order information from a bilingual corpus. In addition, they present a technique for training very deep neural networks without suffering from vanishing/exploding gradients problem. Their proposed approach achieves state-of-the art results across all language pairs evaluated in WMT'14 benchmarks.

And these are the parameters that I used:

do_sample=True
temperature=0.36
top_p=1
typical_p=1
repetition_penalty=1.23
top_k=12
num_beams=1
penalty_alpha=0
min_length=0
length_penalty=1
no_repeat_ngram_size=0
early_stopping=False

And this is the performance on a RTX3090 + Ryzen 5900x + 32GB RAM on Win 11, Python 3.10, CUDA 11.8 (with the Cudnn DLL files replaced in torch).

Output generated in 19.22 seconds (7.28 tokens/s, 140 tokens)

@generic-username0718
Copy link

Did you manage to find a solution?

Yes (but no). I tried to load in 8-bit mode: python server.py --model llama-7b --lora alpaca --load-in-8bit

In my opinion, this is not always the preferred solution, as it requires 8G VRAM, which is not possible for some users.

EDIT: The 8-bit mode first successfully loads the model and the LoRA, but then errors out in text generation:

python server.py --model llama-7b --lora alpaca-lora-7b --listen --gpu-memory 11 --cai-chat --load-in-8bit                                                                                                           
Loading llama-7b...                                                                     
                                                                                                                                                                                
===================================BUG REPORT===================================        
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues                      
================================================================================
/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories liste
d in your path were found to be non-existent: {PosixPath('//debuginfod.archlinux.org '), PosixPath('https')}
  warn(msg)                                                                                                                                                                     
CUDA SETUP: CUDA runtime path found: /opt/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5                                                                                                                 
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [01:03<00:00,  1.93s/it]
Loaded the model in 64.53 seconds.                                                                                                                                              
alpaca-lora-7b                                                                                                                                                                  
Adding the LoRA alpaca-lora-7b to the model...                                                                                                                                  
Loading the extension "gallery"... Ok.                                                                                                                                          
/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the N
umber component instead.                                                                                                                                                        
  warnings.warn(value)                                                                                                                                                          
Running on local URL:  http://0.0.0.0:7860                                                                                                                                      
                                                                                                                                                                                
To create a public link, set `share=True` in `launch()`.                                                                                                                        
Exception in thread Thread-4 (gentask):                                                                                                                                         
Traceback (most recent call last):                                                                                                                                              
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/threading.py", line 1016, in _bootstrap_inner                                                   
    self.run()                                                                                                                                                                  
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/threading.py", line 953, in run                                                                 
    self._target(*self._args, **self._kwargs)                                                                                                                                   
  File "/home/user/Downloads/Python/text-generation-webui/modules/callbacks.py", line 65, in gentask                                                                          
    ret = self.mfunc(callback=_callback, **self.kwargs)                                                                                                                         
  File "/home/user/Downloads/Python/text-generation-webui/modules/text_generation.py", line 199, in generate_with_callback                                                    
    shared.model.generate(**kwargs)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/peft/peft_model.py", line 580, in generate                                        
    return self.base_model.generate(**kwargs)                                                                                                                                   
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)                                                        
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1452, in generate                         
    return self.sample(                                                                 
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2468, in sample
    outputs = self(                                                                     
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl                             
    return forward_call(*input, **kwargs)                                                                                                                                       
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward                                    
    output = old_forward(*args, **kwargs)                                               
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 772, in forward                
    outputs = self.model(            
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl        
    return forward_call(*input, **kwargs)                                                                                                                                       
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 621, in forward
    layer_outputs = decoder_layer(
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 316, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 216, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/peft/tuners/lora.py", line 502, in forward
    result = super().forward(x)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 317, in forward
    state.CxB, state.SB = F.transform(state.CB, to_order=formatB)
  File "/home/user/Downloads/Python/text-generation-webui/venv/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1698, in transform
    prev_device = pre_call(A.device)
AttributeError: 'NoneType' object has no attribute 'device'

I'm stuck with this message too... Any solution to fix for Lora in 8bit mode?

@generic-username0718
Copy link

I think I'm running into this bug huggingface/peft#115 (comment)

Looks like I may need to modify PeftModel.from_pretrained or PeftModelForCausalLM but I'm not sure where...

@generic-username0718
Copy link

For me/us, this fixed 8bit and 4bit with LoRA mode: #332 (comment)

are you splitting the model in a multi-gpu setup?

@bartman081523
Copy link
Author

are you splitting the model in a multi-gpu setup?

no.

@generic-username0718
Copy link

generic-username0718 commented Mar 19, 2023

Yeah I think that's my problem...

Looks like this guy may have done it... something about autocast?

huggingface/peft#115 (comment)

with torch.cuda.amp.autocast():
outputs = model.generate(input_ids=inputs['input_ids'], max_new_tokens=10)

@bartman081523
Copy link
Author

with git reset --hard and git pull (update) and the below peft fix, it is now possible to load LoRA models in 4-bit or 8-bit with --gptq-bits 4 or --load-in-8bit and with loaded LoRA it is now possible to generate text in 4-bit and in 8-bit mode.

I have tested and all 3 errors are gone (for me, with this peft fix), UnboundLocalError: local variable 'new_module' referenced before assignment, ValueError: We need an offload_dir and AttributeError: 'NoneType' object has no attribute 'device'

FIX:

@wywywywy @BadisG found a way to fix 4-bit mode: #332 (comment)

change the lora.py from the peft package: C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\peft\tuners\lora.py

in Linux venv/lib/python3.10/site-packages/peft/tuners/lora.py

fixed lora.py https://pastebin.com/eUWZsirk

@BadisG added those 2 instructions on the _find_and_replace() method

Insert in line 148 over if_loaded_in_8bit
new_module = None

insert in line 180 over self._replace_module

if new_module is None:
continue

@wywywywy
Copy link
Contributor

wywywywy commented Mar 20, 2023

@BadisG added those 2 instructions on the _find_and_replace() method
new_module = None

if new_module is None:
    continue

Good fix thank you. It worked.

But I wonder why not everybody faces the same problem? Other people can GPTQ 4bit without modifying peft. Maybe it's because we got our 4bit weights from different sources?

Actually I have my doubts whether this fix actually does anything.

Sure it stops the crashing, but it's not doing a find&replace anymore. This is because it tries to find all the Linear layers in the model to create new modules, but there are no Linear layers in GPTQ 4bit models - they have been converted to QuantLinear by GPTQ, so this _find_and_replace function now just skips everything.

So does the Lora even do anything at all in this case?!

@bartman081523
Copy link
Author

bartman081523 commented Mar 20, 2023

Actually I have my doubts whether this fix actually does anything.

Me too. I think textgen/modules/LoRA.py could be patched, to output the models layers before applying the LoRA and after applying the LoRA.
Or you could just compare by used Gpu memory, given a 4-bit model without LoRA and a 4-bit model+LoRA, the second would have to be bigger.
Or someone would have to dig the peft documentation and examples for 4-bit support.

@wywywywy
Copy link
Contributor

wywywywy commented Mar 20, 2023

It's hard to compare by VRAM because LoRA doesn't use much.

I'm pretty sure there's no explicit support from peft. In any case, I've opened a new issue on their board. Please chip in if anyone has anything to add. huggingface/peft#198

Here below are some comparisons of LoRA.

4bit without Lora

Below is an instruction that describes a task.
Write a response that appropriately completes the request.

Instruction:

In 10 sentences, summarise the seminal paper called "Attention is All You Need" by Vaswani et al. in 2017

Response:

The authors argue that attention-based models are more effective than recurrent neural networks for machine translation tasks. They also show how to use a single encoder and decoders with different attention mechanisms to improve performance on several NLP tasks such as language modeling, question answering, parsing, and machine translation. The key idea behind their approach is using two separate modules for encoding (decoding) input text into hidden states which can be used further for translating it back from target languages to source ones. This way they avoid the problem of vanishing gradients when training RNNs because each module has its own parameters. In addition, this allows them to train faster since only one side needs to learn at any given time. Finally, the authors propose three approaches for improving the quality of attention mechanism based on softmax loss, multiheaded attention or self-attentive cross-encodings.

4bit with Lora (probably not actually using Lora?)

Below is an instruction that describes a task.
Write a response that appropriately completes the request.

Instruction:

In 10 sentences, summarise the seminal paper called "Attention is All You Need" by Vaswani et al. in 2017

Response:

The authors of this paper propose to tackle the problem of learning long-term dependencies from sequences using a simple neural network with attention mechanism and recurrent structure for input data. In particular, they present a novel architecture based on LSTMs (Long Short Term Memory) which combines self-attentions between hidden states at different time steps as well as cross-attentions across layers. The model has been evaluated on several tasks including machine translation and language modelling where it achieves state-of-the-art results compared to other models such as RNN or CNN+LSTM. Furthermore, their approach can be easily extended to multi-head attention mechanisms or even more complex architectures like transformer networks.

8bit without Lora

Below is an instruction that describes a task.
Write a response that appropriately completes the request.

Instruction:

In 10 sentences, summarise the seminal paper called "Attention is All You Need" by Vaswani et al. in 2017

Response:

The authors of this paper propose to use attention mechanisms for improving performance on machine translation tasks. They show how using self-attentive encoders can improve accuracy and reduce computational complexity compared with traditional recurrent neural networks (RNNs). The proposed model consists of two main components — a bidirectional RNN (BiLSTM) which processes input sequences into hidden states through time steps; and another BiLSTM network used as a contextual decoder to predict output tokens from these hidden state vectors at each step. This architecture allows both forward and backward information flow between layers within the same sequence. It also enables parallel computation across all timesteps simultaneously, thus reducing overall compute cost. In addition, it provides flexibility when choosing different sizes of windows during inference. Finally, they demonstrate the effectiveness of their approach over other models such as LSTM or GRU architectures.

"""

8bit with Lora (should actually be working)

Below is an instruction that describes a task.
Write a response that appropriately completes the request.

Instruction:

In 10 sentences, summarise the seminal paper called "Attention is All You Need" by Vaswani et al. in 2017

Response:

The Attention Is All You Need (AIAYN) model proposed by Vaswani and his colleagues was one of the first successful attempts to use deep learning for natural language processing tasks such as machine translation and question answering. The AIAN model consists of two main components - encoder and decoder networks which are trained separately using supervised learning algorithms like RNNs or CNNs with attention mechanisms. Encoders learn how to encode input sequences into vectors while decoders predict output sequences from these encoded inputs.

@wywywywy
Copy link
Contributor

There's additional discussion of whether this peft "fix" actually works here #332

@bartman081523
Copy link
Author

bartman081523 commented Mar 21, 2023

There's additional discussion of whether this peft "fix" actually works here #332

Thank you. I will close here.

FIX:
https://rentry.org/i3qzn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants