Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

在transformers==4.31下,CastOutputToFloat的置換產生的bugs #268

Closed
GitYCC opened this issue Jul 28, 2023 · 22 comments
Closed

在transformers==4.31下,CastOutputToFloat的置換產生的bugs #268

GitYCC opened this issue Jul 28, 2023 · 22 comments
Labels
solved This problem has been already solved

Comments

@GitYCC
Copy link
Contributor

GitYCC commented Jul 28, 2023

支援LLaMA 2的transformers版本為4.31。

其中增加了這一行:
https://github.com/huggingface/transformers/blame/e42587f596181396e1c4b63660abf0c736b10dae/src/transformers/models/llama/modeling_llama.py#L820

其中self.lm_head.weight在執行時會出現問題,因為在
https://github.com/hiyouga/LLaMA-Efficient-Tuning/blob/553b97a9d59a9fe69df8c4014db4dbb121fbf461/src/llmtuner/extras/misc.py#L95

lm_head被CastOutputToFloat置換,因此不會有weight這個attribute,請問應該怎麼解決?

@GitYCC GitYCC changed the title 在transformers==4.31下,CastOutputToFloat(torch.nn.Sequential)的置換產生的bugs 在transformers==4.31下,CastOutputToFloat的置換產生的bugs Jul 28, 2023
@hiyouga hiyouga added the solved This problem has been already solved label Jul 28, 2023
@hiyouga
Copy link
Owner

hiyouga commented Jul 28, 2023

Fixed

@GitYCC
Copy link
Contributor Author

GitYCC commented Jul 29, 2023

感謝幫忙進版,但目前這個版本在訓練時loss不會下降
@hiyouga

@hiyouga
Copy link
Owner

hiyouga commented Jul 29, 2023

我这边验证是可以正常下降的,请排查模型文件或者超参数原因

@GitYCC
Copy link
Contributor Author

GitYCC commented Jul 30, 2023

請問是使用transformers==4.31(這個版本是官方認證能正確執行Llama 2的版本)嗎?我在這個版本下訓練loss無法下降。

但是我試transformers==4.29.1,loss是可以下降的。

我的猜想是

setattr(new_output_layer, "weight", output_layer.weight)

並不能真正解決

lm_head_slices = self.lm_head.weight.split(self.vocab_size // self.pretraining_tp, dim=0)

所帶來的問題

@hiyouga
Copy link
Owner

hiyouga commented Jul 30, 2023

我的环境是 4.30.0

@GitYCC
Copy link
Contributor Author

GitYCC commented Jul 30, 2023

4.31 是 Llama 2 官方指定的版本,他們有針對 Llama 2 去作修正。

過去的版本,在 inference 時會出現奇怪的行為

@hiyouga hiyouga reopened this Jul 31, 2023
@hiyouga
Copy link
Owner

hiyouga commented Jul 31, 2023

如果将 CastOutputToFloat 删掉可以正常训练吗?

@hiyouga hiyouga added pending This problem is yet to be addressed and removed solved This problem has been already solved labels Jul 31, 2023
@GitYCC
Copy link
Contributor Author

GitYCC commented Jul 31, 2023

我使用transformers==4.31並且把CastOutputToFloat刪除,training loss 不會正常的下降

但我有試了transformers==4.29.1 + DEL CastOutputToFloat,training loss 此時會正常的下降

@GitYCC
Copy link
Contributor Author

GitYCC commented Jul 31, 2023

huggingface/transformers@07360b6
附上transformers最近對llama的改動,供參考

@hiyouga
Copy link
Owner

hiyouga commented Jul 31, 2023

参考一下这个:#202

@GitYCC
Copy link
Contributor Author

GitYCC commented Jul 31, 2023

Update libraries using

pip install -U git+https://github.com/huggingface/transformers.git
pip install -U git+https://github.com/huggingface/peft.git

And it works. Thank you.

@GitYCC
Copy link
Contributor Author

GitYCC commented Jul 31, 2023

transformers==4.32.0.dev0
peft==0.5.0.dev0

In this case, the training loss can decrease, but I found we can not save the checkpoint.
It still have problems.

hiyouga added a commit that referenced this issue Jul 31, 2023
@hiyouga
Copy link
Owner

hiyouga commented Jul 31, 2023

@GitYCC 更新代码后重试

@GitYCC
Copy link
Contributor Author

GitYCC commented Aug 1, 2023

It still cannot save the checkpoint.

@hiyouga
Copy link
Owner

hiyouga commented Aug 1, 2023

把依赖库降级到稳定版本而非 dev 版以后再试一下

@GitYCC
Copy link
Contributor Author

GitYCC commented Aug 1, 2023

@hiyouga
When I back to transformers==4.31.0 and peft==0.4.0,
training loss stuck again and the checkpoint cannot be saved.

@hiyouga
Copy link
Owner

hiyouga commented Aug 1, 2023

Please provide a script for reproducing the error.

@GitYCC
Copy link
Contributor Author

GitYCC commented Aug 1, 2023

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch --config_file accelerate_config.yaml --main_process_port 29501  src/train_bash.py \
    --stage sft \
    --model_name_or_path /path/to/Llama-2-13b-chat-hf \
    --template llama2 \
    --do_train \
    --dataset alpaca_zh \
    --finetuning_type lora \
    --output_dir ./outputs/llama_chat_sft_test \
    --overwrite_cache \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 2 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 100 \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --plot_loss \
    --fp16 \
    --dev_ratio 0.0001

accelerate_config.yaml:

compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 2
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
accelerate==0.21.0
bitsandbytes==0.41.0
datasets==2.12.0
peft==0.4.0
scipy==1.11.1
tokenizers==0.13.3
torch==1.13.1
transformers==4.31.0

python version: 3.10

Tesla V100-SXM2-32GB
Driver Version: 530.30.02
CUDA Version: 12.1

@GitYCC
Copy link
Contributor Author

GitYCC commented Aug 1, 2023

@hiyouga
I also revised bellow code piece in misc.py because I use transformers==4.31.0

if hasattr(model, "pretraining_tp"):
       model.pretraining_tp = 1 # disable TP for LoRA (https://github.com/huggingface/peft/pull/728)

@GitYCC
Copy link
Contributor Author

GitYCC commented Aug 1, 2023

@hiyouga
I find out the problem of "checkpoint cannot be saved".
That is because RAM is not enough size to aggregate all information when saving checkpoint.
When I use 2 GPU, I can normally save the checkpoint.

But training loss is still stuck now.

@hiyouga
Copy link
Owner

hiyouga commented Aug 1, 2023

Consider using English datasets to fine-tune LLaMA-2 models instead of non-English corpus.

@GitYCC
Copy link
Contributor Author

GitYCC commented Aug 2, 2023

@hiyouga It works, but please help to check whether the method is right or not.

version I used:

transformers==4.32.0.dev0
peft==0.5.0.dev0

remove piece of codes:
image

use 4 GPU to avoid out of RAM memory

@hiyouga hiyouga removed the pending This problem is yet to be addressed label Aug 2, 2023
@hiyouga hiyouga added the solved This problem has been already solved label Aug 2, 2023
@hiyouga hiyouga closed this as completed Aug 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

2 participants