在transformers==4.31下，CastOutputToFloat的置換產生的bugs #268

GitYCC · 2023-07-28T08:10:40Z

支援LLaMA 2的transformers版本為4.31。

其中增加了這一行：
https://github.com/huggingface/transformers/blame/e42587f596181396e1c4b63660abf0c736b10dae/src/transformers/models/llama/modeling_llama.py#L820

其中self.lm_head.weight在執行時會出現問題，因為在
https://github.com/hiyouga/LLaMA-Efficient-Tuning/blob/553b97a9d59a9fe69df8c4014db4dbb121fbf461/src/llmtuner/extras/misc.py#L95

lm_head被CastOutputToFloat置換，因此不會有weight這個attribute，請問應該怎麼解決？

The text was updated successfully, but these errors were encountered:

hiyouga · 2023-07-28T09:02:41Z

Fixed

GitYCC · 2023-07-29T13:43:50Z

感謝幫忙進版，但目前這個版本在訓練時loss不會下降
@hiyouga

hiyouga · 2023-07-29T15:55:35Z

我这边验证是可以正常下降的，请排查模型文件或者超参数原因

GitYCC · 2023-07-30T01:10:24Z

請問是使用transformers==4.31（這個版本是官方認證能正確執行Llama 2的版本）嗎？我在這個版本下訓練loss無法下降。

但是我試transformers==4.29.1，loss是可以下降的。

我的猜想是

setattr(new_output_layer, "weight", output_layer.weight)

並不能真正解決

lm_head_slices = self.lm_head.weight.split(self.vocab_size // self.pretraining_tp, dim=0)

所帶來的問題

hiyouga · 2023-07-30T09:41:46Z

我的环境是 4.30.0

GitYCC · 2023-07-30T23:40:26Z

4.31 是 Llama 2 官方指定的版本，他們有針對 Llama 2 去作修正。

過去的版本，在 inference 時會出現奇怪的行為

hiyouga · 2023-07-31T00:53:43Z

如果将 CastOutputToFloat 删掉可以正常训练吗？

GitYCC · 2023-07-31T03:34:16Z

我使用transformers==4.31並且把CastOutputToFloat刪除，training loss 不會正常的下降

但我有試了transformers==4.29.1 + DEL CastOutputToFloat，training loss 此時會正常的下降

GitYCC · 2023-07-31T03:57:51Z

huggingface/transformers@07360b6
附上transformers最近對llama的改動，供參考

hiyouga · 2023-07-31T06:42:55Z

参考一下这个：#202

GitYCC · 2023-07-31T07:34:03Z

Update libraries using

pip install -U git+https://github.com/huggingface/transformers.git
pip install -U git+https://github.com/huggingface/peft.git

And it works. Thank you.

GitYCC · 2023-07-31T15:23:10Z

transformers==4.32.0.dev0
peft==0.5.0.dev0

In this case, the training loss can decrease, but I found we can not save the checkpoint.
It still have problems.

hiyouga · 2023-07-31T15:43:41Z

@GitYCC 更新代码后重试

GitYCC · 2023-08-01T01:51:16Z

It still cannot save the checkpoint.

hiyouga · 2023-08-01T07:55:47Z

把依赖库降级到稳定版本而非 dev 版以后再试一下

GitYCC · 2023-08-01T14:14:30Z

@hiyouga
When I back to transformers==4.31.0 and peft==0.4.0,
training loss stuck again and the checkpoint cannot be saved.

hiyouga · 2023-08-01T14:20:14Z

Please provide a script for reproducing the error.

GitYCC · 2023-08-01T14:23:23Z

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch --config_file accelerate_config.yaml --main_process_port 29501  src/train_bash.py \
    --stage sft \
    --model_name_or_path /path/to/Llama-2-13b-chat-hf \
    --template llama2 \
    --do_train \
    --dataset alpaca_zh \
    --finetuning_type lora \
    --output_dir ./outputs/llama_chat_sft_test \
    --overwrite_cache \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 2 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 100 \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --plot_loss \
    --fp16 \
    --dev_ratio 0.0001

accelerate_config.yaml:

compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 2
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

accelerate==0.21.0
bitsandbytes==0.41.0
datasets==2.12.0
peft==0.4.0
scipy==1.11.1
tokenizers==0.13.3
torch==1.13.1
transformers==4.31.0

python version: 3.10

Tesla V100-SXM2-32GB
Driver Version: 530.30.02
CUDA Version: 12.1

GitYCC · 2023-08-01T15:08:43Z

@hiyouga
I also revised bellow code piece in misc.py because I use transformers==4.31.0

if hasattr(model, "pretraining_tp"):
       model.pretraining_tp = 1 # disable TP for LoRA (https://github.com/huggingface/peft/pull/728)

GitYCC · 2023-08-01T16:33:13Z

@hiyouga
I find out the problem of "checkpoint cannot be saved".
That is because RAM is not enough size to aggregate all information when saving checkpoint.
When I use 2 GPU, I can normally save the checkpoint.

But training loss is still stuck now.

hiyouga · 2023-08-01T16:53:45Z

Consider using English datasets to fine-tune LLaMA-2 models instead of non-English corpus.

GitYCC · 2023-08-02T02:48:48Z

@hiyouga It works, but please help to check whether the method is right or not.

version I used:

transformers==4.32.0.dev0
peft==0.5.0.dev0

remove piece of codes:

use 4 GPU to avoid out of RAM memory

GitYCC changed the title ~~在transformers==4.31下，CastOutputToFloat(torch.nn.Sequential)的置換產生的bugs~~ 在transformers==4.31下，CastOutputToFloat的置換產生的bugs Jul 28, 2023

hiyouga closed this as completed in 91dd17d Jul 28, 2023

hiyouga added the solved This problem has been already solved label Jul 28, 2023

hiyouga reopened this Jul 31, 2023

hiyouga added pending This problem is yet to be addressed and removed solved This problem has been already solved labels Jul 31, 2023

hiyouga added a commit that referenced this issue Jul 31, 2023

support streaming data, fix #284 #274 #268

0411a4b

hiyouga removed the pending This problem is yet to be addressed label Aug 2, 2023

hiyouga added the solved This problem has been already solved label Aug 2, 2023

hiyouga closed this as completed Aug 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

在transformers==4.31下，CastOutputToFloat的置換產生的bugs #268

在transformers==4.31下，CastOutputToFloat的置換產生的bugs #268

GitYCC commented Jul 28, 2023

hiyouga commented Jul 28, 2023

GitYCC commented Jul 29, 2023 •

edited

Loading

hiyouga commented Jul 29, 2023

GitYCC commented Jul 30, 2023 •

edited

Loading

hiyouga commented Jul 30, 2023

GitYCC commented Jul 30, 2023 •

edited

Loading

hiyouga commented Jul 31, 2023

GitYCC commented Jul 31, 2023 •

edited

Loading

GitYCC commented Jul 31, 2023

hiyouga commented Jul 31, 2023

GitYCC commented Jul 31, 2023 •

edited

Loading

GitYCC commented Jul 31, 2023

hiyouga commented Jul 31, 2023

GitYCC commented Aug 1, 2023

hiyouga commented Aug 1, 2023

GitYCC commented Aug 1, 2023 •

edited

Loading

hiyouga commented Aug 1, 2023

GitYCC commented Aug 1, 2023 •

edited

Loading

GitYCC commented Aug 1, 2023 •

edited

Loading

GitYCC commented Aug 1, 2023

hiyouga commented Aug 1, 2023

GitYCC commented Aug 2, 2023

在transformers==4.31下，CastOutputToFloat的置換產生的bugs #268

在transformers==4.31下，CastOutputToFloat的置換產生的bugs #268

Comments

GitYCC commented Jul 28, 2023

hiyouga commented Jul 28, 2023

GitYCC commented Jul 29, 2023 • edited Loading

hiyouga commented Jul 29, 2023

GitYCC commented Jul 30, 2023 • edited Loading

hiyouga commented Jul 30, 2023

GitYCC commented Jul 30, 2023 • edited Loading

hiyouga commented Jul 31, 2023

GitYCC commented Jul 31, 2023 • edited Loading

GitYCC commented Jul 31, 2023

hiyouga commented Jul 31, 2023

GitYCC commented Jul 31, 2023 • edited Loading

GitYCC commented Jul 31, 2023

hiyouga commented Jul 31, 2023

GitYCC commented Aug 1, 2023

hiyouga commented Aug 1, 2023

GitYCC commented Aug 1, 2023 • edited Loading

hiyouga commented Aug 1, 2023

GitYCC commented Aug 1, 2023 • edited Loading

GitYCC commented Aug 1, 2023 • edited Loading

GitYCC commented Aug 1, 2023

hiyouga commented Aug 1, 2023

GitYCC commented Aug 2, 2023

GitYCC commented Jul 29, 2023 •

edited

Loading

GitYCC commented Jul 30, 2023 •

edited

Loading

GitYCC commented Jul 30, 2023 •

edited

Loading

GitYCC commented Jul 31, 2023 •

edited

Loading

GitYCC commented Jul 31, 2023 •

edited

Loading

GitYCC commented Aug 1, 2023 •

edited

Loading

GitYCC commented Aug 1, 2023 •

edited

Loading

GitYCC commented Aug 1, 2023 •

edited

Loading