0.8.1版本DeepSpeed 的 zero stage3报错 #4209

xinyubai1209 · 2024-06-11T10:40:24Z

Reminder

I have read the README and searched the existing issues.

System Info

通过DeepSpeed训练Qwen1.5-1.8B，使用Zero2可以正常训练，但是使用Zero3报错。

Reproduction

执行命令：
CUDA_VISIBLE_DEVICES=5,6 llamafactory-cli train examples/lora_multi_gpu/qwen_lora_dpo_ds.yaml
训练配置文件：

### model
model_name_or_path: /home/Qwen1.5-1.8B

### method
stage: dpo
do_train: true
finetuning_type: lora
lora_target: q_proj,v_proj

### ddp
ddp_timeout: 180000000
deepspeed: examples/deepspeed/ds_z3_config.json

### dataset
dataset: comparison_gpt4_zh
template: qwen
cutoff_len: 1024
max_samples: 100000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/Qwen1.5-1.8B/lora/dpo
logging_steps: 100
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 2
gradient_accumulation_steps: 8
learning_rate: 5.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
fp16: true

### eval
val_size: 0.1
per_device_eval_batch_size: 2
eval_strategy: steps
eval_steps: 500

Expected behavior

报错如下：

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/LLaMA-Factory/src/llamafactory/launcher.py", line 9, in <module>
[rank0]:     launch()
[rank0]:   File "/home/LLaMA-Factory/src/llamafactory/launcher.py", line 5, in launch
[rank0]:     run_exp()
[rank0]:   File "/home/LLaMA-Factory/src/llamafactory/train/tuner.py", line 39, in run_exp
[rank0]:     run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
[rank0]:   File "/home/LLaMA-Factory/src/llamafactory/train/dpo/workflow.py", line 64, in run_dpo
[rank0]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1885, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2216, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3238, in training_step
[rank0]:     loss = self.compute_loss(model, inputs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/trl/trainer/dpo_trainer.py", line 1081, in compute_loss
[rank0]:     loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
[rank0]:   File "/home/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 207, in get_batch_loss_metrics
[rank0]:     reference_chosen_logps, reference_rejected_logps = self.compute_reference_log_probs(model, batch)
[rank0]:   File "/home/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 185, in compute_reference_log_probs
[rank0]:     reference_chosen_logps, reference_rejected_logps, *_ = self.concatenated_forward(ref_model, batch)
[rank0]:   File "/home/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 156, in concatenated_forward
[rank0]:     all_logits: "torch.Tensor" = model(**batch, return_dict=True, use_cache=False).logits.to(torch.float32)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1852, in forward
[rank0]:     loss = self.module(*inputs, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1582, in _call_impl
[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 1430, in forward
[rank0]:     return self.base_model(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1582, in _call_impl
[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/peft/tuners/tuners_utils.py", line 179, in forward
[rank0]:     return self.model.forward(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2/modeling_qwen2.py", line 1149, in forward
[rank0]:     outputs = self.model(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1582, in _call_impl
[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2/modeling_qwen2.py", line 978, in forward
[rank0]:     inputs_embeds = self.embed_tokens(input_ids)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1571, in _call_impl
[rank0]:     args_result = hook(self, args)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook
[rank0]:     self.pre_sub_module_forward_function(module)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function
[rank0]:     param_coordinator.fetch_sub_module(sub_module, forward=True)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 316, in fetch_sub_module
[rank0]:     assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
[rank0]: AssertionError: {'id': 0, 'status': 'INFLIGHT', 'numel': 311164928, 'ds_numel': 311164928, 'shape': (151936, 2048), 'ds_shape': (151936, 2048), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': {4}, 'ds_tensor.shape': torch.Size([155582464])}

Others

No response

The text was updated successfully, but these errors were encountered:

hiyouga · 2024-06-11T11:36:48Z

去掉参数：

### eval
val_size: 0.1
per_device_eval_batch_size: 2
eval_strategy: steps
eval_steps: 500

ayyyq · 2024-06-11T17:32:23Z

去掉参数后仍然出现这个问题，尝试了deepspeed==0.13.0和0.14.0 @hiyouga

xinyubai1209 · 2024-06-12T02:44:05Z

去掉参数后仍然出现这个问题，尝试了deepspeed==0.13.0和0.14.0 @hiyouga

确实还有一样的问题，大佬在帮忙看看吧

DeyangKong · 2024-06-12T09:22:36Z

我也有这个问题

hiyouga · 2024-06-12T18:26:46Z

fixed

onex7777 · 2024-06-15T13:48:46Z

你好，是更新了版本后，DPO和KTO能够正常训练吗？

github-actions bot added the pending This problem is yet to be addressed label Jun 11, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jun 11, 2024

hiyouga closed this as completed in cf9f2d6 Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.8.1版本DeepSpeed 的 zero stage3报错 #4209

0.8.1版本DeepSpeed 的 zero stage3报错 #4209

xinyubai1209 commented Jun 11, 2024

hiyouga commented Jun 11, 2024

ayyyq commented Jun 11, 2024

xinyubai1209 commented Jun 12, 2024

DeyangKong commented Jun 12, 2024

hiyouga commented Jun 12, 2024

onex7777 commented Jun 15, 2024

0.8.1版本DeepSpeed 的 zero stage3报错 #4209

0.8.1版本DeepSpeed 的 zero stage3报错 #4209

Comments

xinyubai1209 commented Jun 11, 2024

Reminder

System Info

Reproduction

Expected behavior

Others

hiyouga commented Jun 11, 2024

ayyyq commented Jun 11, 2024

xinyubai1209 commented Jun 12, 2024

DeyangKong commented Jun 12, 2024

hiyouga commented Jun 12, 2024

onex7777 commented Jun 15, 2024