Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

单机多卡全参数训练LLAMA3,报错warmup_steps must be either 0 or > 1 #4005

Closed
1 task done
ZhuYanzhen1 opened this issue May 31, 2024 · 2 comments
Closed
1 task done
Labels
solved This problem has been already solved

Comments

@ZhuYanzhen1
Copy link

ZhuYanzhen1 commented May 31, 2024

Reminder

  • I have read the README and searched the existing issues.

Reproduction

我使用命令./train.sh发起对LLAMA3-70B的全参数训练,我使用的显卡是3张 A100-SXM4-40GB,以下是train.sh的内容。

#!/bin/bash

NPROC_PER_NODE=3
NNODES=1
RANK=0
MASTER_ADDR=127.0.0.1
MASTER_PORT=29500

CUDA_VISIBLE_DEVICES=0,1,2 torchrun \
        --nproc_per_node $NPROC_PER_NODE \
        --nnodes $NNODES \
        --node_rank $RANK \
        --master_addr $MASTER_ADDR \
        --master_port $MASTER_PORT \
        ../llama/src/train.py llama3_sft_multi.yaml

以下是llama3_sft_multi.yaml的内容,其中model_name_or_path一项我设置为了本地的模型。该模型是从Meta官网下载的LLAMA3-Instruct模型的pth文件经由transformers脚本转换后得到的:

### model
model_name_or_path: /docker/llama3_70b_instruct

### method
stage: sft
do_train: true
finetuning_type: full

### ddp
ddp_timeout: 180000000
deepspeed: deepspeed_z3_config.json

### dataset
dataset: identity,alpaca_en_demo
template: llama3
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: /docker/llama3_70b_sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 0.0001
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_steps: 0.1
fp16: true

### eval
val_size: 0.1
per_device_eval_batch_size: 1
evaluation_strategy: steps
eval_steps: 500

以下是deepspeed_z3_config.json的内容:

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "bf16": {
    "enabled": "auto"
  },
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  }
}

运行./train.sh后报以下错误:

[2024-05-31 12:38:07,473] torch.distributed.run: [WARNING] 
[2024-05-31 12:38:07,473] torch.distributed.run: [WARNING] *****************************************
[2024-05-31 12:38:07,473] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-05-31 12:38:07,473] torch.distributed.run: [WARNING] *****************************************
[2024-05-31 12:38:11,586] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-31 12:38:11,595] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-31 12:38:11,599] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py:1483: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
[2024-05-31 12:38:13,327] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-31 12:38:13,327] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py:1483: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
[2024-05-31 12:38:13,451] [INFO] [comm.py:637:init_distributed] cdb=None
/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py:1483: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
[2024-05-31 12:38:13,458] [INFO] [comm.py:637:init_distributed] cdb=None
Traceback (most recent call last):
  File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 14, in <module>
    main()
  File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 5, in main
    run_exp()
  File "/home/student_zyz/Desktop/llama/src/llamafactory/train/tuner.py", line 28, in run_exp
    model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
                                                                             ^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 126, in get_train_args
    model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
                                                                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 112, in _parse_train_args
    return _parse_args(parser, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 42, in _parse_args
    return parser.parse_yaml_file(os.path.abspath(sys.argv[1]))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 423, in parse_yaml_file
    outputs = self.parse_dict(yaml.safe_load(Path(yaml_file).read_text()), allow_extra_keys=allow_extra_keys)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 374, in parse_dict
    obj = dtype(**inputs)
          ^^^^^^^^^^^^^^^
  File "<string>", line 133, in __init__
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py", line 1801, in __post_init__
    raise ValueError("warmup_steps must be either 0 or > 1")
ValueError: warmup_steps must be either 0 or > 1
Traceback (most recent call last):
  File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 14, in <module>
    main()
  File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 5, in main
    run_exp()
  File "/home/student_zyz/Desktop/llama/src/llamafactory/train/tuner.py", line 28, in run_exp
    model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
                                                                             ^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 126, in get_train_args
    model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
                                                                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 112, in _parse_train_args
    return _parse_args(parser, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 42, in _parse_args
    return parser.parse_yaml_file(os.path.abspath(sys.argv[1]))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 423, in parse_yaml_file
    outputs = self.parse_dict(yaml.safe_load(Path(yaml_file).read_text()), allow_extra_keys=allow_extra_keys)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 374, in parse_dict
    obj = dtype(**inputs)
          ^^^^^^^^^^^^^^^
  File "<string>", line 133, in __init__
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py", line 1801, in __post_init__
    raise ValueError("warmup_steps must be either 0 or > 1")
ValueError: warmup_steps must be either 0 or > 1
Traceback (most recent call last):
  File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 14, in <module>
    main()
  File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 5, in main
    run_exp()
  File "/home/student_zyz/Desktop/llama/src/llamafactory/train/tuner.py", line 28, in run_exp
    model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
                                                                             ^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 126, in get_train_args
    model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
                                                                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 112, in _parse_train_args
    return _parse_args(parser, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 42, in _parse_args
    return parser.parse_yaml_file(os.path.abspath(sys.argv[1]))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 423, in parse_yaml_file
    outputs = self.parse_dict(yaml.safe_load(Path(yaml_file).read_text()), allow_extra_keys=allow_extra_keys)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 374, in parse_dict
    obj = dtype(**inputs)
          ^^^^^^^^^^^^^^^
  File "<string>", line 133, in __init__
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py", line 1801, in __post_init__
    raise ValueError("warmup_steps must be either 0 or > 1")
ValueError: warmup_steps must be either 0 or > 1
[2024-05-31 12:38:17,477] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 4060611) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/home/student_zyz/.local/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/home/student_zyz/.local/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/student_zyz/.local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
../llama/src/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-05-31_12:38:17
  host      : edaserver01
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 4060612)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-05-31_12:38:17
  host      : edaserver01
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 4060613)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-31_12:38:17
  host      : edaserver01
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 4060611)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Expected behavior

使用三张显卡进行LLAMA3-70B的全参量训练

System Info

  • transformers version: 4.42.0.dev0
  • Platform: Linux-5.15.0-107-generic-x86_64-with-glibc2.31
  • Python version: 3.11.9
  • Huggingface_hub version: 0.23.1
  • Safetensors version: 0.4.3
  • Accelerate version: 0.29.3
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.2.1+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Others

No response

@mmbwf
Copy link
Contributor

mmbwf commented May 31, 2024

warmup_steps 替换成 warmup_ratio
[TrainingArguments]
warmup_ratio (float, optional, defaults to 0.0) — Ratio of total training steps used for a linear warmup from 0 to learning_rate.
warmup_steps (int, optional, defaults to 0) — Number of steps used for a linear warmup from 0 to learning_rate. Overrides any effect of warmup_ratio.

@ZhuYanzhen1
Copy link
Author

已解决,谢谢

@hiyouga hiyouga added the solved This problem has been already solved label Jun 3, 2024
hiyouga added a commit that referenced this issue Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

3 participants