Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm下多节点训练时提示FileExistsError: [Errno 17] File exists: ‘../trainer_log.jsonl' #3010

Closed
1 task done
Rookie-Kai opened this issue Mar 27, 2024 · 1 comment
Labels
solved This problem has been already solved

Comments

@Rookie-Kai
Copy link

Rookie-Kai commented Mar 27, 2024

Reminder

  • I have read the README and searched the existing issues.

Reproduction

我在两个节点上使用8*A800 80G对Qwen1.5-14B进行增量预训练时,提示FileExistsError: [Errno 17] File exists: ‘../trainer_log.jsonl'
参数信息
torchrun --nnodes $NNODES --master_addr $MASTER_ADDR --master_port $MASTER_PORT --node_rank $NODE_RANK --nproc_per_node 8 \ /mnt/afs/LLaMA-Factory/src/train_bash.py \ --deepspeed /mnt/afs/LLaMA-Factory/examples/deepspeed/ds_z3_config.json \ --stage pt \ --template qwen \ --model_name_or_path /mnt/afs/Model/Qwen1.5-14B \ --do_train --dataset_dir /mnt/afs/dataset \ --dataset test \ --finetuning_type full \ --output_dir /mnt/afs/Output/Qwen1.5-14B \ --overwrite_cache \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 1 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --warmup_ratio 0.1 \ --learning_rate 5e-5 \ --num_train_epochs 3.0 \ --plot_loss \ --bf16 \ --flash_attn \ --overwrite_output_dir \ --preprocessing_num_workers 128 \ --rope_scaling linear \ --cutoff_len 4096 \ --ddp_timeout 180000

报错信息:
FileExistsError: [Errno 17] File exists: '/mnt/afs/Output/Qwen1.5-14B/trainer_log.jsonl'
torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 97) of binary: /mnt/afs/miniconda3/envs/qwen/bin/python

Expected behavior

我希望对Qwen1.5-14B在多节点进行增量预训练
但是提示FileExistsError,而且这个trainer_log.jsonl是开始训练后才出现的
我的两个节点共享存储空间

所以我想知道该如何解决这个问题,以及能否提供一个可以使用的deepspeed在Slurm多节点的DDP脚本示例,谢谢

System Info

No response

Others

No response

@hiyouga hiyouga added pending This problem is yet to be addressed and removed pending This problem is yet to be addressed labels Mar 28, 2024
@hiyouga hiyouga added the solved This problem has been already solved label Mar 28, 2024
@hiyouga
Copy link
Owner

hiyouga commented Mar 28, 2024

fixed

tybalex added a commit to sanjay920/LLaMA-Factory that referenced this issue Apr 10, 2024
* fix packages

* Update wechat.jpg

* Updated README with new information

* Updated README with new information

* Updated README with new information

* Follow HF_ENDPOINT environment variable

* fix hiyouga#2346

* fix hiyouga#2777 hiyouga#2895

* add orca_dpo_pairs dataset

* support fsdp + qlora

* update readme

* update tool extractor

* paper release

* add citation

* move file

* Update README.md, fix the release date of the paper

* Update README_zh.md, fix the release date of the paper

* Update wechat.jpg

* fix hiyouga#2941

* fix hiyouga#2928

* fix hiyouga#2936

* fix Llama lora merge crash

* fix Llama lora merge crash

* fix Llama lora merge crash

* pass ruff check

* tiny fix

* Update requirements.txt

* Update README_zh.md

* release v0.6.0

* add arg check

* Update README_zh.md

* Update README.md

* update readme

* tiny fix

* release v0.6.0 (real)

* Update wechat.jpg

* fix hiyouga#2961

* fix bug

* fix hiyouga#2981

* fix ds optimizer

* update trainers

* fix hiyouga#3010

* update readme

* fix hiyouga#2982

* add project

* update readme

* release v0.6.1

* Update wechat.jpg

* fix pile datset hf hub url

* upgrade gradio to 4.21.0

* support save args in webui hiyouga#2807 hiyouga#3046

some ideas are borrowed from @marko1616

* Fix Llama model save for full param train

* fix blank line contains whitespace

* tiny fix

* support ORPO

* support orpo in webui

* update readme

* use log1p in orpo loss

huggingface/trl#1491

* fix plots

* fix IPO and ORPO loss

* fix ORPO loss

* update webui

* support infer 4bit model on GPUs hiyouga#3023

* fix hiyouga#3077

* add qwen1.5 moe

* fix hiyouga#3083

* set dev version

* Update SECURITY.md

* fix hiyouga#3022

* add moe aux loss control hiyouga#3085

* simplify readme

* update readme

* update readme

* update examples

* update examples

* add zh readme

* update examples

* update readme

* update vllm example

* Update wechat.jpg

* fix hiyouga#3116

* fix resize vocab at inference hiyouga#3022

* fix requires for windows

* fix bug in latest gradio

* back to gradio 4.21 and fix chat

* tiny fix

* update examples

* update readme

* support Qwen1.5-32B

* support Qwen1.5-32B

* fix spell error

* support hiyouga#3152

* rename template to breeze

* rename template to breeze

* add empty line

* Update wechat.jpg

* tiny fix

* fix quant infer and qwen2moe

* Pass additional_target to unsloth

Fixes hiyouga#3200

* Update adapter.py

* Update adapter.py

* fix hiyouga#3225

---------

Co-authored-by: hiyouga <[email protected]>
Co-authored-by: 刘一博 <[email protected]>
Co-authored-by: khazic <[email protected]>
Co-authored-by: SirlyDreamer <[email protected]>
Co-authored-by: Sanjay Nadhavajhala <[email protected]>
Co-authored-by: sanjay920 <[email protected]>
Co-authored-by: 0xez <[email protected]>
Co-authored-by: marko1616 <[email protected]>
Co-authored-by: Remek Kinas <[email protected]>
Co-authored-by: Tsumugii24 <[email protected]>
Co-authored-by: li.yunhao <[email protected]>
Co-authored-by: sliderSun <[email protected]>
Co-authored-by: codingma <[email protected]>
Co-authored-by: Erich Schubert <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

2 participants