Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.7.0 版本生成的trainer_log.json不完整 #3658

Closed
1 task done
Grey4sh opened this issue May 9, 2024 · 2 comments
Closed
1 task done

0.7.0 版本生成的trainer_log.json不完整 #3658

Grey4sh opened this issue May 9, 2024 · 2 comments
Labels
solved This problem has been already solved

Comments

@Grey4sh
Copy link

Grey4sh commented May 9, 2024

Reminder

  • I have read the README and searched the existing issues.

Reproduction

sft训练,记录的trainer_log只能到0.87epoch

训练脚本

deepspeed src/train.py --model_name_or_path=xxxx \
        --stage sft \
        --dataset "sft_dataset_merge" \
        --finetuning_type  full \
        --overwrite_cache \
        --preprocessing_num_workers  32 \
        --template  xxx \
        --flash_attn fa2 \
        --output_dir  xxx \
        --bf16  true  \
        --do_train  true  \
        --do_eval false \
        --seed 42 \
        --gradient_accumulation_steps 2 \
        --learning_rate  1e-05 \
        --warmup_ratio 0.02 \
        --cutoff_len 4096 \
        --tf32 true \
        --logging_steps  10 \
        --logging_strategy  steps \
        --lr_scheduler_type  cosine \
        --max_steps  -1 \
        --num_train_epochs  3 \
        --overwrite_output_dir  true  \
        --per_device_train_batch_size 4 \
        --remove_unused_columns  true \
        --report_to tensorboard \
        --plot_loss \
        --save_steps 2000 \
        --eval_steps 200 \
        --val_size 0.01 \
        --evaluation_strategy steps \
        --load_best_model_at_end \
        --save_total_limit  2 \
        --save_safetensors  true  \
        --deepspeed=ds_z3_lr_schedule.json

trainer_log记录

{"current_steps": 10, "total_steps": 690, "loss": 0.635, "learning_rate": 7.1428571428571436e-06, "epoch": 0.04, "percentage": 1.45, "elapsed_time": "0:01:55", "remaining_time": "2:11:22"}
{"current_steps": 20, "total_steps": 690, "loss": 0.4894, "learning_rate": 9.998056338091415e-06, "epoch": 0.09, "percentage": 2.9, "elapsed_time": "0:03:22", "remaining_time": "1:53:18"}
{"current_steps": 30, "total_steps": 690, "loss": 0.486, "learning_rate": 9.986183876164412e-06, "epoch": 0.13, "percentage": 4.35, "elapsed_time": "0:05:26", "remaining_time": "1:59:35"}
{"current_steps": 40, "total_steps": 690, "loss": 0.473, "learning_rate": 9.96354437049027e-06, "epoch": 0.17, "percentage": 5.8, "elapsed_time": "0:07:04", "remaining_time": "1:54:54"}
{"current_steps": 50, "total_steps": 690, "loss": 0.4353, "learning_rate": 9.930186708264902e-06, "epoch": 0.22, "percentage": 7.25, "elapsed_time": "0:08:47", "remaining_time": "1:52:30"}
{"current_steps": 60, "total_steps": 690, "loss": 0.4402, "learning_rate": 9.88618292120984e-06, "epoch": 0.26, "percentage": 8.7, "elapsed_time": "0:10:45", "remaining_time": "1:52:56"}
{"current_steps": 70, "total_steps": 690, "loss": 0.4109, "learning_rate": 9.831628030028698e-06, "epoch": 0.3, "percentage": 10.14, "elapsed_time": "0:12:39", "remaining_time": "1:52:05"}
{"current_steps": 80, "total_steps": 690, "loss": 0.4154, "learning_rate": 9.76663983922178e-06, "epoch": 0.35, "percentage": 11.59, "elapsed_time": "0:14:33", "remaining_time": "1:51:01"}
{"current_steps": 90, "total_steps": 690, "loss": 0.3968, "learning_rate": 9.691358682701927e-06, "epoch": 0.39, "percentage": 13.04, "elapsed_time": "0:16:23", "remaining_time": "1:49:16"}
{"current_steps": 100, "total_steps": 690, "loss": 0.3841, "learning_rate": 9.605947120760878e-06, "epoch": 0.43, "percentage": 14.49, "elapsed_time": "0:18:04", "remaining_time": "1:46:40"}
{"current_steps": 110, "total_steps": 690, "loss": 0.3916, "learning_rate": 9.510589589040554e-06, "epoch": 0.48, "percentage": 15.94, "elapsed_time": "0:20:04", "remaining_time": "1:45:51"}
{"current_steps": 120, "total_steps": 690, "loss": 0.3583, "learning_rate": 9.405492000267228e-06, "epoch": 0.52, "percentage": 17.39, "elapsed_time": "0:21:59", "remaining_time": "1:44:25"}
{"current_steps": 130, "total_steps": 690, "loss": 0.3739, "learning_rate": 9.29088129960862e-06, "epoch": 0.56, "percentage": 18.84, "elapsed_time": "0:23:53", "remaining_time": "1:42:54"}
{"current_steps": 140, "total_steps": 690, "loss": 0.3806, "learning_rate": 9.16700497461403e-06, "epoch": 0.61, "percentage": 20.29, "elapsed_time": "0:25:50", "remaining_time": "1:41:32"}
{"current_steps": 150, "total_steps": 690, "loss": 0.3589, "learning_rate": 9.034130520795774e-06, "epoch": 0.65, "percentage": 21.74, "elapsed_time": "0:27:37", "remaining_time": "1:39:26"}
{"current_steps": 160, "total_steps": 690, "loss": 0.3358, "learning_rate": 8.892544864005899e-06, "epoch": 0.69, "percentage": 23.19, "elapsed_time": "0:29:16", "remaining_time": "1:36:57"}
{"current_steps": 170, "total_steps": 690, "loss": 0.3258, "learning_rate": 8.742553740855507e-06, "epoch": 0.74, "percentage": 24.64, "elapsed_time": "0:31:16", "remaining_time": "1:35:40"}
{"current_steps": 180, "total_steps": 690, "loss": 0.3344, "learning_rate": 8.584481038514573e-06, "epoch": 0.78, "percentage": 26.09, "elapsed_time": "0:33:10", "remaining_time": "1:33:59"}
{"current_steps": 190, "total_steps": 690, "loss": 0.3343, "learning_rate": 8.418668095317912e-06, "epoch": 0.82, "percentage": 27.54, "elapsed_time": "0:34:33", "remaining_time": "1:30:56"}
{"current_steps": 200, "total_steps": 690, "loss": 0.3376, "learning_rate": 8.245472963687484e-06, "epoch": 0.87, "percentage": 28.99, "elapsed_time": "0:36:13", "remaining_time": "1:28:45"}
{"current_steps": 200, "total_steps": 690, "eval_loss": 0.31446942687034607, "epoch": 0.87, "percentage": 28.99, "elapsed_time": "0:36:21", "remaining_time": "1:29:04"}

Expected behavior

我使用新版本进行了多次sft训练,都遇到了这个问题。

System Info

  • transformers version: 4.39.3
  • Platform: Linux-5.15.0-25-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.20.1
  • Safetensors version: 0.4.3
  • Accelerate version: 0.27.2
  • Accelerate config: - compute_environment: LOCAL_MACHINE
    - distributed_type: MULTI_GPU
    - mixed_precision: bf16
    - use_cpu: False
    - debug: True
    - num_processes: 8
    - machine_rank: 0
    - num_machines: 1
    - gpu_ids: all
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []
  • PyTorch version (GPU?): 2.1.1+cu121 (True)
  • Tensorflow version (GPU?): 2.15.0 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Others

No response

@Jungle728
Copy link

same question

@hiyouga hiyouga added the pending This problem is yet to be addressed label May 11, 2024
@hiyouga
Copy link
Owner

hiyouga commented May 11, 2024

fixed

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels May 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

3 participants