We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sft训练,记录的trainer_log只能到0.87epoch
deepspeed src/train.py --model_name_or_path=xxxx \ --stage sft \ --dataset "sft_dataset_merge" \ --finetuning_type full \ --overwrite_cache \ --preprocessing_num_workers 32 \ --template xxx \ --flash_attn fa2 \ --output_dir xxx \ --bf16 true \ --do_train true \ --do_eval false \ --seed 42 \ --gradient_accumulation_steps 2 \ --learning_rate 1e-05 \ --warmup_ratio 0.02 \ --cutoff_len 4096 \ --tf32 true \ --logging_steps 10 \ --logging_strategy steps \ --lr_scheduler_type cosine \ --max_steps -1 \ --num_train_epochs 3 \ --overwrite_output_dir true \ --per_device_train_batch_size 4 \ --remove_unused_columns true \ --report_to tensorboard \ --plot_loss \ --save_steps 2000 \ --eval_steps 200 \ --val_size 0.01 \ --evaluation_strategy steps \ --load_best_model_at_end \ --save_total_limit 2 \ --save_safetensors true \ --deepspeed=ds_z3_lr_schedule.json
{"current_steps": 10, "total_steps": 690, "loss": 0.635, "learning_rate": 7.1428571428571436e-06, "epoch": 0.04, "percentage": 1.45, "elapsed_time": "0:01:55", "remaining_time": "2:11:22"} {"current_steps": 20, "total_steps": 690, "loss": 0.4894, "learning_rate": 9.998056338091415e-06, "epoch": 0.09, "percentage": 2.9, "elapsed_time": "0:03:22", "remaining_time": "1:53:18"} {"current_steps": 30, "total_steps": 690, "loss": 0.486, "learning_rate": 9.986183876164412e-06, "epoch": 0.13, "percentage": 4.35, "elapsed_time": "0:05:26", "remaining_time": "1:59:35"} {"current_steps": 40, "total_steps": 690, "loss": 0.473, "learning_rate": 9.96354437049027e-06, "epoch": 0.17, "percentage": 5.8, "elapsed_time": "0:07:04", "remaining_time": "1:54:54"} {"current_steps": 50, "total_steps": 690, "loss": 0.4353, "learning_rate": 9.930186708264902e-06, "epoch": 0.22, "percentage": 7.25, "elapsed_time": "0:08:47", "remaining_time": "1:52:30"} {"current_steps": 60, "total_steps": 690, "loss": 0.4402, "learning_rate": 9.88618292120984e-06, "epoch": 0.26, "percentage": 8.7, "elapsed_time": "0:10:45", "remaining_time": "1:52:56"} {"current_steps": 70, "total_steps": 690, "loss": 0.4109, "learning_rate": 9.831628030028698e-06, "epoch": 0.3, "percentage": 10.14, "elapsed_time": "0:12:39", "remaining_time": "1:52:05"} {"current_steps": 80, "total_steps": 690, "loss": 0.4154, "learning_rate": 9.76663983922178e-06, "epoch": 0.35, "percentage": 11.59, "elapsed_time": "0:14:33", "remaining_time": "1:51:01"} {"current_steps": 90, "total_steps": 690, "loss": 0.3968, "learning_rate": 9.691358682701927e-06, "epoch": 0.39, "percentage": 13.04, "elapsed_time": "0:16:23", "remaining_time": "1:49:16"} {"current_steps": 100, "total_steps": 690, "loss": 0.3841, "learning_rate": 9.605947120760878e-06, "epoch": 0.43, "percentage": 14.49, "elapsed_time": "0:18:04", "remaining_time": "1:46:40"} {"current_steps": 110, "total_steps": 690, "loss": 0.3916, "learning_rate": 9.510589589040554e-06, "epoch": 0.48, "percentage": 15.94, "elapsed_time": "0:20:04", "remaining_time": "1:45:51"} {"current_steps": 120, "total_steps": 690, "loss": 0.3583, "learning_rate": 9.405492000267228e-06, "epoch": 0.52, "percentage": 17.39, "elapsed_time": "0:21:59", "remaining_time": "1:44:25"} {"current_steps": 130, "total_steps": 690, "loss": 0.3739, "learning_rate": 9.29088129960862e-06, "epoch": 0.56, "percentage": 18.84, "elapsed_time": "0:23:53", "remaining_time": "1:42:54"} {"current_steps": 140, "total_steps": 690, "loss": 0.3806, "learning_rate": 9.16700497461403e-06, "epoch": 0.61, "percentage": 20.29, "elapsed_time": "0:25:50", "remaining_time": "1:41:32"} {"current_steps": 150, "total_steps": 690, "loss": 0.3589, "learning_rate": 9.034130520795774e-06, "epoch": 0.65, "percentage": 21.74, "elapsed_time": "0:27:37", "remaining_time": "1:39:26"} {"current_steps": 160, "total_steps": 690, "loss": 0.3358, "learning_rate": 8.892544864005899e-06, "epoch": 0.69, "percentage": 23.19, "elapsed_time": "0:29:16", "remaining_time": "1:36:57"} {"current_steps": 170, "total_steps": 690, "loss": 0.3258, "learning_rate": 8.742553740855507e-06, "epoch": 0.74, "percentage": 24.64, "elapsed_time": "0:31:16", "remaining_time": "1:35:40"} {"current_steps": 180, "total_steps": 690, "loss": 0.3344, "learning_rate": 8.584481038514573e-06, "epoch": 0.78, "percentage": 26.09, "elapsed_time": "0:33:10", "remaining_time": "1:33:59"} {"current_steps": 190, "total_steps": 690, "loss": 0.3343, "learning_rate": 8.418668095317912e-06, "epoch": 0.82, "percentage": 27.54, "elapsed_time": "0:34:33", "remaining_time": "1:30:56"} {"current_steps": 200, "total_steps": 690, "loss": 0.3376, "learning_rate": 8.245472963687484e-06, "epoch": 0.87, "percentage": 28.99, "elapsed_time": "0:36:13", "remaining_time": "1:28:45"} {"current_steps": 200, "total_steps": 690, "eval_loss": 0.31446942687034607, "epoch": 0.87, "percentage": 28.99, "elapsed_time": "0:36:21", "remaining_time": "1:29:04"}
我使用新版本进行了多次sft训练,都遇到了这个问题。
transformers
No response
The text was updated successfully, but these errors were encountered:
same question
Sorry, something went wrong.
fixed
4777efe
No branches or pull requests
Reminder
Reproduction
sft训练,记录的trainer_log只能到0.87epoch
训练脚本
trainer_log记录
Expected behavior
我使用新版本进行了多次sft训练,都遇到了这个问题。
System Info
transformers
version: 4.39.3- distributed_type: MULTI_GPU
- mixed_precision: bf16
- use_cpu: False
- debug: True
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Others
No response
The text was updated successfully, but these errors were encountered: