Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear divide of loss before and after commit 9bec3c98a22c91b1c28fda757db51eb780291641 #2983

Closed
1 task done
HideLord opened this issue Mar 26, 2024 · 7 comments
Closed
1 task done
Labels
solved This problem has been already solved

Comments

@HideLord
Copy link

Reminder

  • I have read the README and searched the existing issues.

Reproduction

Command:

deepspeed --num_gpus 2 --master_port=9901 src/train_bash.py     --deepspeed ds_config.json     --flash_attn     --stage sft     --do_train True     --model_name_or_path yam-peleg/Experiment26-7B     --finetuning_type lora     --template mistral     --dataset_dir data     --dataset double_take_dataset     --data_seed 42     --seed 42     --cutoff_len 1500     --learning_rate 0.00003     --num_train_epochs 1.0     --max_samples 100000     --per_device_train_batch_size 4     --per_device_eval_batch_size 4     --gradient_accumulation_steps 1     --lr_scheduler_type polynomial     --max_grad_norm 1.0     --logging_steps 5     --bf16 True     --lora_rank 108     --lora_alpha 216     --lora_target all     --val_size 0.05     --evaluation_strategy steps   --save_steps 0.1    --eval_steps 0.1   --load_best_model_at_end True     --plot_loss True     --run_name DoubleTake_v16.9     --output_dir saves/Mistral/lora/DoubleTake_v16.9

Deepspeed config:

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "initial_scale_power": 16,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "overlap_comm": false,
    "contiguous_gradients": true
  }
}

image

The ones with the higher loss are the commits after (and including) 9bec3c98a22c91b1c28fda757db51eb780291641. The same can be seen with the train loss as well:

image

All the tests were performed with the same dataset and seed and arguments.

Expected behavior

No regression in the train/eval loss.

System Info

  • transformers version: 4.39.1
  • Platform: Linux-5.15.0-97-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.22.0
  • Safetensors version: 0.4.2
  • Accelerate version: 0.28.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.2.1+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: yes

Others

Please tell me if you need more info, and I will help if possible!

@hiyouga hiyouga added the pending This problem is yet to be addressed label Mar 26, 2024
@hiyouga
Copy link
Owner

hiyouga commented Mar 26, 2024

It should not cause a loss regression, could you try fine-tuning models with different seeds?

@HideLord
Copy link
Author

It should not cause a loss regression, could you try fine-tuning models with different seeds?

I started another run, but it will take some time. Will update this comment once a few hundred steps have ran.

@HideLord
Copy link
Author

HideLord commented Mar 26, 2024

Here are another two runs with a new seed 123. The data shuffling has been turned off so that the runs are comparable:

deepspeed --num_gpus 2 --master_port=9901 src/train_bash.py     --deepspeed ds_config.json     --flash_attn     --stage sft     --do_train True     --model_name_or_path yam-peleg/Experiment26-7B     --finetuning_type lora     --template mistral     --dataset_dir data     --dataset double_take_dataset     --data_seed 123     --seed 123     --cutoff_len 1500     --learning_rate 0.00003     --num_train_epochs 1.0     --max_samples 100000     --per_device_train_batch_size 4     --per_device_eval_batch_size 4     --gradient_accumulation_steps 1     --lr_scheduler_type polynomial     --max_grad_norm 1.0     --logging_steps 5     --bf16 True     --lora_rank 108     --lora_alpha 216     --lora_target all     --val_size 0.05     --evaluation_strategy steps   --save_steps 0.1    --eval_steps 0.1   --load_best_model_at_end True     --plot_loss True     --run_name DoubleTake_v16.9     --output_dir saves/Mistral/lora/DoubleTake_v16.9

Black = With commit 9bec3c98a22c91b1c28fda757db51eb780291641
Red = With commit 7b8f5029018f0481f7da83cc5ee4408d95c9beb2

Eval Loss:
image

Train Loss:
image

@HideLord
Copy link
Author

Here is the logs difference if you need it:
https://www.diffchecker.com/YdIhGqzq/

@hiyouga
Copy link
Owner

hiyouga commented Mar 26, 2024

Could you try again with 3bcd41b ?

@HideLord
Copy link
Author

Absolute legend 🤩
That fixed it! Blue is the new run:
image

@hiyouga
Copy link
Owner

hiyouga commented Mar 26, 2024

Thank you very much for your rapid test, we are still working to check the reason leading to the loss regression. We will post a bug report once we find it out.

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Mar 27, 2024
@hiyouga hiyouga closed this as completed Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

2 participants