Clear divide of loss before and after commit 9bec3c98a22c91b1c28fda757db51eb780291641 #2983

HideLord · 2024-03-26T09:28:57Z

Reminder

I have read the README and searched the existing issues.

Reproduction

Command:

deepspeed --num_gpus 2 --master_port=9901 src/train_bash.py     --deepspeed ds_config.json     --flash_attn     --stage sft     --do_train True     --model_name_or_path yam-peleg/Experiment26-7B     --finetuning_type lora     --template mistral     --dataset_dir data     --dataset double_take_dataset     --data_seed 42     --seed 42     --cutoff_len 1500     --learning_rate 0.00003     --num_train_epochs 1.0     --max_samples 100000     --per_device_train_batch_size 4     --per_device_eval_batch_size 4     --gradient_accumulation_steps 1     --lr_scheduler_type polynomial     --max_grad_norm 1.0     --logging_steps 5     --bf16 True     --lora_rank 108     --lora_alpha 216     --lora_target all     --val_size 0.05     --evaluation_strategy steps   --save_steps 0.1    --eval_steps 0.1   --load_best_model_at_end True     --plot_loss True     --run_name DoubleTake_v16.9     --output_dir saves/Mistral/lora/DoubleTake_v16.9

Deepspeed config:

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "initial_scale_power": 16,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "overlap_comm": false,
    "contiguous_gradients": true
  }
}

The ones with the higher loss are the commits after (and including) 9bec3c98a22c91b1c28fda757db51eb780291641. The same can be seen with the train loss as well:

All the tests were performed with the same dataset and seed and arguments.

Expected behavior

No regression in the train/eval loss.

System Info

transformers version: 4.39.1
Platform: Linux-5.15.0-97-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.22.0
Safetensors version: 0.4.2
Accelerate version: 0.28.0
Accelerate config: not found
PyTorch version (GPU?): 2.2.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: yes

Others

Please tell me if you need more info, and I will help if possible!

The text was updated successfully, but these errors were encountered:

hiyouga · 2024-03-26T09:48:53Z

It should not cause a loss regression, could you try fine-tuning models with different seeds?

HideLord · 2024-03-26T09:59:23Z

It should not cause a loss regression, could you try fine-tuning models with different seeds?

I started another run, but it will take some time. Will update this comment once a few hundred steps have ran.

HideLord · 2024-03-26T10:48:22Z

Here are another two runs with a new seed 123. The data shuffling has been turned off so that the runs are comparable:

deepspeed --num_gpus 2 --master_port=9901 src/train_bash.py     --deepspeed ds_config.json     --flash_attn     --stage sft     --do_train True     --model_name_or_path yam-peleg/Experiment26-7B     --finetuning_type lora     --template mistral     --dataset_dir data     --dataset double_take_dataset     --data_seed 123     --seed 123     --cutoff_len 1500     --learning_rate 0.00003     --num_train_epochs 1.0     --max_samples 100000     --per_device_train_batch_size 4     --per_device_eval_batch_size 4     --gradient_accumulation_steps 1     --lr_scheduler_type polynomial     --max_grad_norm 1.0     --logging_steps 5     --bf16 True     --lora_rank 108     --lora_alpha 216     --lora_target all     --val_size 0.05     --evaluation_strategy steps   --save_steps 0.1    --eval_steps 0.1   --load_best_model_at_end True     --plot_loss True     --run_name DoubleTake_v16.9     --output_dir saves/Mistral/lora/DoubleTake_v16.9

Black = With commit 9bec3c98a22c91b1c28fda757db51eb780291641
Red = With commit 7b8f5029018f0481f7da83cc5ee4408d95c9beb2

Eval Loss:

Train Loss:

HideLord · 2024-03-26T12:02:35Z

Here is the logs difference if you need it:
https://www.diffchecker.com/YdIhGqzq/

hiyouga · 2024-03-26T15:41:57Z

Could you try again with 3bcd41b ?

HideLord · 2024-03-26T16:06:47Z

Absolute legend 🤩
That fixed it! Blue is the new run:

hiyouga · 2024-03-26T16:18:29Z

Thank you very much for your rapid test, we are still working to check the reason leading to the loss regression. We will post a bug report once we find it out.

hiyouga added the pending This problem is yet to be addressed label Mar 26, 2024

hiyouga referenced this issue Mar 26, 2024

fix ds optimizer

3bcd41b

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Mar 27, 2024

hiyouga closed this as completed Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clear divide of loss before and after commit 9bec3c98a22c91b1c28fda757db51eb780291641 #2983

Clear divide of loss before and after commit 9bec3c98a22c91b1c28fda757db51eb780291641 #2983

HideLord commented Mar 26, 2024

hiyouga commented Mar 26, 2024

HideLord commented Mar 26, 2024

HideLord commented Mar 26, 2024 •

edited

Loading

HideLord commented Mar 26, 2024

hiyouga commented Mar 26, 2024

HideLord commented Mar 26, 2024

hiyouga commented Mar 26, 2024

Clear divide of loss before and after commit 9bec3c98a22c91b1c28fda757db51eb780291641 #2983

Clear divide of loss before and after commit 9bec3c98a22c91b1c28fda757db51eb780291641 #2983

Comments

HideLord commented Mar 26, 2024

Reminder

Reproduction

Expected behavior

System Info

Others

hiyouga commented Mar 26, 2024

HideLord commented Mar 26, 2024

HideLord commented Mar 26, 2024 • edited Loading

HideLord commented Mar 26, 2024

hiyouga commented Mar 26, 2024

HideLord commented Mar 26, 2024

hiyouga commented Mar 26, 2024

HideLord commented Mar 26, 2024 •

edited

Loading