Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rouge-score results are surprisingly low #3764

Closed
1 task done
Jiminator opened this issue May 15, 2024 · 5 comments
Closed
1 task done

Rouge-score results are surprisingly low #3764

Jiminator opened this issue May 15, 2024 · 5 comments
Labels
solved This problem has been already solved

Comments

@Jiminator
Copy link

Reminder

  • I have read the README and searched the existing issues.

Reproduction

Here is my finetuning yaml:

# model
model_name_or_path: meta-llama/Meta-Llama-3-8B

# method
stage: sft
do_train: true
finetuning_type: full
use_badam: true
badam_switch_mode: descending
badam_switch_interval: 50
badam_verbose: 2

# dataset
dataset: slimorca
template: llama3
cutoff_len: 1024
max_samples: 1000
val_size: 0.1
overwrite_cache: true
preprocessing_num_workers: 16

# output
output_dir: saves/llama3-8b/slimorca/badam/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

# train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 0.0001
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_steps: 0.1
bf16: false

# eval
per_device_eval_batch_size: 1
evaluation_strategy: steps
eval_steps: 500

Here is the yaml file I used to calculate rouge-score/bleu score

# model
model_name_or_path: saves/llama3-8b/slimorca/badam/sft

# method
stage: sft
do_predict: true
finetuning_type: full
use_badam: true
badam_switch_mode: descending
badam_switch_interval: 50
badam_verbose: 2

# dataset
dataset: alpaca_gpt4_en
template: llama3
cutoff_len: 1024
max_samples: 50
overwrite_cache: true
preprocessing_num_workers: 16

# output
output_dir: saves/llama3-8b/slimorca/badam/predict
overwrite_output_dir: true

# eval
per_device_eval_batch_size: 1
predict_with_generate: true

Expected behavior

After fine-tuning the llama3 8B using Badam, I expected the rouge and bleu scores to significantly improve, as with my previous experiments using Lora, Lora+, and Qlora. However, the output of my prediction script showed that Badam only netted a very slight improvement.

Llama-3-8b (base):

{
    "predict_bleu-4": 4.054454,
    "predict_rouge-1": 18.643065999999997,
    "predict_rouge-2": 5.275764000000001,
    "predict_rouge-l": 4.190858,
    "predict_runtime": 2317.376,
    "predict_samples_per_second": 0.022,
    "predict_steps_per_second": 0.022
}

Llama-3-8b (Badam):

{
    "predict_bleu-4": 6.910812,
    "predict_rouge-1": 23.442495999999995,
    "predict_rouge-2": 8.417560000000002,
    "predict_rouge-l": 5.708255999999999,
    "predict_runtime": 1741.1326,
    "predict_samples_per_second": 0.029,
    "predict_steps_per_second": 0.029
}

System Info

  • transformers version: 4.40.0
  • Platform: Linux-5.14.0-284.40.1.el9_2.x86_64-x86_64-with-glibc2.31
  • Python version: 3.12.1
  • Huggingface_hub version: 0.22.2
  • Safetensors version: 0.4.3
  • Accelerate version: 0.30.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.2.2+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: v100
  • Using distributed or parallel set-up in script?: no

Others

The only changes I made to the original badam example are the dataset I am using to fine-tune, the location of the val_size variable in the YAML file, and pure_bf16. Since I am using a v100, my machine does not support bf_16, and when I try to just use fp16: true, I get an error ValueError: Attempting to unscale FP16 gradients. I tried fixing this error by switching to peft=0.6.0, but then llmtuner doesn't work(ImportError: peft>=0.10.0 is required for a normal functioning of this module, but found peft==0.6.0.). bf16: false allows the fine-tuning script to work, but I worry its not doing the fine-tuning correctly. Any help or advice would be greatly appreciated!

@Ledzy
Copy link
Contributor

Ledzy commented May 16, 2024

Hi @Jiminator , as only 1000 samples are used and batch size is 8, setting "badam_switch_interval=50" will only update 8 blocks (1000*3/(8*50)=7.5), while Llama 3-8B has 32 block when using layer-wise partition. You should try increasing the training epochs or reduce switch interval (yet we suggest to set it larger than 20) to ensure every blocks are trained. Usually, setting "badam_switch_mode" to be "ascending" or "random" yields faster convergence at the begining.

As finetuning Llama 3-8B requires no more than 24GB memory and V100 has 32GB memory, you can alternatively set each trainable block to be larger, e.g. contain 2 or more layers (instead of single layer). You can follow the instruction in https://github.com/Ledzy/BAdam?tab=readme-ov-file#partition-by-module to set block partition (This requires modifying the _create_badam_optimizer function a bit), which will yield faster convergence as the increased training parameter offers larger parameter search space.

Alternatively, you can try "--badam_mode ratio" with proper "badam_update_ratio" that fits into your memory limit.

@hiyouga
Copy link
Owner

hiyouga commented May 16, 2024

You should use template: default for Llama-3-base models, the llama3 template should only be used for the instruct models

@Jiminator
Copy link
Author

Jiminator commented May 16, 2024

@Ledzy ty for the tip, should the badam example template be adjusted in the repository to reflect this?

@hiyouga TYSM! I am guessing the template change needs to be done for both predict and train yaml files, correct?

hiyouga added a commit that referenced this issue May 16, 2024
@hiyouga
Copy link
Owner

hiyouga commented May 16, 2024

@Jiminator Exactly, be aware of using same template in training and inference

@hiyouga hiyouga added the solved This problem has been already solved label May 16, 2024
@hiyouga hiyouga closed this as completed May 19, 2024
@jinec
Copy link

jinec commented Jun 25, 2024

我用的是 qwen模型来微调 qwen1.5的大模型,验证集的rouge同样很低,最高才10几分。可是等我训练好后,再次单独去推理,rouge是80分以上。。。很理解不了到底是为啥。。肯定是哪里发生了重大的变化。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

4 participants