Rouge-score results are surprisingly low #3764

Jiminator · 2024-05-15T19:59:12Z

Reminder

I have read the README and searched the existing issues.

Reproduction

Here is my finetuning yaml:

# model
model_name_or_path: meta-llama/Meta-Llama-3-8B

# method
stage: sft
do_train: true
finetuning_type: full
use_badam: true
badam_switch_mode: descending
badam_switch_interval: 50
badam_verbose: 2

# dataset
dataset: slimorca
template: llama3
cutoff_len: 1024
max_samples: 1000
val_size: 0.1
overwrite_cache: true
preprocessing_num_workers: 16

# output
output_dir: saves/llama3-8b/slimorca/badam/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

# train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 0.0001
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_steps: 0.1
bf16: false

# eval
per_device_eval_batch_size: 1
evaluation_strategy: steps
eval_steps: 500

Here is the yaml file I used to calculate rouge-score/bleu score

# model
model_name_or_path: saves/llama3-8b/slimorca/badam/sft

# method
stage: sft
do_predict: true
finetuning_type: full
use_badam: true
badam_switch_mode: descending
badam_switch_interval: 50
badam_verbose: 2

# dataset
dataset: alpaca_gpt4_en
template: llama3
cutoff_len: 1024
max_samples: 50
overwrite_cache: true
preprocessing_num_workers: 16

# output
output_dir: saves/llama3-8b/slimorca/badam/predict
overwrite_output_dir: true

# eval
per_device_eval_batch_size: 1
predict_with_generate: true

Expected behavior

After fine-tuning the llama3 8B using Badam, I expected the rouge and bleu scores to significantly improve, as with my previous experiments using Lora, Lora+, and Qlora. However, the output of my prediction script showed that Badam only netted a very slight improvement.

Llama-3-8b (base):

{
    "predict_bleu-4": 4.054454,
    "predict_rouge-1": 18.643065999999997,
    "predict_rouge-2": 5.275764000000001,
    "predict_rouge-l": 4.190858,
    "predict_runtime": 2317.376,
    "predict_samples_per_second": 0.022,
    "predict_steps_per_second": 0.022
}

Llama-3-8b (Badam):

{
    "predict_bleu-4": 6.910812,
    "predict_rouge-1": 23.442495999999995,
    "predict_rouge-2": 8.417560000000002,
    "predict_rouge-l": 5.708255999999999,
    "predict_runtime": 1741.1326,
    "predict_samples_per_second": 0.029,
    "predict_steps_per_second": 0.029
}

System Info

transformers version: 4.40.0
Platform: Linux-5.14.0-284.40.1.el9_2.x86_64-x86_64-with-glibc2.31
Python version: 3.12.1
Huggingface_hub version: 0.22.2
Safetensors version: 0.4.3
Accelerate version: 0.30.0
Accelerate config: not found
PyTorch version (GPU?): 2.2.2+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: v100
Using distributed or parallel set-up in script?: no

Others

The only changes I made to the original badam example are the dataset I am using to fine-tune, the location of the val_size variable in the YAML file, and pure_bf16. Since I am using a v100, my machine does not support bf_16, and when I try to just use fp16: true, I get an error ValueError: Attempting to unscale FP16 gradients. I tried fixing this error by switching to peft=0.6.0, but then llmtuner doesn't work(ImportError: peft>=0.10.0 is required for a normal functioning of this module, but found peft==0.6.0.). bf16: false allows the fine-tuning script to work, but I worry its not doing the fine-tuning correctly. Any help or advice would be greatly appreciated!

The text was updated successfully, but these errors were encountered:

Ledzy · 2024-05-16T13:28:04Z

Hi @Jiminator , as only 1000 samples are used and batch size is 8, setting "badam_switch_interval=50" will only update 8 blocks (1000*3/(8*50)=7.5), while Llama 3-8B has 32 block when using layer-wise partition. You should try increasing the training epochs or reduce switch interval (yet we suggest to set it larger than 20) to ensure every blocks are trained. Usually, setting "badam_switch_mode" to be "ascending" or "random" yields faster convergence at the begining.

As finetuning Llama 3-8B requires no more than 24GB memory and V100 has 32GB memory, you can alternatively set each trainable block to be larger, e.g. contain 2 or more layers (instead of single layer). You can follow the instruction in https://github.com/Ledzy/BAdam?tab=readme-ov-file#partition-by-module to set block partition (This requires modifying the _create_badam_optimizer function a bit), which will yield faster convergence as the increased training parameter offers larger parameter search space.

Alternatively, you can try "--badam_mode ratio" with proper "badam_update_ratio" that fits into your memory limit.

hiyouga · 2024-05-16T14:00:08Z

You should use template: default for Llama-3-base models, the llama3 template should only be used for the instruct models

Jiminator · 2024-05-16T18:09:39Z

@Ledzy ty for the tip, should the badam example template be adjusted in the repository to reflect this?

@hiyouga TYSM! I am guessing the template change needs to be done for both predict and train yaml files, correct?

hiyouga · 2024-05-16T18:24:54Z

@Jiminator Exactly, be aware of using same template in training and inference

jinec · 2024-06-25T02:30:25Z

我用的是 qwen模型来微调 qwen1.5的大模型，验证集的rouge同样很低，最高才10几分。可是等我训练好后，再次单独去推理，rouge是80分以上。。。很理解不了到底是为啥。。肯定是哪里发生了重大的变化。

hiyouga added a commit that referenced this issue May 16, 2024

update badam example #3764

e5bba7c

hiyouga added the solved This problem has been already solved label May 16, 2024

hiyouga closed this as completed May 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rouge-score results are surprisingly low #3764

Rouge-score results are surprisingly low #3764

Jiminator commented May 15, 2024

Ledzy commented May 16, 2024 •

edited

Loading

hiyouga commented May 16, 2024

Jiminator commented May 16, 2024 •

edited

Loading

hiyouga commented May 16, 2024

jinec commented Jun 25, 2024

Rouge-score results are surprisingly low #3764

Rouge-score results are surprisingly low #3764

Comments

Jiminator commented May 15, 2024

Reminder

Reproduction

Expected behavior

System Info

Others

Ledzy commented May 16, 2024 • edited Loading

hiyouga commented May 16, 2024

Jiminator commented May 16, 2024 • edited Loading

hiyouga commented May 16, 2024

jinec commented Jun 25, 2024

Ledzy commented May 16, 2024 •

edited

Loading

Jiminator commented May 16, 2024 •

edited

Loading