Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLava 1.5 7B 多模态微调报错 IndexError: boolean index did not match indexed array along dimension 0 #4826

Closed
1 task done
DDYuudachi opened this issue Jul 15, 2024 · 3 comments
Assignees
Labels
solved This problem has been already solved

Comments

@DDYuudachi
Copy link

DDYuudachi commented Jul 15, 2024

Reminder

  • I have read the README and searched the existing issues.

System Info

我参考了mllm_demo.json自己做了份数据集,然后在微调时完成第一次evaluation后报错
以下是报错日志
注:我回退到5月31号的git版本就能正常训练了,但好像从6月11号后的版本就一直到第一次evaluation就中途报错

[INFO|trainer.py:641] 2024-07-15 12:45:45,622 >> Using auto half precision backend
[INFO|trainer.py:2078] 2024-07-15 12:45:47,405 >> ***** Running training *****
[INFO|trainer.py:2079] 2024-07-15 12:45:47,406 >>   Num examples = 102
[INFO|trainer.py:2080] 2024-07-15 12:45:47,406 >>   Num Epochs = 200
[INFO|trainer.py:2081] 2024-07-15 12:45:47,406 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:2084] 2024-07-15 12:45:47,406 >>   Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:2085] 2024-07-15 12:45:47,406 >>   Gradient Accumulation steps = 8
[INFO|trainer.py:2086] 2024-07-15 12:45:47,406 >>   Total optimization steps = 600
[INFO|trainer.py:2087] 2024-07-15 12:45:47,412 >>   Number of trainable parameters = 4,194,304
  0%|                                                                                                                                                                                                                | 0/600 [00:00<?, ?it/s]/home/asus/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:91: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
/home/asus/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:91: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
/home/asus/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:91: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
/home/asus/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:91: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
[WARNING|logging.py:329] 2024-07-15 12:45:48,766 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
{'loss': 1.7298, 'grad_norm': 0.3299783170223236, 'learning_rate': 9.993281765806417e-05, 'epoch': 3.08}
{'loss': 1.6045, 'grad_norm': 0.1940723955631256, 'learning_rate': 9.972873416811953e-05, 'epoch': 6.15}
  4%|████████▎                                                                                                                                                                                              | 25/600 [02:08<48:30,  5.06s/it][INFO|trainer.py:3719] 2024-07-15 12:47:55,912 >> ***** Running Evaluation *****
[INFO|trainer.py:3721] 2024-07-15 12:47:55,912 >>   Num examples = 12
[INFO|trainer.py:3724] 2024-07-15 12:47:55,913 >>   Batch size = 1
                                                                                                                                                                                                                                            [rank2]: Traceback (most recent call last):█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  5.79it/s]
[rank2]:   File "/media/asus/DATA/user/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in <module>
[rank2]:     launch()
[rank2]:   File "/media/asus/DATA/user/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch
[rank2]:     run_exp()
[rank2]:   File "/media/asus/DATA/user/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank2]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank2]:   File "/media/asus/DATA/user/LLaMA-Factory-main/src/llamafactory/train/sft/workflow.py", line 90, in run_sft
[rank2]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank2]:   File "/home/asus/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train
[rank2]:     return inner_training_loop(
[rank2]:   File "/home/asus/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2291, in _inner_training_loop
[rank2]:     self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
[rank2]:   File "/home/asus/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2721, in _maybe_log_save_evaluate
[rank2]:     metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
[rank2]:   File "/home/asus/.local/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 180, in evaluate
[rank2]:     return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
[rank2]:   File "/home/asus/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3572, in evaluate
[rank2]:     output = eval_loop(
[rank2]:   File "/home/asus/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3854, in evaluation_loop
[rank2]:     metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
[rank2]:   File "/media/asus/DATA/user/LLaMA-Factory-main/src/llamafactory/train/sft/metric.py", line 52, in compute_accuracy
[rank2]:     accuracies.append(np.mean(pred[label_mask] == label[label_mask]))
[rank2]: IndexError: boolean index did not match indexed array along dimension 0; dimension is 950 but corresponding boolean dimension is 375
[rank3]: Traceback (most recent call last):
[rank3]:   File "/media/asus/DATA/user/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in <module>
[rank3]:     launch()
[rank3]:   File "/media/asus/DATA/user/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch
[rank3]:     run_exp()
[rank3]:   File "/media/asus/DATA/user/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank3]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank3]:   File "/media/asus/DATA/user/LLaMA-Factory-main/src/llamafactory/train/sft/workflow.py", line 90, in run_sft
[rank3]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank3]:   File "/home/asus/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train
[rank3]:     return inner_training_loop(
[rank3]:   File "/home/asus/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2291, in _inner_training_loop
[rank3]:     self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
[rank3]:   File "/home/asus/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2721, in _maybe_log_save_evaluate
[rank3]:     metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
[rank3]:   File "/home/asus/.local/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 180, in evaluate
[rank3]:     return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
[rank3]:   File "/home/asus/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3572, in evaluate
[rank3]:     output = eval_loop(
[rank3]:   File "/home/asus/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3854, in evaluation_loop
[rank3]:     metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
[rank3]:   File "/media/asus/DATA/user/LLaMA-Factory-main/src/llamafactory/train/sft/metric.py", line 52, in compute_accuracy
[rank3]:     accuracies.append(np.mean(pred[label_mask] == label[label_mask]))
[rank3]: IndexError: boolean index did not match indexed array along dimension 0; dimension is 950 but corresponding boolean dimension is 375
[rank1]: Traceback (most recent call last):
[rank1]:   File "/media/asus/DATA/user/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in <module>
[rank1]:     launch()
[rank1]:   File "/media/asus/DATA/user/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch
[rank1]:     run_exp()
[rank1]:   File "/media/asus/DATA/user/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank1]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]:   File "/media/asus/DATA/user/LLaMA-Factory-main/src/llamafactory/train/sft/workflow.py", line 90, in run_sft
[rank1]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank1]:   File "/home/asus/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train
[rank1]:     return inner_training_loop(
[rank1]:   File "/home/asus/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2291, in _inner_training_loop
[rank1]:     self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
[rank1]:   File "/home/asus/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2721, in _maybe_log_save_evaluate
[rank1]:     metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
[rank1]:   File "/home/asus/.local/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 180, in evaluate
[rank1]:     return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
[rank1]:   File "/home/asus/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3572, in evaluate
[rank1]:     output = eval_loop(
[rank1]:   File "/home/asus/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3854, in evaluation_loop
[rank1]:     metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
[rank1]:   File "/media/asus/DATA/user/LLaMA-Factory-main/src/llamafactory/train/sft/metric.py", line 52, in compute_accuracy
[rank1]:     accuracies.append(np.mean(pred[label_mask] == label[label_mask]))
[rank1]: IndexError: boolean index did not match indexed array along dimension 0; dimension is 950 but corresponding boolean dimension is 375
[rank0]: Traceback (most recent call last):
[rank0]:   File "/media/asus/DATA/user/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in <module>
[rank0]:     launch()
[rank0]:   File "/media/asus/DATA/user/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch
[rank0]:     run_exp()
[rank0]:   File "/media/asus/DATA/user/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank0]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]:   File "/media/asus/DATA/user/LLaMA-Factory-main/src/llamafactory/train/sft/workflow.py", line 90, in run_sft
[rank0]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]:   File "/home/asus/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/home/asus/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2291, in _inner_training_loop
[rank0]:     self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
[rank0]:   File "/home/asus/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2721, in _maybe_log_save_evaluate
[rank0]:     metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
[rank0]:   File "/home/asus/.local/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 180, in evaluate
[rank0]:     return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
[rank0]:   File "/home/asus/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3572, in evaluate
[rank0]:     output = eval_loop(
[rank0]:   File "/home/asus/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3854, in evaluation_loop
[rank0]:     metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
[rank0]:   File "/media/asus/DATA/user/LLaMA-Factory-main/src/llamafactory/train/sft/metric.py", line 52, in compute_accuracy
[rank0]:     accuracies.append(np.mean(pred[label_mask] == label[label_mask]))
[rank0]: IndexError: boolean index did not match indexed array along dimension 0; dimension is 950 but corresponding boolean dimension is 375
  4%|████████▎                                                                                                                                                                                              | 25/600 [02:10<49:51,  5.20s/it]
                                                                                                                                                                                                                                            E0715 12:48:02.767000 139636999505728 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 3553573) of binary: /home/asus/anaconda3/envs/llama_factory/bin/python
Traceback (most recent call last):
  File "/home/asus/anaconda3/envs/llama_factory/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/asus/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/asus/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/asus/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/asus/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/asus/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/media/asus/DATA/user/LLaMA-Factory-main/src/llamafactory/launcher.py FAILED

Reproduction

这是微调时的参数

### model
model_name_or_path: /media/asus/DATA/user/models/llava-v1.5-7b
visual_inputs: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: q_proj,v_proj

### dataset
dataset: image_lib
template: vicuna
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/llava1_5-7b/lora/image_lib
logging_steps: 10
save_steps: 25
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 0.0001
num_train_epochs: 200.0
lr_scheduler_type: cosine
warmup_steps: 0.1
fp16: true

### eval
val_size: 0.1
per_device_eval_batch_size: 1
evaluation_strategy: steps
eval_steps: 25

Expected behavior

No response

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Jul 15, 2024
@DDYuudachi DDYuudachi changed the title LLava 1.5 7B 多模态微调报错 LLava 1.5 7B 多模态微调报错 IndexError: boolean index did not match indexed array along dimension 0 Jul 15, 2024
@BUAADreamer BUAADreamer self-assigned this Jul 15, 2024
@BUAADreamer
Copy link
Collaborator

您使用的transformers版本是多少?

@BUAADreamer
Copy link
Collaborator

使用最新版本的llamafactory,transformers==4.42.4(最新版)测试无问题

@BUAADreamer BUAADreamer added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jul 15, 2024
@DDYuudachi
Copy link
Author

使用最新版本的llamafactory,transformers==4.42.4(最新版)测试无问题

你好,已安装最新的llamafactory+ transformers==4.42.4,还是会在第一次evaluation期间就报错,现在只有在多模态lora微调时会有这个问题,其它lora微调可以正常运行

Repository owner deleted a comment from DDYuudachi Jul 16, 2024
@BUAADreamer BUAADreamer reopened this Jul 16, 2024
@BUAADreamer BUAADreamer added pending This problem is yet to be addressed and removed solved This problem has been already solved labels Jul 16, 2024
@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jul 16, 2024
@hiyouga hiyouga self-assigned this Jul 16, 2024
xtchen96 pushed a commit to xtchen96/LLaMA-Factory that referenced this issue Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

3 participants