Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen2VL sft大数据集,pyarrow报错,采用jsonl、小数据集分割、streaming的方式无法解决呢 #5331

Closed
1 task done
zhang122994917 opened this issue Sep 2, 2024 · 4 comments · Fixed by #5346
Labels
solved This problem has been already solved

Comments

@zhang122994917
Copy link

Reminder

  • I have read the README and searched the existing issues.

System Info

  • llamafactory version: 0.8.4.dev0
  • Platform: Linux-4.18.0
  • Python version: 3.8.10
  • PyTorch version: 2.4.0+cu121 (GPU)
  • Transformers version: 4.45.0.dev0
  • Datasets version: 2.21.0
  • Accelerate version: 0.33.0
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA A800-SXM4-80GB
  • DeepSpeed version: 0.12.4

Reproduction

执行命令:
torchrun $DISTRIBUTED_ARGS src/train.py
--deepspeed $DS_CONFIG_PATH
--stage sft
--do_train
--model_name_or_path "./model/Qwen2-VL-7B-Instruct"
--dataset my_dataset
--template qwen2_vl
--finetuning_type lora
--output_dir $OUTPUT_PATH
--overwrite_cache
--overwrite_output_dir
--warmup_steps 100
--weight_decay 0.1
--per_device_train_batch_size 4
--gradient_accumulation_steps 4
--ddp_timeout 9000
--learning_rate 5e-6
--lr_scheduler_type cosine
--logging_steps 1
--cutoff_len 4096
--save_steps 1000
--plot_loss
--num_train_epochs 3
--preprocessing_num_workers 10
--bf16

异常:

image

Traceback (most recent call last):
[rank0]: File "/usr/local/lib/python3.8/dist-packages/multiprocess/pool.py", line 125, in worker
[rank0]: result = (True, func(*args, **kwds))
[rank0]: File "/usr/local/lib/python3.8/dist-packages/datasets/utils/py_utils.py", line 678, in _write_generator_to_queue
[rank0]: for i, result in enumerate(func(**kwargs)):
[rank0]: File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 3581, in _map_single
[rank0]: writer.write_batch(batch)
[rank0]: File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_writer.py", line 572, in write_batch
[rank0]: self.write_table(pa_table, writer_batch_size)
[rank0]: File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_writer.py", line 584, in write_table
[rank0]: pa_table = pa_table.combine_chunks()
[rank0]: File "pyarrow/table.pxi", line 4387, in pyarrow.lib.Table.combine_chunks
[rank0]: File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
[rank0]: File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
[rank0]: pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays

采用转为jsonl的数据集格式 和 分割小数据集的方式无法解决

Expected behavior

No response

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Sep 2, 2024
@hiyouga
Copy link
Owner

hiyouga commented Sep 2, 2024

关闭 preprocessing_num_workers 试试

@zhang122994917
Copy link
Author

关闭preprocessing_num_workers还是有报错,后来把running tokenizer的map中的batch_size降低后解决了

if not data_args.streaming:
        kwargs = dict(
            num_proc=data_args.preprocessing_num_workers,
            load_from_cache_file=(not data_args.overwrite_cache) or (training_args.local_process_index != 0),
            desc="Running tokenizer on dataset",
        )
dataset = dataset.map(preprocess_func, batched=True, batch_size=128, remove_columns=column_names, **kwargs)

@huynhbaobk
Copy link

关闭preprocessing_num_workers还是有报错,后来把running tokenizer的map中的batch_size降低后解决了

if not data_args.streaming:
        kwargs = dict(
            num_proc=data_args.preprocessing_num_workers,
            load_from_cache_file=(not data_args.overwrite_cache) or (training_args.local_process_index != 0),
            desc="Running tokenizer on dataset",
        )
dataset = dataset.map(preprocess_func, batched=True, batch_size=128, remove_columns=column_names, **kwargs)

HI @zhang122994917 Did you finetune successfuly the qwen2-vl? Because i got problem with large dataset, I config same way with you but it leads to overflow RAM because it load entire dataset, when i change to streaming still got stuck.

@hiyouga
Copy link
Owner

hiyouga commented Sep 3, 2024

fixed in #5346

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants