-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qwen2VL sft大数据集,pyarrow报错,采用jsonl、小数据集分割、streaming的方式无法解决呢 #5331
Comments
关闭 preprocessing_num_workers 试试 |
关闭preprocessing_num_workers还是有报错,后来把running tokenizer的map中的batch_size降低后解决了
|
HI @zhang122994917 Did you finetune successfuly the qwen2-vl? Because i got problem with large dataset, I config same way with you but it leads to overflow RAM because it load entire dataset, when i change to streaming still got stuck. |
fixed in #5346 |
Reminder
System Info
llamafactory
version: 0.8.4.dev0Reproduction
执行命令:
torchrun $DISTRIBUTED_ARGS src/train.py
--deepspeed $DS_CONFIG_PATH
--stage sft
--do_train
--model_name_or_path "./model/Qwen2-VL-7B-Instruct"
--dataset my_dataset
--template qwen2_vl
--finetuning_type lora
--output_dir $OUTPUT_PATH
--overwrite_cache
--overwrite_output_dir
--warmup_steps 100
--weight_decay 0.1
--per_device_train_batch_size 4
--gradient_accumulation_steps 4
--ddp_timeout 9000
--learning_rate 5e-6
--lr_scheduler_type cosine
--logging_steps 1
--cutoff_len 4096
--save_steps 1000
--plot_loss
--num_train_epochs 3
--preprocessing_num_workers 10
--bf16
异常:
Traceback (most recent call last):
[rank0]: File "/usr/local/lib/python3.8/dist-packages/multiprocess/pool.py", line 125, in worker
[rank0]: result = (True, func(*args, **kwds))
[rank0]: File "/usr/local/lib/python3.8/dist-packages/datasets/utils/py_utils.py", line 678, in _write_generator_to_queue
[rank0]: for i, result in enumerate(func(**kwargs)):
[rank0]: File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 3581, in _map_single
[rank0]: writer.write_batch(batch)
[rank0]: File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_writer.py", line 572, in write_batch
[rank0]: self.write_table(pa_table, writer_batch_size)
[rank0]: File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_writer.py", line 584, in write_table
[rank0]: pa_table = pa_table.combine_chunks()
[rank0]: File "pyarrow/table.pxi", line 4387, in pyarrow.lib.Table.combine_chunks
[rank0]: File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
[rank0]: File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
[rank0]: pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
采用转为jsonl的数据集格式 和 分割小数据集的方式无法解决
Expected behavior
No response
Others
No response
The text was updated successfully, but these errors were encountered: