Qwen2VL sft大数据集，pyarrow报错，采用jsonl、小数据集分割、streaming的方式无法解决呢 #5331

zhang122994917 · 2024-09-02T14:04:20Z

Reminder

I have read the README and searched the existing issues.

System Info

llamafactory version: 0.8.4.dev0
Platform: Linux-4.18.0
Python version: 3.8.10
PyTorch version: 2.4.0+cu121 (GPU)
Transformers version: 4.45.0.dev0
Datasets version: 2.21.0
Accelerate version: 0.33.0
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA A800-SXM4-80GB
DeepSpeed version: 0.12.4

Reproduction

执行命令：
torchrun $DISTRIBUTED_ARGS src/train.py
--deepspeed $DS_CONFIG_PATH
--stage sft
--do_train
--model_name_or_path "./model/Qwen2-VL-7B-Instruct"
--dataset my_dataset
--template qwen2_vl
--finetuning_type lora
--output_dir $OUTPUT_PATH
--overwrite_cache
--overwrite_output_dir
--warmup_steps 100
--weight_decay 0.1
--per_device_train_batch_size 4
--gradient_accumulation_steps 4
--ddp_timeout 9000
--learning_rate 5e-6
--lr_scheduler_type cosine
--logging_steps 1
--cutoff_len 4096
--save_steps 1000
--plot_loss
--num_train_epochs 3
--preprocessing_num_workers 10
--bf16

异常：

Traceback (most recent call last):
[rank0]: File "/usr/local/lib/python3.8/dist-packages/multiprocess/pool.py", line 125, in worker
[rank0]: result = (True, func(*args, **kwds))
[rank0]: File "/usr/local/lib/python3.8/dist-packages/datasets/utils/py_utils.py", line 678, in _write_generator_to_queue
[rank0]: for i, result in enumerate(func(**kwargs)):
[rank0]: File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 3581, in _map_single
[rank0]: writer.write_batch(batch)
[rank0]: File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_writer.py", line 572, in write_batch
[rank0]: self.write_table(pa_table, writer_batch_size)
[rank0]: File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_writer.py", line 584, in write_table
[rank0]: pa_table = pa_table.combine_chunks()
[rank0]: File "pyarrow/table.pxi", line 4387, in pyarrow.lib.Table.combine_chunks
[rank0]: File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
[rank0]: File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
[rank0]: pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays

采用转为jsonl的数据集格式和分割小数据集的方式无法解决

Expected behavior

No response

Others

No response

The text was updated successfully, but these errors were encountered:

hiyouga · 2024-09-02T15:58:03Z

关闭 preprocessing_num_workers 试试

zhang122994917 · 2024-09-03T01:53:44Z

关闭preprocessing_num_workers还是有报错，后来把running tokenizer的map中的batch_size降低后解决了

if not data_args.streaming:
        kwargs = dict(
            num_proc=data_args.preprocessing_num_workers,
            load_from_cache_file=(not data_args.overwrite_cache) or (training_args.local_process_index != 0),
            desc="Running tokenizer on dataset",
        )
dataset = dataset.map(preprocess_func, batched=True, batch_size=128, remove_columns=column_names, **kwargs)

huynhbaobk · 2024-09-03T02:25:54Z

关闭preprocessing_num_workers还是有报错，后来把running tokenizer的map中的batch_size降低后解决了

if not data_args.streaming:
        kwargs = dict(
            num_proc=data_args.preprocessing_num_workers,
            load_from_cache_file=(not data_args.overwrite_cache) or (training_args.local_process_index != 0),
            desc="Running tokenizer on dataset",
        )
dataset = dataset.map(preprocess_func, batched=True, batch_size=128, remove_columns=column_names, **kwargs)

HI @zhang122994917 Did you finetune successfuly the qwen2-vl? Because i got problem with large dataset, I config same way with you but it leads to overflow RAM because it load entire dataset, when i change to streaming still got stuck.

hiyouga · 2024-09-03T19:01:09Z

fixed in #5346

github-actions bot added the pending This problem is yet to be addressed label Sep 2, 2024

zhang122994917 closed this as completed Sep 3, 2024

hiyouga reopened this Sep 3, 2024

hiyouga mentioned this issue Sep 3, 2024

[exp] Lazyload for multimodal inputs #5346

Merged

2 tasks

hiyouga closed this as completed in #5346 Sep 3, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen2VL sft大数据集，pyarrow报错，采用jsonl、小数据集分割、streaming的方式无法解决呢 #5331

Qwen2VL sft大数据集，pyarrow报错，采用jsonl、小数据集分割、streaming的方式无法解决呢 #5331

zhang122994917 commented Sep 2, 2024

hiyouga commented Sep 2, 2024

zhang122994917 commented Sep 3, 2024

huynhbaobk commented Sep 3, 2024

hiyouga commented Sep 3, 2024

Qwen2VL sft大数据集，pyarrow报错，采用jsonl、小数据集分割、streaming的方式无法解决呢 #5331

Qwen2VL sft大数据集，pyarrow报错，采用jsonl、小数据集分割、streaming的方式无法解决呢 #5331

Comments

zhang122994917 commented Sep 2, 2024

Reminder

System Info

Reproduction

Expected behavior

Others

hiyouga commented Sep 2, 2024

zhang122994917 commented Sep 3, 2024

huynhbaobk commented Sep 3, 2024

hiyouga commented Sep 3, 2024