Excessive RAM Usage After Dataset Concatenation concatenate_datasets #7373

sam-hey · 2025-01-16T16:33:10Z

Describe the bug

When loading a dataset from disk, concatenating it, and starting the training process, the RAM usage progressively increases until the kernel terminates the process due to excessive memory consumption.

#2276

Steps to reproduce the bug

rom datasets import  DatasetDict, concatenate_datasets

dataset = DatasetDict.load_from_disk("data")

...
...

combined_dataset = concatenate_datasets(
        [dataset[split] for split in dataset]
    )

#start SentenceTransformer training

Expected behavior

I would not expect RAM utilization to increase after concatenation. Removing the concatenation step resolves the issue

Environment info

sentence-transformers==3.1.1
datasets==3.2.0

python3.10

The text was updated successfully, but these errors were encountered:

sam-hey · 2025-01-17T07:54:21Z

Adding a img from memray
https://gist.github.com/sam-hey/00c958f13fb0f7b54d17197fe353002f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive RAM Usage After Dataset Concatenation concatenate_datasets #7373

Excessive RAM Usage After Dataset Concatenation concatenate_datasets #7373

sam-hey commented Jan 16, 2025 •

edited

Loading

sam-hey commented Jan 17, 2025 •

edited

Loading

Excessive RAM Usage After Dataset Concatenation concatenate_datasets #7373

Excessive RAM Usage After Dataset Concatenation concatenate_datasets #7373

Comments

sam-hey commented Jan 16, 2025 • edited Loading

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

sam-hey commented Jan 17, 2025 • edited Loading

sam-hey commented Jan 16, 2025 •

edited

Loading

sam-hey commented Jan 17, 2025 •

edited

Loading