OOM when loading large JSON files in v2 #985

anand-bala · 2023-09-21T20:18:05Z

Overview

I am trying to parse a ~2.5GB JSON data file containing a list of lists of data (think Array of Array of Structs). Using the recommended approach of model_validate_json(f.read()) results in the OS SIGKILL-ing the process due to it running out of memory. In comparison, Python's json module parses it effortlessly.

For a bit of detail, I profiled the code using the below snippets using memray and am attaching the HTML flame graph files as TXT for ease of use (and because Github doesn't allow HTML files as attachments but allows PPTX..).

I wasn't able to dig deeper into the issue (due to lack of time) but it is possible that the issue is related to #843, but I could be very wrong (hence the new issue).

Vanilla `json`

import json

with open("dataset.json") as f:
    data = json.load(f)

memray-flamegraph-test-json.py.107113.html.txt

This approach just uses about 8.8G of memory: ~6 for parsing and the rest for the string data buffer.

Pydantic recommended API

from titanium_data import Data

with open("dataset.json") as f:
    data = Data.model_validate_json(f)

memray-flamegraph-test-pydantic.py.131233.html.txt

This gets SIGKILLed by the OS after consuming ~23G to parse the 2.5GB file.

Pydantic second approach

This uses the "non-recommended" approach from pydantic/pydantic#7323

import json

from titanium_data import Data

with open("./data/trajectories/scenario1/dataset.json") as f:
    data = json.load(f)
data = Data.model_validate(data)

memray-flamegraph-test-pydantic2.py.130581.html.txt

Interestingly enough, this method successfully parses the dataset, and much faster than the direct approach of using model_validate_json.

System Information

uname -srvmo

Linux 5.15.0-84-generic #93~20.04.1-Ubuntu SMP Wed Sep 6 16:15:40 UTC 2023 x86_64 GNU/Linux

Pydantic versions:

pydantic==2.3.0
-e git+https://github.com/pydantic/pydantic-core@c086caec1a200417f19850244282c06b5d4d1650#egg=pydantic_core
- Equivalent to ==2.6.3

The text was updated successfully, but these errors were encountered:

adriangb · 2023-09-21T20:43:11Z

Can you share the dataset, or a similar dataset, and the model in question?

anand-bala · 2023-09-21T20:51:51Z

I am not sure if I can, but I can check and let you know.

samuelcolvin · 2023-09-21T21:01:34Z

Ye, I've seen this too, a partial answer will be the new jiter JSON parser, but even that right now requires you to read the entire JSON into memory, but maybe just one or two copies in memory is fine.

davidhewitt · 2023-09-25T12:54:30Z

This is very likely also related to PyO3/pyo3#3382 / PyO3/pyo3#1056

lattwood · 2024-11-05T20:01:48Z

@sydney-runkle err, the memray flamegraphs make it pretty apparent this is an issue?

davidhewitt · 2024-11-05T20:50:38Z

We have long updated to PyO3 0.21; we don't expect there to be a significant issue here any more. I'll close this one; @lattwood if you have a new example / repro, can you please open a new issue?

pydantic-hooky bot assigned dmontagu Sep 21, 2023

pydantic-hooky bot added the unconfirmed label Sep 21, 2023

sydney-runkle added bug Something isn't working and removed unconfirmed labels Aug 17, 2024

davidhewitt closed this as completed Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM when loading large JSON files in v2 #985

OOM when loading large JSON files in v2 #985

anand-bala commented Sep 21, 2023

adriangb commented Sep 21, 2023

anand-bala commented Sep 21, 2023

samuelcolvin commented Sep 21, 2023

davidhewitt commented Sep 25, 2023

lattwood commented Nov 5, 2024

davidhewitt commented Nov 5, 2024

OOM when loading large JSON files in v2 #985

OOM when loading large JSON files in v2 #985

Comments

anand-bala commented Sep 21, 2023

Overview

Vanilla json

Pydantic recommended API

Pydantic second approach

System Information

adriangb commented Sep 21, 2023

anand-bala commented Sep 21, 2023

samuelcolvin commented Sep 21, 2023

davidhewitt commented Sep 25, 2023

lattwood commented Nov 5, 2024

davidhewitt commented Nov 5, 2024

Vanilla `json`