-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM when loading large JSON files in v2 #985
Comments
Can you share the dataset, or a similar dataset, and the model in question? |
I am not sure if I can, but I can check and let you know. |
Ye, I've seen this too, a partial answer will be the new jiter JSON parser, but even that right now requires you to read the entire JSON into memory, but maybe just one or two copies in memory is fine. |
This is very likely also related to PyO3/pyo3#3382 / PyO3/pyo3#1056 |
@sydney-runkle err, the memray flamegraphs make it pretty apparent this is an issue? |
We have long updated to PyO3 0.21; we don't expect there to be a significant issue here any more. I'll close this one; @lattwood if you have a new example / repro, can you please open a new issue? |
Overview
I am trying to parse a ~2.5GB JSON data file containing a list of lists of data (think Array of Array of Structs). Using the recommended approach of
model_validate_json(f.read())
results in the OS SIGKILL-ing the process due to it running out of memory. In comparison, Python'sjson
module parses it effortlessly.For a bit of detail, I profiled the code using the below snippets using
memray
and am attaching the HTML flame graph files as TXT for ease of use (and because Github doesn't allow HTML files as attachments but allows PPTX..).I wasn't able to dig deeper into the issue (due to lack of time) but it is possible that the issue is related to #843, but I could be very wrong (hence the new issue).
Vanilla
json
memray-flamegraph-test-json.py.107113.html.txt
This approach just uses about 8.8G of memory: ~6 for parsing and the rest for the string data buffer.
Pydantic recommended API
memray-flamegraph-test-pydantic.py.131233.html.txt
This gets SIGKILLed by the OS after consuming ~23G to parse the 2.5GB file.
Pydantic second approach
This uses the "non-recommended" approach from pydantic/pydantic#7323
memray-flamegraph-test-pydantic2.py.130581.html.txt
Interestingly enough, this method successfully parses the dataset, and much faster than the direct approach of using
model_validate_json
.System Information
uname -srvmo
Linux 5.15.0-84-generic #93~20.04.1-Ubuntu SMP Wed Sep 6 16:15:40 UTC 2023 x86_64 GNU/Linux
Pydantic versions:
pydantic==2.3.0
-e git+https://github.com/pydantic/pydantic-core@c086caec1a200417f19850244282c06b5d4d1650#egg=pydantic_core
==2.6.3
The text was updated successfully, but these errors were encountered: