Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Add support to read JSON #712

Closed
ritchie46 opened this issue Dec 25, 2021 · 10 comments · Fixed by #758
Closed

Add support to read JSON #712

ritchie46 opened this issue Dec 25, 2021 · 10 comments · Fixed by #758
Labels
enhancement An improvement to an existing feature no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog

Comments

@ritchie46
Copy link
Collaborator

Currently we write and read JSON lines.

https://jsonlines.org/

I believe it would be a minor modification to also be able to read and write JSON. We could wrap the JSON Line values in an array and separate them with a , instead of a new line char.

E.g. now we write:

JSON Lines

{c1:"a"}
{c1:null}

JSON

[{c1:"a"},
{c1:null}]

Would this fit the scope of arrow2, as this is something different then what pyarrow does? I don't mean that we should drop the JSON Lines functionality, but that we also allow reading and writing JSON.

@jorgecarleitao
Copy link
Owner

Yes, definitely in scope. We currently already support writing to JSON like that. IMO we should aim to drop the serde_json dependency because it would give us more control without the massive performance penalty associated to it (see #709)

@ritchie46
Copy link
Collaborator Author

Yes, definitely in scope. We currently already support writing to JSON like that. IMO we should aim to drop the serde_json dependency because it would give us more control without the massive performance penalty associated to it (see #709)

Definitely agree. Serde is awesome because of its generic targets, but hard to optimize in my experience.

@universalmind303
Copy link
Contributor

if there was support for arrays, a lot of libraries support an optional indent parameter. Would it be a lot of extra effort to add something like this.

python has json.dumps(data, indent=2)
js has JSON.stringify(data, null, 2)
serde_json has PrettyFormatter

@jorgecarleitao
Copy link
Owner

@universalmind303 , yes, definitely in scope also. It needs some piping of the ident level for struct arrays, but imo nothing blocking.

@jorgecarleitao jorgecarleitao added the enhancement An improvement to an existing feature label Dec 28, 2021
@alamb
Copy link
Collaborator

alamb commented Jan 2, 2022

arrow supports both json lines and delimited json -- in case it is helpful for API design or implementation:

https://docs.rs/arrow/6.5.0/arrow/json/writer/index.html

@ritchie46
Copy link
Collaborator Author

arrow supports both json lines and delimited json -- in case it is helpful for API design or implementation:

https://docs.rs/arrow/6.5.0/arrow/json/writer/index.html

Thanks Andrew!

arrow supports both json lines and delimited json -- in case it is helpful for API design or implementation:

https://docs.rs/arrow/6.5.0/arrow/json/writer/index.html

Thanks Andrew!

@jorgecarleitao
Copy link
Owner

I think that both implement the same writer. The main difference between them at this point is #709

@jorgecarleitao
Copy link
Owner

jorgecarleitao commented Jan 8, 2022

My random thoughts over this situation:

Reading

for historical reasons our parser splits the file in lines via \n and .readlines. This means that we do not support JSON, only NDJSON (that forbids new lines in the middle of the file at the expense of human readability).

When a file can't be split in chunks, we can't read it in chunks. This means that we can't separate IO from CPU and need to read the whole file at once (e.g. via serde_json). To offer an example, consider

[
    {"a": 1, "b": 2},
    {"a": 2, "b": 20},
    {
         "a": 3,
         "b": 30
    },
    {"a": 4, "b": 40}
]

splitting it using the \n results in

[
---------
    {"a": 1, "b": 2},
---------
    {"a": 2, "b": 20},
---------
    {
---------
         "a": 3,
---------
         "b": 30
---------
    },
---------
    {"a": 4, "b": 40}
---------
]

i.e. we broke the 3rd record into pieces, thus making individual lines invalid JSON.

What we want in this case is to be able to split it in something like

[
---------
    {"a": 1, "b": 2},
---------
    {"a": 2, "b": 20},
---------
    {
         "a": 3,
         "b": 30
    },
---------
    {"a": 4, "b": 40}
---------
]

AFAIK this is not supported by serde_json, that has a (owned) in-memory format (serde_json::Value) and reads the json end-to-end.

The quick hack is to read the whole thing into serde_json::Value. More performant and less memory intensive ways are warrant.

IMO, for something like this, we need the an API similar to that of the csv crate, that offer a parser that can read rows of an array without deserializing the whole content on it. In this context, the same way CSV must check for " to indicate that the delimiter is part of a string column, the equivalent here is a a state machine that ignores , when it is part of a nested struct.

This would allow to perform minimal CPU work on read, and support the same APIs we already have for pure CPU work: deserialize(rows: &[Row], &Field) -> Box<dyn Array>

@jorgecarleitao
Copy link
Owner

jorgecarleitao commented Jan 11, 2022

addressed by #712 (quick hack of the above comment)

@ritchie46
Copy link
Collaborator Author

addressed by #712 (quick hack of the above comment)

Nice, seems like a good starting point to improve upone.

@jorgecarleitao jorgecarleitao changed the title Feature: JSON IO? Add support to read JSON Jan 14, 2022
@jorgecarleitao jorgecarleitao added the no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog label Jan 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement An improvement to an existing feature no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants