Add support to read JSON #712

ritchie46 · 2021-12-25T12:20:57Z

Currently we write and read JSON lines.

I believe it would be a minor modification to also be able to read and write JSON. We could wrap the JSON Line values in an array and separate them with a , instead of a new line char.

E.g. now we write:

JSON Lines

{c1:"a"}
{c1:null}

JSON

[{c1:"a"},
{c1:null}]

Would this fit the scope of arrow2, as this is something different then what pyarrow does? I don't mean that we should drop the JSON Lines functionality, but that we also allow reading and writing JSON.

The text was updated successfully, but these errors were encountered:

jorgecarleitao · 2021-12-25T16:19:55Z

Yes, definitely in scope. We currently already support writing to JSON like that. IMO we should aim to drop the serde_json dependency because it would give us more control without the massive performance penalty associated to it (see #709)

ritchie46 · 2021-12-25T16:24:17Z

Yes, definitely in scope. We currently already support writing to JSON like that. IMO we should aim to drop the serde_json dependency because it would give us more control without the massive performance penalty associated to it (see #709)

Definitely agree. Serde is awesome because of its generic targets, but hard to optimize in my experience.

universalmind303 · 2021-12-25T18:25:38Z

if there was support for arrays, a lot of libraries support an optional indent parameter. Would it be a lot of extra effort to add something like this.

python has json.dumps(data, indent=2)
js has JSON.stringify(data, null, 2)
serde_json has PrettyFormatter

jorgecarleitao · 2021-12-28T08:18:25Z

@universalmind303 , yes, definitely in scope also. It needs some piping of the ident level for struct arrays, but imo nothing blocking.

alamb · 2022-01-02T16:27:23Z

arrow supports both json lines and delimited json -- in case it is helpful for API design or implementation:

https://docs.rs/arrow/6.5.0/arrow/json/writer/index.html

ritchie46 · 2022-01-02T17:14:44Z

arrow supports both json lines and delimited json -- in case it is helpful for API design or implementation:

https://docs.rs/arrow/6.5.0/arrow/json/writer/index.html

Thanks Andrew!

arrow supports both json lines and delimited json -- in case it is helpful for API design or implementation:

https://docs.rs/arrow/6.5.0/arrow/json/writer/index.html

Thanks Andrew!

jorgecarleitao · 2022-01-02T17:19:57Z

I think that both implement the same writer. The main difference between them at this point is #709

jorgecarleitao · 2022-01-08T06:27:17Z

My random thoughts over this situation:

Reading

for historical reasons our parser splits the file in lines via \n and .readlines. This means that we do not support JSON, only NDJSON (that forbids new lines in the middle of the file at the expense of human readability).

When a file can't be split in chunks, we can't read it in chunks. This means that we can't separate IO from CPU and need to read the whole file at once (e.g. via serde_json). To offer an example, consider

[
    {"a": 1, "b": 2},
    {"a": 2, "b": 20},
    {
         "a": 3,
         "b": 30
    },
    {"a": 4, "b": 40}
]

splitting it using the \n results in

[
---------
    {"a": 1, "b": 2},
---------
    {"a": 2, "b": 20},
---------
    {
---------
         "a": 3,
---------
         "b": 30
---------
    },
---------
    {"a": 4, "b": 40}
---------
]

i.e. we broke the 3rd record into pieces, thus making individual lines invalid JSON.

What we want in this case is to be able to split it in something like

[
---------
    {"a": 1, "b": 2},
---------
    {"a": 2, "b": 20},
---------
    {
         "a": 3,
         "b": 30
    },
---------
    {"a": 4, "b": 40}
---------
]

AFAIK this is not supported by serde_json, that has a (owned) in-memory format (serde_json::Value) and reads the json end-to-end.

The quick hack is to read the whole thing into serde_json::Value. More performant and less memory intensive ways are warrant.

IMO, for something like this, we need the an API similar to that of the csv crate, that offer a parser that can read rows of an array without deserializing the whole content on it. In this context, the same way CSV must check for " to indicate that the delimiter is part of a string column, the equivalent here is a a state machine that ignores , when it is part of a nested struct.

This would allow to perform minimal CPU work on read, and support the same APIs we already have for pure CPU work: deserialize(rows: &[Row], &Field) -> Box<dyn Array>

jorgecarleitao · 2022-01-11T20:59:00Z

addressed by #712 (quick hack of the above comment)

ritchie46 · 2022-01-11T21:06:22Z

addressed by #712 (quick hack of the above comment)

Nice, seems like a good starting point to improve upone.

jorgecarleitao added the enhancement An improvement to an existing feature label Dec 28, 2021

jorgecarleitao mentioned this issue Jan 11, 2022

Added support to deserialize JSON (!= NDJSON) #758

Merged

jorgecarleitao closed this as completed in #758 Jan 12, 2022

jorgecarleitao changed the title ~~Feature: JSON IO?~~ Add support to read JSON Jan 14, 2022

jorgecarleitao added the no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog label Jan 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support to read JSON #712

Add support to read JSON #712

ritchie46 commented Dec 25, 2021

jorgecarleitao commented Dec 25, 2021

ritchie46 commented Dec 25, 2021

universalmind303 commented Dec 25, 2021

jorgecarleitao commented Dec 28, 2021

alamb commented Jan 2, 2022

ritchie46 commented Jan 2, 2022

jorgecarleitao commented Jan 2, 2022

jorgecarleitao commented Jan 8, 2022 •

edited

Loading

jorgecarleitao commented Jan 11, 2022 •

edited

Loading

ritchie46 commented Jan 11, 2022

Add support to read JSON #712

Add support to read JSON #712

Comments

ritchie46 commented Dec 25, 2021

JSON Lines

JSON

jorgecarleitao commented Dec 25, 2021

ritchie46 commented Dec 25, 2021

universalmind303 commented Dec 25, 2021

jorgecarleitao commented Dec 28, 2021

alamb commented Jan 2, 2022

ritchie46 commented Jan 2, 2022

jorgecarleitao commented Jan 2, 2022

jorgecarleitao commented Jan 8, 2022 • edited Loading

Reading

jorgecarleitao commented Jan 11, 2022 • edited Loading

ritchie46 commented Jan 11, 2022

jorgecarleitao commented Jan 8, 2022 •

edited

Loading

jorgecarleitao commented Jan 11, 2022 •

edited

Loading