Allow overriding schema when reading/scanning JSON files #8279

baggiponte · 2023-04-16T15:54:47Z

Problem description

I tried opening a JSON line files with both read_ndjson and scan_ndjson; even though I set infer_schema_length=0 there was a reading error:

pl.scan_ndjson(datapath / "simulationprefab.json", infer_schema_length=0, n_rows=5)

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: OutOfSpec("Struct array must be created with a DataType whose physical type is Struct")', /Users/runner/.cargo/git/checkouts/arrow2-945af624853845da/d0174d3/src/array/struct_/mod.rs:240:41
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[4], line 1
----> 1 pl.scan_ndjson(datapath / "simulationprefab.json", infer_schema_length=0, n_rows=5)

File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/io/ndjson.py:71, in scan_ndjson(source, infer_schema_length, batch_size, n_rows, low_memory, rechunk, row_count_name, row_count_offset)
     68 if isinstance(source, (str, Path)):
     69     source = normalise_filepath(source)
---> 71 return pli.LazyFrame._scan_ndjson(
     72     source,
     73     infer_schema_length=infer_schema_length,
     74     batch_size=batch_size,
     75     n_rows=n_rows,
     76     low_memory=low_memory,
     77     rechunk=rechunk,
     78     row_count_name=row_count_name,
     79     row_count_offset=row_count_offset,
     80 )

File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/lazyframe/frame.py:485, in LazyFrame._scan_ndjson(cls, source, infer_schema_length, batch_size, n_rows, low_memory, rechunk, row_count_name, row_count_offset)
    474 """
    475 Lazily read from a newline delimited JSON file.
    476 
   (...)
    482 
    483 """
    484 self = cls.__new__(cls)
--> 485 self._ldf = PyLazyFrame.new_from_ndjson(
    486     source,
    487     infer_schema_length,
    488     batch_size,
    489     n_rows,
    490     low_memory,
    491     rechunk,
    492     _prepare_row_count_args(row_count_name, row_count_offset),
    493 )
    494 return self

PanicException: called `Result::unwrap()` on an `Err` value: OutOfSpec("Struct array must be created with a DataType whose physical type is Struct")

@ghuls suggested to open a PR to request schema overriding when reading/scanning JSONL.

As a side note, I could not understand why the error was raised even though schema inference was disabled. Is this also a bug to fix? It seems related to #3942? Would love to help, but I don't really know Rust. I cannot share the data publicly but I guess I could ask if I could send the assignee a small sample to work with.

The text was updated successfully, but these errors were encountered:

ritchie46 · 2023-04-18T14:39:42Z

Have you got the file that produced this error?

baggiponte · 2023-04-19T12:28:50Z

Yes, working on a reproducible example with a minimal sample of the data that I can share. Thanks for answering!

baggiponte · 2023-04-27T11:16:30Z

I should have some time to answer properly on this on this weekend, thank you for your patience.

EDIT: I was super busy lately, I should have time this week to go back on this, thanks again.

baggiponte · 2023-06-04T15:12:34Z

Heya, sorry for taking so long. I took some time today to work this out.

1. Do not use `infer_schema_length`

If I use infer_schema_length=0, I get the error above. If I do not (e.g. I run pl.scan_ndjson("../data/raw/simulationprefab.json", n_rows=4).collect(), here is the traceback:

ComputeError                              Traceback (most recent call last)
File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/lazyframe/frame.py:1501, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, no_optimization, slice_pushdown, common_subplan_elimination, streaming)
   1490     common_subplan_elimination = False
   1492 ldf = self._ldf.optimization_toggle(
   1493     type_coercion,
   1494     predicate_pushdown,
   (...)
   1499     streaming,
   1500 )
-> 1501 return wrap_df(ldf.collect())

ComputeError: expected list/array in json value, got str

2. Use `pl.from_pandas()`

I tried reading with pandas and converting to polars but this returns an error...

data = pd.read_json("../data/raw/simulationprefab.json.gz", lines=True, nrows=4)
pl.from_pandas(data)

but this happens at the pyarrow level. The same error is raised with this:

data = pd.read_json("../data/raw/simulationprefab.json.gz", lines=True, nrows=4, dtype_backend="pyarrow")

pyarrow stacktrace

ArrowInvalid                              Traceback (most recent call last)
File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/convert.py:720, in from_pandas(data, schema_overrides, rechunk, nan_to_null, include_index)
    718     return pl.Series._from_pandas("", data, nan_to_null=nan_to_null)
    719 elif isinstance(data, pd.DataFrame):
--> 720     return pl.DataFrame._from_pandas(
    721         data,
    722         rechunk=rechunk,
    723         nan_to_null=nan_to_null,
    724         schema_overrides=schema_overrides,
    725         include_index=include_index,
    726     )
    727 else:
    728     raise ValueError(f"Expected pandas DataFrame or Series, got {type(data)}.")

File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/dataframe/frame.py:664, in DataFrame._from_pandas(cls, data, schema, schema_overrides, rechunk, nan_to_null, include_index)
    620 @classmethod
    621 def _from_pandas(
    622     cls,
   (...)
    629     include_index: bool = False,
    630 ) -> Self:
    631     """
    632     Construct a Polars DataFrame from a pandas DataFrame.
    633 
   (...)
    661 
    662     """
    663     return cls._from_pydf(
--> 664         pandas_to_pydf(
    665             data,
    666             schema=schema,
    667             schema_overrides=schema_overrides,
    668             rechunk=rechunk,
    669             nan_to_null=nan_to_null,
    670             include_index=include_index,
    671         )
    672     )

File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/utils/_construction.py:1519, in pandas_to_pydf(data, schema, schema_overrides, rechunk, nan_to_null, include_index)
   1512         arrow_dict[str(idxcol)] = _pandas_series_to_arrow(
   1513             data.index.get_level_values(idxcol),
   1514             nan_to_null=nan_to_null,
   1515             length=length,
   1516         )
   1518 for col in data.columns:
-> 1519     arrow_dict[str(col)] = _pandas_series_to_arrow(
   1520         data[col], nan_to_null=nan_to_null, length=length
   1521     )
   1523 arrow_table = pa.table(arrow_dict)
   1524 return arrow_to_pydf(
   1525     arrow_table, schema=schema, schema_overrides=schema_overrides, rechunk=rechunk
   1526 )

File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/utils/_construction.py:550, in _pandas_series_to_arrow(values, nan_to_null, length)
    548     elif first_non_none is None:
    549         return pa.nulls(length or len(values), pa.large_utf8())
--> 550     return pa.array(values, from_pandas=nan_to_null)
    551 elif dtype:
    552     return pa.array(values, from_pandas=nan_to_null)

File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/pyarrow/array.pxi:323, in pyarrow.lib.array()

File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/pyarrow/array.pxi:83, in pyarrow.lib._ndarray_to_array()

File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status()

ArrowInvalid: Could not convert 'true' with type str: tried to convert to boolean

3. Read with pandas, save to json, read again with polars

After this, I tried reading the file with pandas and save it as json, i.e. like so:

data = pd.read_json("../data/raw/simulationprefab.json.gz", lines=True, nrows=4)

data.to_json("./sample.json", orient="records", lines=True)

And then read it back with polars.

pl.read_json("./sample.json")
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File [~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/io/json.py:28](https://file+.vscode-resource.vscode-cdn.net/Users/lucabaggi/Documents/dev/futura/futura-grader/~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/io/json.py:28), in read_json(source)
     [14](file:///Users/lucabaggi/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/io/json.py?line=13) def read_json(source: str | Path | IOBase) -> DataFrame:
     [15](file:///Users/lucabaggi/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/io/json.py?line=14)     """
     [16](file:///Users/lucabaggi/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/io/json.py?line=15)     Read into a DataFrame from a JSON file.
     [17](file:///Users/lucabaggi/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/io/json.py?line=16) 
   (...)
     [26](file:///Users/lucabaggi/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/io/json.py?line=25) 
     [27](file:///Users/lucabaggi/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/io/json.py?line=26)     """
---> [28](file:///Users/lucabaggi/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/io/json.py?line=27)     return pl.DataFrame._read_json(source)

File [~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/dataframe/frame.py:1010](https://file+.vscode-resource.vscode-cdn.net/Users/lucabaggi/Documents/dev/futura/futura-grader/~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/dataframe/frame.py:1010), in DataFrame._read_json(cls, source)
   [1007](file:///Users/lucabaggi/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/dataframe/frame.py?line=1006)     source = normalise_filepath(source)
   [1009](file:///Users/lucabaggi/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/dataframe/frame.py?line=1008) self = cls.__new__(cls)
-> [1010](file:///Users/lucabaggi/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/dataframe/frame.py?line=1009) self._df = PyDataFrame.read_json(source, False)
   [1011](file:///Users/lucabaggi/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/dataframe/frame.py?line=1010) return self

RuntimeError: BindingsError: "InternalError at character 201722 ('}')"

4. Read with pandas, save to json, read again with pandas and use `pl.from_pandas()` (WTF)

Why did I read 4 lines? Because the following snippet works up to the 4th line, after which the same error in the toggle menu above is raised:

data = pd.read_json("../data/raw/simulationprefab.json.gz", lines=True, nrows=4)
data.to_json("./sample.json", orient="records", lines=True)

pd.read_json("./sample.json", lines=True)
pl.from_pandas(data)

5. Using Python generators

Finally, I went for a whole different approach. I use itertools and generators because the data is 5GB and I wanted to use itertools.islice to find out up until which line polars worked.

import json
import itertools

def yield_lines(filepath):
    with open(filepath) as f:
        yield from f

generator = (json.loads(line) for line in yield_lines("../data/raw/simulationprefab.json"))

dat = itertools.islice(generator, 35)
pl.DataFrame(dat)

This time, at line 35 I raise the following error.

Error

File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/dataframe/frame.py:400, in DataFrame.__init__(self, data, schema, schema_overrides, orient, infer_schema_length, nan_to_null)
    395     self._df = pandas_to_pydf(
    396         data, schema=schema, schema_overrides=schema_overrides
    397     )
    399 elif not isinstance(data, Sized) and isinstance(data, (Generator, Iterable)):
--> 400     self._df = iterable_to_pydf(
    401         data,
    402         schema=schema,
    403         schema_overrides=schema_overrides,
    404         orient=orient,
    405         infer_schema_length=infer_schema_length,
    406     )
    407 else:
    408     raise ValueError(
    409         f"DataFrame constructor called with unsupported type; got {type(data)}"
    410     )

File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/utils/_construction.py:1460, in iterable_to_pydf(data, schema, schema_overrides, orient, chunk_size, infer_schema_length)
   1458 if not values:
   1459     break
-> 1460 frame_chunk = to_frame_chunk(values, original_schema)
   1461 if df is None:
   1462     df = frame_chunk

File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/utils/_construction.py:1434, in iterable_to_pydf..to_frame_chunk(values, schema)
   1433 def to_frame_chunk(values: list[Any], schema: SchemaDefinition | None) -> DataFrame:
-> 1434     return pl.DataFrame(
   1435         data=values,
   1436         schema=schema,
   1437         orient="row",
   1438         infer_schema_length=infer_schema_length,
   1439     )

File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/dataframe/frame.py:368, in DataFrame.__init__(self, data, schema, schema_overrides, orient, infer_schema_length, nan_to_null)
    360     self._df = dict_to_pydf(
    361         data,
    362         schema=schema,
    363         schema_overrides=schema_overrides,
    364         nan_to_null=nan_to_null,
    365     )
    367 elif isinstance(data, (list, tuple, Sequence)):
--> 368     self._df = sequence_to_pydf(
    369         data,
    370         schema=schema,
    371         schema_overrides=schema_overrides,
    372         orient=orient,
    373         infer_schema_length=infer_schema_length,
    374     )
    375 elif isinstance(data, pl.Series):
    376     self._df = series_to_pydf(
    377         data, schema=schema, schema_overrides=schema_overrides
    378     )

File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/utils/_construction.py:820, in sequence_to_pydf(data, schema, schema_overrides, orient, infer_schema_length)
    817 if len(data) == 0:
    818     return dict_to_pydf({}, schema=schema, schema_overrides=schema_overrides)
--> 820 return _sequence_to_pydf_dispatcher(
    821     data[0],
    822     data=data,
    823     schema=schema,
    824     schema_overrides=schema_overrides,
    825     orient=orient,
    826     infer_schema_length=infer_schema_length,
    827 )

File ~/.local/share/rtx/installs/python/3.9.16/lib/python3.9/functools.py:888, in singledispatch..wrapper(*args, **kw)
    884 if not args:
    885     raise TypeError(f'{funcname} requires at least '
    886                     '1 positional argument')
--> 888 return dispatch(args[0].__class__)(*args, **kw)

File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/utils/_construction.py:1036, in _sequence_of_dict_to_pydf(first_element, data, schema, schema_overrides, infer_schema_length, **kwargs)
   1028 column_names, schema_overrides = _unpack_schema(
   1029     schema, schema_overrides=schema_overrides
   1030 )
   1031 dicts_schema = (
   1032     include_unknowns(schema_overrides, column_names or list(schema_overrides))
   1033     if schema_overrides and column_names
   1034     else None
   1035 )
-> 1036 pydf = PyDataFrame.read_dicts(data, infer_schema_length, dicts_schema)
   1038 if column_names and set(column_names).intersection(pydf.columns()):
   1039     column_names = []

ComputeError: mixed dtypes found when building Utf8 Series

Conclusions and data snippet

Since the object is deeply nested, I guess that with some schema magic I could make it work, so I guess I'll explore a bit more. In the meanwhile, here are the first 4 rows of the data:

sample.json.gz

I also have a theory about why wiring the JSON with pandas yields a different result. The JSON comes from DynamoDB and uses Decimal types to encode integers, so it might be that pandas can handle the conversion?

I saw that polars has the decimal type (e.g. pl.Series([decimal.Decimal("1.0")] but perhaps types being inconsistent raise some errors.

Relates to pola-rs#8279. I'm not 100% sure about the Python schema type annotation, there are a few different variations in this file but this seems to make the most sense? Happy to adjust though.

baggiponte added the enhancement New feature or an improvement of an existing feature label Apr 16, 2023

sd2k mentioned this issue Sep 7, 2023

feat: allow specifying schema in pl.scan_ndjson #10963

Merged

ritchie46 mentioned this issue Jan 11, 2024

fix: fix schema inference for json #13637

Merged

ritchie46 closed this as completed in #13637 Jan 11, 2024

c-peters added the accepted Ready for implementation label Jan 14, 2024

c-peters assigned ritchie46 Jan 14, 2024

c-peters added this to Backlog Jan 14, 2024

c-peters moved this to Done in Backlog Jan 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow overriding schema when reading/scanning JSON files #8279

Allow overriding schema when reading/scanning JSON files #8279

baggiponte commented Apr 16, 2023

ritchie46 commented Apr 18, 2023

baggiponte commented Apr 19, 2023

baggiponte commented Apr 27, 2023 •

edited

Loading

baggiponte commented Jun 4, 2023 •

edited

Loading

Allow overriding schema when reading/scanning JSON files #8279

Allow overriding schema when reading/scanning JSON files #8279

Comments

baggiponte commented Apr 16, 2023

Problem description

ritchie46 commented Apr 18, 2023

baggiponte commented Apr 19, 2023

baggiponte commented Apr 27, 2023 • edited Loading

baggiponte commented Jun 4, 2023 • edited Loading

1. Do not use infer_schema_length

2. Use pl.from_pandas()

3. Read with pandas, save to json, read again with polars

4. Read with pandas, save to json, read again with pandas and use pl.from_pandas() (WTF)

5. Using Python generators

Conclusions and data snippet

baggiponte commented Apr 27, 2023 •

edited

Loading

baggiponte commented Jun 4, 2023 •

edited

Loading

1. Do not use `infer_schema_length`

2. Use `pl.from_pandas()`

4. Read with pandas, save to json, read again with pandas and use `pl.from_pandas()` (WTF)