Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow overriding schema when reading/scanning JSON files #8279

Closed
baggiponte opened this issue Apr 16, 2023 · 4 comments · Fixed by #13637
Closed

Allow overriding schema when reading/scanning JSON files #8279

baggiponte opened this issue Apr 16, 2023 · 4 comments · Fixed by #13637
Assignees
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@baggiponte
Copy link
Contributor

Problem description

I tried opening a JSON line files with both read_ndjson and scan_ndjson; even though I set infer_schema_length=0 there was a reading error:

pl.scan_ndjson(datapath / "simulationprefab.json", infer_schema_length=0, n_rows=5)

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: OutOfSpec("Struct array must be created with a DataType whose physical type is Struct")', /Users/runner/.cargo/git/checkouts/arrow2-945af624853845da/d0174d3/src/array/struct_/mod.rs:240:41
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[4], line 1
----> 1 pl.scan_ndjson(datapath / "simulationprefab.json", infer_schema_length=0, n_rows=5)

File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/io/ndjson.py:71, in scan_ndjson(source, infer_schema_length, batch_size, n_rows, low_memory, rechunk, row_count_name, row_count_offset)
     68 if isinstance(source, (str, Path)):
     69     source = normalise_filepath(source)
---> 71 return pli.LazyFrame._scan_ndjson(
     72     source,
     73     infer_schema_length=infer_schema_length,
     74     batch_size=batch_size,
     75     n_rows=n_rows,
     76     low_memory=low_memory,
     77     rechunk=rechunk,
     78     row_count_name=row_count_name,
     79     row_count_offset=row_count_offset,
     80 )

File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/lazyframe/frame.py:485, in LazyFrame._scan_ndjson(cls, source, infer_schema_length, batch_size, n_rows, low_memory, rechunk, row_count_name, row_count_offset)
    474 """
    475 Lazily read from a newline delimited JSON file.
    476 
   (...)
    482 
    483 """
    484 self = cls.__new__(cls)
--> 485 self._ldf = PyLazyFrame.new_from_ndjson(
    486     source,
    487     infer_schema_length,
    488     batch_size,
    489     n_rows,
    490     low_memory,
    491     rechunk,
    492     _prepare_row_count_args(row_count_name, row_count_offset),
    493 )
    494 return self

PanicException: called `Result::unwrap()` on an `Err` value: OutOfSpec("Struct array must be created with a DataType whose physical type is Struct")

@ghuls suggested to open a PR to request schema overriding when reading/scanning JSONL.

As a side note, I could not understand why the error was raised even though schema inference was disabled. Is this also a bug to fix? It seems related to #3942? Would love to help, but I don't really know Rust. I cannot share the data publicly but I guess I could ask if I could send the assignee a small sample to work with.

@baggiponte baggiponte added the enhancement New feature or an improvement of an existing feature label Apr 16, 2023
@ritchie46
Copy link
Member

Have you got the file that produced this error?

@baggiponte
Copy link
Contributor Author

Yes, working on a reproducible example with a minimal sample of the data that I can share. Thanks for answering!

@baggiponte
Copy link
Contributor Author

baggiponte commented Apr 27, 2023

I should have some time to answer properly on this on this weekend, thank you for your patience.

EDIT: I was super busy lately, I should have time this week to go back on this, thanks again.

@baggiponte
Copy link
Contributor Author

baggiponte commented Jun 4, 2023

Heya, sorry for taking so long. I took some time today to work this out.

1. Do not use infer_schema_length

If I use infer_schema_length=0, I get the error above. If I do not (e.g. I run pl.scan_ndjson("../data/raw/simulationprefab.json", n_rows=4).collect(), here is the traceback:

ComputeError                              Traceback (most recent call last)
File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/lazyframe/frame.py:1501, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, no_optimization, slice_pushdown, common_subplan_elimination, streaming)
   1490     common_subplan_elimination = False
   1492 ldf = self._ldf.optimization_toggle(
   1493     type_coercion,
   1494     predicate_pushdown,
   (...)
   1499     streaming,
   1500 )
-> 1501 return wrap_df(ldf.collect())

ComputeError: expected list/array in json value, got str

2. Use pl.from_pandas()

I tried reading with pandas and converting to polars but this returns an error...

data = pd.read_json("../data/raw/simulationprefab.json.gz", lines=True, nrows=4)
pl.from_pandas(data)

but this happens at the pyarrow level. The same error is raised with this:

data = pd.read_json("../data/raw/simulationprefab.json.gz", lines=True, nrows=4, dtype_backend="pyarrow")
pyarrow stacktrace
ArrowInvalid                              Traceback (most recent call last)
File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/convert.py:720, in from_pandas(data, schema_overrides, rechunk, nan_to_null, include_index)
    718     return pl.Series._from_pandas("", data, nan_to_null=nan_to_null)
    719 elif isinstance(data, pd.DataFrame):
--> 720     return pl.DataFrame._from_pandas(
    721         data,
    722         rechunk=rechunk,
    723         nan_to_null=nan_to_null,
    724         schema_overrides=schema_overrides,
    725         include_index=include_index,
    726     )
    727 else:
    728     raise ValueError(f"Expected pandas DataFrame or Series, got {type(data)}.")

File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/dataframe/frame.py:664, in DataFrame._from_pandas(cls, data, schema, schema_overrides, rechunk, nan_to_null, include_index)
    620 @classmethod
    621 def _from_pandas(
    622     cls,
   (...)
    629     include_index: bool = False,
    630 ) -> Self:
    631     """
    632     Construct a Polars DataFrame from a pandas DataFrame.
    633 
   (...)
    661 
    662     """
    663     return cls._from_pydf(
--> 664         pandas_to_pydf(
    665             data,
    666             schema=schema,
    667             schema_overrides=schema_overrides,
    668             rechunk=rechunk,
    669             nan_to_null=nan_to_null,
    670             include_index=include_index,
    671         )
    672     )

File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/utils/_construction.py:1519, in pandas_to_pydf(data, schema, schema_overrides, rechunk, nan_to_null, include_index)
   1512         arrow_dict[str(idxcol)] = _pandas_series_to_arrow(
   1513             data.index.get_level_values(idxcol),
   1514             nan_to_null=nan_to_null,
   1515             length=length,
   1516         )
   1518 for col in data.columns:
-> 1519     arrow_dict[str(col)] = _pandas_series_to_arrow(
   1520         data[col], nan_to_null=nan_to_null, length=length
   1521     )
   1523 arrow_table = pa.table(arrow_dict)
   1524 return arrow_to_pydf(
   1525     arrow_table, schema=schema, schema_overrides=schema_overrides, rechunk=rechunk
   1526 )

File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/utils/_construction.py:550, in _pandas_series_to_arrow(values, nan_to_null, length)
    548     elif first_non_none is None:
    549         return pa.nulls(length or len(values), pa.large_utf8())
--> 550     return pa.array(values, from_pandas=nan_to_null)
    551 elif dtype:
    552     return pa.array(values, from_pandas=nan_to_null)

File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/pyarrow/array.pxi:323, in pyarrow.lib.array()

File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/pyarrow/array.pxi:83, in pyarrow.lib._ndarray_to_array()

File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status()

ArrowInvalid: Could not convert 'true' with type str: tried to convert to boolean

3. Read with pandas, save to json, read again with polars

After this, I tried reading the file with pandas and save it as json, i.e. like so:

data = pd.read_json("../data/raw/simulationprefab.json.gz", lines=True, nrows=4)

data.to_json("./sample.json", orient="records", lines=True)

And then read it back with polars.

pl.read_json("./sample.json")
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File [~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/io/json.py:28](https://file+.vscode-resource.vscode-cdn.net/Users/lucabaggi/Documents/dev/futura/futura-grader/~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/io/json.py:28), in read_json(source)
     [14](file:///Users/lucabaggi/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/io/json.py?line=13) def read_json(source: str | Path | IOBase) -> DataFrame:
     [15](file:///Users/lucabaggi/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/io/json.py?line=14)     """
     [16](file:///Users/lucabaggi/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/io/json.py?line=15)     Read into a DataFrame from a JSON file.
     [17](file:///Users/lucabaggi/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/io/json.py?line=16) 
   (...)
     [26](file:///Users/lucabaggi/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/io/json.py?line=25) 
     [27](file:///Users/lucabaggi/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/io/json.py?line=26)     """
---> [28](file:///Users/lucabaggi/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/io/json.py?line=27)     return pl.DataFrame._read_json(source)

File [~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/dataframe/frame.py:1010](https://file+.vscode-resource.vscode-cdn.net/Users/lucabaggi/Documents/dev/futura/futura-grader/~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/dataframe/frame.py:1010), in DataFrame._read_json(cls, source)
   [1007](file:///Users/lucabaggi/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/dataframe/frame.py?line=1006)     source = normalise_filepath(source)
   [1009](file:///Users/lucabaggi/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/dataframe/frame.py?line=1008) self = cls.__new__(cls)
-> [1010](file:///Users/lucabaggi/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/dataframe/frame.py?line=1009) self._df = PyDataFrame.read_json(source, False)
   [1011](file:///Users/lucabaggi/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/dataframe/frame.py?line=1010) return self

RuntimeError: BindingsError: "InternalError at character 201722 ('}')"

4. Read with pandas, save to json, read again with pandas and use pl.from_pandas() (WTF)

Why did I read 4 lines? Because the following snippet works up to the 4th line, after which the same error in the toggle menu above is raised:

data = pd.read_json("../data/raw/simulationprefab.json.gz", lines=True, nrows=4)
data.to_json("./sample.json", orient="records", lines=True)

pd.read_json("./sample.json", lines=True)
pl.from_pandas(data)

5. Using Python generators

Finally, I went for a whole different approach. I use itertools and generators because the data is 5GB and I wanted to use itertools.islice to find out up until which line polars worked.

import json
import itertools

def yield_lines(filepath):
    with open(filepath) as f:
        yield from f

generator = (json.loads(line) for line in yield_lines("../data/raw/simulationprefab.json"))

dat = itertools.islice(generator, 35)
pl.DataFrame(dat)

This time, at line 35 I raise the following error.

Error
File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/dataframe/frame.py:400, in DataFrame.__init__(self, data, schema, schema_overrides, orient, infer_schema_length, nan_to_null)
    395     self._df = pandas_to_pydf(
    396         data, schema=schema, schema_overrides=schema_overrides
    397     )
    399 elif not isinstance(data, Sized) and isinstance(data, (Generator, Iterable)):
--> 400     self._df = iterable_to_pydf(
    401         data,
    402         schema=schema,
    403         schema_overrides=schema_overrides,
    404         orient=orient,
    405         infer_schema_length=infer_schema_length,
    406     )
    407 else:
    408     raise ValueError(
    409         f"DataFrame constructor called with unsupported type; got {type(data)}"
    410     )

File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/utils/_construction.py:1460, in iterable_to_pydf(data, schema, schema_overrides, orient, chunk_size, infer_schema_length)
   1458 if not values:
   1459     break
-> 1460 frame_chunk = to_frame_chunk(values, original_schema)
   1461 if df is None:
   1462     df = frame_chunk

File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/utils/_construction.py:1434, in iterable_to_pydf..to_frame_chunk(values, schema)
   1433 def to_frame_chunk(values: list[Any], schema: SchemaDefinition | None) -> DataFrame:
-> 1434     return pl.DataFrame(
   1435         data=values,
   1436         schema=schema,
   1437         orient="row",
   1438         infer_schema_length=infer_schema_length,
   1439     )

File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/dataframe/frame.py:368, in DataFrame.__init__(self, data, schema, schema_overrides, orient, infer_schema_length, nan_to_null)
    360     self._df = dict_to_pydf(
    361         data,
    362         schema=schema,
    363         schema_overrides=schema_overrides,
    364         nan_to_null=nan_to_null,
    365     )
    367 elif isinstance(data, (list, tuple, Sequence)):
--> 368     self._df = sequence_to_pydf(
    369         data,
    370         schema=schema,
    371         schema_overrides=schema_overrides,
    372         orient=orient,
    373         infer_schema_length=infer_schema_length,
    374     )
    375 elif isinstance(data, pl.Series):
    376     self._df = series_to_pydf(
    377         data, schema=schema, schema_overrides=schema_overrides
    378     )

File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/utils/_construction.py:820, in sequence_to_pydf(data, schema, schema_overrides, orient, infer_schema_length)
    817 if len(data) == 0:
    818     return dict_to_pydf({}, schema=schema, schema_overrides=schema_overrides)
--> 820 return _sequence_to_pydf_dispatcher(
    821     data[0],
    822     data=data,
    823     schema=schema,
    824     schema_overrides=schema_overrides,
    825     orient=orient,
    826     infer_schema_length=infer_schema_length,
    827 )

File ~/.local/share/rtx/installs/python/3.9.16/lib/python3.9/functools.py:888, in singledispatch..wrapper(*args, **kw)
    884 if not args:
    885     raise TypeError(f'{funcname} requires at least '
    886                     '1 positional argument')
--> 888 return dispatch(args[0].__class__)(*args, **kw)

File ~/Documents/dev/futura/futura-grader/.venv/lib/python3.9/site-packages/polars/utils/_construction.py:1036, in _sequence_of_dict_to_pydf(first_element, data, schema, schema_overrides, infer_schema_length, **kwargs)
   1028 column_names, schema_overrides = _unpack_schema(
   1029     schema, schema_overrides=schema_overrides
   1030 )
   1031 dicts_schema = (
   1032     include_unknowns(schema_overrides, column_names or list(schema_overrides))
   1033     if schema_overrides and column_names
   1034     else None
   1035 )
-> 1036 pydf = PyDataFrame.read_dicts(data, infer_schema_length, dicts_schema)
   1038 if column_names and set(column_names).intersection(pydf.columns()):
   1039     column_names = []

ComputeError: mixed dtypes found when building Utf8 Series

Conclusions and data snippet

Since the object is deeply nested, I guess that with some schema magic I could make it work, so I guess I'll explore a bit more. In the meanwhile, here are the first 4 rows of the data:

sample.json.gz

I also have a theory about why wiring the JSON with pandas yields a different result. The JSON comes from DynamoDB and uses Decimal types to encode integers, so it might be that pandas can handle the conversion?

I saw that polars has the decimal type (e.g. pl.Series([decimal.Decimal("1.0")] but perhaps types being inconsistent raise some errors.

sd2k added a commit to sd2k/polars that referenced this issue Sep 7, 2023
Relates to pola-rs#8279.

I'm not 100% sure about the Python schema type annotation, there are a
few different variations in this file but this seems to make the most
sense? Happy to adjust though.
@c-peters c-peters added the accepted Ready for implementation label Jan 14, 2024
@c-peters c-peters added this to Backlog Jan 14, 2024
@c-peters c-peters moved this to Done in Backlog Jan 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants