-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TypeError: Couldn't cast array of type string to null in long json #7222
Comments
I am encountering this same issue. It seems that the library manages to recognise an optional column (but not exclusively null) if there is at least one non-null instance within the same file. For example, given a {"a": "a1", "b": "b1", "c": null, "d": null}
{"a": "a2", "b": null, "c": "c2", "d": null} the data is correctly loaded, recognising that columns {'a': ['a1', 'a2'], 'b': ['b1', None], 'c': [None, 'c2'], 'd': [None, None]} But if the {"a": null, "b": "b3", "c": "c3", "d": "d3"}
{"a": "a4", "b": "b4", "c": null, "d": null} then, an error is raised:
I have created a sample repository if that helps. Interestingly, the dataset viewer correctly shows the data across files, although it still indicates the above error. |
Managed to find a workaround, by specifying the features explicitly, which is also possible to do directly using the YAML file configuration. |
NeMo-issues.jsonl is the original data file. To work around huggingface/datasets#7222, we create a new file NeMo-issues-fixed.jsonl which consists of the last 1000 lines and then the first 9000 lines of NeMo-issues.jsonl.
I hit the same issue for
For NeMo-issues.jsonl, I got an exception:
For NeMo-issues-fixed.json which consists of the last 1000 lines and then the first 9000 lines of NeMo-issues.jsonl, I could load the data:
|
Describe the bug
In general, changing the type from string to null is allowed within a dataset — there are even examples of this in the documentation.
However, if the dataset is large and unevenly distributed, this allowance stops working. The schema gets locked in after reading a chunk.
Consequently, if all values in the first chunk of a field are, for example, null, the field will be locked as type null, and if a string appears in that field in the second chunk, it will trigger this error:
Traceback
Steps to reproduce the bug
Expected behavior
Concatenation of the chunks without errors
Environment info
datasets
version: 3.0.1huggingface_hub
version: 0.24.7fsspec
version: 2024.6.1The text was updated successfully, but these errors were encountered: