-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Saving a DataFrame with many columns throws obscure error #635
Comments
I've tried to dig into this a couple of times, but unfortunately, this is a pretty deep, nasty corrupted compiler issue. I've chatted with @Keno a bit about what's going on and he said he'll try to take a look soonish. If we can narrow down a bit what exactly is causing corruption, maybe we can find a work-around, or perhaps there are concrete fixes in the compiler that can help. FWIW, I can reproduce the results on latest 1.6 master with just hte following: using CSV, DataFrames
ncols = 67000
df = DataFrame(rand(6, ncols));
CSV.write("huge.csv", df); |
There have been a few cases of extremely wide tables where users have run into fundamental compiler limits for lengths of tuples (as discussed with core devs). One example is JuliaData/CSV.jl#635. This PR proposes for very large schemas (> 65,000 columns), to store names/types in `Vector` instead of tuples with the aim to avoid breaking the runtime. The aim here is to be as non-disruptive as possible, hence the very high threshold for switching over to store names/types. Another goal is that downstream packages don't break with just these changes in place. I'm not aware of any packages testing such wide tables, but in my own testing, I've seen issues where packages are relying on the `Tables.Schema` type parameters for names/types. There's also an issue in DataFrames where `Tables.schema` attempts to construct a `Tables.Schema` directly instead of using the `Tables.Schema(names, types)` constructor. So while this PR is needed, we'll need to play whack-a-mole with downstream packages to ensure these really wide tables can be properly supported end-to-end. Going through those downstream package changes, we should probably make notes of how we can clarify Tables.jl interface docs to hopefully help future implementors do so properly and avoid the same pitfalls.
This is part of fixing errors like JuliaData/CSV.jl#635 in addition to the changes to support really wide tables in JuliaData/Tables.jl#241. Luckily, there aren't many cases I've found across Tables.jl implementations that make working with really wide tables impossible, but this was a key place where for really wide tables, we want the names/types to be stored as `Vector`s instead of `Tuple`/`Tuple{}` in `Tables.Schema`. This shouldn't have any noticeable change/affect for non-wide DataFrames and should be covered by existing tests.
…ly (#2797) This is part of fixing errors like JuliaData/CSV.jl#635 in addition to the changes to support really wide tables in JuliaData/Tables.jl#241. Luckily, there aren't many cases I've found across Tables.jl implementations that make working with really wide tables impossible, but this was a key place where for really wide tables, we want the names/types to be stored as `Vector`s instead of `Tuple`/`Tuple{}` in `Tables.Schema`. This shouldn't have any noticeable change/affect for non-wide DataFrames and should be covered by existing tests.
* Allow stored names/types in Schema for very large schemas There have been a few cases of extremely wide tables where users have run into fundamental compiler limits for lengths of tuples (as discussed with core devs). One example is JuliaData/CSV.jl#635. This PR proposes for very large schemas (> 65,000 columns), to store names/types in `Vector` instead of tuples with the aim to avoid breaking the runtime. The aim here is to be as non-disruptive as possible, hence the very high threshold for switching over to store names/types. Another goal is that downstream packages don't break with just these changes in place. I'm not aware of any packages testing such wide tables, but in my own testing, I've seen issues where packages are relying on the `Tables.Schema` type parameters for names/types. There's also an issue in DataFrames where `Tables.schema` attempts to construct a `Tables.Schema` directly instead of using the `Tables.Schema(names, types)` constructor. So while this PR is needed, we'll need to play whack-a-mole with downstream packages to ensure these really wide tables can be properly supported end-to-end. Going through those downstream package changes, we should probably make notes of how we can clarify Tables.jl interface docs to hopefully help future implementors do so properly and avoid the same pitfalls. * Add tests; update eachcolumn/eachcolumns * Add some more testing for Tables.jl-provided types * fix * fix2 * fix corner case * fix tests
Along with Tables.jl and DataFrames.jl fixes, this provides the CSV.jl part of fixing #635. I was pleasantly surprised to find this was all that was needed to support extremely wide tables when writing (reading already works fine).
This is fixed on current CSV.jl |
Issue:
The following code
produces
Info:
Full error message - which is very long due to printing the many column names - can be found here
The text was updated successfully, but these errors were encountered: