Saving a DataFrame with many columns throws obscure error #635

skleinbo · 2020-06-07T11:31:22Z

Issue:
The following code

using DataFrames, CSV

df = DataFrame(rand(6, 10^5));
CSV.write("huge.csv", df);

produces

ERROR: LoadError: UndefVarError: x2465 not defined
Stacktrace:
 [1] x2466 at .\x2469:95 [inlined]
 [2] macro expansion at .\x2468:83 [inlined]
 [3] eachcolumn at .\x2468:66 [inlined]
 [4] writerow(`...

Info:

julia> versioninfo()
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

(@v1.4) pkg> st --manifest CSV DataFrames
Status `C:\Users\stephan\.julia\environments\v1.4\Manifest.toml`
  [336ed68f] CSV v0.6.2 [`C:\Users\stephan\.julia\dev\CSV`]
  [324d7699] CategoricalArrays v0.8.1
  [34da2185] Compat v3.10.0
  [9a962f9c] DataAPI v1.3.0
  [a93c6f00] DataFrames v0.21.2
  [48062228] FilePathsBase v0.8.0
  [41ab1584] InvertedIndices v1.0.0
  [82899510] IteratorInterfaceExtensions v1.0.0
  [e1d29d7a] Missings v0.4.3
  [69de0a69] Parsers v1.0.4
  [2dfb63ee] PooledArrays v0.5.3
  [189a3867] Reexport v0.2.0
  [a2af1166] SortingAlgorithms v0.3.1
  [3783bdb8] TableTraits v1.0.0
  [bd369af6] Tables v1.0.4
  [ea10d353] WeakRefStrings v0.6.2
  [ade2ca70] Dates
  [9fa8497b] Future
  [a63ad114] Mmap
  [de0858da] Printf
  [3fa0cd96] REPL
  [10745b16] Statistics
  [4ec0a83e] Unicode

Full error message - which is very long due to printing the many column names - can be found here

Works with 5*10^4 columns (buffer overflow?)
Reproducible on Mac with same Julia/package versions
Reproducible on Julia 1.3.1
No changes were made to the CSV.jl source, despite being checked out for development.

The text was updated successfully, but these errors were encountered:

quinnj · 2020-10-30T06:07:19Z

I've tried to dig into this a couple of times, but unfortunately, this is a pretty deep, nasty corrupted compiler issue. I've chatted with @Keno a bit about what's going on and he said he'll try to take a look soonish. If we can narrow down a bit what exactly is causing corruption, maybe we can find a work-around, or perhaps there are concrete fixes in the compiler that can help.

FWIW, I can reproduce the results on latest 1.6 master with just hte following:

using CSV, DataFrames
ncols = 67000
df = DataFrame(rand(6, ncols));
CSV.write("huge.csv", df);

There have been a few cases of extremely wide tables where users have run into fundamental compiler limits for lengths of tuples (as discussed with core devs). One example is JuliaData/CSV.jl#635. This PR proposes for very large schemas (> 65,000 columns), to store names/types in `Vector` instead of tuples with the aim to avoid breaking the runtime. The aim here is to be as non-disruptive as possible, hence the very high threshold for switching over to store names/types. Another goal is that downstream packages don't break with just these changes in place. I'm not aware of any packages testing such wide tables, but in my own testing, I've seen issues where packages are relying on the `Tables.Schema` type parameters for names/types. There's also an issue in DataFrames where `Tables.schema` attempts to construct a `Tables.Schema` directly instead of using the `Tables.Schema(names, types)` constructor. So while this PR is needed, we'll need to play whack-a-mole with downstream packages to ensure these really wide tables can be properly supported end-to-end. Going through those downstream package changes, we should probably make notes of how we can clarify Tables.jl interface docs to hopefully help future implementors do so properly and avoid the same pitfalls.

This is part of fixing errors like JuliaData/CSV.jl#635 in addition to the changes to support really wide tables in JuliaData/Tables.jl#241. Luckily, there aren't many cases I've found across Tables.jl implementations that make working with really wide tables impossible, but this was a key place where for really wide tables, we want the names/types to be stored as `Vector`s instead of `Tuple`/`Tuple{}` in `Tables.Schema`. This shouldn't have any noticeable change/affect for non-wide DataFrames and should be covered by existing tests.

…ly (#2797) This is part of fixing errors like JuliaData/CSV.jl#635 in addition to the changes to support really wide tables in JuliaData/Tables.jl#241. Luckily, there aren't many cases I've found across Tables.jl implementations that make working with really wide tables impossible, but this was a key place where for really wide tables, we want the names/types to be stored as `Vector`s instead of `Tuple`/`Tuple{}` in `Tables.Schema`. This shouldn't have any noticeable change/affect for non-wide DataFrames and should be covered by existing tests.

* Allow stored names/types in Schema for very large schemas There have been a few cases of extremely wide tables where users have run into fundamental compiler limits for lengths of tuples (as discussed with core devs). One example is JuliaData/CSV.jl#635. This PR proposes for very large schemas (> 65,000 columns), to store names/types in `Vector` instead of tuples with the aim to avoid breaking the runtime. The aim here is to be as non-disruptive as possible, hence the very high threshold for switching over to store names/types. Another goal is that downstream packages don't break with just these changes in place. I'm not aware of any packages testing such wide tables, but in my own testing, I've seen issues where packages are relying on the `Tables.Schema` type parameters for names/types. There's also an issue in DataFrames where `Tables.schema` attempts to construct a `Tables.Schema` directly instead of using the `Tables.Schema(names, types)` constructor. So while this PR is needed, we'll need to play whack-a-mole with downstream packages to ensure these really wide tables can be properly supported end-to-end. Going through those downstream package changes, we should probably make notes of how we can clarify Tables.jl interface docs to hopefully help future implementors do so properly and avoid the same pitfalls. * Add tests; update eachcolumn/eachcolumns * Add some more testing for Tables.jl-provided types * fix * fix2 * fix corner case * fix tests

Along with Tables.jl and DataFrames.jl fixes, this provides the CSV.jl part of fixing #635. I was pleasantly surprised to find this was all that was needed to support extremely wide tables when writing (reading already works fine).

#849) Along with Tables.jl and DataFrames.jl fixes, this provides the CSV.jl part of fixing #635. I was pleasantly surprised to find this was all that was needed to support extremely wide tables when writing (reading already works fine).

quinnj · 2021-08-20T06:02:24Z

This is fixed on current CSV.jl main where we handle the new Tables.jl functionality of stored schema.

quinnj mentioned this issue Jun 24, 2020

Segfault in compilation with large tuple JuliaLang/julia#36422

Closed

quinnj mentioned this issue Oct 29, 2020

CSV.write crashes with large dataframe #735

Closed

quinnj mentioned this issue Apr 12, 2021

MethodError: no method matching fuzzymatch(::Dict{Symbol, Int64}, ::QuoteNode) #824

Closed

quinnj mentioned this issue Jun 18, 2021

Allow stored names/types in Schema for very large schemas JuliaData/Tables.jl#241

Merged

quinnj mentioned this issue Jun 22, 2021

Use standard Tables.Schema constructor instead of constructing directly JuliaData/DataFrames.jl#2797

Merged

quinnj mentioned this issue Jun 23, 2021

Use sch.names to get schema names instead of relying on type parameter #849

Merged

quinnj closed this as completed Aug 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saving a DataFrame with many columns throws obscure error #635

Saving a DataFrame with many columns throws obscure error #635

skleinbo commented Jun 7, 2020

quinnj commented Oct 30, 2020

quinnj commented Aug 20, 2021

Saving a DataFrame with many columns throws obscure error #635

Saving a DataFrame with many columns throws obscure error #635

Comments

skleinbo commented Jun 7, 2020

quinnj commented Oct 30, 2020

quinnj commented Aug 20, 2021