Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving a DataFrame with many columns throws obscure error #635

Closed
skleinbo opened this issue Jun 7, 2020 · 2 comments
Closed

Saving a DataFrame with many columns throws obscure error #635

skleinbo opened this issue Jun 7, 2020 · 2 comments

Comments

@skleinbo
Copy link

skleinbo commented Jun 7, 2020

Issue:
The following code

using DataFrames, CSV

df = DataFrame(rand(6, 10^5));
CSV.write("huge.csv", df);

produces

ERROR: LoadError: UndefVarError: x2465 not defined
Stacktrace:
 [1] x2466 at .\x2469:95 [inlined]
 [2] macro expansion at .\x2468:83 [inlined]
 [3] eachcolumn at .\x2468:66 [inlined]
 [4] writerow(`...

Info:

julia> versioninfo()
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

(@v1.4) pkg> st --manifest CSV DataFrames
Status `C:\Users\stephan\.julia\environments\v1.4\Manifest.toml`
  [336ed68f] CSV v0.6.2 [`C:\Users\stephan\.julia\dev\CSV`]
  [324d7699] CategoricalArrays v0.8.1
  [34da2185] Compat v3.10.0
  [9a962f9c] DataAPI v1.3.0
  [a93c6f00] DataFrames v0.21.2
  [48062228] FilePathsBase v0.8.0
  [41ab1584] InvertedIndices v1.0.0
  [82899510] IteratorInterfaceExtensions v1.0.0
  [e1d29d7a] Missings v0.4.3
  [69de0a69] Parsers v1.0.4
  [2dfb63ee] PooledArrays v0.5.3
  [189a3867] Reexport v0.2.0
  [a2af1166] SortingAlgorithms v0.3.1
  [3783bdb8] TableTraits v1.0.0
  [bd369af6] Tables v1.0.4
  [ea10d353] WeakRefStrings v0.6.2
  [ade2ca70] Dates
  [9fa8497b] Future
  [a63ad114] Mmap
  [de0858da] Printf
  [3fa0cd96] REPL
  [10745b16] Statistics
  [4ec0a83e] Unicode

Full error message - which is very long due to printing the many column names - can be found here

  • Works with 5*10^4 columns (buffer overflow?)
  • Reproducible on Mac with same Julia/package versions
  • Reproducible on Julia 1.3.1
  • No changes were made to the CSV.jl source, despite being checked out for development.
@quinnj
Copy link
Member

quinnj commented Oct 30, 2020

I've tried to dig into this a couple of times, but unfortunately, this is a pretty deep, nasty corrupted compiler issue. I've chatted with @Keno a bit about what's going on and he said he'll try to take a look soonish. If we can narrow down a bit what exactly is causing corruption, maybe we can find a work-around, or perhaps there are concrete fixes in the compiler that can help.

FWIW, I can reproduce the results on latest 1.6 master with just hte following:

using CSV, DataFrames
ncols = 67000
df = DataFrame(rand(6, ncols));
CSV.write("huge.csv", df);

quinnj added a commit to JuliaData/Tables.jl that referenced this issue Jun 18, 2021
There have been a few cases of extremely wide tables where users have
run into fundamental compiler limits for lengths of tuples (as discussed
with core devs). One example is
JuliaData/CSV.jl#635. This PR proposes for
very large schemas (> 65,000 columns), to store names/types in `Vector`
instead of tuples with the aim to avoid breaking the runtime. The aim
here is to be as non-disruptive as possible, hence the very high
threshold for switching over to store names/types. Another goal is that
downstream packages don't break with just these changes in place. I'm
not aware of any packages testing such wide tables, but in my own
testing, I've seen issues where packages are relying on the
`Tables.Schema` type parameters for names/types. There's also an issue
in DataFrames where `Tables.schema` attempts to construct a
`Tables.Schema` directly instead of using the `Tables.Schema(names,
types)` constructor. So while this PR is needed, we'll need to play
whack-a-mole with downstream packages to ensure these really wide tables
can be properly supported end-to-end. Going through those downstream
package changes, we should probably make notes of how we can clarify
Tables.jl interface docs to hopefully help future implementors do so
properly and avoid the same pitfalls.
quinnj added a commit to JuliaData/DataFrames.jl that referenced this issue Jun 22, 2021
This is part of fixing errors like
JuliaData/CSV.jl#635 in addition to the
changes to support really wide tables in
JuliaData/Tables.jl#241. Luckily, there aren't
many cases I've found across Tables.jl implementations that make working
with really wide tables impossible, but this was a key place where for
really wide tables, we want the names/types to be stored as `Vector`s
instead of `Tuple`/`Tuple{}` in `Tables.Schema`. This shouldn't have any
noticeable change/affect for non-wide DataFrames and should be covered
by existing tests.
quinnj added a commit to JuliaData/DataFrames.jl that referenced this issue Jun 22, 2021
…ly (#2797)

This is part of fixing errors like
JuliaData/CSV.jl#635 in addition to the
changes to support really wide tables in
JuliaData/Tables.jl#241. Luckily, there aren't
many cases I've found across Tables.jl implementations that make working
with really wide tables impossible, but this was a key place where for
really wide tables, we want the names/types to be stored as `Vector`s
instead of `Tuple`/`Tuple{}` in `Tables.Schema`. This shouldn't have any
noticeable change/affect for non-wide DataFrames and should be covered
by existing tests.
quinnj added a commit to JuliaData/Tables.jl that referenced this issue Jun 23, 2021
* Allow stored names/types in Schema for very large schemas

There have been a few cases of extremely wide tables where users have
run into fundamental compiler limits for lengths of tuples (as discussed
with core devs). One example is
JuliaData/CSV.jl#635. This PR proposes for
very large schemas (> 65,000 columns), to store names/types in `Vector`
instead of tuples with the aim to avoid breaking the runtime. The aim
here is to be as non-disruptive as possible, hence the very high
threshold for switching over to store names/types. Another goal is that
downstream packages don't break with just these changes in place. I'm
not aware of any packages testing such wide tables, but in my own
testing, I've seen issues where packages are relying on the
`Tables.Schema` type parameters for names/types. There's also an issue
in DataFrames where `Tables.schema` attempts to construct a
`Tables.Schema` directly instead of using the `Tables.Schema(names,
types)` constructor. So while this PR is needed, we'll need to play
whack-a-mole with downstream packages to ensure these really wide tables
can be properly supported end-to-end. Going through those downstream
package changes, we should probably make notes of how we can clarify
Tables.jl interface docs to hopefully help future implementors do so
properly and avoid the same pitfalls.

* Add tests; update eachcolumn/eachcolumns

* Add some more testing for Tables.jl-provided types

* fix

* fix2

* fix corner case

* fix tests
quinnj added a commit that referenced this issue Jun 23, 2021
Along with Tables.jl and DataFrames.jl fixes, this provides the CSV.jl
part of fixing #635. I was pleasantly surprised to find this was all
that was needed to support extremely wide tables when writing (reading
already works fine).
quinnj added a commit that referenced this issue Jun 23, 2021
#849)

Along with Tables.jl and DataFrames.jl fixes, this provides the CSV.jl
part of fixing #635. I was pleasantly surprised to find this was all
that was needed to support extremely wide tables when writing (reading
already works fine).
@quinnj
Copy link
Member

quinnj commented Aug 20, 2021

This is fixed on current CSV.jl main where we handle the new Tables.jl functionality of stored schema.

@quinnj quinnj closed this as completed Aug 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants