-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow stored names/types in Schema for very large schemas #241
Conversation
There have been a few cases of extremely wide tables where users have run into fundamental compiler limits for lengths of tuples (as discussed with core devs). One example is JuliaData/CSV.jl#635. This PR proposes for very large schemas (> 65,000 columns), to store names/types in `Vector` instead of tuples with the aim to avoid breaking the runtime. The aim here is to be as non-disruptive as possible, hence the very high threshold for switching over to store names/types. Another goal is that downstream packages don't break with just these changes in place. I'm not aware of any packages testing such wide tables, but in my own testing, I've seen issues where packages are relying on the `Tables.Schema` type parameters for names/types. There's also an issue in DataFrames where `Tables.schema` attempts to construct a `Tables.Schema` directly instead of using the `Tables.Schema(names, types)` constructor. So while this PR is needed, we'll need to play whack-a-mole with downstream packages to ensure these really wide tables can be properly supported end-to-end. Going through those downstream package changes, we should probably make notes of how we can clarify Tables.jl interface docs to hopefully help future implementors do so properly and avoid the same pitfalls.
Codecov Report
@@ Coverage Diff @@
## main #241 +/- ##
==========================================
+ Coverage 94.59% 94.75% +0.16%
==========================================
Files 7 7
Lines 610 629 +19
==========================================
+ Hits 577 596 +19
Misses 33 33
Continue to review full report at Codecov.
|
Do we test against empty schema? |
Great, this may help JuliaAI/MLJBase.jl#428 |
Looks like we don't have any explicit tests; I'll add a few.
I don't think this will affect the case mentioned in that issue specifically. The core issue here is that Julia's core But none of that changes that when you try to do |
Ok, I've made |
This is part of fixing errors like JuliaData/CSV.jl#635 in addition to the changes to support really wide tables in JuliaData/Tables.jl#241. Luckily, there aren't many cases I've found across Tables.jl implementations that make working with really wide tables impossible, but this was a key place where for really wide tables, we want the names/types to be stored as `Vector`s instead of `Tuple`/`Tuple{}` in `Tables.Schema`. This shouldn't have any noticeable change/affect for non-wide DataFrames and should be covered by existing tests.
Great work - thank you! |
get(io, :print_schema_header, true) && println(io, "Tables.Schema:") | ||
Base.print_matrix(io, hcat(collect(names), types === nothing ? fill(nothing, length(names)) : collect(fieldtype(types, i) for i = 1:fieldcount(types)))) | ||
nms = sch.names | ||
Base.print_matrix(io, hcat(nms isa Vector ? nms : collect(nms), sch.types === nothing ? fill(nothing, length(nms)) : collect(sch.types))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to be sure - are both paths tested?
…ly (#2797) This is part of fixing errors like JuliaData/CSV.jl#635 in addition to the changes to support really wide tables in JuliaData/Tables.jl#241. Luckily, there aren't many cases I've found across Tables.jl implementations that make working with really wide tables impossible, but this was a key place where for really wide tables, we want the names/types to be stored as `Vector`s instead of `Tuple`/`Tuple{}` in `Tables.Schema`. This shouldn't have any noticeable change/affect for non-wide DataFrames and should be covered by existing tests.
There have been a few cases of extremely wide tables where users have
run into fundamental compiler limits for lengths of tuples (as discussed
with core devs). One example is
JuliaData/CSV.jl#635. This PR proposes for
very large schemas (> 65,000 columns), to store names/types in
Vector
instead of tuples with the aim to avoid breaking the runtime. The aim
here is to be as non-disruptive as possible, hence the very high
threshold for switching over to store names/types. Another goal is that
downstream packages don't break with just these changes in place. I'm
not aware of any packages testing such wide tables, but in my own
testing, I've seen issues where packages are relying on the
Tables.Schema
type parameters for names/types. There's also an issuein DataFrames where
Tables.schema
attempts to construct aTables.Schema
directly instead of using theTables.Schema(names, types)
constructor. So while this PR is needed, we'll need to playwhack-a-mole with downstream packages to ensure these really wide tables
can be properly supported end-to-end. Going through those downstream
package changes, we should probably make notes of how we can clarify
Tables.jl interface docs to hopefully help future implementors do so
properly and avoid the same pitfalls.