Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for better integration with MLUtils.jl #330

Open
ablaom opened this issue Apr 4, 2023 · 6 comments
Open

Request for better integration with MLUtils.jl #330

ablaom opened this issue Apr 4, 2023 · 6 comments

Comments

@ablaom
Copy link
Contributor

ablaom commented Apr 4, 2023

As we can see from the errors below, this issue is about having a general fallback to get the number of rows of a table (where known) which has come up several times before.

using Tables, MLUtils, DataFrames

matrix = rand(5, 3)

# data frames work with MLJUtils.eachobs:
df = DataFrame(matrix, [:x, :y, :z])
itr = eachobs(df)
rows = collect(itr)
# 5-element Vector{NamedTuple{(:x, :y, :z), Tuple{Float64, Float64, Float64}}}:
#  (x = 0.33293061622080167, y = 0.6294965738565482, z = 0.1302089332160854)
#  (x = 0.5483824901396804, y = 0.18815431424905893, z = 0.7994920108502044)
#  (x = 0.8279896155449579, y = 0.12200341195127706, z = 0.02419234041200058)
#  (x = 0.7029168250852702, y = 0.7140012316560944, z = 0.19287954223140447)
#  (x = 0.8703837453976674, y = 0.10716967455998871, z = 0.9857281022636002)

# so do row tables:
rowtable = Tables.rowtable(df)
itr = eachobs(rowtable)
@assert collect(itr) == rows

# so do column tables:
columntable = Tables.columntable(df)
itr = eachobs(columntable)
@assert collect(itr) == rows

# matrix tables
matrixtable = Tables.table(matrix)
itr = eachobs(matrixtable); # error if you try to display
collect(itr)
# ERROR: MethodError: no method matching nrow(::Tables.MatrixTable{Matrix{Float64}})
# Closest candidates are:
#   nrow(::SubDataFrame) at ~/.julia/packages/DataFrames/LteEl/src/subdataframe/subdataframe.jl:164                
#   nrow(::DataFrame) at ~/.julia/packages/DataFrames/LteEl/src/dataframe/dataframe.jl:459
# Stacktrace:
#  [1] numobs(::Type{MLUtils.IsTable{Tables.MatrixTable{Matrix{Float64}}}}, data::Tables.MatrixTable{Matrix{Float64}})                                                        
#    @ MLUtils ~/.julia/packages/MLUtils/R44Zf/src/observation.jl:52
#  [2] numobs
#    @ ~/.julia/packages/SimpleTraits/l1ZsK/src/SimpleTraits.jl:331 [inlined]
#  [3] length(e::DataLoader{Tables.MatrixTable{Matrix{Float64}}, Random._GLOBAL_RNG, Val{nothing}})                    
#    @ MLUtils ~/.julia/packages/MLUtils/R44Zf/src/eachobs.jl:194
#  [4] _similar_shape(itr::DataLoader{Tables.MatrixTable{Matrix{Float64}}, Random._GLOBAL_RNG, Val{nothing}}, #unused#::Base.HasLength)
#    @ Base ./array.jl:663
#  [5] _collect(c::UnitRange{Int64}, itr::DataLoader{Tables.MatrixTable{Matrix{Float64}}, Random._GLOBAL_RNG, Val{nothing}}, #unused#::Base.EltypeUnknown, isz::Base.HasLength)
#    @ Base ./array.jl:806
#  [6] collect(itr::DataLoader{Tables.MatrixTable{Matrix{Float64}}, Random._GLOBAL_RNG, Val{nothing}})                    
#    @ Base ./array.jl:712
#  [7] top-level scope
#    @ REPL[94]:1

# similar errors:
collect(eachobs(Tables.dictrowtable(df)))
collect(eachobs(Tables.dictcolumntable(df)))

@bkamins @nalimilan

@bkamins
Copy link
Member

bkamins commented Apr 4, 2023

As the error shows the nrow function should be used to get this value, as is defined here https://github.com/JuliaData/DataAPI.jl/blob/main/src/DataAPI.jl#L276.

The issue is that, as I understand currently matrixtable, dictrowtable and dictcolumntable do not implement nrow yet. Right?

A related issue is JuliaData/DataFrames.jl#3054, where I wanted to get a feedback if anyone uses eachrow(df) or eachcol(df) as Tables.jl table (as then we should also add support for nrow and ncol there).

@ablaom
Copy link
Contributor Author

ablaom commented Apr 5, 2023

Yes, if we implement DataAPI.nrow for those specific tables in DataAPI.jl then that accommodates the particular cases raised above. But we will have the same problem for any table that does not overload DataAPI.nrow. I wonder if DataAPI.jl is actually the right place for this method. Wouldn't it make more sense to for Tables.jl to own nrow because Tables.jl can provide a fallback, something like nrows(table) = length(Tables.rows(table)) or whatever?

@ablaom
Copy link
Contributor Author

ablaom commented Apr 6, 2023

Okay, I guess the argument is that providing a fallback in Tables.jl is not that useful if it will not always work. And adding something in Tables.jl to declare "this table has a length" doesn't make sense, as implementing DataAPI.nrow amounts to the same thing, yes?

@nalimilan
Copy link
Member

nrow(table) = length(Tables.rows(table)) can be super inefficient, so it's probably better to file issues against types that don't implement nrow.

@bkamins
Copy link
Member

bkamins commented Apr 7, 2023

Also some streamed tables might have nrow undefined.

@EssamWisam
Copy link

Also some streamed tables might have nrow undefined.

Wouldn't it be nice if Tables.jl provided a way for a source to declare whether it has a length Tables.hasLength(x) so that there can be a standard way to query that length which is perhaps a basic requirement for table consumers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants