-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse data and abstract matrix input #731
Comments
I seem to have outdated information for a big chunk of this discussion but fwiw I agree with you that "feature-sparse" seems the more relevant/important use case. Note that in my experience, the story with the likes of pandas is not ideal either. Last I checked if you do One note: a big chunk of the cases for sparse data is encoding (like OHE), given models in MLJ can ingest data then do their own stuff on the data, you could imagine they do their own encoding as well and handle the sparsity as well which may be what you're already suggesting? |
Allowing input to models to be an instance |
The absence of of sparse data supports in OneHotEncoder and ContinousEncoder makes then unusable for the large number of features / categorical features with large number of categories. Yes, sparse features can be created by hand but that nullifies the purpose of MLJ of making ML easy to do. |
I'm coming alongto bump a relatively old conversation here -- how does this topic relate to the (now a while ago) discussed notion of supporting sparse matrices throughout the MLJ flow (i.e. avoiding densification by MLJ proper, at least, even if individual model implementers don't handle this properly)? |
@yalwan-sage Good to hear from you! Let me clarify that MLJ itself does not impose densification. The issue is that MLJ encourages implementers of the MLJ model interface to accept tabular input where this makes sense. If densification is inevitable, this is no big deal. This would also not be a big deal if wrapping matrices with a large number of columns as tables worked well. As far as I know, a suitable sparse tabular format does not exist. I initially thought As far as I can tell DataFrames deals with sparsity within columns, but not sparsity within rows. Alternatively (or additionally) any model can choose to accept matrix data. In fact, in that case, it must be able to handle any julia> models() do m
AbstractMatrix{Continuous} <: m.input_scitype
end
(name = EvoTreeClassifier, package_name = EvoTrees, ... )
(name = EvoTreeCount, package_name = EvoTrees, ... )
(name = EvoTreeGaussian, package_name = EvoTrees, ... )
(name = EvoTreeRegressor, package_name = EvoTrees, ... )
(name = TSVDTransformer, package_name = TSVD, ... ) Moving forward, someone either introduces a better feature-sparse tabular format (so that dealing with sparsity becomes more of an implementation detail) or, existing models that can support sparse data extend the I'm not yet convinced by @OkonSamuel 's suggestion that we need separate scitypes to handle sparse data. It seems to me that sparsity is more a property of the representation of the data, than of its "scientific" interpretation. Perhaps flagging a model as supporting sparse data with a model trait is better. |
Right, presumably because you can build your dataframe as a collection of SparseVectors
So this is actually part of why i've come looking. I'm hoping to add support to LightGBM.jl for dataset construction from (to begin with) sparse matrices and I was wondering if there was a scitype or trait I needed to set to indicate this when patching up the interface. To the best of what I've understood you wrote, we don't yet have a finalised way to indicate sparse support as an implementer, and exactly how is not settled on. Is that right? |
Thanks for that claification |
So, currently, models do not need to articulate that they support sparsity. However, a zoom discussion with @OkonSamuel has raised for me another point, which is that it's probably worth models articulating (with a new trait) whether the core algorithm likes observations as rows or columns. Because of |
Just to clarify, when you put |
Yes, from sodlib SparseArrays. |
Discussions at MLJ meetings have turned to the problem of sparse data. Data can be observation-sparse or feature-sparse (or both). My feeling that the feature-sparse case is the more important use-case, and the more tricky to deal with. I originally thought one might handle this within the current tabular data format but I think this requires extra infra-structure that does not exist yet in Julia. Given limited resources the pragmatic thing to do would be to allow models that handle feature-sparse data to ingest the data as an abstract matrix (in addition to a dense table). When this is sparse, the performance benefits kick in.
Having "given up" on the uniform requirement of tabular data, we might just as well allow arbitrary models that currently take tabular input to ingest data in the form of matrices as well. It would be quite natural to roll this out at the same time as implementing the new optional data front-end for models. If we do allow matrices, an important design decision regards the output of models (say of transformers, or the form of the target in multi-target supervised models). I guess if we train on matrices, then matrices should be the output and similarly for tables.
Thoughts anyone?
@OkonSamuel
@tlienart
The text was updated successfully, but these errors were encountered: