Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate transform constructors and fit method #106

Closed
mzgubic opened this issue Feb 18, 2022 · 0 comments · Fixed by #107
Closed

Separate transform constructors and fit method #106

mzgubic opened this issue Feb 18, 2022 · 0 comments · Fixed by #107

Comments

@mzgubic
Copy link
Contributor

mzgubic commented Feb 18, 2022

Current state

Some transformations (e.g. MeanStdScaling) need data to determine some of the parameters. At present, this is done by passing the data to the constructor.

As a result, it isn't generically possible to exchange the transformation from MeanStdScaling to e.g. LogTransform in some data preprocessing step, since they have a different API. An example:

julia> function pipeline(transform_type, data)
           tsf = transform_type(data)
           return FeatureTransforms.apply(data, tsf)
       end

works with

julia> pipeline(MeanStdScaling, rand(10))

but not with

julia> pipeline(LogTransform, rand(10))
ERROR: MethodError: no method matching LogTransform(::Vector{Float64})
Closest candidates are:
  LogTransform() at ~/JuliaEnvs/FeatureTransforms.jl/src/log.jl:8
Stacktrace:
 [1] pipeline(transform_type::Type{LogTransform}, data::Vector{Float64})
   @ Main ./REPL[38]:2
 [2] top-level scope
   @ REPL[40]:1

In this simple case we can avoid the issue by instantiating the transforms outside of the pipeline. However, that means that we have to have access to data when constructing the pipeline, which feels wrong and unmodular. Even worse, if MeanStdScaling is somewhere in the middle of long list of operations, we have to apply the preceding transformations first to instantiate it all while we construct the pipeline.

A possible alternative

We could implement a StatsBase.fit!(transform, data) method which handles the fitting part. The fallback method would do nothing, but for MeanStdScaling the fitting work currently done in the constructor would be moved to the fit! method.

I would also suggest adding a fitted boolean to MeanStdScaling so that we can error if the transform method is called before fit.

Since this is breaking we can also take the opportunity to rename MeanStdScaling to whatever comes out of #87.

Related to #59

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant