You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some transformations (e.g. MeanStdScaling) need data to determine some of the parameters. At present, this is done by passing the data to the constructor.
As a result, it isn't generically possible to exchange the transformation from MeanStdScaling to e.g. LogTransform in some data preprocessing step, since they have a different API. An example:
julia>functionpipeline(transform_type, data)
tsf =transform_type(data)
return FeatureTransforms.apply(data, tsf)
end
works with
julia>pipeline(MeanStdScaling, rand(10))
but not with
julia>pipeline(LogTransform, rand(10))
ERROR: MethodError: no method matching LogTransform(::Vector{Float64})
Closest candidates are:LogTransform() at ~/JuliaEnvs/FeatureTransforms.jl/src/log.jl:8
Stacktrace:
[1] pipeline(transform_type::Type{LogTransform}, data::Vector{Float64})
@ Main ./REPL[38]:2
[2] top-level scope
@ REPL[40]:1
In this simple case we can avoid the issue by instantiating the transforms outside of the pipeline. However, that means that we have to have access to data when constructing the pipeline, which feels wrong and unmodular. Even worse, if MeanStdScaling is somewhere in the middle of long list of operations, we have to apply the preceding transformations first to instantiate it all while we construct the pipeline.
A possible alternative
We could implement a StatsBase.fit!(transform, data) method which handles the fitting part. The fallback method would do nothing, but for MeanStdScaling the fitting work currently done in the constructor would be moved to the fit! method.
I would also suggest adding a fitted boolean to MeanStdScaling so that we can error if the transform method is called before fit.
Since this is breaking we can also take the opportunity to rename MeanStdScaling to whatever comes out of #87.
Current state
Some transformations (e.g.
MeanStdScaling
) need data to determine some of the parameters. At present, this is done by passing the data to the constructor.As a result, it isn't generically possible to exchange the transformation from
MeanStdScaling
to e.g.LogTransform
in some data preprocessing step, since they have a different API. An example:works with
but not with
In this simple case we can avoid the issue by instantiating the transforms outside of the pipeline. However, that means that we have to have access to data when constructing the pipeline, which feels wrong and unmodular. Even worse, if
MeanStdScaling
is somewhere in the middle of long list of operations, we have to apply the preceding transformations first to instantiate it all while we construct the pipeline.A possible alternative
We could implement a
StatsBase.fit!(transform, data)
method which handles the fitting part. The fallback method would do nothing, but forMeanStdScaling
the fitting work currently done in the constructor would be moved to thefit!
method.I would also suggest adding a
fitted
boolean toMeanStdScaling
so that we can error if thetransform
method is called beforefit
.Since this is breaking we can also take the opportunity to rename
MeanStdScaling
to whatever comes out of #87.Related to #59
The text was updated successfully, but these errors were encountered: