Plan to add pipeline? #99

xiaodaigh · 2021-07-13T04:53:49Z

The below feels very boiler-platy

function FeatureTransforms.transform(data)
    # Define the Transforms we will apply
    p = Power(0.123)
    lc = LinearCombination([0.1, 0.9])
    ohe = OneHotEncoding(["type1", "type2", "type3"])

    features = deepcopy(data)
    FeatureTransforms.apply!(features, p; cols=[:a], header=[:a])
    features = FeatureTransforms.apply_append(features, lc; cols=[:a, :b], header=[:ab])
    features = FeatureTransforms.apply_append(features, ohe; cols=:types, header=[:type1, :type2, :type3])
end

What about something like

pipeline1 = @pipeline begin
   @transform! Power(0.123)
   @append LinearCombination([0.1, 0.9]) cols = [:a, :b] header = [:ab]
   @append OneHotEncoding(["type1", "type2", "type3"]) cols = :types header = [:type1, :type2, :type3]
end

The text was updated successfully, but these errors were encountered:

juliohm · 2021-10-25T10:50:06Z

I second this suggestion, but I think the API in this package needs some refactoring first. Namely, I think all transforms should be detached from the data (table or array) so that we can feed in a single table as table |> f1 |> f2 |> f3 and the intermediate results are used correctly to fit transforms like MeanStdScaling. Currently the MeanStdScaling transform cannot be inserted in the middle of a pipeline because a table is fixed at construction time.

Additionally, it would be really nice to have inverse transforms as first-class citzens instead of a keyword argument. This is because we would like to automatically construct inverse pipelines like f3^-1 |> f^2^-1 |> f1^-1 from a forward pipeline.

I am planning to contribute actively to this package and will submit a few transforms in the following days, but it would be nice to know from people at Invenia if it is in the plans to support pipelines? I am happy to contribute with more and more features if the plans are aligned.

glennmoy · 2021-10-25T11:56:03Z

I think the API in this package needs some refactoring first. Namely, I think all transforms should be detached from the data (table or array) so that we can feed in a single table as table |> f1 |> f2 |> f3 and the intermediate results are used correctly to fit transforms like MeanStdScaling.

I definitely agree with the need to making the code less boilerplate and more flexible, and I'm open to ideas to doing that.

Currently the MeanStdScaling transform cannot be inserted in the middle of a pipeline because a table is fixed at construction time.

Can you elaborate a little? Is this related to #59 in that you have to have already determined the scaling parameters before the transform can be applied?

Additionally, it would be really nice to have inverse transforms as first-class citzens instead of a keyword argument.

This is also a good suggestion. I'm not a fan of boolean kwargs in the first place but time constraints meant we had to go with the easiest option to start. I'm not sure what the best approach would be but I like the syntactic sugar of f^-1 or similar (not sure if that's what you were directly suggesting though).

I am planning to contribute actively to this package and will submit a few transforms in the following days, but it would be nice to know from people at Invenia if it is in the plans to support pipelines? I am happy to contribute with more and more features if the plans are aligned.

That would be great! We have no immediate plans to extend / refactor FeatureTransforms as we're currently using it in a relatively steady-state because it's largely abstracted by another (internal) API. But I would wholly support & help any effort to making this package better. I'll keep my eye out for the MRs :)

juliohm · 2021-10-25T19:29:53Z

Can you elaborate a little? Is this related to #59 in that you have to have already determined the scaling parameters before the transform can be applied?

I didn't have time to read the code carefully yet as I am just starting with the implementation of a few transforms in some of the packages we are developing, but I understood that the MeanStdScaling object is currently attached to a specific data set at construction time. I think we could just make it lazy and only compute mu and sigma at application time somehow or do something similar to what was suggested in the issue storing mu and sigma without any particular reference to a data set.

juliohm · 2021-10-28T14:45:55Z

For anyone interested, here is our attempt to provide transforms for general Tables.jl. We support revertible pipelines with very clean syntax. All the issues I raised above are solved in TableTransforms.jl: https://github.com/JuliaML/TableTransforms.jl

mzgubic mentioned this issue Dec 6, 2021

Composite Transforms #105

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plan to add pipeline? #99

Plan to add pipeline? #99

xiaodaigh commented Jul 13, 2021

juliohm commented Oct 25, 2021 •

edited

Loading

glennmoy commented Oct 25, 2021 •

edited

Loading

juliohm commented Oct 25, 2021

juliohm commented Oct 28, 2021

Plan to add pipeline? #99

Plan to add pipeline? #99

Comments

xiaodaigh commented Jul 13, 2021

juliohm commented Oct 25, 2021 • edited Loading

glennmoy commented Oct 25, 2021 • edited Loading

juliohm commented Oct 25, 2021

juliohm commented Oct 28, 2021

juliohm commented Oct 25, 2021 •

edited

Loading

glennmoy commented Oct 25, 2021 •

edited

Loading