Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plan to add pipeline? #99

Open
xiaodaigh opened this issue Jul 13, 2021 · 4 comments
Open

Plan to add pipeline? #99

xiaodaigh opened this issue Jul 13, 2021 · 4 comments

Comments

@xiaodaigh
Copy link

The below feels very boiler-platy

function FeatureTransforms.transform(data)
    # Define the Transforms we will apply
    p = Power(0.123)
    lc = LinearCombination([0.1, 0.9])
    ohe = OneHotEncoding(["type1", "type2", "type3"])

    features = deepcopy(data)
    FeatureTransforms.apply!(features, p; cols=[:a], header=[:a])
    features = FeatureTransforms.apply_append(features, lc; cols=[:a, :b], header=[:ab])
    features = FeatureTransforms.apply_append(features, ohe; cols=:types, header=[:type1, :type2, :type3])
end

What about something like

pipeline1 = @pipeline begin
   @transform! Power(0.123)
   @append LinearCombination([0.1, 0.9]) cols = [:a, :b] header = [:ab]
   @append OneHotEncoding(["type1", "type2", "type3"]) cols = :types header = [:type1, :type2, :type3]
end
@juliohm
Copy link

juliohm commented Oct 25, 2021

I second this suggestion, but I think the API in this package needs some refactoring first. Namely, I think all transforms should be detached from the data (table or array) so that we can feed in a single table as table |> f1 |> f2 |> f3 and the intermediate results are used correctly to fit transforms like MeanStdScaling. Currently the MeanStdScaling transform cannot be inserted in the middle of a pipeline because a table is fixed at construction time.

Additionally, it would be really nice to have inverse transforms as first-class citzens instead of a keyword argument. This is because we would like to automatically construct inverse pipelines like f3^-1 |> f^2^-1 |> f1^-1 from a forward pipeline.

I am planning to contribute actively to this package and will submit a few transforms in the following days, but it would be nice to know from people at Invenia if it is in the plans to support pipelines? I am happy to contribute with more and more features if the plans are aligned.

@glennmoy
Copy link
Member

glennmoy commented Oct 25, 2021

I think the API in this package needs some refactoring first. Namely, I think all transforms should be detached from the data (table or array) so that we can feed in a single table as table |> f1 |> f2 |> f3 and the intermediate results are used correctly to fit transforms like MeanStdScaling.

I definitely agree with the need to making the code less boilerplate and more flexible, and I'm open to ideas to doing that.

Currently the MeanStdScaling transform cannot be inserted in the middle of a pipeline because a table is fixed at construction time.

Can you elaborate a little? Is this related to #59 in that you have to have already determined the scaling parameters before the transform can be applied?

Additionally, it would be really nice to have inverse transforms as first-class citzens instead of a keyword argument.

This is also a good suggestion. I'm not a fan of boolean kwargs in the first place but time constraints meant we had to go with the easiest option to start. I'm not sure what the best approach would be but I like the syntactic sugar of f^-1 or similar (not sure if that's what you were directly suggesting though).

I am planning to contribute actively to this package and will submit a few transforms in the following days, but it would be nice to know from people at Invenia if it is in the plans to support pipelines? I am happy to contribute with more and more features if the plans are aligned.

That would be great! We have no immediate plans to extend / refactor FeatureTransforms as we're currently using it in a relatively steady-state because it's largely abstracted by another (internal) API. But I would wholly support & help any effort to making this package better. I'll keep my eye out for the MRs :)

@juliohm
Copy link

juliohm commented Oct 25, 2021

Can you elaborate a little? Is this related to #59 in that you have to have already determined the scaling parameters before the transform can be applied?

I didn't have time to read the code carefully yet as I am just starting with the implementation of a few transforms in some of the packages we are developing, but I understood that the MeanStdScaling object is currently attached to a specific data set at construction time. I think we could just make it lazy and only compute mu and sigma at application time somehow or do something similar to what was suggested in the issue storing mu and sigma without any particular reference to a data set.

@juliohm
Copy link

juliohm commented Oct 28, 2021

For anyone interested, here is our attempt to provide transforms for general Tables.jl. We support revertible pipelines with very clean syntax. All the issues I raised above are solved in TableTransforms.jl: https://github.com/JuliaML/TableTransforms.jl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants