Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement apply method that appends result to data and / or forces promotion #38

Closed
glennmoy opened this issue Mar 2, 2021 · 7 comments · Fixed by #69
Closed

Implement apply method that appends result to data and / or forces promotion #38

glennmoy opened this issue Mar 2, 2021 · 7 comments · Fixed by #69
Assignees
Labels

Comments

@glennmoy
Copy link
Member

glennmoy commented Mar 2, 2021

Related to #12

We have two methods for applying a transform to data

  • apply takes the transform but preserves the original data
  • apply! takes the transform and mutates the original data in-place

While apply is universally supported, apply! is only supported for transforms that can directly replace the input.

In one case, this means it needs the output to be the same type:

julia> p = Power(1.2);

julia> x = Int64[1, 2, 3];

julia> FeatureTransforms.apply(x, p)
3-element Array{Float64,1}:
 1.0
 2.2973967099940698
 3.7371928188465517

julia> FeatureTransforms.apply!(x, p)
ERROR: InexactError: Int64(2.2973967099940698)

In this example, we might just want to force the type promotion.
But another simple case arises when the output is a different shape to the input.
Consider LinearCombination, which typically takes more than 1 input but produces just 1 output:

lc = LinearCombination([1, -1])
A = [1 2; 5 9]

julia> FeatureTransforms.apply(A, lc);  # works

julia> FeatureTransforms.apply!(A, lc)
ERROR: DimensionMismatch("tried to assign 2 elements to 4 destinations")

Note that this kind of transform is Many-to-One, so would expect similar problems for One-to-Many and Many-to-Many.

We therefore might want some apply-like methods that would:

  1. Force mutation of the input (where possible) for One-to-One transforms by converting the underlying types.
  2. Append the input with the result (where possible) for Many-to-One, One-to-Many, or Many-to-Many transforms.

Given the types of problems these are solving it might be desirable to have these achieved by separate methods.
But note that it's possible to solve both problems using (2) and this would be a consistent behaviour.

Here are some ideas for how we might approach the solution:

  1. Special keyword args: apply!(...; force=true), apply!(...; append=true).
  2. Special methods: apply_force!, apply_and_append!, also apply!! (cf https://github.com/JuliaFolds/BangBang.jl)
  3. Define traits based on the transform cardinality with special rules in place for, e.g., apply!!(x, ::OnetoOne; kwargs...), apply!!(x, ::ManytoOne; kwargs...), apply!!(x, ::ManytoMany; kwargs...).

This also opens the question of how to name the columns for the appended data for a Table.
Should it be provided by the user? or automatically generated?

@glennmoy glennmoy added the design label Mar 2, 2021
@glennmoy
Copy link
Member Author

glennmoy commented Mar 2, 2021

Note that one example where all this will fail regardless is in some dimension-reducing transforms like PCA, for which it would be impossible to append to the input data. But this is a hard limitation no matter what we do with the above.

@bencottier
Copy link
Contributor

bencottier commented Mar 2, 2021

Note that this kind of transform is Many-to-One, so would expect similar problems for One-to-Many and Many-to-Many, although neither of these kinds of transforms have been implemented yet.

Is OneHotEncoding One-to-Many? Just thinking about the problems for that one, it could be appended to a table in some cases but it's weird to have multiple columns for one transform result.

@bencottier
Copy link
Contributor

bencottier commented Mar 2, 2021

I'll note I had similar thoughts while writing an example for the docs. These are not my all-things-considered thoughts.

  • Wanting apply! to force type promotion. I had a HoD transform, but it became convoluted as an example, when I wanted to apply! MeanStdScaling to the result of HoD (which is Int type). I don't know if that made sense in terms of feature engineering, but there are surely similar cases.
  • Wanting (at least the option) to append the result. I wanted to get a DataFrame out (or mutated) from a DataFrame in.
  • For many-to-one transforms, we could make the results column name an optional argument. The default could join the input column names. It doesn't seem so bad - the behaviour is defined clearly, and if the user isn't happy with that they can use their preferred column names.

@glennmoy
Copy link
Member Author

glennmoy commented Mar 2, 2021

Is OneHotEncoding One-to-Many?

Yeah I guess it is given the output type. I forgot to consider it. Will update the text.

@glennmoy
Copy link
Member Author

It was suggested that instead of an apply method, this could be another (binary) Transform.

@bencottier
Copy link
Contributor

It was suggested that instead of an apply method, this could be another (binary) Transform.

Thinking about initialising an append Transform, it seems weird to me, but maybe it helps composability.

@glennmoy
Copy link
Member Author

Thinking about initialising an append Transform, it seems weird to me, but maybe it helps composability.

TBH I'm not sure I like it. We'll have to see after trying out a few ideas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants