Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor scaling to only compute one mean and std #42

Merged
merged 4 commits into from
Mar 9, 2021

Conversation

glennmoy
Copy link
Member

@glennmoy glennmoy commented Mar 5, 2021

Proposed Solution

When constructing MeanStdScaling it should compute a single mean and std pair for all the data it is given and no longer compute it on a slice-by-slice basis.

This PR:

  • Simplifies the code in scaling.jl
  • Simplifies the constructor and formats it consistently no matter what data it was given
  • Simplifies the parent apply functions (which no longer need to pass name downstream)
  • De-couples the MeanStdScaling from the data, allowing it operate as an independent Transform like any other.
  • Makes it's application more consistent with other transforms, i.e. "apply the transform to all the data it's given" rather than assuming it needs to be applied slice-by-slice (which no other transform does).

Cons:

  • Now need to construct separate MeanStdScaling transforms for each column if you wish to use more than one. It's possible we can make a helper util for this in future.

Problem Description

There are a few peculiarities with the way the MeanStdScaling transform is implemented that has given rise to some code smell.

  1. We parse a name kwarg in the parent apply method that only applies to this transform, which complicates the code in apply and gives a false impression that name is universal.
  2. Keeping track of the names in scaling is also complicated and the way the constructor is handled differs whether it was computed using an array or table, and if it used all the data or not.
  3. MeanStdScaling is not an independent transform; it is coupled directly to the data that created it (one cannot rename a column afterwards for example).
  4. How it's used is also inconsistent with our other transforms (see example below)
  5. Finally, using mapslices may not be the best idea after all (Use a consistent convention for dims #18 (comment)) and AxisArrays doesn't natively support it (it also doesn't look like anyone's in a rush to put it in Implement mapslices for AxisArrays JuliaArrays/AxisArrays.jl#195) but refactoring that part was impossible without simplifying this first.

On point 3. Let's consider how a more simple transform, e.g. Power, operates

julia> A = [1 2; 5 9; 8 10];

julia> p = Power(3);
Power(3)

The default behaviour is that transform is applied to all of the data.
More specifically, the parameter p.exponent is applied uniformly to all of the data it is given.
We don't compute /apply a separate exponent to each slice (hint hint).
Although we can restrict it to certain slices of the data if we want.

julia> FeatureTransforms.apply(A, p)
3×2 Array{Int64,2}:
   1     8
 125   729
 512  1000

julia> FeatureTransforms.apply(A, p; dims=1, inds=1)
1×2 Array{Int64,2}:
 1  8

Compare this with MeanStdScaling, which is special because the constructor needs to first parse the data to create the transform and so the parameters will depend on what we give it.
But the same principle of "apply the same parameters to all the data" should still hold.

Looking at how we pass in the data, the first problem is that it can change how the constructor is represented, which makes it difficult to wrangle in the code.
Moreover, it's not obvious what these parameters mean.

julia> scaling_all = MeanStdScaling(A)
MeanStdScaling((all = 5.833333333333333,), (all = 3.7638632635454052,))

julia> scaling_selected = MeanStdScaling(A; dims=1)
MeanStdScaling((1 = 4.666666666666667, 2 = 7.0), (1 = 3.5118845842842465, 2 = 4.358898943540674))

Second, it makes the assumption that we want separate scaling parameters for each slice.
(This feature probably comes from my earlier suggestion but that comment should have been clearer to prioritise simplicity.)

What this means is that when applying this to A, different elements are transformed by different variables, which violates the principle above.

julia> FeatureTransforms.apply(A, scaling_all)
3×2 Array{Float64,2}:
 -1.28414   -1.01846
 -0.221404   0.841334
  0.57565    1.10702

julia> FeatureTransforms.apply(A, scaling_selected; dims=1)
3×2 Array{Float64,2}:
 -1.04407    -1.14708
  0.0949158   0.458831
  0.949158    0.688247

Finally, the transform is overly-coupled to the data it was computed on.
While it should have a memory of this data (as it is stateful) that memory lies in the scaling parameters themselves, after computing these it should operate as an independent transform like any other.
But right now, applying it to certain dims, inds and cols relies too heavily on the original data and it is therefore too brittle and restrictive.

Compare with this branch.
The scaling is computed across all data it is given, dims doesn't matter unless inds is provided.
And it returns a consistent constructor now matter what format it's in.

julia> A = [1 2; 5 9; 8 10];

julia> scaling = MeanStdScaling(A)
MeanStdScaling(5.833333333333333, 3.7638632635454052)

julia> scaling = MeanStdScaling(A; dims=1)
MeanStdScaling(5.833333333333333, 3.7638632635454052)

julia> scaling = MeanStdScaling(A; dims=1, inds=1)
MeanStdScaling(1.5, 0.7071067811865476)

Afterwards, the scaling is an independent transform like any other and can be applied freely.
This takes into account the data may have changed (via other transforms) so the user should be able to specify what to apply it to.

julia> FeatureTransforms.apply(A, scaling)
3×2 Array{Float64,2}:
 -0.707107   0.707107
  4.94975   10.6066
  9.19239   12.0208

julia> FeatureTransforms.apply(A, scaling; dims=1, inds=[2])
1×2 Array{Float64,2}:
 4.94975  10.6066

I'll note that I think this also closes #39 in that it transforming the second column is consistent.
This is thanks to scaling now acting like an independent transform.

julia> M = reshape(collect(1:6), 3, 2);

#normalise the second column
julia> scaling = MeanStdScaling(M; dims=2, inds=2)
MeanStdScaling(5.0, 1.0)

julia> FeatureTransforms.apply(M, scaling; dims=2, inds=[2])
3×1 Array{Float64,2}:
 -1.0
  0.0
  1.0

@glennmoy glennmoy changed the title Refactor scaling WIP: Refactor scaling Mar 5, 2021
@codecov
Copy link

codecov bot commented Mar 5, 2021

Codecov Report

Merging #42 (a9bae35) into main (081a9af) will decrease coverage by 0.46%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #42      +/-   ##
==========================================
- Coverage   95.31%   94.84%   -0.47%     
==========================================
  Files           9        9              
  Lines         128       97      -31     
==========================================
- Hits          122       92      -30     
+ Misses          6        5       -1     
Impacted Files Coverage Δ
src/scaling.jl 100.00% <100.00%> (+3.33%) ⬆️
src/transformers.jl 100.00% <100.00%> (ø)
src/periodic.jl 100.00% <0.00%> (ø)
src/one_hot_encoding.jl 100.00% <0.00%> (ø)
src/linear_combination.jl 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fbc7b8f...3edd9d8. Read the comment docs.

Copy link
Contributor

@bencottier bencottier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this. It's cleaner, simpler, and more consistent with the Transform ethos.

However I have a couple of concerns:

  • It's extra effort to create a scaling object for each slice or column. I think normalising multiple columns is a typical use case. Maybe we could add a convenience function for this?
  • I think this only closes Apply scaling to certain slices of an array #39 with the change from mapslices to selectdim, not the refactor of scaling. The issue is about how inds plays with the dims convention.

By the way, the description is hard to follow when I don't already know the conclusion. I recommend the pyramid principle :) Summarise what this does and why, at the top.

test/one_hot_encoding.jl Outdated Show resolved Hide resolved
test/scaling.jl Show resolved Hide resolved
src/scaling.jl Outdated Show resolved Hide resolved
src/scaling.jl Show resolved Hide resolved
@glennmoy glennmoy mentioned this pull request Mar 5, 2021
@glennmoy
Copy link
Member Author

glennmoy commented Mar 5, 2021

It's extra effort to create a scaling object for each slice or column. I think normalising multiple columns is a typical use case.

Hmm...arguably each column in a collection is its own feature, which would suggest creating separate transforms rather than doing them all once? And I don't think we scale multiple columns in our own code base but I'm happy to be corrected if you want to post some examples in slack.

I think this only closes #39 with the change from mapslices to selectdim, not the refactor of scaling. The issue is about how inds plays with the dims convention.

Are you sure? Even without the selectdim changes I can scale a column and select the same column?

By the way, the description is hard to follow when I don't already know the conclusion. I recommend the pyramid principle :) Summarise what this does and why, at the top.

👍

@glennmoy glennmoy force-pushed the gm/refactor_scaling branch from 9d9e29a to 4853529 Compare March 5, 2021 16:47
@glennmoy glennmoy changed the title WIP: Refactor scaling Refactor scaling to only compute one mean and std Mar 5, 2021
@bencottier
Copy link
Contributor

bencottier commented Mar 5, 2021

Hmm...arguably each column in a collection is its own feature, which would suggest creating separate transforms rather than doing them all once? And I don't think we scale multiple columns in our own code base but I'm happy to be corrected if you want to post some examples in slack.

I have at least one example I'll post.

That's a fair argument, but we allow applying to multiple columns at once for every other element-wise transform. But I understand the difference that those transforms do the same operation everywhere.

src/scaling.jl Outdated Show resolved Hide resolved
@glennmoy glennmoy force-pushed the gm/refactor_scaling branch from fad3a5d to c4a86b2 Compare March 5, 2021 21:42
@glennmoy glennmoy force-pushed the gm/refactor_scaling branch from c4a86b2 to 290a1f1 Compare March 8, 2021 21:16
docs/src/examples.md Outdated Show resolved Hide resolved
docs/src/examples.md Outdated Show resolved Hide resolved
This will apply the `Transform` to slices of the array along this dimension, which can be selected by the `inds` keyword.
So when `dims` and `inds` are used together, the `inds` change from being the global indices of the array to the relative indices of each slice.

For example, given a `Matrix`, `dims=1` slices the data column-wise and `inds=[2, 3]` selects the 2nd and 3rd rows.
Copy link
Member Author

@glennmoy glennmoy Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bencottier just double-checking that this make sense to you? and the changes to the example below?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah looks good

@glennmoy glennmoy merged commit f6bb5e4 into main Mar 9, 2021
@glennmoy glennmoy deleted the gm/refactor_scaling branch March 9, 2021 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Apply scaling to certain slices of an array
3 participants