Refactor scaling to only compute one mean and std #42

glennmoy · 2021-03-05T10:21:50Z

Proposed Solution

When constructing MeanStdScaling it should compute a single mean and std pair for all the data it is given and no longer compute it on a slice-by-slice basis.

This PR:

Simplifies the code in scaling.jl
Simplifies the constructor and formats it consistently no matter what data it was given
Simplifies the parent apply functions (which no longer need to pass name downstream)
De-couples the MeanStdScaling from the data, allowing it operate as an independent Transform like any other.
Makes it's application more consistent with other transforms, i.e. "apply the transform to all the data it's given" rather than assuming it needs to be applied slice-by-slice (which no other transform does).

Cons:

Now need to construct separate MeanStdScaling transforms for each column if you wish to use more than one. It's possible we can make a helper util for this in future.

Problem Description

There are a few peculiarities with the way the MeanStdScaling transform is implemented that has given rise to some code smell.

We parse a name kwarg in the parent apply method that only applies to this transform, which complicates the code in apply and gives a false impression that name is universal.
Keeping track of the names in scaling is also complicated and the way the constructor is handled differs whether it was computed using an array or table, and if it used all the data or not.
MeanStdScaling is not an independent transform; it is coupled directly to the data that created it (one cannot rename a column afterwards for example).
How it's used is also inconsistent with our other transforms (see example below)
Finally, using mapslices may not be the best idea after all (Use a consistent convention for dims #18 (comment)) and AxisArrays doesn't natively support it (it also doesn't look like anyone's in a rush to put it in Implement mapslices for AxisArrays JuliaArrays/AxisArrays.jl#195) but refactoring that part was impossible without simplifying this first.

On point 3. Let's consider how a more simple transform, e.g. Power, operates

julia> A = [1 2; 5 9; 8 10];

julia> p = Power(3);
Power(3)

The default behaviour is that transform is applied to all of the data.
More specifically, the parameter p.exponent is applied uniformly to all of the data it is given.
We don't compute /apply a separate exponent to each slice (hint hint).
Although we can restrict it to certain slices of the data if we want.

julia> FeatureTransforms.apply(A, p)
3×2 Array{Int64,2}:
   1     8
 125   729
 512  1000

julia> FeatureTransforms.apply(A, p; dims=1, inds=1)
1×2 Array{Int64,2}:
 1  8

Compare this with MeanStdScaling, which is special because the constructor needs to first parse the data to create the transform and so the parameters will depend on what we give it.
But the same principle of "apply the same parameters to all the data" should still hold.

Looking at how we pass in the data, the first problem is that it can change how the constructor is represented, which makes it difficult to wrangle in the code.
Moreover, it's not obvious what these parameters mean.

julia> scaling_all = MeanStdScaling(A)
MeanStdScaling((all = 5.833333333333333,), (all = 3.7638632635454052,))

julia> scaling_selected = MeanStdScaling(A; dims=1)
MeanStdScaling((1 = 4.666666666666667, 2 = 7.0), (1 = 3.5118845842842465, 2 = 4.358898943540674))

Second, it makes the assumption that we want separate scaling parameters for each slice.
(This feature probably comes from my earlier suggestion but that comment should have been clearer to prioritise simplicity.)

What this means is that when applying this to A, different elements are transformed by different variables, which violates the principle above.

julia> FeatureTransforms.apply(A, scaling_all)
3×2 Array{Float64,2}:
 -1.28414   -1.01846
 -0.221404   0.841334
  0.57565    1.10702

julia> FeatureTransforms.apply(A, scaling_selected; dims=1)
3×2 Array{Float64,2}:
 -1.04407    -1.14708
  0.0949158   0.458831
  0.949158    0.688247

Finally, the transform is overly-coupled to the data it was computed on.
While it should have a memory of this data (as it is stateful) that memory lies in the scaling parameters themselves, after computing these it should operate as an independent transform like any other.
But right now, applying it to certain dims, inds and cols relies too heavily on the original data and it is therefore too brittle and restrictive.

Compare with this branch.
The scaling is computed across all data it is given, dims doesn't matter unless inds is provided.
And it returns a consistent constructor now matter what format it's in.

julia> A = [1 2; 5 9; 8 10];

julia> scaling = MeanStdScaling(A)
MeanStdScaling(5.833333333333333, 3.7638632635454052)

julia> scaling = MeanStdScaling(A; dims=1)
MeanStdScaling(5.833333333333333, 3.7638632635454052)

julia> scaling = MeanStdScaling(A; dims=1, inds=1)
MeanStdScaling(1.5, 0.7071067811865476)

Afterwards, the scaling is an independent transform like any other and can be applied freely.
This takes into account the data may have changed (via other transforms) so the user should be able to specify what to apply it to.

julia> FeatureTransforms.apply(A, scaling)
3×2 Array{Float64,2}:
 -0.707107   0.707107
  4.94975   10.6066
  9.19239   12.0208

julia> FeatureTransforms.apply(A, scaling; dims=1, inds=[2])
1×2 Array{Float64,2}:
 4.94975  10.6066

I'll note that I think this also closes #39 in that it transforming the second column is consistent.
This is thanks to scaling now acting like an independent transform.

julia> M = reshape(collect(1:6), 3, 2);

#normalise the second column
julia> scaling = MeanStdScaling(M; dims=2, inds=2)
MeanStdScaling(5.0, 1.0)

julia> FeatureTransforms.apply(M, scaling; dims=2, inds=[2])
3×1 Array{Float64,2}:
 -1.0
  0.0
  1.0

codecov · 2021-03-05T10:27:42Z

Codecov Report

Merging #42 (a9bae35) into main (081a9af) will decrease coverage by 0.46%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main      #42      +/-   ##
==========================================
- Coverage   95.31%   94.84%   -0.47%     
==========================================
  Files           9        9              
  Lines         128       97      -31     
==========================================
- Hits          122       92      -30     
+ Misses          6        5       -1

Impacted Files	Coverage Δ
src/scaling.jl	`100.00% <100.00%> (+3.33%)`	⬆️
src/transformers.jl	`100.00% <100.00%> (ø)`
src/periodic.jl	`100.00% <0.00%> (ø)`
src/one_hot_encoding.jl	`100.00% <0.00%> (ø)`
src/linear_combination.jl	`100.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fbc7b8f...3edd9d8. Read the comment docs.

bencottier

I like this. It's cleaner, simpler, and more consistent with the Transform ethos.

However I have a couple of concerns:

It's extra effort to create a scaling object for each slice or column. I think normalising multiple columns is a typical use case. Maybe we could add a convenience function for this?
I think this only closes Apply scaling to certain slices of an array #39 with the change from mapslices to selectdim, not the refactor of scaling. The issue is about how inds plays with the dims convention.

By the way, the description is hard to follow when I don't already know the conclusion. I recommend the pyramid principle :) Summarise what this does and why, at the top.

test/one_hot_encoding.jl

test/scaling.jl

src/scaling.jl

glennmoy · 2021-03-05T16:35:35Z

It's extra effort to create a scaling object for each slice or column. I think normalising multiple columns is a typical use case.

Hmm...arguably each column in a collection is its own feature, which would suggest creating separate transforms rather than doing them all once? And I don't think we scale multiple columns in our own code base but I'm happy to be corrected if you want to post some examples in slack.

I think this only closes #39 with the change from mapslices to selectdim, not the refactor of scaling. The issue is about how inds plays with the dims convention.

Are you sure? Even without the selectdim changes I can scale a column and select the same column?

By the way, the description is hard to follow when I don't already know the conclusion. I recommend the pyramid principle :) Summarise what this does and why, at the top.

👍

bencottier · 2021-03-05T17:37:47Z

Hmm...arguably each column in a collection is its own feature, which would suggest creating separate transforms rather than doing them all once? And I don't think we scale multiple columns in our own code base but I'm happy to be corrected if you want to post some examples in slack.

I have at least one example I'll post.

That's a fair argument, but we allow applying to multiple columns at once for every other element-wise transform. But I understand the difference that those transforms do the same operation everywhere.

src/scaling.jl

docs/src/examples.md

glennmoy · 2021-03-09T11:26:28Z

docs/src/transforms.md

+This will apply the `Transform` to slices of the array along this dimension, which can be selected by the `inds` keyword.
+So when `dims` and `inds` are used together, the `inds` change from being the global indices of the array to the relative indices of each slice.
+
+For example, given a `Matrix`, `dims=1` slices the data column-wise and `inds=[2, 3]` selects the 2nd and 3rd rows.


@bencottier just double-checking that this make sense to you? and the changes to the example below?

Yeah looks good

glennmoy changed the title ~~Refactor scaling~~ WIP: Refactor scaling Mar 5, 2021

bencottier reviewed Mar 5, 2021

View reviewed changes

test/one_hot_encoding.jl Outdated Show resolved Hide resolved

test/scaling.jl Show resolved Hide resolved

src/scaling.jl Outdated Show resolved Hide resolved

src/scaling.jl Show resolved Hide resolved

glennmoy mentioned this pull request Mar 5, 2021

Add initial docs #37

Merged

glennmoy force-pushed the gm/refactor_scaling branch from 9d9e29a to 4853529 Compare March 5, 2021 16:47

glennmoy changed the title ~~WIP: Refactor scaling~~ Refactor scaling to only compute one mean and std Mar 5, 2021

nicoleepp approved these changes Mar 5, 2021

View reviewed changes

src/scaling.jl Outdated Show resolved Hide resolved

glennmoy force-pushed the gm/refactor_scaling branch from fad3a5d to c4a86b2 Compare March 5, 2021 21:42

Glenn Moynihan added 3 commits March 8, 2021 18:29

Refactor scaling

5b1ab30

Remove name kwarg from apply

ae5e5a7

Update docs

290a1f1

glennmoy force-pushed the gm/refactor_scaling branch from c4a86b2 to 290a1f1 Compare March 8, 2021 21:16

bencottier approved these changes Mar 9, 2021

View reviewed changes

docs/src/examples.md Outdated Show resolved Hide resolved

docs/src/examples.md Outdated Show resolved Hide resolved

glennmoy commented Mar 9, 2021

View reviewed changes

Apply review suggestions

3edd9d8

glennmoy merged commit f6bb5e4 into main Mar 9, 2021

glennmoy mentioned this pull request Mar 9, 2021

Replace mapslices with selectdim #45

Merged

glennmoy deleted the gm/refactor_scaling branch March 9, 2021 20:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor scaling to only compute one mean and std #42

Refactor scaling to only compute one mean and std #42

glennmoy commented Mar 5, 2021 •

edited

Loading

codecov bot commented Mar 5, 2021 •

edited

Loading

bencottier left a comment •

edited

Loading

glennmoy commented Mar 5, 2021 •

edited

Loading

bencottier commented Mar 5, 2021 •

edited

Loading

glennmoy Mar 9, 2021 •

edited

Loading

bencottier Mar 9, 2021

Refactor scaling to only compute one mean and std #42

Refactor scaling to only compute one mean and std #42

Conversation

glennmoy commented Mar 5, 2021 • edited Loading

Proposed Solution

Problem Description

codecov bot commented Mar 5, 2021 • edited Loading

Codecov Report

bencottier left a comment • edited Loading

Choose a reason for hiding this comment

glennmoy commented Mar 5, 2021 • edited Loading

bencottier commented Mar 5, 2021 • edited Loading

glennmoy Mar 9, 2021 • edited Loading

Choose a reason for hiding this comment

bencottier Mar 9, 2021

Choose a reason for hiding this comment

glennmoy commented Mar 5, 2021 •

edited

Loading

codecov bot commented Mar 5, 2021 •

edited

Loading

bencottier left a comment •

edited

Loading

glennmoy commented Mar 5, 2021 •

edited

Loading

bencottier commented Mar 5, 2021 •

edited

Loading

glennmoy Mar 9, 2021 •

edited

Loading