-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor scaling to only compute one mean and std #42
Conversation
Codecov Report
@@ Coverage Diff @@
## main #42 +/- ##
==========================================
- Coverage 95.31% 94.84% -0.47%
==========================================
Files 9 9
Lines 128 97 -31
==========================================
- Hits 122 92 -30
+ Misses 6 5 -1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this. It's cleaner, simpler, and more consistent with the Transform
ethos.
However I have a couple of concerns:
- It's extra effort to create a scaling object for each slice or column. I think normalising multiple columns is a typical use case. Maybe we could add a convenience function for this?
- I think this only closes Apply scaling to certain slices of an array #39 with the change from
mapslices
toselectdim
, not the refactor of scaling. The issue is about howinds
plays with thedims
convention.
By the way, the description is hard to follow when I don't already know the conclusion. I recommend the pyramid principle :) Summarise what this does and why, at the top.
Hmm...arguably each column in a collection is its own feature, which would suggest creating separate transforms rather than doing them all once? And I don't think we scale multiple columns in our own code base but I'm happy to be corrected if you want to post some examples in slack.
Are you sure? Even without the
👍 |
9d9e29a
to
4853529
Compare
I have at least one example I'll post. That's a fair argument, but we allow applying to multiple columns at once for every other element-wise transform. But I understand the difference that those transforms do the same operation everywhere. |
fad3a5d
to
c4a86b2
Compare
c4a86b2
to
290a1f1
Compare
This will apply the `Transform` to slices of the array along this dimension, which can be selected by the `inds` keyword. | ||
So when `dims` and `inds` are used together, the `inds` change from being the global indices of the array to the relative indices of each slice. | ||
|
||
For example, given a `Matrix`, `dims=1` slices the data column-wise and `inds=[2, 3]` selects the 2nd and 3rd rows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bencottier just double-checking that this make sense to you? and the changes to the example below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah looks good
Proposed Solution
When constructing
MeanStdScaling
it should compute a single mean and std pair for all the data it is given and no longer compute it on a slice-by-slice basis.This PR:
scaling.jl
apply
functions (which no longer need to passname
downstream)MeanStdScaling
from the data, allowing it operate as an independentTransform
like any other.Cons:
MeanStdScaling
transforms for each column if you wish to use more than one. It's possible we can make a helper util for this in future.Problem Description
There are a few peculiarities with the way the
MeanStdScaling
transform is implemented that has given rise to some code smell.name
kwarg in the parentapply
method that only applies to this transform, which complicates the code inapply
and gives a false impression thatname
is universal.scaling
is also complicated and the way the constructor is handled differs whether it was computed using an array or table, and if it used all the data or not.MeanStdScaling
is not an independent transform; it is coupled directly to the data that created it (one cannot rename a column afterwards for example).mapslices
may not be the best idea after all (Use a consistent convention fordims
#18 (comment)) and AxisArrays doesn't natively support it (it also doesn't look like anyone's in a rush to put it in Implement mapslices for AxisArrays JuliaArrays/AxisArrays.jl#195) but refactoring that part was impossible without simplifying this first.On point 3. Let's consider how a more simple transform, e.g.
Power
, operatesThe default behaviour is that transform is applied to all of the data.
More specifically, the parameter
p.exponent
is applied uniformly to all of the data it is given.We don't compute /apply a separate exponent to each slice (hint hint).
Although we can restrict it to certain slices of the data if we want.
Compare this with
MeanStdScaling
, which is special because the constructor needs to first parse the data to create the transform and so the parameters will depend on what we give it.But the same principle of "apply the same parameters to all the data" should still hold.
Looking at how we pass in the data, the first problem is that it can change how the constructor is represented, which makes it difficult to wrangle in the code.
Moreover, it's not obvious what these parameters mean.
Second, it makes the assumption that we want separate scaling parameters for each slice.
(This feature probably comes from my earlier suggestion but that comment should have been clearer to prioritise simplicity.)
What this means is that when applying this to
A
, different elements are transformed by different variables, which violates the principle above.Finally, the transform is overly-coupled to the data it was computed on.
While it should have a memory of this data (as it is stateful) that memory lies in the scaling parameters themselves, after computing these it should operate as an independent transform like any other.
But right now, applying it to certain
dims
,inds
andcols
relies too heavily on the original data and it is therefore too brittle and restrictive.Compare with this branch.
The scaling is computed across all data it is given,
dims
doesn't matter unlessinds
is provided.And it returns a consistent constructor now matter what format it's in.
Afterwards, the
scaling
is an independent transform like any other and can be applied freely.This takes into account the data may have changed (via other transforms) so the user should be able to specify what to apply it to.
I'll note that I think this also closes #39 in that it transforming the second column is consistent.
This is thanks to
scaling
now acting like an independent transform.