-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC: Add traits to support generalising the apply methods #77
Conversation
Codecov Report
@@ Coverage Diff @@
## main #77 +/- ##
==========================================
+ Coverage 99.09% 99.24% +0.14%
==========================================
Files 10 12 +2
Lines 111 132 +21
==========================================
+ Hits 110 131 +21
Misses 1 1
Continue to review full report at Codecov.
|
[[1, 0], [0, 1], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [1, 0], [0, 1]], | ||
[Symbol.(:Column, x) for x in 1:10], | ||
Bool[1 0 0 0 0; 0 1 0 0 0; 0 0 0 1 0; 0 0 0 0 1], | ||
[Symbol.(:Column, x) for x in 1:5], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interestingly this result changed when I finished the refactoring, even though OHE wasn't touched, so I think this was a bug. Indeed, the previous result was inconsistent with the matrix application but I didn't notice that before. This version is now consistent.
Either way I'm not too concerned about this because I'm not sure how much sense it makes to apply OHE to multiple columns. The output isn't necessarily valid or useful, so I would considering dropping the tests on multi-dimensional array altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain more what was going on before and how it works now? I'm just pressed for time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what was happening before is (in the Tables apply
) when we had multiple columns we were calling _apply
on each component in turn and then stacking the result. This lead to the original version that had many columns, which would have had duplicate names.
Now instead, we collect the components and call _apply
on the collection, which is what we were doing for the Array
method all along. So now we get one column per category (as expected) but the shape is now inconsistent.
This is what I mean by it doesn't make sense to call OHE on a Matrix. I think in a future release I'm going to delete these tests because they don't seem worth supporting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me. Generalising LC is nice and I'm glad an intermediate _apply
wasn't needed. There could still be a risk of _preformat
and _postformat
becoming bloated, but I think we can cross that bridge if we come to it.
# _preformat formats the data before calling _apply. Needed for all apply methods. | ||
# Before applying a ManyToOne Transform we must first slice up the data along the dimension | ||
# we are reducing over. | ||
_preformat(::Cardinality, A, d) = A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just had a thought that this kind of function (optionally doing something before _apply
) would be suited to precomputing statistics for MeanStdScaling
without needing to specify args twice #59. But that goes with the type of Transform
rather than Cardinality
.
# After applying a ManyToOne Transform, if we want to cat the result we have to reshape it | ||
# setting the reduced dimension to 1, otherwise cat will error. | ||
_postformat(::Cardinality, result, A, d) = result | ||
function _postformat(::ManyToOne, result, A, d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could comment on how (IIUC) d
is different for _postformat
, since append_dim
is passed to it in apply_append
rather than dims
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I renamed d
->append_dim
in the other PR but I'll explain here as well.
Basically, the problem was when we want to append a row to an array, but LinearCombination always returns a column vector. So we have to reshape it into a row to get it to fit.
In general, when we want to cat
the result with apply_append
for ManyToOne transforms (which always return a N-1 array) we have to reshape it so that the reduced dimension is always 1 so that it fits.
@@ -75,9 +73,12 @@ function apply(table, t::Transform; cols=_get_cols(table), header=nothing, kwarg | |||
# Extract a columns iterator that we should be able to use to mutate the data. | |||
# NOTE: Mutation is not guaranteed for all table types, but it avoid copying the data | |||
coltable = Tables.columntable(table) | |||
cols = _to_vec(cols) | |||
components = reduce(hcat, getproperty(coltable, col) for col in _to_vec(cols)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use a comment. Is this just converting to a Matrix
? Isn't there a function for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem is getproperty(coltable, col) for col in _to_vec(cols)
is a Vector{Vector}
so calling Matrix
on this wouldn't work.
I assume calling Matrix
is what you meant by there being a function?
# We call hcat to convert any Vector components/results into a Matrix. | ||
# Passing dims=2 only matters for ManyToOne transforms - otherwise it has no effect. | ||
input = _preformat(cardinality(t), hcat(components), 2) | ||
result = hcat(_apply(input, t; dims=2, kwargs...)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bit of a smell that we're using hcat
3 times in this function, I'm not quite sure what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left more comments. It's to make sure we can use eachslice
in _preformat
, and then pipe the result into a Table. Without hcat(components)
we can't take a LinearCombination of a single column in a NamedTuple/DataFrame. Not sure how useful that is but we're supporting it.
|
||
struct FakeManyToManyTransform <: Transform end | ||
FeatureTransforms.cardinality(::FakeManyToManyTransform) = ManyToMany() | ||
FeatureTransforms._apply(A, ::FakeManyToManyTransform; kwargs...) = hcat(ones(size(A)), ones(size(A))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this meant to be the same as FakeOneToManyTransform
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes - FakeXToManyTransforms should output multiple components. It doesn't really matter what the input is, although for ManyToMany the output can be bigger/smaller than the input.
@testset "dims" begin | ||
@testset "dims = :" begin | ||
M = [1 1; 2 2; 3 5] | ||
lc = LinearCombination([1, -1, 1]) | ||
@test_throws ArgumentError FeatureTransforms.apply(M, lc; dims=:) | ||
@test_throws MethodError FeatureTransforms.apply(M, lc; dims=:) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What caused this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we removed the explicit check for dims=:
the ArgumentError
is no longer hit.
Now instead it hits eachslice(A; dims=:)
which throws a MethodError
.
[[1, 0], [0, 1], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [1, 0], [0, 1]], | ||
[Symbol.(:Column, x) for x in 1:10], | ||
Bool[1 0 0 0 0; 0 1 0 0 0; 0 0 0 1 0; 0 0 0 0 1], | ||
[Symbol.(:Column, x) for x in 1:5], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain more what was going on before and how it works now? I'm just pressed for time.
@@ -110,7 +109,9 @@ is appended to `A` along the `append_dim` dimension. The remaining `kwargs` corr | |||
the usual [`Transform`](@ref) being invoked. | |||
""" | |||
function apply_append(A::AbstractArray, t; append_dim, kwargs...)::AbstractArray | |||
return cat(A, apply(A, t; kwargs...); dims=append_dim) | |||
result = apply(A, t; kwargs...) | |||
result = _postformat(cardinality(t), result, A, append_dim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to check that append_dim
is definitely right. This came up in the apply_append
PR but do we have any test cases for append_dim != dims
yet?
closed in favour of #80 |
POC for #75
Some brief notes to guide review:
This started off implementing the idea in the description of #75: adding traits to describe the cardinality of our transforms and defining an intermediate
_apply
method that dispatched on the traits.But after some refactoring at the end, I noticed that the key difference boiled down to how the data was prepared before going into a
ManyToOne
transform (LinearCombination
).So I refactored a bit more to create two simple formatting functions:
_preformat
structures the input to_apply
according to the cardinality of the transform. Effectively, this just callseachslice
before calling_apply
forLinearCombination
._preformat
structures the output of_apply
according to the cardinality of the transform. This is really only needed forappend_apply
becauseLinearCombination
returns aVector
, if we are appending it as a rowcat
throws a DimensionMismatch. So we have to pivot the data before doing so.I added some
TestUtils
withFakes
for the above transforms and an example testset for how we might test a new data type in a more manageable way in the future.The idea being that: once a datatype supports each type of transform (and returns correct result for the
kwargs
) then it should support any of the transforms defined in the package.Then, each transform just needs to be tested against
Array
s and nothing else, which should give us full test coverage but in a much simpler way.