Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: Add traits to support generalising the apply methods #77

Closed
wants to merge 12 commits into from

Conversation

glennmoy
Copy link
Member

@glennmoy glennmoy commented Apr 9, 2021

POC for #75

Some brief notes to guide review:

This started off implementing the idea in the description of #75: adding traits to describe the cardinality of our transforms and defining an intermediate _apply method that dispatched on the traits.

But after some refactoring at the end, I noticed that the key difference boiled down to how the data was prepared before going into a ManyToOne transform (LinearCombination).

So I refactored a bit more to create two simple formatting functions:

  • _preformat structures the input to _apply according to the cardinality of the transform. Effectively, this just calls eachslice before calling _apply for LinearCombination.
  • _preformat structures the output of _apply according to the cardinality of the transform. This is really only needed for append_apply because LinearCombination returns a Vector, if we are appending it as a row cat throws a DimensionMismatch. So we have to pivot the data before doing so.

I added some TestUtils with Fakes for the above transforms and an example testset for how we might test a new data type in a more manageable way in the future.
The idea being that: once a datatype supports each type of transform (and returns correct result for the kwargs) then it should support any of the transforms defined in the package.

Then, each transform just needs to be tested against Arrays and nothing else, which should give us full test coverage but in a much simpler way.

@codecov
Copy link

codecov bot commented Apr 9, 2021

Codecov Report

Merging #77 (1276d95) into main (55e3c1a) will increase coverage by 0.14%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #77      +/-   ##
==========================================
+ Coverage   99.09%   99.24%   +0.14%     
==========================================
  Files          10       12       +2     
  Lines         111      132      +21     
==========================================
+ Hits          110      131      +21     
  Misses          1        1              
Impacted Files Coverage Δ
src/FeatureTransforms.jl 100.00% <ø> (ø)
src/apply.jl 100.00% <100.00%> (ø)
src/linear_combination.jl 100.00% <100.00%> (ø)
src/one_hot_encoding.jl 100.00% <100.00%> (ø)
src/periodic.jl 100.00% <100.00%> (ø)
src/power.jl 100.00% <100.00%> (ø)
src/scaling.jl 100.00% <100.00%> (ø)
src/temporal.jl 100.00% <100.00%> (ø)
src/test_utils.jl 100.00% <100.00%> (ø)
src/traits.jl 100.00% <100.00%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 55e3c1a...1276d95. Read the comment docs.

src/traits.jl Show resolved Hide resolved
src/traits.jl Show resolved Hide resolved
Comment on lines -177 to +178
[[1, 0], [0, 1], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [1, 0], [0, 1]],
[Symbol.(:Column, x) for x in 1:10],
Bool[1 0 0 0 0; 0 1 0 0 0; 0 0 0 1 0; 0 0 0 0 1],
[Symbol.(:Column, x) for x in 1:5],
Copy link
Member Author

@glennmoy glennmoy Apr 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interestingly this result changed when I finished the refactoring, even though OHE wasn't touched, so I think this was a bug. Indeed, the previous result was inconsistent with the matrix application but I didn't notice that before. This version is now consistent.

Either way I'm not too concerned about this because I'm not sure how much sense it makes to apply OHE to multiple columns. The output isn't necessarily valid or useful, so I would considering dropping the tests on multi-dimensional array altogether.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain more what was going on before and how it works now? I'm just pressed for time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what was happening before is (in the Tables apply) when we had multiple columns we were calling _apply on each component in turn and then stacking the result. This lead to the original version that had many columns, which would have had duplicate names.

Now instead, we collect the components and call _apply on the collection, which is what we were doing for the Array method all along. So now we get one column per category (as expected) but the shape is now inconsistent.

This is what I mean by it doesn't make sense to call OHE on a Matrix. I think in a future release I'm going to delete these tests because they don't seem worth supporting.

Copy link
Contributor

@bencottier bencottier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. Generalising LC is nice and I'm glad an intermediate _apply wasn't needed. There could still be a risk of _preformat and _postformat becoming bloated, but I think we can cross that bridge if we come to it.

# _preformat formats the data before calling _apply. Needed for all apply methods.
# Before applying a ManyToOne Transform we must first slice up the data along the dimension
# we are reducing over.
_preformat(::Cardinality, A, d) = A
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had a thought that this kind of function (optionally doing something before _apply) would be suited to precomputing statistics for MeanStdScaling without needing to specify args twice #59. But that goes with the type of Transform rather than Cardinality.

# After applying a ManyToOne Transform, if we want to cat the result we have to reshape it
# setting the reduced dimension to 1, otherwise cat will error.
_postformat(::Cardinality, result, A, d) = result
function _postformat(::ManyToOne, result, A, d)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could comment on how (IIUC) d is different for _postformat, since append_dim is passed to it in apply_append rather than dims.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I renamed d->append_dim in the other PR but I'll explain here as well.

Basically, the problem was when we want to append a row to an array, but LinearCombination always returns a column vector. So we have to reshape it into a row to get it to fit.

In general, when we want to cat the result with apply_append for ManyToOne transforms (which always return a N-1 array) we have to reshape it so that the reduced dimension is always 1 so that it fits.

@@ -75,9 +73,12 @@ function apply(table, t::Transform; cols=_get_cols(table), header=nothing, kwarg
# Extract a columns iterator that we should be able to use to mutate the data.
# NOTE: Mutation is not guaranteed for all table types, but it avoid copying the data
coltable = Tables.columntable(table)
cols = _to_vec(cols)
components = reduce(hcat, getproperty(coltable, col) for col in _to_vec(cols))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use a comment. Is this just converting to a Matrix? Isn't there a function for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is getproperty(coltable, col) for col in _to_vec(cols) is a Vector{Vector} so calling Matrix on this wouldn't work.

I assume calling Matrix is what you meant by there being a function?

# We call hcat to convert any Vector components/results into a Matrix.
# Passing dims=2 only matters for ManyToOne transforms - otherwise it has no effect.
input = _preformat(cardinality(t), hcat(components), 2)
result = hcat(_apply(input, t; dims=2, kwargs...))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit of a smell that we're using hcat 3 times in this function, I'm not quite sure what's going on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left more comments. It's to make sure we can use eachslice in _preformat, and then pipe the result into a Table. Without hcat(components) we can't take a LinearCombination of a single column in a NamedTuple/DataFrame. Not sure how useful that is but we're supporting it.

src/linear_combination.jl Show resolved Hide resolved

struct FakeManyToManyTransform <: Transform end
FeatureTransforms.cardinality(::FakeManyToManyTransform) = ManyToMany()
FeatureTransforms._apply(A, ::FakeManyToManyTransform; kwargs...) = hcat(ones(size(A)), ones(size(A)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to be the same as FakeOneToManyTransform?

Copy link
Member Author

@glennmoy glennmoy Apr 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - FakeXToManyTransforms should output multiple components. It doesn't really matter what the input is, although for ManyToMany the output can be bigger/smaller than the input.

@testset "dims" begin
@testset "dims = :" begin
M = [1 1; 2 2; 3 5]
lc = LinearCombination([1, -1, 1])
@test_throws ArgumentError FeatureTransforms.apply(M, lc; dims=:)
@test_throws MethodError FeatureTransforms.apply(M, lc; dims=:)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What caused this change?

Copy link
Member Author

@glennmoy glennmoy Apr 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we removed the explicit check for dims=: the ArgumentError is no longer hit.

Now instead it hits eachslice(A; dims=:) which throws a MethodError.

test/runtests.jl Show resolved Hide resolved
Comment on lines -177 to +178
[[1, 0], [0, 1], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [1, 0], [0, 1]],
[Symbol.(:Column, x) for x in 1:10],
Bool[1 0 0 0 0; 0 1 0 0 0; 0 0 0 1 0; 0 0 0 0 1],
[Symbol.(:Column, x) for x in 1:5],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain more what was going on before and how it works now? I'm just pressed for time.

@@ -110,7 +109,9 @@ is appended to `A` along the `append_dim` dimension. The remaining `kwargs` corr
the usual [`Transform`](@ref) being invoked.
"""
function apply_append(A::AbstractArray, t; append_dim, kwargs...)::AbstractArray
return cat(A, apply(A, t; kwargs...); dims=append_dim)
result = apply(A, t; kwargs...)
result = _postformat(cardinality(t), result, A, append_dim)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to check that append_dim is definitely right. This came up in the apply_append PR but do we have any test cases for append_dim != dims yet?

@glennmoy
Copy link
Member Author

closed in favour of #80

@glennmoy glennmoy closed this Apr 16, 2021
@glennmoy glennmoy deleted the gm/traits branch June 21, 2021 12:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants