POC: Add traits to support generalising the apply methods #77

glennmoy · 2021-04-09T18:35:47Z

POC for #75

Some brief notes to guide review:

This started off implementing the idea in the description of #75: adding traits to describe the cardinality of our transforms and defining an intermediate _apply method that dispatched on the traits.

But after some refactoring at the end, I noticed that the key difference boiled down to how the data was prepared before going into a ManyToOne transform (LinearCombination).

So I refactored a bit more to create two simple formatting functions:

_preformat structures the input to _apply according to the cardinality of the transform. Effectively, this just calls eachslice before calling _apply for LinearCombination.
_preformat structures the output of _apply according to the cardinality of the transform. This is really only needed for append_apply because LinearCombination returns a Vector, if we are appending it as a row cat throws a DimensionMismatch. So we have to pivot the data before doing so.

I added some TestUtils with Fakes for the above transforms and an example testset for how we might test a new data type in a more manageable way in the future.
The idea being that: once a datatype supports each type of transform (and returns correct result for the kwargs) then it should support any of the transforms defined in the package.

Then, each transform just needs to be tested against Arrays and nothing else, which should give us full test coverage but in a much simpler way.

codecov · 2021-04-09T18:40:48Z

Codecov Report

Merging #77 (1276d95) into main (55e3c1a) will increase coverage by 0.14%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main      #77      +/-   ##
==========================================
+ Coverage   99.09%   99.24%   +0.14%     
==========================================
  Files          10       12       +2     
  Lines         111      132      +21     
==========================================
+ Hits          110      131      +21     
  Misses          1        1

Impacted Files	Coverage Δ
src/FeatureTransforms.jl	`100.00% <ø> (ø)`
src/apply.jl	`100.00% <100.00%> (ø)`
src/linear_combination.jl	`100.00% <100.00%> (ø)`
src/one_hot_encoding.jl	`100.00% <100.00%> (ø)`
src/periodic.jl	`100.00% <100.00%> (ø)`
src/power.jl	`100.00% <100.00%> (ø)`
src/scaling.jl	`100.00% <100.00%> (ø)`
src/temporal.jl	`100.00% <100.00%> (ø)`
src/test_utils.jl	`100.00% <100.00%> (ø)`
src/traits.jl	`100.00% <100.00%> (ø)`
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 55e3c1a...1276d95. Read the comment docs.

src/traits.jl

glennmoy · 2021-04-09T18:53:37Z

test/one_hot_encoding.jl

-            [[1, 0], [0, 1], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [1, 0], [0, 1]],
-            [Symbol.(:Column, x) for x in 1:10],
+            Bool[1 0 0 0 0; 0 1 0 0 0; 0 0 0 1 0; 0 0 0 0 1],
+            [Symbol.(:Column, x) for x in 1:5],


interestingly this result changed when I finished the refactoring, even though OHE wasn't touched, so I think this was a bug. Indeed, the previous result was inconsistent with the matrix application but I didn't notice that before. This version is now consistent.

Either way I'm not too concerned about this because I'm not sure how much sense it makes to apply OHE to multiple columns. The output isn't necessarily valid or useful, so I would considering dropping the tests on multi-dimensional array altogether.

Could you explain more what was going on before and how it works now? I'm just pressed for time.

I think what was happening before is (in the Tables apply) when we had multiple columns we were calling _apply on each component in turn and then stacking the result. This lead to the original version that had many columns, which would have had duplicate names.

Now instead, we collect the components and call _apply on the collection, which is what we were doing for the Array method all along. So now we get one column per category (as expected) but the shape is now inconsistent.

This is what I mean by it doesn't make sense to call OHE on a Matrix. I think in a future release I'm going to delete these tests because they don't seem worth supporting.

bencottier

Overall looks good to me. Generalising LC is nice and I'm glad an intermediate _apply wasn't needed. There could still be a risk of _preformat and _postformat becoming bloated, but I think we can cross that bridge if we come to it.

bencottier · 2021-04-12T12:55:19Z

src/apply.jl

+# _preformat formats the data before calling _apply. Needed for all apply methods.
+# Before applying a ManyToOne Transform we must first slice up the data along the dimension
+# we are reducing over.
+_preformat(::Cardinality, A, d) = A


Just had a thought that this kind of function (optionally doing something before _apply) would be suited to precomputing statistics for MeanStdScaling without needing to specify args twice #59. But that goes with the type of Transform rather than Cardinality.

bencottier · 2021-04-12T13:19:26Z

src/apply.jl

+# After applying a ManyToOne Transform, if we want to cat the result we have to reshape it
+# setting the reduced dimension to 1, otherwise cat will error.
+_postformat(::Cardinality, result, A, d) = result
+function _postformat(::ManyToOne, result, A, d)


Could comment on how (IIUC) d is different for _postformat, since append_dim is passed to it in apply_append rather than dims.

sure, I renamed d->append_dim in the other PR but I'll explain here as well.

Basically, the problem was when we want to append a row to an array, but LinearCombination always returns a column vector. So we have to reshape it into a row to get it to fit.

In general, when we want to cat the result with apply_append for ManyToOne transforms (which always return a N-1 array) we have to reshape it so that the reduced dimension is always 1 so that it fits.

bencottier · 2021-04-12T13:22:22Z

src/apply.jl

@@ -75,9 +73,12 @@ function apply(table, t::Transform; cols=_get_cols(table), header=nothing, kwarg
    # Extract a columns iterator that we should be able to use to mutate the data.
    # NOTE: Mutation is not guaranteed for all table types, but it avoid copying the data
    coltable = Tables.columntable(table)
-    cols = _to_vec(cols)
+    components = reduce(hcat, getproperty(coltable, col) for col in _to_vec(cols))


Could use a comment. Is this just converting to a Matrix? Isn't there a function for that?

the problem is getproperty(coltable, col) for col in _to_vec(cols) is a Vector{Vector} so calling Matrix on this wouldn't work.

I assume calling Matrix is what you meant by there being a function?

bencottier · 2021-04-12T13:25:44Z

src/apply.jl

+    # We call hcat to convert any Vector components/results into a Matrix.
+    # Passing dims=2 only matters for ManyToOne transforms - otherwise it has no effect.
+    input = _preformat(cardinality(t), hcat(components), 2)
+    result = hcat(_apply(input, t; dims=2, kwargs...))


Bit of a smell that we're using hcat 3 times in this function, I'm not quite sure what's going on.

I've left more comments. It's to make sure we can use eachslice in _preformat, and then pipe the result into a Table. Without hcat(components) we can't take a LinearCombination of a single column in a NamedTuple/DataFrame. Not sure how useful that is but we're supporting it.

src/linear_combination.jl

bencottier · 2021-04-12T13:49:59Z

src/test_utils.jl

+
+struct FakeManyToManyTransform <: Transform end
+FeatureTransforms.cardinality(::FakeManyToManyTransform) = ManyToMany()
+FeatureTransforms._apply(A, ::FakeManyToManyTransform; kwargs...) = hcat(ones(size(A)), ones(size(A)))


Is this meant to be the same as FakeOneToManyTransform?

yes - FakeXToManyTransforms should output multiple components. It doesn't really matter what the input is, although for ManyToMany the output can be bigger/smaller than the input.

bencottier · 2021-04-12T13:55:44Z

test/linear_combination.jl

        @testset "dims" begin
            @testset "dims = :" begin
                M = [1 1; 2 2; 3 5]
                lc = LinearCombination([1, -1, 1])
-                @test_throws ArgumentError FeatureTransforms.apply(M, lc; dims=:)
+                @test_throws MethodError FeatureTransforms.apply(M, lc; dims=:)


What caused this change?

When we removed the explicit check for dims=: the ArgumentError is no longer hit.

Now instead it hits eachslice(A; dims=:) which throws a MethodError.

test/runtests.jl

bencottier · 2021-04-12T14:01:58Z

test/one_hot_encoding.jl

-            [[1, 0], [0, 1], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [1, 0], [0, 1]],
-            [Symbol.(:Column, x) for x in 1:10],
+            Bool[1 0 0 0 0; 0 1 0 0 0; 0 0 0 1 0; 0 0 0 0 1],
+            [Symbol.(:Column, x) for x in 1:5],


Could you explain more what was going on before and how it works now? I'm just pressed for time.

bencottier · 2021-04-12T14:03:18Z

src/apply.jl

@@ -110,7 +109,9 @@ is appended to `A` along the `append_dim` dimension. The remaining `kwargs` corr
 the usual [`Transform`](@ref) being invoked.
 """
 function apply_append(A::AbstractArray, t; append_dim, kwargs...)::AbstractArray
-    return cat(A, apply(A, t; kwargs...); dims=append_dim)
+    result = apply(A, t; kwargs...)
+    result = _postformat(cardinality(t), result, A, append_dim)


Just want to check that append_dim is definitely right. This came up in the apply_append PR but do we have any test cases for append_dim != dims yet?

glennmoy · 2021-04-16T11:18:19Z

closed in favour of #80

Glenn Moynihan added 12 commits April 9, 2021 19:36

Define Cardinality Traits

992c398

Simplify LinearCombination

41e8501

Update apply methods

57879c9

Update OHE tests

b7e199d

Update linear combination tests

f95534a

Tidy up and add comments

5be2e70

Tidy up intermediate _apply methods

ea72631

Refactor _apply_append to _reformat

6104a2e

Refactor cardinality _apply methods into formatting methods

9cf843e

Tidy up code

82e5e73

Define FakeTransform TestUtils

aed98b9

Add POC tests for new data type

1276d95

glennmoy force-pushed the gm/traits branch from 8e9e029 to 1276d95 Compare April 9, 2021 18:36

glennmoy mentioned this pull request Apr 9, 2021

Use traits to generalise apply methods #75

Closed

glennmoy commented Apr 9, 2021

View reviewed changes

src/traits.jl Show resolved Hide resolved

glennmoy commented Apr 9, 2021

View reviewed changes

src/traits.jl Show resolved Hide resolved

glennmoy commented Apr 9, 2021

View reviewed changes

bencottier reviewed Apr 12, 2021

View reviewed changes

glennmoy mentioned this pull request Apr 16, 2021

Refactor apply methods to use Traits #80

Merged

glennmoy closed this Apr 16, 2021

glennmoy deleted the gm/traits branch June 21, 2021 12:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC: Add traits to support generalising the apply methods #77

POC: Add traits to support generalising the apply methods #77

glennmoy commented Apr 9, 2021 •

edited

Loading

codecov bot commented Apr 9, 2021 •

edited

Loading

glennmoy Apr 9, 2021 •

edited

Loading

bencottier Apr 12, 2021

glennmoy Apr 16, 2021

bencottier left a comment

bencottier Apr 12, 2021

bencottier Apr 12, 2021

glennmoy Apr 16, 2021

bencottier Apr 12, 2021

glennmoy Apr 16, 2021

bencottier Apr 12, 2021

glennmoy Apr 16, 2021

bencottier Apr 12, 2021

glennmoy Apr 16, 2021 •

edited

Loading

bencottier Apr 12, 2021

glennmoy Apr 16, 2021 •

edited

Loading

bencottier Apr 12, 2021

bencottier Apr 12, 2021

glennmoy commented Apr 16, 2021

POC: Add traits to support generalising the apply methods #77

POC: Add traits to support generalising the apply methods #77

Conversation

glennmoy commented Apr 9, 2021 • edited Loading

codecov bot commented Apr 9, 2021 • edited Loading

Codecov Report

glennmoy Apr 9, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bencottier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glennmoy Apr 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glennmoy Apr 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glennmoy commented Apr 16, 2021

glennmoy commented Apr 9, 2021 •

edited

Loading

codecov bot commented Apr 9, 2021 •

edited

Loading

glennmoy Apr 9, 2021 •

edited

Loading

glennmoy Apr 16, 2021 •

edited

Loading

glennmoy Apr 16, 2021 •

edited

Loading