Use a consistent convention for `dims` #18

bencottier · 2021-02-12T18:28:40Z

The apply method for AbstractArray in transformers.jl uses mapslices to apply a transform on each slice of the array, along a given dimension dims.

Meanwhile, the equivalent apply! method uses eachslice.

The problem is that mapslices and eachslice have opposite notions of dims. For example:

julia> M = [1 2 3; 4 5 6]
2×3 Array{Int64,2}:
 1  2  3
 4  5  6

julia> map(x -> println(x), eachslice(M; dims=1));
[1, 2, 3]
[4, 5, 6]

julia> mapslices(x -> println(x), M; dims=1);
[1, 4]
[2, 5]
[3, 6]

For higher dimensions, dims=3 in eachslice is equivalent to dims=[1, 2] in mapslices, for example (and note that eachslice only supports a single dimension in dims).

Note also that Statistics.mean and Statistics.std uses the same notion of dims as mapslices.

We should adopt a consistent convention for the meaning of dims, and explain this clearly in documentation.

The text was updated successfully, but these errors were encountered:

bencottier · 2021-02-17T14:18:54Z

mapslices does not seem to allow mutation.

julia> M = [1 2; 3 4];

julia> function add_one!(x)
           x[:] = x .+ 1
       end
add_one! (generic function with 1 method)

julia> mapslices(x -> add_one!(x), M; dims=1)
2×2 Array{Int64,2}:
 2  3
 4  5

julia> M
2×2 Array{Int64,2}:
 1  2
 3  4

Compare eachslice:

julia> map(x -> add_one!(x), eachslice(M; dims=2))
2-element Array{Array{Int64,1},1}:
 [2, 4]
 [3, 5]

julia> M
2×2 Array{Int64,2}:
 2  3
 4  5

bencottier · 2021-02-17T14:33:55Z

SplitApplyCombine.jl splitdimsview could be good. It uses the same dims convention as eachslice but allows more than a single dimension. It seems to be faster than mapslices.

Discovered from this post

julia> using SplitApplyCombine

julia> f_ms(A) = mapslices(x -> sum(x), A; dims=(2, 3))
f_ms (generic function with 1 method)

julia> f_es(A) = map(x -> sum(x), eachslice(A; dims=1))
f_es (generic function with 1 method)

julia> f_sac(A) = map(x -> sum(x), splitdimsview(A, 1))
f_sac (generic function with 1 method)

julia> A = rand(150, 50, 100);

julia> isapprox(f_es(A), f_sac(A); atol=1e-15)
true

julia> isapprox(f_es(A), f_ms(A); atol=1e-12)
false

julia> isapprox(f_es(A), f_ms(A); atol=1e-11)  # quite a difference in precision...
true

julia> @btime f_es(A)
  765.021 μs (6 allocations: 1.45 KiB)

julia> @btime f_ms(A);
  1.640 ms (1988 allocations: 89.39 KiB)

julia> @btime f_sac(A);
  811.317 μs (1 allocation: 1.33 KiB)

also supports mutation:

julia> map(splitdimsview(M, 2)) do x
           add_one!(x)
       end
2-element Array{Array{Int64,1},1}:
 [2, 4]
 [3, 5]

julia> M
2×2 Array{Int64,2}:
 2  3
 4  5

EDIT: splitdimsview does not support named dims, while eachslice and mapslices do.

bencottier · 2021-02-17T14:38:00Z

Note converting from one dims convention to another is pretty simple: setdiff(1:ndims(A), dims)

bencottier · 2021-02-17T15:24:15Z

Overall, I'm favouring eachslice.

Upsides:

Faster
Supports named dims
Supports mutation

Downsides:

Opposite convention to Statistics.mean which we use elsewhere (but conversion should be straightforward)
Only supports slicing along a single dimension (but multiple dimensions seem like a rare use case)

rofinn · 2021-02-17T15:59:16Z

Yeah, eachslice seems like the easiest path forward. I think if we want to add named dims support splitdimsview, we could probably do that in AxisKeys.jl and NamedDims.jl. That package seems pretty lightweight. That also seems like a relatively non-breaking change to introduce later since it'd just be adding support for multiple dims.

nicoleepp · 2021-02-17T16:18:09Z

I did run into issues trying to use eachslice in the non mutating apply method, can't remember what they were so would be worth trying to change it again, or otherwise just converting the dims passed into it.

bencottier · 2021-02-17T17:04:34Z

I did run into issues trying to use eachslice in the non mutating apply method, can't remember what they were so would be worth trying to change it again, or otherwise just converting the dims passed into it.

Yeah I'm getting different results with eachslice, stuff like this:

dims = 1: Test Failed at /Users/bencottier/JuliaEnvs/Transform.jl/test/temporal.jl:64
  Expression: Transforms.apply(M, hod; dims = d) == expected
   Evaluated: [[1, 9], [2, 10], [3, 11]] == [1 9; 2 10; 3 11]

nicoleepp · 2021-02-17T17:48:10Z

I did run into issues trying to use eachslice in the non mutating apply method, can't remember what they were so would be worth trying to change it again, or otherwise just converting the dims passed into it.

Yeah I'm getting different results with eachslice, stuff like this:
dims = 1: Test Failed at /Users/bencottier/JuliaEnvs/Transform.jl/test/temporal.jl:64
  Expression: Transforms.apply(M, hod; dims = d) == expected
   Evaluated: [[1, 9], [2, 10], [3, 11]] == [1 9; 2 10; 3 11]

Different results are to be expected given this changes what dims things operate on. If that's the only issue shouldn't be a problem. But I do think mapslices conserves the type and shape (where possible) of what it operate on where eachslice doesn't, which is the reason we switched to mapslices in the first place

rofinn · 2021-02-17T18:54:10Z

But I do think mapslices conserves the type and shape (where possible) of what it operate on where eachslice doesn't, which is the reason we switched to mapslices in the first place

If you're mutating the values of the slice shouldn't that preserve the dimensionality of the input array though? I haven't run into that issue with Impute.jl

nicoleepp · 2021-02-17T19:10:22Z

But I do think mapslices conserves the type and shape (where possible) of what it operate on where eachslice doesn't, which is the reason we switched to mapslices in the first place

If you're mutating the values of the slice shouldn't that preserve the dimensionality of the input array though? I haven't run into that issue with Impute.jl

If you are mutating it does, if you are constructing a new type to return it doesn't but mapslices does preserve the shape when not mutating

rofinn · 2021-02-17T20:08:25Z

preserve the shape when not mutating

Are there links to parts of the code where we can't do the mutating form? I was thinking that even for cases where the type needs to change, so the base API can't be mutating, that we'd still have enough info to pre-allocate the output? Then we'd just iterate over the slices of the inputs and outputs and mutate the pre-allocated outputs?

nicoleepp · 2021-02-17T20:43:59Z

preserve the shape when not mutating

Are there links to parts of the code where we can't do the mutating form? I was thinking that even for cases where the type needs to change, so the base API can't be mutating, that we'd still have enough info to pre-allocate the output? Then we'd just iterate over the slices of the inputs and outputs and mutate the pre-allocated outputs?

https://github.com/invenia/Transforms.jl/blob/main/src/transformers.jl#L75

The output of _apply is not always 1-1, leading to issues if you expect to allocate it a previously created object

bencottier · 2021-02-19T16:08:38Z

Polling result from asking 7 people in Research (3 ML, 1 PS, 1 DS, 2 RSE): all say the mapslices and mean convention is more intuitive/expected (note, in fairness, that the example I gave was about taking the mean, but I asked for their intuition without looking up code). A couple expressed pretty strong opinions that this was the right way: the dims that you specify are the dims that get reduced and removed (or turned into singletons) from the resulting array.

However, that raises an interesting point: consider

julia> M = NamedDimsArray{(:features, :observations)}([1 3; 5 7])
2×2 NamedDimsArray(::Array{Int64,2}, (:features, :observations)):
            → observations
↓ features  1  3
            5  7

julia> mean(M; dims=2)
2×1 NamedDimsArray(::Array{Float64,2}, (:features, :observations)):
            → observations
↓ features  2.0
            6.0

julia> mean(eachslice(M; dims=2))
2-element NamedDimsArray(::Array{Float64,1}, (:features,)):
↓ features  2.0
            6.0

julia> map(mean, eachslice(M; dims=2))
2-element Array{Float64,1}:
 3.0
 5.0

Notice that the convention using eachslice is consistent with mean if you pass it in directly. It's because mean(eachslice... takes the mean over the slices (combining them together), while the map case takes the mean of each slice, one at a time. It's like mapreduce vs. map.

This raises the question: can the dims convention legitimately differ, depending on whether or not the transform involves a reduce-style operation?

It could, but after the polling and discussion with people, I still think we should be consistent and use the mapslices/mean convention. Because you can still explain that in the way mapslices does:

Transform the given dimensions of array A using function f. f is called on each slice of A of the form A[...,:,...,:,...].
dims is an  integer vector specifying where the colons go in this expression. The results are concatenated along the
 remaining dimensions. For example, if dims is [1,2] and A is 4-dimensional, f is called on A[:,:,i,j] for all i and j.

If you accept all of that, a lingering problem is that the general apply! (from bc/scaling branch) won't be able to reduce one dimension at a time for >2D arrays, because:

(suppose) we follow the mapslices/mean convention for dims, so
If you want to reduce on one dimension, you would specify that dimension for dims, then
apply! would invert those dims to work for eachslice, turning the single dim into multiple, and then
eachslice would complain because it doesn't support multiple dims.

rofinn · 2021-02-19T18:39:15Z

In retrospect, I'm not a fan of using eachslice, eachcol, eachrow, but then trying to match the conventions of mapslices. I feel like we should be transparent about which default operation we're performing behind the scenes rather than introducing our own hybrid behaviour. The reason I was a fan of the eachslice default in Impute.jl was that it's easier to map the behaviour of eachcol and eachrow to the corresponding default logic with tables. That being said, only filter was doing a reduction, so that use-case didn't really come up.

nicoleepp · 2021-02-19T19:11:39Z

In retrospect, I'm not a fan of using eachslice, eachcol, eachrow, but then trying to match the conventions of mapslices. I feel like we should be transparent about which default operation we're performing behind the scenes rather than introducing our own hybrid behaviour.

Can you elaborate more on why you aren't a fan anymore?

rofinn · 2021-02-19T19:39:34Z

Just that if our default behaviour is basically just running eachslice or mapslices behind the scenes then we should say so and have the dims behaviour be consistent with that. If our dims behaviour matches mapslices, but we're still passing SubArray to the specific implemented _apply! methods then we're just creating more confusion than is necessary. If folks want to overload our default behaviour it should be pretty clear what the default is based on knowledge of Julia more broadly.

NOTE: This might be a relevant discussion on the differences. JuliaLang/julia#29146

glennmoy · 2021-02-22T14:20:55Z

Polling result from asking 7 people in Research (3 ML, 1 PS, 1 DS, 2 RSE): all say the mapslices and mean convention is more intuitive/expected...A couple expressed pretty strong opinions that this was the right way: the dims that you specify are the dims that get reduced and removed (or turned into singletons) from the resulting array.

I think @bencottier's survey provides the strongest reason(s) for favouring mapslices.

It would also be interesting to check our currently implemented Transforms (which I will do later) for examples where dims is used in a context where reduction is not taking place and/or eachslice (when used with dims) is more intuitive to use.

I suspect there won't be many examples...

glennmoy · 2021-02-22T14:22:21Z

EDIT: splitdimsview does not support named dims, while eachslice and mapslices do.

What happens? Can you open an issue in NamedDims.j and reference it here for later?

bencottier · 2021-02-22T15:09:51Z

EDIT: splitdimsview does not support named dims, while eachslice and mapslices do.

What happens? Can you open an issue in NamedDims.j and reference it here for later?

invenia/NamedDims.jl#162

SplitApplyCombine.jl restricts dims to Int here.

julia> using SplitApplyCombine, NamedDims

julia> M = NamedDimsArray{(:features, :observations)}([1 3; 5 7])

julia> map(x -> sum(x), splitdimsview(M, :features))
ERROR: MethodError: no method matching splitdimsview(::NamedDimsArray{(:features, :observations),Int64,2,Array{Int64,2}}, ::Symbol)
Closest candidates are:
  splitdimsview(::AbstractArray) at /Users/bencottier/.julia/packages/SplitApplyCombine/RS5bI/src/splitdims.jl:127
  splitdimsview(::AbstractArray, ::Int64) at /Users/bencottier/.julia/packages/SplitApplyCombine/RS5bI/src/splitdims.jl:129
  splitdimsview(::AbstractArray{var"#s46",N} where var"#s46", ::Tuple{Vararg{Int64,M}}) where {N, M} at /Users/bencottier/.julia/packages/SplitApplyCombine/RS5bI/src/splitdims.jl:130
Stacktrace:
 [1] top-level scope at REPL[4]:1

julia> map(x -> sum(x), splitdimsview(M, 1))
2-element Array{Int64,1}:
  4
 12

julia> map(x -> sum(x), eachslice(M; dims=:features))
2-element Array{Int64,1}:
  4
 12

nickrobinson251 · 2021-02-26T20:29:12Z

Some more relevant links:

PR to create a Base.SliceArray type add Slices array type for eachslice/eachrow/eachcol JuliaLang/julia#32310
- Base seems to be focussing on eachslice (over mapslices)
Issue discussing dims convention mapslices, map interpret dims differently JuliaLang/julia#34840
Issues with mapslices:

Some relevant packages:

bencottier added the design label Feb 12, 2021

bencottier mentioned this issue Feb 12, 2021

Add MeanStdScaling transform #17

Merged

bencottier added the bug Something isn't working label Feb 16, 2021

This was referenced Feb 19, 2021

Support named dims for dims keyword argument #20

Closed

Support slices over multiple dims #21

Closed

glennmoy mentioned this issue Feb 22, 2021

Clean up apply! code #25

Merged

glennmoy closed this as completed in #25 Feb 24, 2021

This was referenced Mar 5, 2021

Refactor scaling to only compute one mean and std #42

Merged

Replace mapslices with selectdim #45

Merged

bencottier mentioned this issue Apr 16, 2021

API simplification invenia/Impute.jl#66

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a consistent convention for `dims` #18

Use a consistent convention for `dims` #18

bencottier commented Feb 12, 2021 •

edited

Loading

bencottier commented Feb 17, 2021 •

edited

Loading

bencottier commented Feb 17, 2021 •

edited

Loading

bencottier commented Feb 17, 2021

bencottier commented Feb 17, 2021

rofinn commented Feb 17, 2021 •

edited

Loading

nicoleepp commented Feb 17, 2021

bencottier commented Feb 17, 2021

nicoleepp commented Feb 17, 2021

rofinn commented Feb 17, 2021

nicoleepp commented Feb 17, 2021

rofinn commented Feb 17, 2021 •

edited

Loading

nicoleepp commented Feb 17, 2021

bencottier commented Feb 19, 2021 •

edited

Loading

rofinn commented Feb 19, 2021

nicoleepp commented Feb 19, 2021

rofinn commented Feb 19, 2021 •

edited

Loading

glennmoy commented Feb 22, 2021 •

edited

Loading

glennmoy commented Feb 22, 2021

bencottier commented Feb 22, 2021

nickrobinson251 commented Feb 26, 2021

Use a consistent convention for dims #18

Use a consistent convention for dims #18

Comments

bencottier commented Feb 12, 2021 • edited Loading

bencottier commented Feb 17, 2021 • edited Loading

bencottier commented Feb 17, 2021 • edited Loading

bencottier commented Feb 17, 2021

bencottier commented Feb 17, 2021

rofinn commented Feb 17, 2021 • edited Loading

nicoleepp commented Feb 17, 2021

bencottier commented Feb 17, 2021

nicoleepp commented Feb 17, 2021

rofinn commented Feb 17, 2021

nicoleepp commented Feb 17, 2021

rofinn commented Feb 17, 2021 • edited Loading

nicoleepp commented Feb 17, 2021

bencottier commented Feb 19, 2021 • edited Loading

rofinn commented Feb 19, 2021

nicoleepp commented Feb 19, 2021

rofinn commented Feb 19, 2021 • edited Loading

glennmoy commented Feb 22, 2021 • edited Loading

glennmoy commented Feb 22, 2021

bencottier commented Feb 22, 2021

nickrobinson251 commented Feb 26, 2021

Use a consistent convention for `dims` #18

Use a consistent convention for `dims` #18

bencottier commented Feb 12, 2021 •

edited

Loading

bencottier commented Feb 17, 2021 •

edited

Loading

bencottier commented Feb 17, 2021 •

edited

Loading

rofinn commented Feb 17, 2021 •

edited

Loading

rofinn commented Feb 17, 2021 •

edited

Loading

bencottier commented Feb 19, 2021 •

edited

Loading

rofinn commented Feb 19, 2021 •

edited

Loading

glennmoy commented Feb 22, 2021 •

edited

Loading