Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

block method to assemble multi-dimensional arrays #46003

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

nlw0
Copy link
Contributor

@nlw0 nlw0 commented Jul 11, 2022

This is an alternative implementation to #43334. This PR was created to investigate a slight variation on the original proposal. We refer to the present function as awfulstack to avoid ambiguities during any discussions. There has been no investigation regarding the performance of awfulstack whatsoever.

awfulstack was implemented based on reduce and cat, iterating over the dimensions of the input array from major to minor, in a DFS kind of recursion. With a simple vector of vectors, awfulstack implements reduce(vcat, data). A 1xN array of vectors will result in reduce(hcat, data).

awfulstack cannot be the simple inverse of eachcol or eachrow, since these functions produce vector-of-vectors with no dimensional information. Although it's easy to adapt the data to recover the original matrix by using eg. reshape or permutedims. The input is strictly treated as a block-matrix, or block-array.

awfulstack can deal with non-uniform sub-array sizes and with concatenating arrays into higher dimensions (2x2 ++ 2x2 -> 2x2x2). "flatmap" functionality is also available.

All credit goes to #43334, I only made this PR because there are a few details I felt important to investigate, and I find it easier to talk over code than to just talk. I hope the community finds this a worthy exploration.

@nlw0 nlw0 force-pushed the nic/awfulstack branch from 3a99888 to 1a24bd4 Compare July 11, 2022 22:01
@kshyatt kshyatt added the arrays [a, r, r, a, y, s] label Jul 11, 2022
@N5N3
Copy link
Member

N5N3 commented Jul 12, 2022

I haven't test all cases, but looks like we can do this by:

awfulstack(a::AbstractArray{<:AbstractArray}) = Base.hvncat(size(a), false, a...)

@nlw0
Copy link
Contributor Author

nlw0 commented Jul 12, 2022

@N5N3 That looks nice. In my opinion, this would ideally be a thin layer over cat functions. If I'm not mistaken, though, splatting can cause performance issues with large inputs. That would be a reason to use the reduce(hcat, x) approach instead.

Another way I have in mind would be having an iterator that goes through all the values in the correct order, and in the end this gets collected and reshaped.

@nlw0
Copy link
Contributor Author

nlw0 commented Jul 12, 2022

#21672 In May 2017 "Despacito" was topping the charts

@nlw0
Copy link
Contributor Author

nlw0 commented Jul 12, 2022

I now realized there may actually be demand for two different things. lolstack takes a list-of-lists (lol) and assembles a tensor out of them. I have no idea what alternatives are there for this, but I wonder if eg ITensors.jl has something like that.

This function would now be capable to work as the inverse of eachcol, and as an implementation of reduce(hcat,x). It does something more useful with a vector-of-vectors than just flattening the whole thing recursively, as awfulstack does, interpreting each level as a dimension instead.

I imagine it would be possible to infer the height from the array type.

@nlw0 nlw0 force-pushed the nic/awfulstack branch from 7471f3d to decc841 Compare July 12, 2022 14:39
@nlw0
Copy link
Contributor Author

nlw0 commented Jul 24, 2022

I have now modified lolstack so that it works like this: we recursively flatten the input data structure, producing an iterator with every element visited in column major order. The recursive flatten should perhaps be a separate feature itself. During the process, though, we also collect size information at each level of the tree. Once we reach the leaves, we can now collect the iterator, and reshape it so we have a tensor. The input could be a multi-dimensional generator, as the example illustrates.

Tuples would now not be traversed, as well as RGB{x}, only abstractarrays or anything with iterate and ndims. objects with a length of one are also not traversed, but that's perhaps incorrect?

@nlw0
Copy link
Contributor Author

nlw0 commented Oct 17, 2022

Not sure there may be any support for this in Base, so I'm making it a little external package. https://github.com/nlw0/ArrayAssemblers.jl

@mcabbott
Copy link
Contributor

awfulstack appears to be the same as numpy.block. That's one possible generalisation of reduce(vcat, vec_of_vecs) to more dimensions. Is it exactly the same or are there differences?

And when exactly does this do something different to the Base.hvncat idea above?

@nlw0
Copy link
Contributor Author

nlw0 commented Oct 22, 2022

The hvncat implementation seems fine, the only problem is the splatting, and I don't think it can be easily adapted to use with reduce.

It looks like numpy.block does the same indeed, although there's this detail about supporting cases like

AAAbb
AAAbb
cDDDD

Julia's array syntax supports this in in the same way, concatenating rows first. This awfulstack here supports it, but concatenates columns first. the hvncat version does not seem to support it.

julia> [ reshape(1:6,2,3) reshape(1:4,2,2);    [1] reshape(1:4,1,4)]
3×5 Matrix{Int64}:
 1  3  5  1  3
 2  4  6  2  4
 1  1  2  3  4

julia> awfulstack(reshape([ reshape(1:6,2,3) ,reshape(1:3,1,3), reshape(1:2,1,2), reshape(1:4,2,2)],2,2))
3×5 Matrix{Int64}:
 1  3  5  1  2
 2  4  6  1  3
 1  2  3  2  4

@mcabbott
Copy link
Contributor

mcabbott commented Oct 23, 2022

OK. The penalty for splatting is certainly not as bad as it was... [Edit -- collapsed old benchmarks]

julia> vm = [rand(10,10) for _ in 1:1000];

julia> m1 = @btime reduce(vcat, $vm);
  min 77.375 μs, mean 129.687 μs (2 allocations, 781.30 KiB)

julia> m2 = @btime vcat($vm...);
  min 118.708 μs, mean 166.544 μs (6 allocations, 804.94 KiB)

julia> m3 = @btime awfulstack($vm);
  min 57.842 ms, mean 61.589 ms (19960 allocations, 382.39 MiB)

julia> m4 = @btime awfulstack_hvn($vm);
  min 427.166 μs, mean 1.331 ms (33 allocations, 868.80 KiB)

julia> m1 == m2 == m3 == m4
true

@nlw0
Copy link
Contributor Author

nlw0 commented Oct 23, 2022

My understanding is that splat might cause issues beyond anything related to performance, such as stack overflows, and should be avoided for large inputs, I don't know if that has changed.

@nlw0 nlw0 changed the title awfulstack is a multi-dimensional reduce(cat, ...) block and lolcat methods to assemble multi-dimensional arrays Oct 23, 2022
@nlw0
Copy link
Contributor Author

nlw0 commented Oct 23, 2022

I've done more tests with the hvncat implementation and it does seem fine, so I just switched it. I also changed the more marketable names.

@LilithHafner LilithHafner marked this pull request as draft October 24, 2022 11:58
@LilithHafner
Copy link
Member

IIUC this is intended as exploration and not to be merged, if that is the case, it's nice to mark it as a draft and/or tag it with do not merge

@nlw0
Copy link
Contributor Author

nlw0 commented Oct 24, 2022

It's intended to be merged.

@LilithHafner
Copy link
Member

Oops! My mistake.

@LilithHafner LilithHafner marked this pull request as ready for review October 25, 2022 11:02
@LilithHafner
Copy link
Member

Why the name lolcat for assembling an array from a vector tree?

@nlw0
Copy link
Contributor Author

nlw0 commented Oct 25, 2022

Why the name lolcat for assembling an array from a vector tree?

"lol" means "list of lists". "cat" follows the existing pattern of cat vcat hvncat etc.

I think this is a good name but I really don't mind changing. I could make a list of alternatives, but I'd rather listen first to ideas from other people if that's not a good name. I'm more concerned just with having the functionality available, that I believe is very important.

@LilithHafner
Copy link
Member

When I saw the name lolcat, the first thing that came to mind was the meme (https://en.wikipedia.org/wiki/Lolcat); the python package lolcat (and I believe the ruby package by the same name) are both for creating such memes. I thought that the name was a joke and so assumed the PR was not intended to be merged.

vovcat or aoacat depending on whether it supports Arrays would be better than lolcat because julia uses the terms Vector and Array, not List. They would also avoid this name confusion.

@nlw0
Copy link
Contributor Author

nlw0 commented Oct 25, 2022

You got me, lolcat would also be a hat-tip to the ancient and declining computer science tradition of whimsy.

"list" is a more generic term, and the function actually takes anything that is linearly iterable, such as generators. I'd rather avoid using the term array or vector since it implies those Julia data-structures, even though it always produces an Array. "List" might be too specific, though, since it's not just strictly linear collections, it actually works on anything "map-able" such as multidimensional arrays, eg

julia> lolcat(repeat([(j,k)],2,2) for j in 1:2, k in 1:2)
2×2×2×2 Array{Tuple{Int64, Int64}, 4}:
[:, :, 1, 1] =
 (1, 1)  (1, 1)
 (1, 1)  (1, 1)

[:, :, 2, 1] =
 (2, 1)  (2, 1)
 (2, 1)  (2, 1)

[:, :, 1, 2] =
 (1, 2)  (1, 2)
 (1, 2)  (1, 2)

[:, :, 2, 2] =
 (2, 2)  (2, 2)
 (2, 2)  (2, 2)

# or also
# lolcat((repeat([(j,k)],2,2) for j in 1:2) for k in 1:2)

It's a recursive form of flattening or concatenating or stacking etc.

@nlw0
Copy link
Contributor Author

nlw0 commented Oct 25, 2022

Maybe one interesting point about these functions is that the first, block, seems to be strongly related to [f(y) for x in a for y in x], while the second lolcat seems to be related to [f(x,y) for x in a, y in b]. They're basically offering these two features available in comprehensions, but in the form of a function that can also be used with do-syntax.

@LilithHafner
Copy link
Member

You got me, lolcat would also be a hat-tip to the ancient and declining computer science tradition of whimsy.

This is tragic, point taken. If lolcat were a viable name in other respects, I would be down to use it. Looking at the code, the term for what this accepts is iterators.

@aplavin
Copy link
Contributor

aplavin commented Oct 25, 2022

strongly related to [f(y) for x in a for y in x]

This is often called flatmap in other languages, and btw in your recent Julia PR Iterators.flatmap #44792.
I've also implemented a general and efficient flatmap for arrays and arbitrary collections in https://gitlab.com/aplavin/FlexiMaps.jl. There, special care was taken to support empty collections, support and propagate arbitrary array types, while staying performant/type-stable/generic. I think these characteristics are also expected for a Base function.

Another common building block is something [fᵢₙ(x, y) for x in X for y in fₒᵤₜ(x)]-like. It's a somewhat generalized flatmap, for example C# has it under the same function (their name for flatmap is SelectMany though). That's also implemented in FlexiMaps.jl as flatmap(fₒᵤₜ, fᵢₙ, X).

A simple example of how the latter is useful:

tbl = [(name=1, measurements=[1, 2, 3]), ...]
flatmap(obj -> obj.measurements, (obj, meas) -> (; obj.name, x=meas), tbl)
# [(name=1, x=1), (name=1, x=2), (name=1, x=3), ...]

@nlw0
Copy link
Contributor Author

nlw0 commented Oct 25, 2022

strongly related to [f(y) for x in a for y in x]

This is often called flatmap in other languages, and btw in your recent Julia PR Iterators.flatmap #44792.

Indeed, the main motivation for this PR is just to have flatmap behavior producing arrays in Base. It's been a hard struggle... But anyways, block here offers the desired flatmap feature, and also a little more than that, so I'm hoping this can make it more appealing.

I've also implemented a general and efficient flatmap for arrays and arbitrary collections in https://gitlab.com/aplavin/FlexiMaps.jl. There, special care was taken to support empty collections, support and propagate arbitrary array types, while staying performant/type-stable/generic. I think these characteristics are also expected for a Base function.

Thanks, I'm sure to take something from it if you don't mind, if necessary. I have no idea what will be the minimal requirements to push this, though. I'd be satisfied with very little.

Another common building block is something [fᵢₙ(x, y) for x in X for y in fₒᵤₜ(x)]-like. It's a somewhat generalized flatmap, for example C# has it under the same function (their name for flatmap is SelectMany though). That's also implemented in FlexiMaps.jl as flatmap(fₒᵤₜ, fᵢₙ, X).

Seems interesting, but I think it might make it impossible to use flatmap(f,a,b) to represent flatten(map(f,a,b)), what I would imagine is desirable to keep a similar interface to other functions.

@nlw0
Copy link
Contributor Author

nlw0 commented Oct 25, 2022

This is tragic, point taken. If lolcat were a viable name in other respects, I would be down to use it. Looking at the code, the term for what this accepts is iterators.

I'm not sure why it's not viable apart from being funny, but I'd point out the term "list of lists" is pretty common in general, especially in sparse arrays, where for whatever reason the abbreviation "LIL" was used in SuiteSparse and catched on...

I think there's even more need to deal with this kind of data-structure, of implicit trees stored as lists of lists, but I'm really not sure what to call it other than that. I've even created this other small package because of it. But in the case here the idea is to have a dedicated function, without the need for a special wrapping class... https://github.com/nlw0/ArrayTrees.jl

And btw I'd not propose treecat or lilcat...

@nlw0
Copy link
Contributor Author

nlw0 commented Oct 25, 2022

Added a couple more examples. I'm still not entirely sure, but lolcat really seems to do exactly the same as stack(x) or stack(x, dims=:) as long as the input is an array-of-arrays, with a single level of recursion, I hope @mcabbott can confirm this. The difference is that it will traverse the input in DFS fashion. And the reason it matters is illustrated in this example:

julia> myxor = lolcat(0:1) do c
           map(0:7) do b
               map(0:3) do a
                   xor(a,b,c)
               end
           end
       end
4×8×2 Array{Int64, 3}:
[:, :, 1] =
 0  1  2  3  4  5  6  7
 1  0  3  2  5  4  7  6
 2  3  0  1  6  7  4  5
 3  2  1  0  7  6  5  4

[:, :, 2] =
 1  0  3  2  5  4  7  6
 0  1  2  3  4  5  6  7
 3  2  1  0  7  6  5  4
 2  3  0  1  6  7  4  5

julia> myxor == [xor(a,b,c) for a in 0:3, b in 0:7, c in 0:1]
true

The nested maps are the do-syntax correlate to for x..., y..., z... syntax. But if you simply do the maps, what you get is this nested data-structure, that still needs to be "flattened" like the comprehension does. And with the difference that with do-syntax we get a lot more flexibility by being able to write multi-line functions and having closures (which may have their issues, that every programmer should know about, and are just part of life.)

Extending on the issue of do-syntax, thinking back on the example from @aplavin it would look like

block(X) do x
    map(fₒᵤₜ(x)) do y
        fᵢₙ(x, y)
    end
end

By the way, I don't think you can do [fᵢₙ(x, y) for x in X, y in fₒᵤₜ(x)] or [fᵢₙ(x, y) for y in fₒᵤₜ(x), x in X], am I right?

Anyways, my point is just to make clear this is about bringing important existing language features that today are available pretty much exclusively through comprehensions, and offer them through functions, which on top of that can be used with do-syntax. I'll accept whatever name I need to in order to have these features available. I'm partial to mxyzptlk(x) and flicts(y) but I'm sure these aren't too catchy. I just don't want to keep them nameless as they are now, available only through comprehensions.

@nlw0
Copy link
Contributor Author

nlw0 commented Oct 25, 2022

Another proposal to bring flatmap functionality to Base is #45985, where we also remove the offensive new verb flatmap and use flatmap(f,x) instead. I could live with that, but I find this PR more interesting. We'll keep that DFS search until we find the least offensive patch to help like-minded-developers...

@nlw0
Copy link
Contributor Author

nlw0 commented Oct 25, 2022

I have now realized lolcat might be achieved by recursive calls to stack, and that's just fine by me. I'll just remove that function. We would still miss an Iterators.stack, though.

@mcabbott
Copy link
Contributor

Yes to recursion, stack is only one deep because everything else in Julia is? Including Iterators.flatten which it matches up to shape.

julia> tri = map(0:1) do c
           map(0:7) do b
               map(0:3) do a
                   xor(a,b,c)
               end
           end
       end;

julia> stack(stack, tri) |> summary
"4×8×2 Array{Int64, 3}"

julia> Iterators.flatten(tri) |> collect |> summary
"16-element Vector{Vector{Int64}}"

julia> Iterators.flatten(stack(x) for x in tri) |> collect |> summary
"64-element Vector{Int64}"

No to anyone being offended by #45985, but pushing two independent things into one PR makes it much less likely that reviewers look closely. Here too, separating (or deleting) the recursive one to focus on this np.block-like one is probably a good idea, since there's no reason the two need be considered together.

@nlw0 nlw0 changed the title block and lolcat methods to assemble multi-dimensional arrays block method to assemble multi-dimensional arrays Oct 26, 2022
@nlw0
Copy link
Contributor Author

nlw0 commented Oct 26, 2022

I have removed the second method. It would still be great to have and iterator based stack. It would also be great if we could have this block(f,x) be based on Iterator.map as well, assuming this can actually help with performance, but I'm not sure what are all the implications.

@aplavin
Copy link
Contributor

aplavin commented Oct 26, 2022

Seems interesting, but I think it might make it impossible to use flatmap(f,a,b) to represent flatten(map(f,a,b)), what I would imagine is desirable to keep a similar interface to other functions.

By "other functions", do you mean map? Don't remember anything else that does this.
Dunno, I've never understood why map(a, b, c) exists at all instead of a more composable map(zip(a, b, c)).

Extending on the issue of do-syntax, thinking back on the example from @aplavin it would look like

block(X) do x
    map(fₒᵤₜ(x)) do y
        fᵢₙ(x, y)
    end
end

Kinda, but it allocates an extra array for each x, compared to flatmap(fₒᵤₜ, fᵢₙ, X), doesn't support empty arrays and so on.

And it's far from intuitive what happens here, takes some time to parse visually. This operation is pretty common in data processing, so I enjoy having a concise form to write it. Given that other widely used languages also have the function with the same semantics, it must be quite general.

By the way, I don't think you can do [fᵢₙ(x, y) for x in X, y in fₒᵤₜ(x)] or [fᵢₙ(x, y) for y in fₒᵤₜ(x), x in X], am I right?

You mean, with flatmap? Sure, you cannot: the result isn't flat, it's a multidimensional array, so out of scope for flatmap.

Anyways, my point is just to make clear this is about bringing important existing language features that today are available pretty much exclusively through comprehensions, and offer them through functions, which on top of that can be used with do-syntax.

flatmap is kinda present in the language already in the form of mapreduce(f, vcat, A) and collect(Iterators.flatmap(f, A)) (after your PR). These are not performant, though, and don't preserve types when possible.

block as implemented in this PR also exists: you literally wrote in the code that block(a) = Base.hvncat(size(a), false, a...). So, hvncat can already be used for this purpose.

Another proposal to bring flatmap functionality to Base is #45985, where we also remove the offensive new verb flatmap and use flatmap(f,x) instead

That PR mixes up two completely unrelated changes (rename Iterators.flatmap and add Base.flatten). I think flatmap is a nice name, used in many languages already to mean the same thing, so why call it "offensive" and merge with flatten? And Base.flatten proposed there is far from efficient.

@mcabbott
Copy link
Contributor

mcabbott commented Oct 26, 2022

I've never understood why map(a, b, c) exists at all

Before about 1.5, it was stricter than the version with zip, which was nice. You couldn't accidentally have mismatched lengths. It still is for some arguments, like Tuples. Not sure this change was even intentional.

I think flatmap is a nice name, used in many languages already

On the original PR, some people objected that this was too trivial a feature to bother giving a whole new name, since it just composes Iteratros.flatten ∘ Iterators.map. Presumably the name has no strong echos of these people; instead it suggests we should name sum(f, x) as summap or something, a dozen such names? But we have this other pattern, although (as far as I know?) no existing Iterators.fun(x) + Iterators.fun(f, x) pairs.

Is flatmap(fₒᵤₜ, fᵢₙ, X) a common usage elsewhere, under this name? If so, then avoiding confusion between the "two-function" version and the "obvious" meaning to flatten(f, x, y) might be one reason to use the name only for that. But not sure.

(All off-topic really, sorry!)

@aplavin
Copy link
Contributor

aplavin commented Oct 26, 2022

instead it suggests we should name sum(f, x) as summap or something, a dozen such names?

Well, sum(f, x) and others are convenient, but have consistency issues. While the majority of reductions in Base are red(f, x) == red(map(f, x)), some have a completely different behavior, and it's not always obvious which is the case.

More composable and DRY alternative to these methods would be something like sum(mapview(f, x)). This is already possible without any overhead using FlexiMaps:

julia> @btime sum(-, 1:10^5)
  713.820 ns (0 allocations: 0 bytes)
-5000050000

julia> @btime sum(mapview(-, 1:10^5))
  698.014 ns (0 allocations: 0 bytes)
-5000050000

Then, no red(f, x) == red(map(f, x)) methods would be needed, other than a few well-established or otherwise very common functions. flatmap is one of these, and is actually implemented as flatten(mapview(f, A)) in FlexiMaps.

Is flatmap(fₒᵤₜ, fᵢₙ, X) a common usage elsewhere, under this name?

I'm aware of a wide range of languages/libraries that have flatmap(f, X) with a single function argument. The two-function argument version is in C# LINQ, but they call it (both 1- and 2-func versions) SelectMany; also, both are present under the mapmany name in Query.jl. Googling suggests that two-func flatmap exists in some reactive programming libraries.

None seem to have flatmap(f, X, Y, ...). Maybe, it's best to keep Julia map(f, X, Y, ...) as a historical artifact, and not add any new function with that behavior? Other than maybe mapview(f, X, Y, ...) as the view version of the same map function. Indeed unfortunate that zip is that permissive...

@nlw0
Copy link
Contributor Author

nlw0 commented Oct 27, 2022

I've never understood why map(a, b, c) exists at all instead of a more composable map(zip(a, b, c)).

I do use the pattern map(f,a,b) now and then, although I'd agree it's not super common. I think this might be influence from LISP.

block as implemented in this PR also exists: you literally wrote in the code that block(a) = Base.hvncat(size(a), false, a...). So, hvncat can already be used for this purpose.

Most existing function calls can be replaced by their underlying implementation, it's called "inlining", and it's also the basis of lambda calculus. On the other hand, a wise man once said programming is about building useful abstractions. I think this is a useful one that I'm bound to use more and more, and I'm trying to share the thought with my fellow Julia programmers. If everybody thinks it's dumb, then that's it.

About the relation between flatten and hcat, you guys are telling me the two functions I proposed are unrelated. You're missing the point, I'm precisely trying to highlight how they are actually related, trying to figure out what is in common between those functions. Looking for some fundamental truth here. When you do hcat(vector_of_vectors), it's the same thing as flattening, and then reshaping. That's pretty much what stack will do with dims=:, as I understand. There are extra features, though, that are pretty much about manipulating the shapes of the arrays before and after the flattening/stacking.

I'm looking for an "ultimate flatmap", something that will be able to deal with multi-dimensional arrays, compose efficiently with iterators, and allow use with do-syntax so we can write multi-line blocks and be able to use variables in the surrounding scope. I'm not really sure what it's going to look like in the end, but I do miss it. stack is great progress, I'm happy for that. I just feel there's more progress to be done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants