Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for All, Between and Not broadcating #2171

Closed
bkamins opened this issue Apr 6, 2020 · 26 comments · Fixed by #2918
Closed

Add support for All, Between and Not broadcating #2171

bkamins opened this issue Apr 6, 2020 · 26 comments · Fixed by #2918
Labels
breaking The proposed change is breaking. decision grouping
Milestone

Comments

@bkamins
Copy link
Member

bkamins commented Apr 6, 2020

For select and combine we should in the future add option for All, Between and Not broadcating (when DataAPI.jl and InvertedIndices.jl is updated).

Unfortunately it is not possible to broadcast regex and colon as they are in Base.

@nalimilan An alternative to Not(:a) .=> fun could be to add yet another wrapper, e.g. Spread that would read Spread(Not(:a)) .=> fun and we would not have to change anything anywhere. What do you think? The benefit is that then Spread(:) .=> fun and Spread(r"x") .=> fun could also work.

@bkamins bkamins added grouping non-breaking The proposed change is not breaking labels Apr 6, 2020
@bkamins bkamins added this to the 1.x milestone Apr 6, 2020
@nalimilan
Copy link
Member

My current thinking is that Spread would be equivalent to All if we merge JuliaData/DataAPI.jl#10. So the question is whether using All for this is OK, or whether it's worth introducing yet another Spread object (or any other name).

@bkamins
Copy link
Member Author

bkamins commented Apr 6, 2020

No - it is not worth to add Spread as discussed.

I would leave a discussion what to do with this for later. For now we will add names(df, selector) option that will give the same for now.

I keep this open I we might come back to this decision after 1.0 release.

@pdeffebach
Copy link
Contributor

Thanks for this. I understand that names(df, selector) is useful.

However I would really like to work on this for the post-1.0 period. It's a nice feature that would match Stata's convenient syntax.

@bkamins
Copy link
Member Author

bkamins commented Apr 6, 2020

I would add this to DataFramesMeta.jl as we can intercept things like Between(:x1, :x3) .=> fun at macro processing step.

@CameronBieganek
Copy link

CameronBieganek commented Apr 22, 2020

I'm afraid I'm a bit late to the select discussion. I have a proposal that avoids the need for broadcasting altogether. I apologize for proposing a syntax change after you've already merged the select PR.

Quoting from the Julia docs:

Tuples are an abstraction of the arguments of a function

With that in mind, here's my proposal:

  • To apply a multivariate function to multiple columns, use the following syntax:
    • select(df, (:a, :b) => dot)
  • To apply a univariate function to multiple columns, use the following syntax:
    • select(df, [:a, :b] => sum)

Though the parsing of : => sum presents one small difficulty:

julia> : => sum
ERROR: syntax: space not allowed after ":" used for quoting
Stacktrace:
 [1] top-level scope at REPL[193]:0

julia> (:) => sum
Colon() => sum

julia> [:] => sum
[Colon()] => sum

It seems like the Julia parser could be modified to parse : => var as Colon() => var. I don't see any ambiguity there. But in the meantime, (:) => var and [:] => var are parsed correctly and would work.

Advantages of this approach:

  • There would no longer be a need to broadcast the pair operator over column selectors.
  • Not(:y) => fun works, without needing to wrap with Cols and without needing to make Not broadcastable.
  • [r"x", :] => fun works, without needing to wrap with Cols.
  • [startswith("hello"), Between(:x1, :x3)] => fun works, without needing to wrap with Cols and without needing to make Between broadcastable.
  • (:a, :b) => dot looks very similar to the corresponding closure syntax.
  • Cols / All wouldn't actually be needed at all, since the arguments to select are effectively wrapped in a Cols.

Perhaps Cols would still be needed for getindex, but maybe that could be solved by allowing getindex to accept multiple column selector arguments, like select.

To clarify, under this proposal, Not(...) and Between(...) would expand to arrays of columns, [...].

Disadvantages:

There is one use case I can think of that works now on master but would be more difficult under the new proposal. This currently works:

julia> df = DataFrame(a=1, b=2, c=3, d=4);

julia> select(df, Between(:b, :d) => ByRow(max))
1×1 DataFrame
│ Row │ b_c_max │
│     │ Int64   │
├─────┼─────────┤
│ 14

I'm not sure how that would be done under the new proposal. On the other hand, it doesn't seem like a very common use case to me, whereas [:a, :b] => sum and [:] => sum are quite common.

EDIT:
I think the above could be achieved in the new proposal with

select(df, AsTable(Between(:b, :d)) => ByRow(maximum))

@bkamins
Copy link
Member Author

bkamins commented Apr 23, 2020

Thank you for the comment. Essentially you propose to introduce "automatic broadcasting" of vectors so if someone passes:

vector1 => sum => vector2

it gets parsed as currently would

vector1 .=> sum .=> vector2

This is doable without a problem and then special casing Between, Not, All, :, Cols is not a problem.

Also then you do not need tuple syntax to pass multiple arguments to a function, you would just write (using your problematic example):

select(df, [Between(:b, :d)] => ByRow(max))

or using a simpler example:

select(df, [[:a, :b]] => cor)

In summary:

  1. what you propose is possible to be done
  2. I think both options (adding broadcasting to Between, All etc.) and what you propose are comparable (both have some cases in which they are more convinient)

The issue is that:

  1. adding broadcasting to Between etc. is not a pressing issue (we can do it later as it would be non-breaking) and currently you can write names(df, cols) that already is broadcastable;
  2. your proposal is breaking so if we went for it we should decide on doing it now

@nalimilan - what do you think? As this decision is blocking form 0.21 release.

@bkamins bkamins added breaking The proposed change is breaking. and removed non-breaking The proposed change is not breaking labels Apr 23, 2020
@bkamins bkamins modified the milestones: 1.x, 1.0 Apr 23, 2020
@CameronBieganek
Copy link

select(df, [[:a, :b]] => cor)

Hmm, that's an interesting example. Intuitively in my head I would have "automatically broadcasted" that expression twice. In other words, [[:a, :b]] => cor would turn into [:a, :b] => cor which would turn into [:a => cor, :b => cor]. But of course that's not how broadcasting works, e.g.

julia> [[:a, :b]] .=> sum
1-element Array{Pair{Array{Symbol,1},typeof(sum)},1}:
 [:a, :b] => sum

In other words, broadcasting only does one level of un-nesting. So, your current approach definitely has the advantage of being more explicit. [:a, :b] .=> sum is literally equal to [:a => sum, :b => sum]. It's just too bad that we can't broadcast Regex and Colon.

So, I guess I change my vote. I'm in favor of staying with the current syntax (on the master branch). Except I vote for renaming All to Cols. :)

I'll go to InvertedIndices.jl and voice my support for making Not broadcastable.

Thanks for all your hard work on DataFrames!

@CameronBieganek
Copy link

FYI, I also opened this issue on InvertedIndices.jl:

https://github.com/mbauman/InvertedIndices.jl/issues/18

@bkamins
Copy link
Member Author

bkamins commented Apr 23, 2020

Yes - we have discussed this issue with @nalimilan today extensively. The option you have proposed is tempting though. If we can come up with some good proposal then we will post it here.

Just to sum up the current problem. in the setting source => fun => target there is a problem that if in source we get a vector it should have a dual meaning - and be either "auto-broadcasted" (if it is on top level) or left as-is (if it is nested). For this we to be clean design-wise would need two different types, but this is essentially what broadcasting syntax is exactly designed to do (as it does not do recursive application as you have noted).

@CameronBieganek
Copy link

Could you just disallow nested arrays in source and use the (:x, :y) => fun syntax for multivariate functions? I'm not sure how easy it would be to disallow nested arrays for source, or if it would introduce other edge cases...

@bkamins
Copy link
Member Author

bkamins commented Apr 23, 2020

There are cases like e.g.:

[Between(:x1, :x3), Not(:y), :p, [:a, :b]] => fun

that are not entirely clear how they should be handled for me. Of course we could decide they should throw an error and allow only symbols, strings or integers in vectors.

@pdeffebach
Copy link
Contributor

[Between(:x1, :x3), Not(:y), :p, [:a, :b]] => fun

My intuition is that this should be

fun(Tuple(Between(:x1, :x3)), Tuple(Not(:y)), :p, ([:a, :b]...,))

But you are right this makes for some odd rules, and no functions actually take that kind of argument.

@CameronBieganek
Copy link

Hmm, yeah I was going to say that

[Between(:x1, :x3), [:a, :b]] => fun

should throw an error and

select(df, [Between(:x1, :x3), (:a, :b)] => fun)

would lower to

select(df, Between(:x1, :x3) => fun, (:a, :b) => fun)

where (:a, :b) => fun is the multivariate function application syntax. But Between(:x1, :x3) implicitly expands to an array, so

[Between(:x1, :x3), (:a, :b)] => fun

conceptually expands to

[[:x1, :x2, :x3], (:a, :b)] => fun

which would still be invalid...

@pdeffebach
Copy link
Contributor

We could resolve this problem by simply not broadcasting a second level, right?

I'm not fond of having tuples and vectors have different meanings in this context, it adds a cognitive burden and makes code harder to read. Better if we just stick to vectors.

@bkamins
Copy link
Member Author

bkamins commented Apr 23, 2020

We probably would not use tuples but rather AsArgs wrapper. And the whole desing would be.

In the syntax:

source => fun => dest

source can be:

  • AsTable passes a NamedTuple to fun, dest must be a Symbol or string
  • AsArgs passes positional arguments to fun, dest must be a Symbol or string
  • ColumnIndex passes a column to fun, dest must be a Symbol` or string
  • everything else gets expanded to [s => fun => d for (s,d) in zip(index(df)[source], dest] with the condition that dest is a Vector of Symbols or strings

Then we allow vectors of Pairs to be passed which get splatted recursively.

@CameronBieganek
Copy link

EDIT: This comment does not reflect your most recent post.

Yeah, thinking about it more, the current approach seems to be the best way to distinguish between these two use cases:

Between(:x1, :x10) => vararg_fun
Between(:x1, :x10) .=> univariate_fun

(where we wrap the right side in ByRow if necessary).

In both cases, Between(:x1, :x10) implicitly expands to an array [:x1, ..., :x10]. With the automatic broadcasting approach, there's no easy way to express the vararg_fun case. For example, this works on master:

julia> df = DataFrame(a = 1:3, b = 4:6, c = 7:9, d = 10:12);

julia> bar(args...) = reduce((x, y) -> x .* y, args);

julia> select(df, Between(:b, :d) => bar)
3×1 DataFrame
│ Row │ b_c_d_bar │
│     │ Int64     │
├─────┼───────────┤
│ 1280       │
│ 2440       │
│ 3648

But I can't think of any way to express that with the automatic broadcasting approach. Even

select(df, AsTable(Between(:b, :d)) => bar)

wouldn't work, since bar takes a variable number of arguments rather than a single named tuple.

Oh right, I forgot, I guess [Between(:x1, :x10)] => vararg_fun would work if we only auto-broadcast one level, and then interpret the inside array as multiple arguments to one function. But that seems rather confusing and somewhat inconsistent to me.

Sorry to create all this noise just to come back to endorsing the status quo! :)

@CameronBieganek
Copy link

source can be:

  • AsTable passes a NamedTuple to fun, dest must be a Symbol or string
  • AsArgs passes positional arguments to fun, dest must be a Symbol or string
  • ColumnIndex passes a column to fun, dest must be a Symbol` or string
  • everything else gets expanded to [s => fun => d for (s,d) in zip(index(df)[source], dest] with the condition that dest is a Vector of Symbols or strings

This sounds pretty good! Let me think about it for a bit...

@bkamins
Copy link
Member Author

bkamins commented Apr 23, 2020

This sounds pretty good! Let me think about it for a bit...

This is an alternative to broadcasting using .=> (both give roughly the same but with a different default behavior). The downside is that the user will have to learn three idioms 1) normal selector, 2) AsTable and 3) AsArgs. With broadcasting the user probably already knows broadcasting.

@CameronBieganek
Copy link

CameronBieganek commented Apr 23, 2020

Another possible downside to the new design is that the [r"x", :] => sum syntax doesn't signal the behavior Cols(A, B, C) ---> t = (A, B\A, C\(A ∪ B)). On the other hand, select(df, A, B, C) already exhibits that behavior without explicitly signalling it with a Cols wrapper, so maybe it's ok.

EDIT:
It's not totally obvious that Cols signals that behavior either. I guess it's mainly a matter of documentation.

@bkamins
Copy link
Member Author

bkamins commented Apr 23, 2020

With AsArgs proposal (let us call it like this) [r"x", :] => sum would be an ArgumentError. You would have to write Cols(r"x", :) => sum.

The downside of AsArgs proposal is that in the past (i.e. on currently released DataFrames.jl version) Between(:x1, :x3) => fun passes a NamedTuple to fun (so it is a "single function" operation). But we are going to be breaking here anyway so this is not a huge problem (just AsArgs proposal is in a sense more breaking than "broadcasting" design we currently have).

@CameronBieganek
Copy link

Hmm, I thought being able to write [r"x", :] => sum was one of the main reasons to do auto-broadcasting.

(i.e. on currently released DataFrames.jl version) Between(:x1, :x3) => fun passes a NamedTuple to fun (so it is a "single function" operation).

Do you mean in the by function on v0.20.2? To be honest, I'm struggling to figure out how to use that functionality:

julia> df = DataFrame(a=[1, 1, 2, 2], b=1:4, c=5:8)
4×3 DataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1115     │
│ 2126     │
│ 3237     │
│ 4248     │

julia> by(df, :a, Between(:b, :c) => (r -> sum(r.b) + sum(r.c)))
ERROR: MethodError: no method matching length(::Between{Symbol,Symbol})

@pdeffebach
Copy link
Contributor

pdeffebach commented Apr 23, 2020

Do you mean in the by function on v0.20.2? To be honest, I'm struggling to figure out how to use that functionality:

You should try master to work on this.

Note that the user can always write

transform(df, Between(:x1, :x3) => fun, r"x" => fun)

Instead of something like

transform(df, [Between(:x1, :x3), r"x"] .=> fun)

I still think we should splat everything inside the vector via Cols. It's not clear to me that this is a problem people have actually encountered with our updated syntax.

transform(df, [Between(:x1, :x3), r"x"] => fun)

should be equivelent to

transform(df, Cols(Between(:x1, :x3), r"x") => fun)

and

transform(df, [Between(:x1, :x3), r"x"] .=> fun)

should be equivelent to

transform(df, Cols(Between(:x1, :x3), r"x") .=> fun)

@bkamins
Copy link
Member Author

bkamins commented Apr 23, 2020

on 0.20.2 Between is not supported, but it is a minor thing (simply in the past API was not made 100% consistent across functions yet). You should think of it as:

julia> df = DataFrame(a=[1, 1, 2, 2], b=1:4, c=5:8)
4×3 DataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 1     │ 5     │
│ 2   │ 1     │ 2     │ 6     │
│ 3   │ 2     │ 3     │ 7     │
│ 4   │ 2     │ 4     │ 8     │

julia> by(df, :a, [:b, :c] => (r -> sum(r.b) + sum(r.c)))
2×2 DataFrame
│ Row │ a     │ x1    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 14    │
│ 2   │ 2     │ 22    │

I thought being able to write [r"x", :] => sum was one of the main reasons to do auto-broadcasting.

But if we allowed this then we would run into problems with consistency you have noted. The design I proposed with AsArgs is more restrictive (by not allowing overly complex statements) but is guaranteed to be "easy enough" to reason about.

Again - broadcasting with .=> is OK in my opinion, but for it to fully work we just need to define the broadcasting rules for several custom types (which is doable - this was the initial idea of this issue 😄).

@CameronBieganek
Copy link

You should try master to work on this.

Yeah, I've been playing around with master. I thought @bkamins comment here

in the past (i.e. on currently released DataFrames.jl version) Between(:x1, :x3) => fun passes a NamedTuple to fun (so it is a "single function" operation).

meant that Between() passes a named tuple on v0.20.2.

The more we talk about this, the more confused I get. I think the original approach advocated in this issue that uses the current master plus explicit broadcasting of the pair operator is probably the easiest to understand. 😃

@bkamins
Copy link
Member Author

bkamins commented Apr 23, 2020

meant that Between() passes a named tuple on v0.20.2.

I knew "multi column selector" worked, but I spent my last 3 months on master 😄, so I have forgotten how much we have added in the meantime (in particular that 0.20.2 in combine does not allow any column selector).

is probably the easiest to understand. 😃

This is what I meant - we can give different rules, but what was proposed originally is 100% consistent with intuition people should have from Base (although in some cases your proposal would be easier to write).

@bkamins bkamins modified the milestones: 1.0, 1.x Jan 8, 2021
@bkamins bkamins modified the milestones: 1.x, 1.3 Sep 4, 2021
@bkamins
Copy link
Member Author

bkamins commented Sep 4, 2021

This will be added after DataAPI.jl 1.8 and InvertedIndices.jl 1.1 releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking The proposed change is breaking. decision grouping
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants