-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for All, Between and Not broadcating #2171
Comments
My current thinking is that |
No - it is not worth to add I would leave a discussion what to do with this for later. For now we will add I keep this open I we might come back to this decision after 1.0 release. |
Thanks for this. I understand that However I would really like to work on this for the post-1.0 period. It's a nice feature that would match Stata's convenient syntax. |
I would add this to DataFramesMeta.jl as we can intercept things like |
I'm afraid I'm a bit late to the Quoting from the Julia docs:
With that in mind, here's my proposal:
Though the parsing of julia> : => sum
ERROR: syntax: space not allowed after ":" used for quoting
Stacktrace:
[1] top-level scope at REPL[193]:0
julia> (:) => sum
Colon() => sum
julia> [:] => sum
[Colon()] => sum It seems like the Julia parser could be modified to parse Advantages of this approach:
Perhaps To clarify, under this proposal, Disadvantages: There is one use case I can think of that works now on master but would be more difficult under the new proposal. This currently works: julia> df = DataFrame(a=1, b=2, c=3, d=4);
julia> select(df, Between(:b, :d) => ByRow(max))
1×1 DataFrame
│ Row │ b_c_max │
│ │ Int64 │
├─────┼─────────┤
│ 1 │ 4 │ I'm not sure how that would be done under the new proposal. On the other hand, it doesn't seem like a very common use case to me, whereas EDIT: select(df, AsTable(Between(:b, :d)) => ByRow(maximum)) |
Thank you for the comment. Essentially you propose to introduce "automatic broadcasting" of vectors so if someone passes:
it gets parsed as currently would
This is doable without a problem and then special casing Also then you do not need tuple syntax to pass multiple arguments to a function, you would just write (using your problematic example):
or using a simpler example:
In summary:
The issue is that:
@nalimilan - what do you think? As this decision is blocking form 0.21 release. |
Hmm, that's an interesting example. Intuitively in my head I would have "automatically broadcasted" that expression twice. In other words, julia> [[:a, :b]] .=> sum
1-element Array{Pair{Array{Symbol,1},typeof(sum)},1}:
[:a, :b] => sum In other words, broadcasting only does one level of un-nesting. So, your current approach definitely has the advantage of being more explicit. So, I guess I change my vote. I'm in favor of staying with the current syntax (on the master branch). Except I vote for renaming I'll go to InvertedIndices.jl and voice my support for making Thanks for all your hard work on DataFrames! |
FYI, I also opened this issue on InvertedIndices.jl: https://github.com/mbauman/InvertedIndices.jl/issues/18 |
Yes - we have discussed this issue with @nalimilan today extensively. The option you have proposed is tempting though. If we can come up with some good proposal then we will post it here. Just to sum up the current problem. in the setting |
Could you just disallow nested arrays in |
There are cases like e.g.:
that are not entirely clear how they should be handled for me. Of course we could decide they should throw an error and allow only symbols, strings or integers in vectors. |
My intuition is that this should be
But you are right this makes for some odd rules, and no functions actually take that kind of argument. |
Hmm, yeah I was going to say that [Between(:x1, :x3), [:a, :b]] => fun should throw an error and select(df, [Between(:x1, :x3), (:a, :b)] => fun) would lower to select(df, Between(:x1, :x3) => fun, (:a, :b) => fun) where [Between(:x1, :x3), (:a, :b)] => fun conceptually expands to [[:x1, :x2, :x3], (:a, :b)] => fun which would still be invalid... |
We could resolve this problem by simply not broadcasting a second level, right? I'm not fond of having tuples and vectors have different meanings in this context, it adds a cognitive burden and makes code harder to read. Better if we just stick to vectors. |
We probably would not use tuples but rather In the syntax:
Then we allow vectors of |
EDIT: This comment does not reflect your most recent post. Yeah, thinking about it more, the current approach seems to be the best way to distinguish between these two use cases: Between(:x1, :x10) => vararg_fun
Between(:x1, :x10) .=> univariate_fun (where we wrap the right side in In both cases, julia> df = DataFrame(a = 1:3, b = 4:6, c = 7:9, d = 10:12);
julia> bar(args...) = reduce((x, y) -> x .* y, args);
julia> select(df, Between(:b, :d) => bar)
3×1 DataFrame
│ Row │ b_c_d_bar │
│ │ Int64 │
├─────┼───────────┤
│ 1 │ 280 │
│ 2 │ 440 │
│ 3 │ 648 │ But I can't think of any way to express that with the automatic broadcasting approach. Even select(df, AsTable(Between(:b, :d)) => bar) wouldn't work, since Oh right, I forgot, I guess Sorry to create all this noise just to come back to endorsing the status quo! :) |
This sounds pretty good! Let me think about it for a bit... |
This is an alternative to broadcasting using |
Another possible downside to the new design is that the EDIT: |
With The downside of |
Hmm, I thought being able to write
Do you mean in the julia> df = DataFrame(a=[1, 1, 2, 2], b=1:4, c=5:8)
4×3 DataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 1 │ 5 │
│ 2 │ 1 │ 2 │ 6 │
│ 3 │ 2 │ 3 │ 7 │
│ 4 │ 2 │ 4 │ 8 │
julia> by(df, :a, Between(:b, :c) => (r -> sum(r.b) + sum(r.c)))
ERROR: MethodError: no method matching length(::Between{Symbol,Symbol}) |
You should try master to work on this. Note that the user can always write
Instead of something like
I still think we should splat everything inside the vector via
should be equivelent to
and
should be equivelent to
|
on 0.20.2
But if we allowed this then we would run into problems with consistency you have noted. The design I proposed with Again - broadcasting with |
Yeah, I've been playing around with master. I thought @bkamins comment here
meant that The more we talk about this, the more confused I get. I think the original approach advocated in this issue that uses the current master plus explicit broadcasting of the pair operator is probably the easiest to understand. 😃 |
I knew "multi column selector" worked, but I spent my last 3 months on master 😄, so I have forgotten how much we have added in the meantime (in particular that 0.20.2 in
This is what I meant - we can give different rules, but what was proposed originally is 100% consistent with intuition people should have from Base (although in some cases your proposal would be easier to write). |
This will be added after DataAPI.jl 1.8 and InvertedIndices.jl 1.1 releases. |
For
select
andcombine
we should in the future add option forAll
,Between
andNot
broadcating (when DataAPI.jl and InvertedIndices.jl is updated).Unfortunately it is not possible to broadcast regex and colon as they are in Base.
@nalimilan An alternative to
Not(:a) .=> fun
could be to add yet another wrapper, e.g.Spread
that would readSpread(Not(:a)) .=> fun
and we would not have to change anything anywhere. What do you think? The benefit is that thenSpread(:) .=> fun
andSpread(r"x") .=> fun
could also work.The text was updated successfully, but these errors were encountered: