-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow broadcasting All and Between #10
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Just a question what do we do with broadcasted |
Very good point. Maybe we should discuss that with @mbauman before merging this PR. One could always write EDIT: do note that for regexes, |
@mbauman Any chance you could quickly comment on this? Do you think it would be acceptable to support with |
@nalimilan - I think we can merge this and do |
I tried to test this with the updated DataFrames behavior, but I think you need a rebase before I can do that. (rebase works fine, I just tested). |
Do not test it yet. After this PR is merged an update to DataFrames.jl is required to take it into account (this PR just makes sure that we preserve the information what was requested by the user "past" broadcasting). See:
and then DataFrames.jl must catch An even harder case will be:
where you have to be aware that here broadcasting took place and again handle it in a special way (in general we will have to put some restrictions where such broadcasting is allowed to avoid ambiguity). |
Thank you for making sure it passes. I would not merge it as per discussion in JuliaData/DataFrames.jl#2171 (when we decide what to do there I would merge this PR then). |
Bumping this PR. Do we know what's needed for this to move forward? |
Two things are blocking:
|
I'm not sure we'll be able to do the same for |
@nalimilan I would propose:
as it will be simpler to handle later in DataFrames.jl using single code. Now, how would processing in DataFrames.jl be performed. We take some
The idea is that we want to simulate that:
should produce the same as:
and And - just for a reference - we cannot handle The PRs implementing these changes should be relatively simple to implement. Any comments on this? |
I'm not sure adding DataAPI as a dependency to InvertedIndices is a good idea. People who use it without other JuliaData packages may complain. But I'm OK with the rest of the plan. |
OK. But then I would add InvertedIndices.jl as a dependency of DataAPI.jl 😄. And define the |
Following the discussion on Slack we will have a separate wrapper here and in InvertedIndices.jl. Also, for a reference, the cases we need to handle in DataFrames.jl are along the lines:
(essentially we need to handle broadcasting of Also I only assume we will handle this if broadcasting is at top-level of the expression (i.e. we do not try unnesting as it does not seem useful or sensible) |
3b42521
to
f518a6a
Compare
I've updated the PR. I added a type parameter to |
Codecov Report
@@ Coverage Diff @@
## main #10 +/- ##
==========================================
+ Coverage 91.30% 92.59% +1.28%
==========================================
Files 1 1
Lines 23 27 +4
==========================================
+ Hits 21 25 +4
Misses 2 2
Continue to review full report at Codecov.
|
julia> DataAPI.Between(:a, :e) .=> sin | ||
DataAPI.BroadcastedSelector{DataAPI.Between{Symbol, Symbol}}(DataAPI.Between{Symbol, Symbol}(:a, :e)) => sin | ||
|
||
julia> DataAPI.Cols(r"x") .=> [sum, prod] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
julia> DataAPI.Cols(r"x") .=> [sum, prod] | |
julia> DataAPI.Cols(r"x") .=> [sum prod] |
using [sum, prod]
is probably not what the user will want (also the output needs to be changed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah - I have just checked that this does not look nice in the output, so maybe leave it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Contrary to e.g. ["a", "b"] .=> [sin, prod]
, DataAPI.Cols(r"x") .=> [sum prod]
does apply sum
and prod
to all matching columns, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the difference is that ["a", "b"] .=> [sin, prod]
broadcasts the same dimension (and we get a vector) while ["a", "b"] .=> [sin prod]
different dimensions (and we get a matrix).
DataAPI.Cols(r"x")
will be treated as expanding to a vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left one small comment.
Regarding specialization - as you prefer. However, I think that type instability here would not cause any issues in any package in practice.
This would allow writing things like
by(df, All() .=> sum)
to compute the sum of each column (replacingaggregate
), as a shorthand forby(df, names(df) .=> sum)
. ForBetween
and other selectors we could add in the future, that pattern would be even more convenient. Basically, it's a kind of deferred broadcasting, given that the actual columns aren't known at the timebroadcast
is called.I wonder whether this is the best implementation. It's hard to decide since I can't think of any other operator than
=>
that we might want to broadcast. The risk is that it will work for some cases that we don't necessarily want to allow, e.g.identity.(DataAPI.All())
orDataAPI.All() .== 1:3
. It fails for most operations, though, becauseAll
andBetween
don't define any methods.Ref. JuliaData/DataFrames.jl#1256 (comment).