Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow broadcasting All and Between #10

Merged
merged 2 commits into from
Sep 3, 2021
Merged

Allow broadcasting All and Between #10

merged 2 commits into from
Sep 3, 2021

Conversation

nalimilan
Copy link
Member

@nalimilan nalimilan commented Sep 7, 2019

This would allow writing things like by(df, All() .=> sum) to compute the sum of each column (replacing aggregate), as a shorthand for by(df, names(df) .=> sum). For Between and other selectors we could add in the future, that pattern would be even more convenient. Basically, it's a kind of deferred broadcasting, given that the actual columns aren't known at the time broadcast is called.

I wonder whether this is the best implementation. It's hard to decide since I can't think of any other operator than => that we might want to broadcast. The risk is that it will work for some cases that we don't necessarily want to allow, e.g. identity.(DataAPI.All()) or DataAPI.All() .== 1:3. It fails for most operations, though, because All and Between don't define any methods.

Ref. JuliaData/DataFrames.jl#1256 (comment).

src/DataAPI.jl Outdated Show resolved Hide resolved
Copy link
Member

@quinnj quinnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@bkamins
Copy link
Member

bkamins commented Nov 15, 2019

Just a question what do we do with broadcasted Not. Are you going to make a PR to InvertedIndices.jl?

@nalimilan
Copy link
Member Author

nalimilan commented Nov 15, 2019

Very good point. Maybe we should discuss that with @mbauman before merging this PR. One could always write All(Not(...)) .=> f but that's a bit weird.

EDIT: do note that for regexes, All(r"pattern") .=> f will be needed anyway as we're unlikely to define this on Regex objects.

@nalimilan
Copy link
Member Author

@mbauman Any chance you could quickly comment on this? Do you think it would be acceptable to support with Not?

@bkamins
Copy link
Member

bkamins commented Feb 29, 2020

@nalimilan - I think we can merge this and do Not later.

@pdeffebach
Copy link

I tried to test this with the updated DataFrames behavior, but I think you need a rebase before I can do that. (rebase works fine, I just tested).

@bkamins
Copy link
Member

bkamins commented Apr 4, 2020

Do not test it yet. After this PR is merged an update to DataFrames.jl is required to take it into account (this PR just makes sure that we preserve the information what was requested by the user "past" broadcasting).

See:

julia> Between(:x1, :x3) .=> identity
BroadcastedBetween{Symbol,Symbol}(:x1, :x3) => identity

and then DataFrames.jl must catch BroadcastedBetween as a special case.

An even harder case will be:

julia> Between(:x1, :x3) .=> identity .=>[:y1, :y2, :y3]
3-element Array{Pair{BroadcastedBetween{Symbol,Symbol},Pair{typeof(identity),Symbol}},1}:
 BroadcastedBetween{Symbol,Symbol}(:x1, :x3) => (identity => :y1)
 BroadcastedBetween{Symbol,Symbol}(:x1, :x3) => (identity => :y2)
 BroadcastedBetween{Symbol,Symbol}(:x1, :x3) => (identity => :y3)

where you have to be aware that here broadcasting took place and again handle it in a special way (in general we will have to put some restrictions where such broadcasting is allowed to avoid ambiguity).

@nalimilan nalimilan marked this pull request as ready for review April 6, 2020 10:03
@nalimilan nalimilan closed this Apr 6, 2020
@nalimilan nalimilan reopened this Apr 6, 2020
@nalimilan nalimilan changed the title RFC: Allow broadcasting All and Between Allow broadcasting All and Between Apr 6, 2020
@bkamins
Copy link
Member

bkamins commented Apr 6, 2020

Thank you for making sure it passes. I would not merge it as per discussion in JuliaData/DataFrames.jl#2171 (when we decide what to do there I would merge this PR then).

@pdeffebach
Copy link

Bumping this PR. Do we know what's needed for this to move forward?

@bkamins
Copy link
Member

bkamins commented Sep 8, 2020

Two things are blocking:

  • lack of Not support for similar change (we could just do picary here or DataFrames.jl as we did not get a response from @mbauman on it)
  • we have to think about the design of the broadcasting here as maybe it would be better to have a special broadcast style also (to have a more fine-grained control of the output) - I am not sure here as I did not think about it enough. Currently we do not impose broadcast style in this PR, which means that the result is uncontrolled, and extra logic will be needed in e.g. DataFrames.jl to make sure we correctly handle the output.

@nalimilan
Copy link
Member Author

I'm not sure we'll be able to do the same for Not, so we would probably require writing All(Not(...)) .=>. That looks a bit weird so we may want to introduce Cols instead to write Cols(Not(...)) .=> (JuliaData/DataFrames.jl#2171). Overall I think we need to have a clear view of the big picture before moving forward.

@bkamins
Copy link
Member

bkamins commented Sep 2, 2021

@nalimilan I would propose:

struct BroadcastedSelector
    sel
end

Base.broadcastable(sel::All) = Ref(BroadcastedSelector(sel))
Base.broadcastable(sel::Cols) = Ref(BroadcastedSelector(sel))
Base.broadcastable(sel::Between) = Ref(BroadcastedSelector(sel))

# in InvertedIndices.jl, we need to add DataAPI.jl as a dependency there
Base.broadcastable(sel::Not) = Ref(BroadcastedSelector(sel))

as it will be simpler to handle later in DataFrames.jl using single code.

Now, how would processing in DataFrames.jl be performed. We take some src => transformation => dst specification (a value or a vector or a matrix of these is allowed). Then the rule would be:

  • if first dimension of the returned object is equal to length of column names selected by the selector then we perform replacement;
  • if first dimension of the returned object is has length equal to 1 we expand this dimension to the length of column names selected by the selector;
  • in all other cases we error;

The idea is that we want to simulate that:

some_broadcasted_expression(Selector)

should produce the same as:

some_broadcasted_expression(names(df, Selector))

and names(df, Selector) produces a Vector.

And - just for a reference - we cannot handle Regex this way. This will have to be documented.

The PRs implementing these changes should be relatively simple to implement.

Any comments on this?

@nalimilan
Copy link
Member Author

I'm not sure adding DataAPI as a dependency to InvertedIndices is a good idea. People who use it without other JuliaData packages may complain. But I'm OK with the rest of the plan.

@bkamins
Copy link
Member

bkamins commented Sep 2, 2021

OK. But then I would add InvertedIndices.jl as a dependency of DataAPI.jl 😄. And define the BroadcastedSelector in InvertedIndices.jl.

@bkamins
Copy link
Member

bkamins commented Sep 2, 2021

Following the discussion on Slack we will have a separate wrapper here and in InvertedIndices.jl.

Also, for a reference, the cases we need to handle in DataFrames.jl are along the lines:

  • identity.(Not(1)) (which is just a fancy column selector)
  • Not(1) .=> fun
  • Not(1) .=> funs
  • Not(1) .=> fun .=> cols
  • cols .=> fun .=> Not(1)
  • Not(1) .=> fun .=> Not(1)
  • Not(1) .=> funs .=> cols
  • cols .=> funs .=> Not(1)
  • Not(1) .=> funs .=> Not(1)
  • Not(1) .=> cols
  • cols .=> Not(1) (just a fancy way to rename)
  • Not(1) .=> Not(1) (just a fancy way to rename)

(essentially we need to handle broadcasting of Not and other selectors as a first or as a last part of DataFrames.jl minilanguage expressions)

Also I only assume we will handle this if broadcasting is at top-level of the expression (i.e. we do not try unnesting as it does not seem useful or sensible)

@nalimilan
Copy link
Member Author

I've updated the PR. I added a type parameter to BroadcastedSelector just in case some functions need it for performance. Anyway in DataFrames we disable specialization where appropriate.

@nalimilan nalimilan changed the base branch from master to main September 2, 2021 20:34
@codecov
Copy link

codecov bot commented Sep 2, 2021

Codecov Report

Merging #10 (f518a6a) into main (708b8ba) will increase coverage by 1.28%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #10      +/-   ##
==========================================
+ Coverage   91.30%   92.59%   +1.28%     
==========================================
  Files           1        1              
  Lines          23       27       +4     
==========================================
+ Hits           21       25       +4     
  Misses          2        2              
Impacted Files Coverage Δ
src/DataAPI.jl 92.59% <100.00%> (+1.28%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 708b8ba...f518a6a. Read the comment docs.

julia> DataAPI.Between(:a, :e) .=> sin
DataAPI.BroadcastedSelector{DataAPI.Between{Symbol, Symbol}}(DataAPI.Between{Symbol, Symbol}(:a, :e)) => sin

julia> DataAPI.Cols(r"x") .=> [sum, prod]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
julia> DataAPI.Cols(r"x") .=> [sum, prod]
julia> DataAPI.Cols(r"x") .=> [sum prod]

using [sum, prod] is probably not what the user will want (also the output needs to be changed)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah - I have just checked that this does not look nice in the output, so maybe leave it as is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contrary to e.g. ["a", "b"] .=> [sin, prod], DataAPI.Cols(r"x") .=> [sum prod] does apply sum and prod to all matching columns, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the difference is that ["a", "b"] .=> [sin, prod] broadcasts the same dimension (and we get a vector) while ["a", "b"] .=> [sin prod] different dimensions (and we get a matrix).

DataAPI.Cols(r"x") will be treated as expanding to a vector.

Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left one small comment.
Regarding specialization - as you prefer. However, I think that type instability here would not cause any issues in any package in practice.

@nalimilan nalimilan merged commit 4ed7f79 into main Sep 3, 2021
@nalimilan nalimilan deleted the nl/broadcast branch September 3, 2021 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants