Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply the same transformation to each of several inputs #2870

Open
jtrakk opened this issue Sep 9, 2021 · 10 comments
Open

Apply the same transformation to each of several inputs #2870

jtrakk opened this issue Sep 9, 2021 · 10 comments
Labels
Milestone

Comments

@jtrakk
Copy link

jtrakk commented Sep 9, 2021

https://discourse.julialang.org/t/frustrated-using-dataframes/67833 needs a way to apply the same transformation to each of several columns.

One way mentioned in the thread is

transform(df, r"temp" => ByValue(t->((t-32)*5/9)) => (c->c*"celsius"))
  • The third component of the pairs is a renamer function.
  • ByValue is like ByRow but the function receives a value instead of a row.

But more flexible is

transform(df, r"temp" => Across(ByRow(t->((t-32)*5/9))) => (c->c*"celsius"))
  • Across applies its argument to each column separately.
  • Follows the normal protocol where the transformer function receives the whole column.

A list of selectors might be allowed:

transform(df, [(:a,:b), (:c, :d)] => Across(ByRow((x,y) -> x+y) => ((name1,name2)->name1*name2))
@nathanrboyer
Copy link
Contributor

Note that Across may more cleanly fix #2171.

@bkamins
Copy link
Member

bkamins commented Sep 10, 2021

We are discussing with @nalimilan exactly this issue currently. Both broadcasting and Across have their pros and cons. The conclusions will be posted here so that we can discuss it.

@jtrakk
Copy link
Author

jtrakk commented Sep 10, 2021

@bkamins That sounds like an interesting discussion. Are the logs available somewhere?

@bkamins
Copy link
Member

bkamins commented Sep 10, 2021

The discussion in short is:

The benefits of Across are:
* it seems newcomers can digest it more easily
* you can pass a predicate for column selector (now we do not allow functions as source)
* you can pass a transformation function for target column names (now we would require an explicit column name)

And the benefits of broadcasting => are that we do not introduce new element to the ecosystem, while:

  • Predicates can be handled by Cols(predicate) which would be consistent with the rest of the design.
  • Column name transformation function in target part also can be supported in the future. (we have such mechanism already in reshaping code)

So in short: can someone give an example fo Across call that would be significantly problematic to be handled with .=> style?

@jtrakk
Copy link
Author

jtrakk commented Sep 11, 2021

Would

transform(df, r"temp" => Across(ByRow(t->((t-32)*5/9))) => (c->c*"celsius"))

be one of these?

transform(df, r"temp" .=> ByRow(t->((t-32)*5/9)) => (c->c*"celsius"))
transform(df, r"temp" .=> ByRow(t->((t-32)*5/9)) .=> (c->c*"celsius"))

@bkamins
Copy link
Member

bkamins commented Sep 11, 2021

transform(df, Cols(r"temp") .=> ByRow(t->((t-32)*5/9)) .=> (c->c*"celsius"))

or

transform(df, Cols(x -> occursin("temp", x)) .=> ByRow(t->((t-32)*5/9)) .=> (c->c*"celsius"))

assuming we add the features I discussed in my previous post.

The r"temp" is special, as it does not support broadcasting the way we want and will not (since it is defined in Julia Base)

@nalimilan
Copy link
Member

Just to add a slight nuance to @bkamins's summary of our discussion: I tend to think that the reason why some people find Across simpler to understand than .=> is that they are used to the former in dplyr. Whether something feels "natural" or not depends a lot on previous experience (even though some syntaxes are arguably simpler than others). So I'd rather wait until we support Cols(...) .=> to see whether people can get used to it or not.

One way to convince me is to show somebody who has never used dplyr and who finds Across much more easy to use than .=>. :-D

@jtrakk
Copy link
Author

jtrakk commented Sep 12, 2021

Why is Cols(r"temp") needed instead of just r"temp"?

@bkamins
Copy link
Member

bkamins commented Sep 12, 2021

For the reason I have written above - Regex objects have a defined behavior in Julia Base that we cannot override.

See:

julia> r"t" .=> [length sum]
1×2 Matrix{Pair{Regex, _A} where _A}:
 r"t"=>length  r"t"=>sum

julia> names(df, r"t") .=> [length sum]
2×2 Matrix{Pair{String, _A} where _A}:
 "t1"=>length  "t1"=>sum
 "t2"=>length  "t2"=>sum

As you can see there is no way to go from result one to result two (as broadcasting gets resolved BEFORE its result is passed to DataFrames.jl function).

While with Cols you have:

julia> Cols(r"t") .=> [length sum]
1×2 Matrix{Pair{DataAPI.BroadcastedSelector{Cols{Tuple{Regex}}}, _A} where _A}:
 BroadcastedSelector{Cols{Tuple{Regex}}}(Cols{Tuple{Regex}}((r"t",)))=>length  BroadcastedSelector{Cols{Tuple{Regex}}}(Cols{Tuple{Regex}}((r"t",)))=>sum

and as you can see we have a BroadcastedSelector wrapper around Cols object so we know that the user requested broadcasting and we can handle this in DataFrames.jl.

@knuesel
Copy link
Contributor

knuesel commented Sep 25, 2021

DataFrames.jl could also provide its own string macro for convenience, for example cols"t" .=> [length sum]

@bkamins bkamins modified the milestones: 1.3, 1.x Nov 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants