-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Row-wise vs. whole vector functions #1952
Comments
Just a small comment, as this overlap was pointed out in #1256 already. I think one has to be a bit careful with performance here, in that |
Figuring out which functions are special cased as by row and which ones take the whole column is a continual pain point for me with dplyr. I would prefer all functions be applied to the whole column and require the use of anonymous functions for element wise operations. |
Which functions are applied by row in dplyr? I thought they were all vectorized.
As I noted above, the problem is that some operations could be made faster if we know they are applied independently to each row: operation can be panellized, temporary vector can be avoided, etc. |
It was the string packages that frustrated me. Another solution might be a |
Wait, p> paste(1:2, 3:4)
[1] "1 3" "2 4"
Yeah, I was thinking something like that could do the trick. Though it's kind of backwards... |
It might have been I see the point. I'm surprised there is not way for the compiler to know if an anonymous function is purely broadcasted, but that seems to be the case so this is a tough problem. |
Adding #2048 as a part of the decision here I guess (I am not sure what is left to decide in the main thread, but this issue seems to fall into the category of this thread 😄). |
@nalimilan regarding:
This comment by you indicates that it would be good that EDIT - I have thought over that. It is not that crucial in the end in my opinion. |
Here is a summary what I have on this issue. Currently pesent:
To be added:
So I would say what we have now is consistent. The TODO is more or less:
If there are no negative comments towards that I will go forward with this plan. |
@nalimilan - I have started documenting & implementing target |
Can you develop? I think we need probably need both. The difficulty is under what form we can provide them... |
I was writing in parallel in #2053 (comment) what is the the rationale. My question to @nalimilan and @pdeffebach - can you give me examples of typical use cases where "whole column" |
Also if we want "whole column" operations on a data frame many typical cases (like standardization of many numeric columns) can be achieved by |
Can you copy your comment here instead? That sounds more appropriate and it will avoid splitting discussions. The most common cases I can think of are EDIT: another interesting use case is when you want to modify or reorder levels of a categorical array. That's quite natural in dplyr with |
Yes i agree with Milan. Another point to make is that it's very easy to simulate row-wise operations with vector notation, just add a
On the other hand if we impose row-wise operations it's very hard to go the other direction. Bcause of this asymmetry I support column-based operations. |
I split the issue when I thought we have a simple case. So let me first add the comment I put there here. Here is the comment from #2053 issue:However, later I came to a conclusion that for The issue is that For Just to sum up, e.g. the use case to standardize the column of a data frame is not so appealing as it is very easy to achieve via
This is a bit verbose but not super bad, as opposed to eg having to write:
when you want to take
(I think row-wise operations on data frame - like in JuliaDB.jl will be needed more often by users than whole column operations) Some more thoughtsActually we already have a whole-column operation available it is
you get exactly what we are talking about. The only thing that is needed is to extend the syntax of I am not sure this is a right direction of thought, but at least this is possible. |
@pdeffebach Sorry if that wasn't clear: what I was suggesting is
@pdeffebach Yes, however it's clearly quite verbose (I mean, if you show that example to any R user he would laugh at us as
@bkamins Right. Since that behavior is due to |
In terms of consistency across packages, what @bkamins is proposing is exactly the JuliaDB approach, where I agree that it'd be interesting to figure out a good name for a colwise transformation, to complete the analogy |
Well, as commented in #2053 we do not really need row-wise Just to explore the possibilities another idea that came to my mind was to allow
Then you could write something like:
What is the rationale behind it? I expect that apart from decision of row-wise vs whole column operations people will want also column selection and renaming functionalities in both options, so it would duplicate the functionality (+ with PS. we use |
This seems to be much better than a keyword argument. In particular it is pretty much in line with special selectors, like |
Exactly - and that is why I thought that it might be also then integrated into JuliaDB.jl so that we have consistency (it will be harder as JuliaDB.jl is distributed, but maybe there would be some efficient way to do it - at least in some cases). |
Regarding the verbosity of
Row-wise operations won't change that verbosity very much, any time you have two columns you will need an anonymous function, as with any time you want to have a keyword argument in a scalar-valued function. Ideally currying would solve all these problems, enabling us to write
or something. I think I still think there would be major inconsistencies in the API if |
I think you will not in most common cases, e.g. in my example above you will be able to write Also note that
I would like to keep DataFrames.jl functionality plain and simple using only Base. All magic should go to DataFramesMeta.jl, which I plan to work on when we release DataFrames.jl 1.0 when we have a stabilized API here.
But as always - let us discuss. I think this is the last major decision before 1.0 so we should have a clear roadmap for it. |
Just to sum up my current thoughts that were scattered around several issues (sorry for that, but I am in creative mode, hopefully will switch to implementing soon, so there will be less noise). My thinking is that we need two functions
In this way we will have two functions:
|
Interesting. Though using Adding keyword arguments to support filtering and grouping is also interesting, but essentially orthogonal AFAICT: this can be achieved in several steps anyway using |
Here are my thoughts:
I agree it is weird.
The fact that
I think it is not orthogonal because of two reasons:
Of course all this makes sense under "powerful select" approach. If we want |
I'd say calling this general column-wise function
I don't think doing Maybe that's the case for other operations, but AFAIK for data frames the most efficient way to do a filtering, grouping and combining operation is really to do these three steps in sequence (as long as you take care of creating a view to filter). I'm not very familiar with query planners, but my understanding is that major optimizations can be obtained when joins are involved (which keyword arguments wouldn't support as we're talking about single-table operations). I guess Jacob and David know better. But yes, you're right that if we want to support keyword arguments in 1.x we have to prevent using them to specify column names.
I'm not sure that's really a requirement. We could have a powerful |
OK - we can go this way I think. So the question is do we like the names Another dimension in e.g. dplyr is that it distinguishes if old columns are kept or not. I think we do not strictly need this distinction as you will be able to add We can start Another question - which again can be decided later - is if we want to add |
In terms of naming, I don't think it's great to have As the default seems to be row-wise, maybe it'd be better that the column-wise version has this explicitly in the name, e.g. |
I am not sure about naming either - hopefully @nalimilan can come up with something smart as usual 😄. Maybe just
As I have said above - this is possible, but it is really easy to keep old columns by adding
I understand that @nalimilan was afraid that this will be too complex if we had only this option for the newcomers. Regarding |
I can't agree with this. I think a *nix philosophy of aptly named simple functions is preferrable. I think if anything we should actually take seriously |
Thank you all for the comments. I have a feeling that I see what The advice that I think would be now most valuable is suggestions of public API. We have the following dimensions (I give a full list, not all of them have to be covered in one shot, but we need a longer term plan):
For now - following the comment by @nalimilan I leave out SQL So which combinations of these options should be exposed by what functions. Probably there are mixed opinions here (which is OK - I think it is better to voice them now so that we can properly weigh them; the crucial thing - if I may ask - is to possibly provide a complete proposal that covers a whole list of possible options, as a key thing here is to be able to verify consistency and intuitiveness of names in the proposals). Thank you! |
I don't have great ideas for naming unfortunately. I agree that having The |
No, actually they are both row-wise, the difference is that At this moment my personal preference is probably for the We had discussed in JuliaData/JuliaDBMeta.jl#29 the possibility to allow interpolating a column in JuliaDBMeta (and I guess DataFramesMeta) row-wise macros. For example in: @transform df normcol = (:col1 + :col2) / $(mean(:col3)) the dollar expression gets computed before calling the row-wise macro and then the rest is computed row-wise. Again, this would mean that most (all?) macros are row-wise, but there is an escape-hatch for column-wise operations. |
Funny, I don't know how I imagined JuliaData/JuliaDBMeta.jl#29 is indeed relevant, thanks for raising it again. I still like the I guess that kind of approach would suit well with the |
I think it is also OK to have What I mean is that:
would become
as it still requires |
Right. But wrapping the column name sounds slightly more correct to me, as it's really the kind of argument passed to the function which is changed by Now, what would be the best name for that wrapper? Is |
Just to add (as it might affect the decision), that we would also write things like As an additional decision we should make is what |
|
Weighing in on a number of topics that cropped up in this thread: Row-wise by defaultI experimented with this a bit in my fork of DataFramesMeta (although admittedly it sort of fell off my radar and I wouldn't be surprised if a lot of it breaks at the moment). By default, all operations were columnwise, but could be interpreted as rowwise by first converting a # using dgkf/DataFramesMeta.jl#dev/symbol_contexts
df = DataFrame(x = 1:4, y = repeat([1, 2], 2), z = 'a':'d')
df |> @transform(a = :x .+ :y) # columnwise
df |> eachrow |> @transform(a = :x + :y) # rowwise
# rowwise with column results
# :. is interpreted as the input object
df |> eachrow |> @transform(a = (:x + :y) / mean(parent(:.)[!,^(:x)])) What I like about this is that it can use the type dispatch to alternate between modes of operation based on the input datatype. The way it's implemented, it's pretty trivial because of how the symbols are evaluated using How to escape columnsWhat's not implemented above, and is reflected in how nasty that last line looks, is how to escape a column. I think that this could be cleaned up in the above example, perhaps to something like: @transform(a = (:x + :y) / mean(:.[!,:x])) # or
@transform(a = (:x + :y) / mean($x)) # as suggested above I'm not crazy about having a bunch of added syntax to denote row cells ( Adapting
|
@nalimilan - is there anything left to be decided in this issue (I know there are several specific things left, but they have separate issues). Is there any remaining "grand decision" to be made in this issue? |
I am closing this as I do not see any open grand discussions here after a review. We have settled on a general design. Please open separate issues if there are future requests for "individual" decisions/functionalities. |
This is a continuation of a discussion started by @piever at #1256 (comment), about whether functions should operate row-wise or take and return full vectors.
As @piever noted, JuliaDB has taken the row-wise approach since that allows distributing operations over multiple cores. This is a good idea in general even for DataFrames, where we could use multiple threads.
groupreduce
is also a good example of this: I indeed added code to detect common reductions inby
/combine
to transform such operations into what is essentially agroupreduce
operation (which could be added to the API at some point). I think it is useful to allow both, as people are used to thinking in terms ofsum
rather than "reduction using+
" (and indeed Base providessum(x)
in addition toreduce(+, x)
).So the general question is, when should a DataFrames function be row-wise and when should it take a full vector? Unfortunately, I think there are advantages to both. Operating row-wise is simpler to write (no dots), distributable and possibly more efficient (no intermediate allocations). But operating over whole vectors allows doing things like
normalize(x)
,x .- mean(x)
ordiff(x)
, which are quite common (either on the whole data frame, or by group); another operation which is sometimes useful is to create a new variable containing for each row the mean of the group it corresponds to (in which case recycling is needed, just like dplyr'smutate
does). We discussed similar issues previously in the context of JuliaDBMeta macros at JuliaData/JuliaDBMeta.jl#29. In theory, these things can be performed as a row-wise operation after computing the needed summary statistic, but that's not very user-friendly unless we can find a very simple macro syntax which also works for window functions likelag
(let's discuss that at JuliaData/JuliaDBMeta.jl#29).The simplest solution would be to have both kinds of functions, provided we can find a clear rule to distinguish them. For example, we already have row-wise
filter
,unique
andsort
, so it would be somewhat consistent to also have a row-wisemap
(ormap(f, eachrow(df))
if we're unsure) as an equivalent to JuliaDB'sselect
.Then we could provide separate functions using a dplyr-like terminology, which operate on whole vectors and are somewhat more user-friendly. But
select
belongs to the latter family, so by that rule it would have to operate over whole vectors. One solution to that would be to only allow selecting (not transforming) columns, as in dplyr. Then we could also introducemutate
(I prefer the nametransform
, but JuliaDB already uses it for row-wise operation...) to create columns by passing a function that operates on whole vectors. We would still need an equivalent ofselect
or dplyr'stransmute
to operate on whole vectors.mutate(df, ..., keep=true)
could work, but it should probably recycle scalars so that the result has the same number of rows as the input: that wouldn't allow replacingaggregate
(#1256).The text was updated successfully, but these errors were encountered: