Get values of grouped columns #1908

jlumpe · 2019-08-02T21:21:27Z

I noticed there doesn't seem to be a built-in mechanism to get the actual group values from a GroupedDataFrame, so I added a groupvalues() function:

julia> df = DataFrame(a = repeat([:foo, :bar, :baz], outer=[2]),
                      b = repeat([2, 1], outer=[3]),
                      c = 1:6);

julia> gd = groupby(df, :a)
GroupedDataFrame with 3 groups based on key: a
First Group (2 rows): a = :foo
│ Row │ a      │ b     │ c     │
│     │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1   │ foo    │ 2     │ 1     │
│ 2   │ foo    │ 1     │ 4     │
⋮
Last Group (2 rows): a = :baz
│ Row │ a      │ b     │ c     │
│     │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1   │ baz    │ 2     │ 3     │
│ 2   │ baz    │ 1     │ 6     │

julia> groupvalues(gd)
3-element Array{Tuple{Symbol},1}:
 (:foo,)
 (:bar,)
 (:baz,)

Unfortunately, GroupedDataFrame doesn't seem to remember whether it was given a single column or a vector containing a single column, so this will always return a vector of tuples.

In Pandas, iterating over the grouped object gives (value, dataframe) pairs, which I use very frequently. Of course you can always get the value from the first row of the SubDataFrame when iterating, but I think this feature leads to much cleaner code. I defined Base.pairs(::GroupedDataFrame) for this purpose:

julia > for ((a,), group) in pairs(groupby(df, :a))
     println("Process group $a, $(size(df, 1)) rows.")
end
Process group foo, 2 rows.
Process group bar, 2 rows.
Process group baz, 2 rows.

bkamins · 2019-08-02T21:46:03Z

Thank you for the commit. If we should add pairs method depends on what @nalimilan plans to do with GroupedDataFrame object in the future.

src/DataFrames.jl

src/groupeddataframe/grouping.jl

bkamins · 2019-08-02T21:55:34Z

@nalimilan - maybe we could even consider storing the group infromation @jlumpe proposes when creating GroupedDataFrame? It should be cheap, and later we could reuse it.

nalimilan · 2019-08-03T15:05:10Z

Thanks. I'm a bit hesitant about this API, since we currently never return arrays of (named) tuples: we rather use data frames for that. The pairs approach is appealing, but the problem is that indexing with values (if we added support for that, which would be required if pairs returns such values) is slower than using integer indices. Maybe we could define eachgroup(gd, true) to return (groupvalues, sdf) pairs, just like we have eachcol(df, true).

bkamins · 2019-08-03T17:16:20Z

Actually - if we followed this idea, we should add an argument to groupby. Then current behavior groupby(df, some_columns) is like eachcol(df) and the additional behavior groupby(df, some_columns, true) would be like eachcol(df, true).

nalimilan · 2019-08-04T14:08:28Z

It would be weird to have an argument to groupby, given that the grouping is the same: one would have to re-group just to change the iteration behavior.

I'm still not sure what's the best solution. The most appealing approach seems to be to define keys and pairs to return named tuples with the groups, and have getindex support that (for consistency). what do you think? People who want performance would use direct iteration or eachindex, which is defined to give the most efficient index type for arrays (as opposed to keys, so we would be consistent with that).

bkamins · 2019-08-04T14:24:14Z

What we should do here depends if you want to make GroupedDataFrame <: AbstractDataFrame in the future. If not, then we are free to decide what is most convenient. However, if it is going to be a subtype of AbstractDataFrame we might have some constraints here.

nalimilan · 2019-08-04T14:40:36Z

In the end I don't think we should make GroupedDataFrame <: AbstractDataFrame. Hadley said that he wouldn't do that if he were to design dplyr from scratch, and there aren't big advantages to doing that anyway (we can still define specific methods like select if we want). But I'm not sure the AbstractDataFrame interface would conflict here, since we're talking about indexing with a single value, which is now deprecated.

bkamins · 2019-08-04T15:29:44Z

I think then it is OK to add keys and pairs. getindex could accept both integer indexing and named tuple indexing similar to how NamedTuple allows dual indexing.

jlumpe · 2019-08-04T21:07:37Z

Implemented feedback:

Added groupvalues to API docs.
groupvalues now returns a DataFrame.
Values now cached in a field of GroupedDataFrame.
Implemented dictionary interface by defining methods for getindex, keys and get.
- Keys are NamedTuples and getindex supports NamedTuple keys with any field order or
  regular Tuples that match the order of gd.cols.
- pairs automatically works when keys is defined.
- Did not make it a subtype of AbstractDict.

jlumpe · 2019-08-04T21:11:42Z

Also, I thin this would resolve #1693

src/groupeddataframe/grouping.jl

docs/src/lib/functions.md

bkamins · 2019-08-04T23:06:37Z

Thank you for the fixes. I have left some things that should be discussed (especially with @nalimilan feedback) - so probably it is best to finish the design before you implement the changes.

jlumpe · 2019-08-12T22:45:45Z

Implemented your feedback, this is now essentially just implementing the dictionary interface for GroupedDataFrame:

Store grouped values as vector of plain tuples in field .values.
keys() now looks up the parent column names when called, so it changes if the parent's columns are renamed (added a test for this).
Out-of-order NamedTuple keys no longer supported. Plain Tuple keys in correct order are supported still.
groupvalues function has been removed, DataFrame(keys(td)) now achieves the same thing.
Added explanation of GroupedDataFrame indexing to docs
Also added tip about using pairs() function to get group values when iterating.

docs/src/lib/indexing.md

docs/src/man/split_apply_combine.md

src/groupeddataframe/grouping.jl

bkamins · 2019-08-13T07:02:31Z

@jlumpe - thank you. I have left several minor comments and one major one regarding values field. My current thinking is that we can leave-out the decision if and how we implement it for a separate PR as this is only performance optimization issue (and how to best design it is non-obvious so it will be easier to discuss it when we have the core user-facing functionality already merged to master).

jlumpe · 2019-12-03T04:14:26Z

OK, full summary of the PR:

Implemented dictionary interface for GroupedDataFrame:
- keys(gd) returns GroupKeys, lazy vector of GroupKey objects
  - GroupKey objects behave like NamedTuples of grouping column values, but property/element access is lazy. Allows for retrieval of grouping column values for each group.
- getindex(gd, ...) defined for dictionary keys:
  - GroupKey objects belonging to the GroupedDataFrame.
  - Tuple or NamedTuple of grouping column values.
- get(gd, ...) for Tuple and NamedTuple, not for GroupKey because they are always valid keys (for the correct GroupedDataFrame).
Added section on GroupedDataFrames to indexing page in documentation
- Clarify difference between array-style and dictionary-style indexing
- List of all valid argument types for getindex.
Refined signature of getindex(gd, ::Array) to restrict it to integer arrays, which was the existing contract of the method.
Added additional tests for groupvars and groupindices with multiple grouping columns.
Added test set for column names changing after groupby called.

I think this is all pretty complete. There are a couple of reviews that it won't let me mark as resolved for some reason. I'm going to go over it once more and check for any corner cases left out of the tests, but I think it should be ready to merge.

docs/src/lib/indexing.md

src/groupeddataframe/grouping.jl

test/grouping.jl

bkamins · 2019-12-03T19:43:40Z

Looks good. I left a few minor comments and we can merge it.

src/groupeddataframe/grouping.jl

jlumpe · 2019-12-04T21:39:34Z

Ok, think that should resolve all remaining issues.

bkamins · 2019-12-04T21:46:56Z

The only thing left is _grouptypes function definition that seems not used so I think it should be removed. See #1908 (comment) comment

bkamins

Thank you! Looks good. It has been a long journey, but the PR was really a major one.

Let us wait for @nalimilan to have final approve and then it can be merged.

src/groupeddataframe/grouping.jl

nalimilan · 2019-12-05T11:23:42Z

There are a few uncovered lines according to Coveralls. Can you check whether they need tests? See https://coveralls.io/builds/27423710/source?filename=src%2Fgroupeddataframe%2Fgrouping.jl

bkamins · 2019-12-05T17:34:13Z

The display tests are usually put in test/show.jl.
The test of IndexStyle is used by pairs by default.

bkamins · 2019-12-06T16:21:34Z

@jlumpe - can you please have a look at comments by @nalimilan. I want to have this PR i 0.20 and this is the last one scheduled for the release. Thank you.

jlumpe · 2019-12-07T01:46:31Z

@bkamins Can you clarify what you mean about IndexStyle?

bkamins · 2019-12-07T08:42:25Z

Can you clarify what you mean about IndexStyle?

IndexStyle is now not covered by tests.

In order to include tests of IndexStyle you either have to call this function directly or call pairs as it internally calls IndexStyle. This is a minor issue so I think we can leave this out.

@nalimilan - if you will not have additional comments I will merge this PR today.

bkamins · 2019-12-07T08:43:37Z

Ah - this is a very minor issue so I am OK to merge this PR without it (just please add it in the next PR you are planning to do related to this functionality).

nalimilan

Thanks @jlumpe! It's been a long way but I think it was worth it.

bkamins · 2019-12-07T11:00:06Z

docs/src/lib/functions.md

@@ -16,7 +16,10 @@ by
 combine
 groupby
 groupindices
+groupvalues


Suggested change

groupvalues

I do not know how to force-push this. I will remove this line in 0.20 PR as this function does not exist now

jlumpe force-pushed the group-values branch from 81d9801 to 18907fb Compare August 2, 2019 21:33

bkamins reviewed Aug 2, 2019

View reviewed changes

src/DataFrames.jl Outdated Show resolved Hide resolved

bkamins reviewed Aug 2, 2019

View reviewed changes

src/groupeddataframe/grouping.jl Outdated Show resolved Hide resolved

jlumpe force-pushed the group-values branch 2 times, most recently from 052a145 to 0dbebed Compare August 4, 2019 21:03

bkamins reviewed Aug 4, 2019

View reviewed changes

src/groupeddataframe/grouping.jl Outdated Show resolved Hide resolved

bkamins reviewed Aug 4, 2019

View reviewed changes

src/groupeddataframe/grouping.jl Outdated Show resolved Hide resolved

bkamins reviewed Aug 4, 2019

View reviewed changes

src/groupeddataframe/grouping.jl Outdated Show resolved Hide resolved

bkamins reviewed Aug 4, 2019

View reviewed changes

src/groupeddataframe/grouping.jl Outdated Show resolved Hide resolved

bkamins reviewed Aug 4, 2019

View reviewed changes

docs/src/lib/functions.md Show resolved Hide resolved

jlumpe force-pushed the group-values branch from 0dbebed to 7e92206 Compare August 12, 2019 22:40

bkamins reviewed Aug 13, 2019

View reviewed changes

docs/src/lib/indexing.md Outdated Show resolved Hide resolved

bkamins reviewed Aug 13, 2019

View reviewed changes

docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved

bkamins reviewed Aug 13, 2019

View reviewed changes

src/groupeddataframe/grouping.jl Outdated Show resolved Hide resolved

bkamins reviewed Aug 13, 2019

View reviewed changes

src/groupeddataframe/grouping.jl Outdated Show resolved Hide resolved

bkamins reviewed Aug 13, 2019

View reviewed changes

src/groupeddataframe/grouping.jl Outdated Show resolved Hide resolved

bkamins reviewed Aug 13, 2019

View reviewed changes

src/groupeddataframe/grouping.jl Outdated Show resolved Hide resolved

bkamins reviewed Dec 3, 2019

View reviewed changes

docs/src/lib/indexing.md Outdated Show resolved Hide resolved

bkamins reviewed Dec 3, 2019

View reviewed changes

src/groupeddataframe/grouping.jl Show resolved Hide resolved

bkamins reviewed Dec 3, 2019

View reviewed changes

src/groupeddataframe/grouping.jl Show resolved Hide resolved

bkamins reviewed Dec 3, 2019

View reviewed changes

test/grouping.jl Show resolved Hide resolved

bkamins reviewed Dec 3, 2019

View reviewed changes

test/grouping.jl Show resolved Hide resolved

bkamins reviewed Dec 3, 2019

View reviewed changes

test/grouping.jl Outdated Show resolved Hide resolved

nalimilan reviewed Dec 4, 2019

View reviewed changes

src/groupeddataframe/grouping.jl Outdated Show resolved Hide resolved

jlumpe added 3 commits December 4, 2019 13:13

Fix IndexStyle for GroupKeys

f27e20c

Update tests for GroupedDataFrame dictionary interface

89f803f

Updates to GroupedDataFrame key documentation

7467a71

jlumpe force-pushed the group-values branch from 67b5f74 to ab70caf Compare December 4, 2019 21:38

Remove unused internal function _grouptypes

8dd16ac

bkamins approved these changes Dec 4, 2019

View reviewed changes

nalimilan reviewed Dec 5, 2019

View reviewed changes

src/groupeddataframe/grouping.jl Outdated Show resolved Hide resolved

jlumpe added 2 commits December 6, 2019 15:07

Remove unused private method

7654f13

Tests for show(::GroupedDataFrame)

cbebbd1

Test IndexStyle, move tests to same place

cb058b1

nalimilan approved these changes Dec 7, 2019

View reviewed changes

bkamins reviewed Dec 7, 2019

View reviewed changes

bkamins merged commit 0288a12 into JuliaData:master Dec 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get values of grouped columns #1908

Get values of grouped columns #1908

jlumpe commented Aug 2, 2019

bkamins commented Aug 2, 2019

bkamins commented Aug 2, 2019

nalimilan commented Aug 3, 2019

bkamins commented Aug 3, 2019

nalimilan commented Aug 4, 2019

bkamins commented Aug 4, 2019

nalimilan commented Aug 4, 2019

bkamins commented Aug 4, 2019

jlumpe commented Aug 4, 2019

jlumpe commented Aug 4, 2019

bkamins commented Aug 4, 2019

jlumpe commented Aug 12, 2019

bkamins commented Aug 13, 2019

jlumpe commented Dec 3, 2019

bkamins commented Dec 3, 2019

jlumpe commented Dec 4, 2019

bkamins commented Dec 4, 2019

bkamins left a comment

nalimilan commented Dec 5, 2019

bkamins commented Dec 5, 2019

bkamins commented Dec 6, 2019

jlumpe commented Dec 7, 2019

bkamins commented Dec 7, 2019

bkamins commented Dec 7, 2019

nalimilan left a comment

bkamins Dec 7, 2019

bkamins Dec 7, 2019

Get values of grouped columns #1908

Get values of grouped columns #1908

Conversation

jlumpe commented Aug 2, 2019

bkamins commented Aug 2, 2019

bkamins commented Aug 2, 2019

nalimilan commented Aug 3, 2019

bkamins commented Aug 3, 2019

nalimilan commented Aug 4, 2019

bkamins commented Aug 4, 2019

nalimilan commented Aug 4, 2019

bkamins commented Aug 4, 2019

jlumpe commented Aug 4, 2019

jlumpe commented Aug 4, 2019

bkamins commented Aug 4, 2019

jlumpe commented Aug 12, 2019

bkamins commented Aug 13, 2019

jlumpe commented Dec 3, 2019

bkamins commented Dec 3, 2019

jlumpe commented Dec 4, 2019

bkamins commented Dec 4, 2019

bkamins left a comment

Choose a reason for hiding this comment

nalimilan commented Dec 5, 2019

bkamins commented Dec 5, 2019

bkamins commented Dec 6, 2019

jlumpe commented Dec 7, 2019

bkamins commented Dec 7, 2019

bkamins commented Dec 7, 2019

nalimilan left a comment

Choose a reason for hiding this comment

bkamins Dec 7, 2019

Choose a reason for hiding this comment

bkamins Dec 7, 2019

Choose a reason for hiding this comment