Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get values of grouped columns #1908

Merged
merged 16 commits into from
Dec 7, 2019
Merged

Get values of grouped columns #1908

merged 16 commits into from
Dec 7, 2019

Conversation

jlumpe
Copy link
Contributor

@jlumpe jlumpe commented Aug 2, 2019

I noticed there doesn't seem to be a built-in mechanism to get the actual group values from a GroupedDataFrame, so I added a groupvalues() function:

julia> df = DataFrame(a = repeat([:foo, :bar, :baz], outer=[2]),
                      b = repeat([2, 1], outer=[3]),
                      c = 1:6);

julia> gd = groupby(df, :a)
GroupedDataFrame with 3 groups based on key: a
First Group (2 rows): a = :foo
│ Row │ a      │ b     │ c     │
│     │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1   │ foo    │ 21     │
│ 2   │ foo    │ 14
Last Group (2 rows): a = :baz
│ Row │ a      │ b     │ c     │
│     │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1   │ baz    │ 23     │
│ 2   │ baz    │ 16     │

julia> groupvalues(gd)
3-element Array{Tuple{Symbol},1}:
 (:foo,)
 (:bar,)
 (:baz,)

Unfortunately, GroupedDataFrame doesn't seem to remember whether it was given a single column or a vector containing a single column, so this will always return a vector of tuples.

In Pandas, iterating over the grouped object gives (value, dataframe) pairs, which I use very frequently. Of course you can always get the value from the first row of the SubDataFrame when iterating, but I think this feature leads to much cleaner code. I defined Base.pairs(::GroupedDataFrame) for this purpose:

julia > for ((a,), group) in pairs(groupby(df, :a))
     println("Process group $a, $(size(df, 1)) rows.")
end
Process group foo, 2 rows.
Process group bar, 2 rows.
Process group baz, 2 rows.

@bkamins
Copy link
Member

bkamins commented Aug 2, 2019

Thank you for the commit. If we should add pairs method depends on what @nalimilan plans to do with GroupedDataFrame object in the future.

src/DataFrames.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member

bkamins commented Aug 2, 2019

@nalimilan - maybe we could even consider storing the group infromation @jlumpe proposes when creating GroupedDataFrame? It should be cheap, and later we could reuse it.

@nalimilan
Copy link
Member

Thanks. I'm a bit hesitant about this API, since we currently never return arrays of (named) tuples: we rather use data frames for that. The pairs approach is appealing, but the problem is that indexing with values (if we added support for that, which would be required if pairs returns such values) is slower than using integer indices. Maybe we could define eachgroup(gd, true) to return (groupvalues, sdf) pairs, just like we have eachcol(df, true).

@bkamins
Copy link
Member

bkamins commented Aug 3, 2019

Actually - if we followed this idea, we should add an argument to groupby. Then current behavior groupby(df, some_columns) is like eachcol(df) and the additional behavior groupby(df, some_columns, true) would be like eachcol(df, true).

@nalimilan
Copy link
Member

It would be weird to have an argument to groupby, given that the grouping is the same: one would have to re-group just to change the iteration behavior.

I'm still not sure what's the best solution. The most appealing approach seems to be to define keys and pairs to return named tuples with the groups, and have getindex support that (for consistency). what do you think? People who want performance would use direct iteration or eachindex, which is defined to give the most efficient index type for arrays (as opposed to keys, so we would be consistent with that).

@bkamins
Copy link
Member

bkamins commented Aug 4, 2019

What we should do here depends if you want to make GroupedDataFrame <: AbstractDataFrame in the future. If not, then we are free to decide what is most convenient. However, if it is going to be a subtype of AbstractDataFrame we might have some constraints here.

@nalimilan
Copy link
Member

In the end I don't think we should make GroupedDataFrame <: AbstractDataFrame. Hadley said that he wouldn't do that if he were to design dplyr from scratch, and there aren't big advantages to doing that anyway (we can still define specific methods like select if we want). But I'm not sure the AbstractDataFrame interface would conflict here, since we're talking about indexing with a single value, which is now deprecated.

@bkamins
Copy link
Member

bkamins commented Aug 4, 2019

I think then it is OK to add keys and pairs. getindex could accept both integer indexing and named tuple indexing similar to how NamedTuple allows dual indexing.

@jlumpe jlumpe force-pushed the group-values branch 2 times, most recently from 052a145 to 0dbebed Compare August 4, 2019 21:03
@jlumpe
Copy link
Contributor Author

jlumpe commented Aug 4, 2019

Implemented feedback:

  • Added groupvalues to API docs.
  • groupvalues now returns a DataFrame.
  • Values now cached in a field of GroupedDataFrame.
  • Implemented dictionary interface by defining methods for getindex, keys and get.
    • Keys are NamedTuples and getindex supports NamedTuple keys with any field order or
      regular Tuples that match the order of gd.cols.
    • pairs automatically works when keys is defined.
    • Did not make it a subtype of AbstractDict.

@jlumpe
Copy link
Contributor Author

jlumpe commented Aug 4, 2019

Also, I thin this would resolve #1693

@bkamins
Copy link
Member

bkamins commented Aug 4, 2019

Thank you for the fixes. I have left some things that should be discussed (especially with @nalimilan feedback) - so probably it is best to finish the design before you implement the changes.

@jlumpe
Copy link
Contributor Author

jlumpe commented Aug 12, 2019

Implemented your feedback, this is now essentially just implementing the dictionary interface for GroupedDataFrame:

  • Store grouped values as vector of plain tuples in field .values.
  • keys() now looks up the parent column names when called, so it changes if the parent's columns are renamed (added a test for this).
  • Out-of-order NamedTuple keys no longer supported. Plain Tuple keys in correct order are supported still.
  • groupvalues function has been removed, DataFrame(keys(td)) now achieves the same thing.
  • Added explanation of GroupedDataFrame indexing to docs
  • Also added tip about using pairs() function to get group values when iterating.

docs/src/lib/indexing.md Outdated Show resolved Hide resolved
@bkamins
Copy link
Member

bkamins commented Aug 13, 2019

@jlumpe - thank you. I have left several minor comments and one major one regarding values field. My current thinking is that we can leave-out the decision if and how we implement it for a separate PR as this is only performance optimization issue (and how to best design it is non-obvious so it will be easier to discuss it when we have the core user-facing functionality already merged to master).

@jlumpe
Copy link
Contributor Author

jlumpe commented Dec 3, 2019

OK, full summary of the PR:

  • Implemented dictionary interface for GroupedDataFrame:
    • keys(gd) returns GroupKeys, lazy vector of GroupKey objects
      • GroupKey objects behave like NamedTuples of grouping column values, but property/element access is lazy. Allows for retrieval of grouping column values for each group.
    • getindex(gd, ...) defined for dictionary keys:
      • GroupKey objects belonging to the GroupedDataFrame.
      • Tuple or NamedTuple of grouping column values.
    • get(gd, ...) for Tuple and NamedTuple, not for GroupKey because they are always valid keys (for the correct GroupedDataFrame).
  • Added section on GroupedDataFrames to indexing page in documentation
    • Clarify difference between array-style and dictionary-style indexing
    • List of all valid argument types for getindex.
  • Refined signature of getindex(gd, ::Array) to restrict it to integer arrays, which was the existing contract of the method.
  • Added additional tests for groupvars and groupindices with multiple grouping columns.
  • Added test set for column names changing after groupby called.

I think this is all pretty complete. There are a couple of reviews that it won't let me mark as resolved for some reason. I'm going to go over it once more and check for any corner cases left out of the tests, but I think it should be ready to merge.

docs/src/lib/indexing.md Outdated Show resolved Hide resolved
test/grouping.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member

bkamins commented Dec 3, 2019

Looks good. I left a few minor comments and we can merge it.

@jlumpe
Copy link
Contributor Author

jlumpe commented Dec 4, 2019

Ok, think that should resolve all remaining issues.

@bkamins
Copy link
Member

bkamins commented Dec 4, 2019

The only thing left is _grouptypes function definition that seems not used so I think it should be removed. See #1908 (comment) comment

Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Looks good. It has been a long journey, but the PR was really a major one.

Let us wait for @nalimilan to have final approve and then it can be merged.

@nalimilan
Copy link
Member

There are a few uncovered lines according to Coveralls. Can you check whether they need tests? See https://coveralls.io/builds/27423710/source?filename=src%2Fgroupeddataframe%2Fgrouping.jl

@bkamins
Copy link
Member

bkamins commented Dec 5, 2019

The display tests are usually put in test/show.jl.
The test of IndexStyle is used by pairs by default.

@bkamins
Copy link
Member

bkamins commented Dec 6, 2019

@jlumpe - can you please have a look at comments by @nalimilan. I want to have this PR i 0.20 and this is the last one scheduled for the release. Thank you.

@jlumpe
Copy link
Contributor Author

jlumpe commented Dec 7, 2019

@bkamins Can you clarify what you mean about IndexStyle?

@bkamins
Copy link
Member

bkamins commented Dec 7, 2019

Can you clarify what you mean about IndexStyle?

IndexStyle is now not covered by tests.

In order to include tests of IndexStyle you either have to call this function directly or call pairs as it internally calls IndexStyle. This is a minor issue so I think we can leave this out.

@nalimilan - if you will not have additional comments I will merge this PR today.

@bkamins
Copy link
Member

bkamins commented Dec 7, 2019

Ah - this is a very minor issue so I am OK to merge this PR without it (just please add it in the next PR you are planning to do related to this functionality).

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jlumpe! It's been a long way but I think it was worth it.

@@ -16,7 +16,10 @@ by
combine
groupby
groupindices
groupvalues
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
groupvalues

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know how to force-push this. I will remove this line in 0.20 PR as this function does not exist now

@bkamins bkamins merged commit 0288a12 into JuliaData:master Dec 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants