-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Major API changes #80
Comments
Renaming sorts of table for the purpose of making the whole more easily understood by different kinds of users is laudable. Deciding "which gets called what" by "what fits best, best fits," is most rational when the set of names that get used are similarly understandable by good swath of their users. IndexedTable is generally approachable as a Table that carries expressive aor algorithmic efficacies through Index|es. NDSparse is not as generally approachable; it trades technical accuracy for conversational transparency. Readability invites diverse readership, so using "Table" in naming sorts of table helps. Sparsity is a trait (as we now use that word). Trait names and subtype naming address different organizing or architectural principles. Mixing them in the most user used part of an API is harder to digest. |
Cool! Yes, to explain just one piece of why I support these changes - a "table" is typically used in the computing / data science areas as a finite relation - some collection of (named) tuples (i.e. rows). If we are going to support the relational operations (that people are used to from SQL, etc), it is most transparent and coherent to view the table as a collection of rows, and have them iterate entire rows. I feel this will help with making great APIs for the kinds of operations typically done on tables. I feel that one of the primary reasons we've shied away from iterating rows in Julia table packages in the past is because of performance concerns (eg type stability), rather than for reasons of conceptual correctness. Of course, this shouldn't preclude "viewing" tables as 2D style objects or as more complex containers like |
I think this is a great move. It should also make the creation of a custom Query.jl backend that exploits the indices much easier. |
My 2 cents as a user of this package: I think this API change would be really beneficial. Here is what I believe are some issues with the current implementation which would be solved with these API changes:
In general, my understanding is that for the average users, data manipulations (say split-apply-combine) should just work with an easy and widely used syntax, but that it should be possible, as a user, to explicitly rearrange columns (or use some more convoluted syntax, or give some extra "sortedness" information) to maximize performance. |
yeah i was really confused initial that i can't seem to use tbl[1:2,:] to get the first two rowz but the. realised that indexedtables work completely differently |
Thanks everyone for your inputs! JuliaData/IndexedTables.jl#85 has come a long way. @piever, it should address all those issues! Here's the upcoming API: |
Amazing work! This really feels much more natural from a user perspective. I only have a doubt wrt the discussion here at DataFrames: to summarize, I think the issue is that IIUC the IndexedTables equivalent of such a function used to be |
Thanks for looking over! Glad you like it!
Do people expect that to be the default behavior of Flattening not being the default seems to be the position taken by @andyferris's SplitApplyCombine.jl https://github.com/JuliaData/SplitApplyCombine.jl#groupby-f--identity-iter and https://github.com/JuliaData/SplitApplyCombine.jl#flattena
Is the most common case for this a single row table? In the current |
I think DataFrames flattens and Query doesn't, I'm not sure about other packages and I don't know what is the rationale.
I'm really not an expert on this, but I would say it's mainly the one row case that is relevant. It's probably best to look carefully at the DataFrames issue linked above and involve the relevant people in the debate (especially because both DataFrames and IndexedTables grouping APIs seem to be changing). I'm taking the liberty to ping @nalimilan as he is the author of the corresponding DataFrames issue and seems extremely knowledgeable about grouping APIs from different packages (DataFrames, Pandas, dplyr). |
I think it's a very common to need a flattened output, at least that's what e.g. dplyr does. A basic example is when you want to normalize a variable so that it has mean 0 within each group. You really want to get the same structure of the data (i.e. same number of rows and columns) as the output, with just one variable transformed. DataFrames provides The problem DataFrames has, but not JuliaDB, is that it doesn't encode column type information, so the return type of the function cannot be inferred, which kills performance. See JuliaData/DataFrames.jl#1256. We're probably going to work around this by using a special kind of sub data frame encoding column types just for this case. |
Thanks for the input Milan! Normalization is a good example. I'll see if I can do this. |
@shashi I've seen the METADATA issue (#12152)[https://github.com/JuliaLang/METADATA.jl/pull/12152] and will try to port GroupedErrors in the week-end, thanks for adding the upper bound to the released version in the meantime. I still don't understand the outcome of the discussion about flattening (I may need it to port GroupedErrors). Am I correct that now |
@piever I decided to put it behind a flag because it felt like conflating 2 ideas into one function. Now there is a flatten function and flatten kwarg to groupby. |
I see, so it's been added now but is not in the released version? I'll see if it's easy to port GroupedErrors without using it, otherwise I imagine I should probably wait for the next point release of IndexedTables to update my package with the right lower bound to the dependency. |
Also, while julia> t=table([1,1,1,2,2,2], [1,1,2,2,1,1], [1,2,3,4,5,6],
names=[:x,:y,:z]);
julia> groupby(identity, t, (:x, :y), select=:z, flatten = true)
Table with 4 rows, 3 columns:
x y identity
──────────────
1 1 [1, 2]
1 2 [3]
2 1 [5, 6]
2 2 [4] Sorry for bugging you so much with all these issues, I hope this kind of feedback is being useful. |
Good catch! That was a bug, now fixed. Yes, you'll need to depend on v0.4.1 I've tagged the releases, someone may hit merge any minute...
Of course! Definitely keep them coming! |
After looking at people use JuliaDB in the wild, and noting the common roadblocks to mastery of the framework, it's clearer that we need to make the API more relatable for someone with a relational database background (which is most people who want to use the package).
IndexedTable vs N-d sparse structure
IndexedTable type as it stands conflates a table structure and that of an N-d sparse array. These two structures can be separated:
One could argue that a
1:n
indexed 2) acts as a table in the traditional sense. This is true, but does not capture sorting as in 1). 2) is just a simple wrapper of 1) which definesgetindex
to work on the indexed values, and with that as basis defines array operations like map, reduce, broadcast, reducedim and so on. 1) may not enforce that indexed values are unique, while 2) must. By default, relational join operations on 1) join non-indexed columns based on indexed columns (as is the case now) but should be configurable (which being implemented in JuliaData/IndexedTables.jl#79). It seems fine to let 2) also be used in relational operations by just forwarding them to the wrapped 1) object.My opinion is that we should call 2) (which is most similar to current IndexedTable) NDSparse, and name 1) IndexedTable with a deprecation step where we deprecate IndexedTable to NDSparse and then bring back IndexedTable as 1).
Credits for some of this thinking goes to @andyferris. An advantage of these changes is that it gives things proper names, making them easier to explain.
AxisArrays vs N-d sparse data
AxisArrays could act as dense versions of 2) above. For an uncomplicated implementation and mental model, we need to figure out some one main issue:
AxisArrays have lowest stride in the first dimension while current IndexedTable (future n-d sparse) has it the other way around.
An interesting fact is that AxisArrays-based 2) can also be thought of as a relational table indexed by some columns (i.e. 1)) and relational operations can thus be implemented.
The proposed changes:
IndexedTable
name forNDSparse
- this is a reversion to the old name, and arguably more accurate name.by
andwith
IndexedTables.jl#79DTable
toDNDSparse
.DTable
as the distributed version of the new IndexedTablecc @JeffBezanson @StefanKarpinski @ViralBShah @andyferris @andreasnoack @aviks @simonbyrne
The text was updated successfully, but these errors were encountered: