Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Major API changes #80

Closed
5 tasks done
shashi opened this issue Sep 26, 2017 · 16 comments
Closed
5 tasks done

RFC: Major API changes #80

shashi opened this issue Sep 26, 2017 · 16 comments

Comments

@shashi
Copy link
Collaborator

shashi commented Sep 26, 2017

After looking at people use JuliaDB in the wild, and noting the common roadblocks to mastery of the framework, it's clearer that we need to make the API more relatable for someone with a relational database background (which is most people who want to use the package).

IndexedTable vs N-d sparse structure

IndexedTable type as it stands conflates a table structure and that of an N-d sparse array. These two structures can be separated:

  1. A table in which rows are sorted by first few columns - iterated over with rows, operations make use of sortedness for speed.
  2. An N-d sparse data cube - iterated over non-indexed values

One could argue that a 1:n indexed 2) acts as a table in the traditional sense. This is true, but does not capture sorting as in 1). 2) is just a simple wrapper of 1) which defines getindex to work on the indexed values, and with that as basis defines array operations like map, reduce, broadcast, reducedim and so on. 1) may not enforce that indexed values are unique, while 2) must. By default, relational join operations on 1) join non-indexed columns based on indexed columns (as is the case now) but should be configurable (which being implemented in JuliaData/IndexedTables.jl#79). It seems fine to let 2) also be used in relational operations by just forwarding them to the wrapped 1) object.

My opinion is that we should call 2) (which is most similar to current IndexedTable) NDSparse, and name 1) IndexedTable with a deprecation step where we deprecate IndexedTable to NDSparse and then bring back IndexedTable as 1).

Credits for some of this thinking goes to @andyferris. An advantage of these changes is that it gives things proper names, making them easier to explain.

AxisArrays vs N-d sparse data

AxisArrays could act as dense versions of 2) above. For an uncomplicated implementation and mental model, we need to figure out some one main issue:

AxisArrays have lowest stride in the first dimension while current IndexedTable (future n-d sparse) has it the other way around.

An interesting fact is that AxisArrays-based 2) can also be thought of as a relational table indexed by some columns (i.e. 1)) and relational operations can thus be implemented.

The proposed changes:

  • Deprecate the current IndexedTable name for NDSparse - this is a reversion to the old name, and arguably more accurate name.
  • Bring back IndexedTable as a row-iterated table, indexed by integers which give a row as a named tuple. Some of the columns may be indexed, below is a rough sketch of what this could look like
struct Perm{X} # sortperm of sorting by a subset of columns
    columns::Vector{Int} # which columns (integer indices) are indexed
    perm::X # sortperm. OneTo(n) denotes that the table is already sorted by these columns
end

struct IndexedTable{C<:Columns}
    columns::C
    perms::Vector{Perm} # any number of these
    cardinality::Vector{Nullable{Float64}} # cached cardinality for each column, to be used in optimizations

    ## temporary buffer
    columns_buffer::C # merged into columns when flush! is called, favors first OneTo ordered perm in insertion.
end
  • Implement relational operations on this type. This would involve reusing existing implementations but exploring rows in different permute orders as in WIP: by and with IndexedTables.jl#79
  • Rename current distributed table type DTable to DNDSparse.
  • Bring back DTable as the distributed version of the new IndexedTable

cc @JeffBezanson @StefanKarpinski @ViralBShah @andyferris @andreasnoack @aviks @simonbyrne

@JeffreySarnoff
Copy link

Renaming sorts of table for the purpose of making the whole more easily understood by different kinds of users is laudable. Deciding "which gets called what" by "what fits best, best fits," is most rational when the set of names that get used are similarly understandable by good swath of their users. IndexedTable is generally approachable as a Table that carries expressive aor algorithmic efficacies through Index|es. NDSparse is not as generally approachable; it trades technical accuracy for conversational transparency. Readability invites diverse readership, so using "Table" in naming sorts of table helps. Sparsity is a trait (as we now use that word). Trait names and subtype naming address different organizing or architectural principles. Mixing them in the most user used part of an API is harder to digest.

@andyferris
Copy link
Member

Cool! Yes, to explain just one piece of why I support these changes - a "table" is typically used in the computing / data science areas as a finite relation - some collection of (named) tuples (i.e. rows). If we are going to support the relational operations (that people are used to from SQL, etc), it is most transparent and coherent to view the table as a collection of rows, and have them iterate entire rows. I feel this will help with making great APIs for the kinds of operations typically done on tables. I feel that one of the primary reasons we've shied away from iterating rows in Julia table packages in the past is because of performance concerns (eg type stability), rather than for reasons of conceptual correctness.

Of course, this shouldn't preclude "viewing" tables as 2D style objects or as more complex containers like NDSparse. So we can easily "view" the data as this interesting structure to enable compact and powerful broadcasting, etc, as appropriate. I still feel some reservations about the name NDSparse but I do appreciate what the name is trying to capture and I haven't thought of anything better. It's a bit like a kind of nested associative that uses sorting instead of hashing for lookup... NDSortedDict seems ok-ish... (I wonder if a more general abstract class of nested/ND associative container would help capture this, and enable indexed tables to use hashing where desired).

@davidanthoff
Copy link

I think this is a great move. It should also make the creation of a custom Query.jl backend that exploits the indices much easier.

@piever
Copy link
Collaborator

piever commented Sep 29, 2017

My 2 cents as a user of this package: I think this API change would be really beneficial. Here is what I believe are some issues with the current implementation which would be solved with these API changes:

  • Transforming into an IndexedTable with the @davidanthoff IterableTables framework gives a table where the first n-1 column are indices, which really is a bit arbitrary: it should ideally give a IndexedTable in the new meaning of the term
  • In tables with many columns, when one wants do do basic data manipulation, it is a bit confusing to have to use different syntaxes to select rows, depending on whether the column is also an index or not
  • Especially in the case where the user is not too worried about performance, it is a bit annoying to explicitly rearrange which columns are indices and which aren't (as in here) to perform some data manipulation.

In general, my understanding is that for the average users, data manipulations (say split-apply-combine) should just work with an easy and widely used syntax, but that it should be possible, as a user, to explicitly rearrange columns (or use some more convoluted syntax, or give some extra "sortedness" information) to maximize performance.

@xiaodaigh
Copy link

yeah i was really confused initial that i can't seem to use tbl[1:2,:] to get the first two rowz but the. realised that indexedtables work completely differently

@shashi
Copy link
Collaborator Author

shashi commented Nov 13, 2017

Thanks everyone for your inputs!

JuliaData/IndexedTables.jl#85 has come a long way. @piever, it should address all those issues!

Here's the upcoming API:
http://juliadb.org/next/api/index.html

@piever
Copy link
Collaborator

piever commented Nov 14, 2017

Amazing work! This really feels much more natural from a user perspective.

I only have a doubt wrt the discussion here at DataFrames: to summarize, I think the issue is that by in DataFrames is too general and thus difficult to optimise. My understanding is that by groups the data into subdataframes and then applies to each subdataframe a function that can return an arbitrary DataFrame and then concatenates the results. Due to such generality, it doesn't know the size of the output and can't efficiently preallocate it.

IIUC the IndexedTables equivalent of such a function used to be mapslices and is now groupby? So is the idea that groupby can take a function either returning a Table (which is like by in DataFrames) or a scalar (which can be optimized, as the output could be preallocated), and it can figure out automatically what implementation to used based on the return type of the "aggregating" function? If that's the case (which I think would be an ideal scenario as it would have both performance and ease of use) it would also be interesting to see if one can optimize the case where a Table is returned but of fixed length so that the output can still be preallocated.

@shashi
Copy link
Collaborator Author

shashi commented Nov 14, 2017

Thanks for looking over! Glad you like it!

IIUC the IndexedTables equivalent of such a function used to be mapslices and is now groupby?

mapslices is still there (but only on NDSparse). And no, groupby is not yet a substitute for it. groupby(identity, t, :foo, slect=:bar) will create a column called groups which contains SubArrays of bar. I'm planning to add a flatten function which will, given any field containing vectors (such as groups) expand the table by repeating all other fields. I'm planning to expose that as flatten=true keyword argument in groupby so that it will enable optimizations as well.

Do people expect that to be the default behavior of groupby? If so, I think we should do it right now before merging. Or is that really an artifact of database systems not being able to represent vectors efficiently in cells and provide means to work with them, and would do it if they could?

Flattening not being the default seems to be the position taken by @andyferris's SplitApplyCombine.jl https://github.com/JuliaData/SplitApplyCombine.jl#groupby-f--identity-iter and https://github.com/JuliaData/SplitApplyCombine.jl#flattena

interesting to see if one can optimize the case where a Table is returned but of fixed length so that the output can still be preallocated.

Is the most common case for this a single row table? In the current groupby if one returns a NamedTuple from the function, the named tuples will create one new column per field. This is also pretty efficient if the type can be inferred (which it usually can be). In general, I'm not a sure if that can be done automatically. f would have to take care of not allocating new tables in each call, and can instead mutate existing memory. This should work since the data will be copied to the flattened table after each call (if used with flatten=true).

@piever
Copy link
Collaborator

piever commented Nov 14, 2017

Do people expect that to be the default behavior of groupby? If so, I think we should do it right now before merging. Or is that really an artifact of database systems not being able to represent vectors efficiently in cells and provide means to work with them, and would do it if they could?

I think DataFrames flattens and Query doesn't, I'm not sure about other packages and I don't know what is the rationale.

Is the most common case for this a single row table? In the current groupby if one returns a NamedTuple from the function, the named tuples will create one new column per field. This is also pretty efficient if the type can be inferred (which it usually can be).

I'm really not an expert on this, but I would say it's mainly the one row case that is relevant.

It's probably best to look carefully at the DataFrames issue linked above and involve the relevant people in the debate (especially because both DataFrames and IndexedTables grouping APIs seem to be changing). I'm taking the liberty to ping @nalimilan as he is the author of the corresponding DataFrames issue and seems extremely knowledgeable about grouping APIs from different packages (DataFrames, Pandas, dplyr).

@nalimilan
Copy link
Member

Do people expect that to be the default behavior of groupby? If so, I think we should do it right now before merging. Or is that really an artifact of database systems not being able to represent vectors efficiently in cells and provide means to work with them, and would do it if they could?

I think it's a very common to need a flattened output, at least that's what e.g. dplyr does. A basic example is when you want to normalize a variable so that it has mean 0 within each group. You really want to get the same structure of the data (i.e. same number of rows and columns) as the output, with just one variable transformed. DataFrames provides by for this, which automatically calls combine(map(f, groupby(...))), with combine being equivalent to flatten.

The problem DataFrames has, but not JuliaDB, is that it doesn't encode column type information, so the return type of the function cannot be inferred, which kills performance. See JuliaData/DataFrames.jl#1256. We're probably going to work around this by using a special kind of sub data frame encoding column types just for this case.

@shashi
Copy link
Collaborator Author

shashi commented Nov 14, 2017

Thanks for the input Milan! Normalization is a good example. I'll see if I can do this.

@shashi shashi closed this as completed Nov 21, 2017
@piever
Copy link
Collaborator

piever commented Nov 23, 2017

@shashi I've seen the METADATA issue (#12152)[https://github.com/JuliaLang/METADATA.jl/pull/12152] and will try to port GroupedErrors in the week-end, thanks for adding the upper bound to the released version in the meantime. I still don't understand the outcome of the discussion about flattening (I may need it to port GroupedErrors). Am I correct that now groupby defaults to non-flattening? Is there a function to flatten?

@shashi
Copy link
Collaborator Author

shashi commented Nov 23, 2017

@piever I decided to put it behind a flag because it felt like conflating 2 ideas into one function.

Now there is a flatten function and flatten kwarg to groupby.

@piever
Copy link
Collaborator

piever commented Nov 23, 2017

I see, so it's been added now but is not in the released version? I'll see if it's easy to port GroupedErrors without using it, otherwise I imagine I should probably wait for the next point release of IndexedTables to update my package with the right lower bound to the dependency.

@piever
Copy link
Collaborator

piever commented Nov 23, 2017

Also, while IndexedTables.flatten works as expected, the keyword argument version behaves counter-intuitively, I'm not sure I understand what's going on here:

julia> t=table([1,1,1,2,2,2], [1,1,2,2,1,1], [1,2,3,4,5,6],
                      names=[:x,:y,:z]);

julia> groupby(identity, t, (:x, :y), select=:z, flatten = true)
Table with 4 rows, 3 columns:
x  y  identity
──────────────
1  1  [1, 2]
1  2  [3]
2  1  [5, 6]
2  2  [4]

Sorry for bugging you so much with all these issues, I hope this kind of feedback is being useful.

@shashi
Copy link
Collaborator Author

shashi commented Nov 23, 2017

Good catch! That was a bug, now fixed. Yes, you'll need to depend on v0.4.1 I've tagged the releases, someone may hit merge any minute...

Sorry for bugging you so much with all these issues, I hope this kind of feedback is being useful.

Of course! Definitely keep them coming!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants