-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a way to specify which columns are sorted #52
Comments
I am working on something related (WIP, but usable): https://github.com/tpapp/FunctionalTables.jl, which already supports Tables.jl (of course without ordering specs). The interesting thing is that orderings survive many operations, including filtering, adding, selecting and dropping columns (which can degrade the ordering), allowing a lot of optimizations. I think it would be best to have a query interface, eg I am using a parametrized singleton type for this (because tuples of allow very efficient and convenient code), but for a general interface, I would go with something along the lines of |
I agree the second approach sounds more natural. |
Yes, the 2nd approach sounds ideal to me too. I'd love to hear @andyferris thoughts because I know he's done a lot of related work with his TypedTables.jl and AcceleratedArrays.jl packages. |
I agree, there is nothing terrribly wrong with the second approach. You could probably implement with a trait (or schema) rather than a blessed type, though? I was exploring this flavour of things in MinimumViableTables, but I felt it had the potential to get very complex in the fully general case. AcceleratedArrays act as secondary acceleration indices on single columns only. I felt this was useful enough for all my work (and I suspect the majority of single computer in memory workloads) and simplified the seperation of concerns, making it trivial to implement new index types like spatial indexes without having to plumb that through everywhere else. But yeah I think for multi-columnar sorts you need to attach this info to the table not the columns. Like mentioned, I could connect what I have already implemented to this idea here via a trait or The only thing is I tended to bork at the idea of treating one style of acceleration (such as sorting) as special. (I feel the generic concept that is missing from my stuff but is in JuliaDB is the idea of partitioning. To deal with that, I was thinking we need a nice primary key system, where the primary key is the indices of the container. So this is going beyond Cartesian indices of unit ranges for arrays, and is more of a non-Cartesian multidimensional dictionary kind of idea, which (surprise surprise) is exactly like the NDSparse container! I’m guessing other flavours of So - eventually that kind of info you might also want in the schema, as well...). |
Actually I agree with @andyferris there should be a function to get the information about whether a table is sorted on some columns, but leave it up to each table table to support this or not. Since the table types will probably want to keep track of this information internally, it would be too bad to require wrapping the tables in a special type just to indicate that they are sorted. But this cannot be a trait, as the columns are not necessarily part of the type (though when that's the case the function will be resolved at compile time). |
Yes - instance-based “discovery function” rather than type trait. |
I also think the best way to signal what fields are sorted is with a function that takes an instance of the table. OTOH, while not mandatory, a wrapper type could be useful to implement fallbacks for partially sorted tables (esp. if we get around to writing TableOperators.jl). |
Yeah, the wrapper type can be useful for tables that wouldn't have a way to internally store additional information (a la TypedTables). |
In another issue, I've mentioned that we should probably have a |
This is a "feature request" that would allow to improve integration with IndexedTables (somehow related to #17, which it'd probably make redundant). The idea is that it'd be nice to allow a way for a table to specify that some of its columns are sorted or to be considered "primary" (i.e. default for grouping and joining operations).
I'm unsure which of the following two is the path forward:
(a=1, b=2) => (c=2, d=4)
means that columnsa
andb
are primary andc
andd
are not (currently the case in IndexedTables but it seems difficult to port the implementation, see WIP: support unwrapping #17 for a clumsy attempt)That would forward all the interface to
table
but would declare what is sorted inpkeys
.I think I'm in favor of the second approach, and then one would make this "composable" by defining in say TableOperators that sorting returns a
PartiallySortedTable
and by making sure that operation that provably preserve sorting would preservepkeys
.The text was updated successfully, but these errors were encountered: