Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way to specify which columns are sorted #52

Closed
piever opened this issue Dec 8, 2018 · 9 comments
Closed

Add a way to specify which columns are sorted #52

piever opened this issue Dec 8, 2018 · 9 comments

Comments

@piever
Copy link

piever commented Dec 8, 2018

This is a "feature request" that would allow to improve integration with IndexedTables (somehow related to #17, which it'd probably make redundant). The idea is that it'd be nice to allow a way for a table to specify that some of its columns are sorted or to be considered "primary" (i.e. default for grouping and joining operations).

I'm unsure which of the following two is the path forward:

  • Encode this in the iterated values, meaning (a=1, b=2) => (c=2, d=4) means that columns a and b are primary and c and d are not (currently the case in IndexedTables but it seems difficult to port the implementation, see WIP: support unwrapping #17 for a clumsy attempt)
  • Create an table type with this info, say:
struct PartiallySortedTable{T, N}
    table::T
    pkeys::NTuple{N, Symbol}
end

That would forward all the interface to table but would declare what is sorted in pkeys.

I think I'm in favor of the second approach, and then one would make this "composable" by defining in say TableOperators that sorting returns a PartiallySortedTable and by making sure that operation that provably preserve sorting would preserve pkeys.

@tpapp
Copy link
Contributor

tpapp commented Dec 9, 2018

I am working on something related (WIP, but usable): https://github.com/tpapp/FunctionalTables.jl, which already supports Tables.jl (of course without ordering specs). The interesting thing is that orderings survive many operations, including filtering, adding, selecting and dropping columns (which can degrade the ordering), allowing a lot of optimizations.

I think it would be best to have a query interface, eg Tables.ordering which would return a representation of the ordering so that other interfaces which consume / work with tables could rely on it. It could default to the representation of "no ordering".

I am using a parametrized singleton type for this (because tuples of allow very efficient and convenient code), but for a general interface, I would go with something along the lines of Tables.Schema, ie a list of columns and their ordering specifications (should allow reverse ordering).

@nalimilan
Copy link
Member

I agree the second approach sounds more natural.

@quinnj
Copy link
Member

quinnj commented Dec 11, 2018

Yes, the 2nd approach sounds ideal to me too. I'd love to hear @andyferris thoughts because I know he's done a lot of related work with his TypedTables.jl and AcceleratedArrays.jl packages.

@andyferris
Copy link
Member

I agree, there is nothing terrribly wrong with the second approach. You could probably implement with a trait (or schema) rather than a blessed type, though? I was exploring this flavour of things in MinimumViableTables, but I felt it had the potential to get very complex in the fully general case.

AcceleratedArrays act as secondary acceleration indices on single columns only. I felt this was useful enough for all my work (and I suspect the majority of single computer in memory workloads) and simplified the seperation of concerns, making it trivial to implement new index types like spatial indexes without having to plumb that through everywhere else. But yeah I think for multi-columnar sorts you need to attach this info to the table not the columns. Like mentioned, I could connect what I have already implemented to this idea here via a trait or Schema style of interface, which may work out nicely.

The only thing is I tended to bork at the idea of treating one style of acceleration (such as sorting) as special.

(I feel the generic concept that is missing from my stuff but is in JuliaDB is the idea of partitioning.

To deal with that, I was thinking we need a nice primary key system, where the primary key is the indices of the container. So this is going beyond Cartesian indices of unit ranges for arrays, and is more of a non-Cartesian multidimensional dictionary kind of idea, which (surprise surprise) is exactly like the NDSparse container! I’m guessing other flavours of AbstractNDSparse could use alternative (possibly “lossy”) partioning schemes rather than sort, like hashing or spatial partitions - not sure

So - eventually that kind of info you might also want in the schema, as well...).

@nalimilan
Copy link
Member

Actually I agree with @andyferris there should be a function to get the information about whether a table is sorted on some columns, but leave it up to each table table to support this or not. Since the table types will probably want to keep track of this information internally, it would be too bad to require wrapping the tables in a special type just to indicate that they are sorted. But this cannot be a trait, as the columns are not necessarily part of the type (though when that's the case the function will be resolved at compile time).

@andyferris
Copy link
Member

Yes - instance-based “discovery function” rather than type trait.

@piever
Copy link
Author

piever commented Dec 11, 2018

I also think the best way to signal what fields are sorted is with a function that takes an instance of the table. OTOH, while not mandatory, a wrapper type could be useful to implement fallbacks for partially sorted tables (esp. if we get around to writing TableOperators.jl).

@quinnj
Copy link
Member

quinnj commented Dec 11, 2018

Yeah, the wrapper type can be useful for tables that wouldn't have a way to internally store additional information (a la TypedTables).

@quinnj
Copy link
Member

quinnj commented Feb 8, 2020

In another issue, I've mentioned that we should probably have a InMemoryTables.jl package that would allow supporting more exotic table queries like this.

@quinnj quinnj closed this as completed Feb 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants