Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nesting #13

Closed
piever opened this issue Sep 5, 2018 · 7 comments
Closed

Nesting #13

piever opened this issue Sep 5, 2018 · 7 comments

Comments

@piever
Copy link

piever commented Sep 5, 2018

Sorry for opening so many issues, but I am really hoping that this package can be feature rich enough to accomodate all needs from IndexedTables.jl and StructArrays.jl. One question that I was thinking about is nesting.

Meaning, if I have "nested rows" (say [(a = (a=1, c=33), b = (c = 1, e = 14))]), then ideally one would like to have some version of columns that returns (a = (a=[1], c=[33]), b = (c = [1], e = [14])).

Would that be possible with Tables currently?

@quinnj
Copy link
Member

quinnj commented Sep 13, 2018

Currently we have:

julia> t = [(a = (a=1, c=33), b = (c = 1, e = 14))]
1-element Array{NamedTuple{(:a, :b),Tuple{NamedTuple{(:a, :c),Tuple{Int64,Int64}},NamedTuple{(:c, :e),Tuple{Int64,Int64}}}},1}:
 (a = (a = 1, c = 33), b = (c = 1, e = 14))

julia> ct = t |> columntable
(a = NamedTuple{(:a, :c),Tuple{Int64,Int64}}[(a = 1, c = 33)], b = NamedTuple{(:c, :e),Tuple{Int64,Int64}}[(c = 1, e = 14)])

julia> ct.a
1-element Array{NamedTuple{(:a, :c),Tuple{Int64,Int64}},1}:
 (a = 1, c = 33)

julia> ct.a |> columntable
(a = [1], c = [33])

So, I'm not quite sure what the question is: if we should try to do this nested unwrapping automatically? Or is the fact that this "works" good enough? Like, in your application, if you knew there would be nested rows, you'd just make sure to pivot to columns, then further pivot the resulting top-level keys.

Let me know if that makes sense and feel free to correct my understanding of what you're after here.

@piever
Copy link
Author

piever commented Sep 13, 2018

My evil plan is to use the machinery here to replace:

  1. the current implementation of collect_columns in IndexedTables (it scales poorly with many "missable" columns, whereas this one here should be fine I think)
  2. a collection mechanism for StructArrays (old one is here).

In case 1), nesting is needed because to iterate some rows that will be collected as a table with primary column we use iteration of pairs of named tuples. For example, something that iterates (name = "John", gender = "Male") => (height = 1.73, weight = 69.5) would be collected in a table that has two primary keys (:name and :gender) and two non primary keys (:height and :weight). This is quite crucial in that all query operations that will sort by some columns in JuliaDB (like groupby) will actually return a lazy iterator of pairs (first element being the primary keys, second element the result of the grouping). Currently we have special cased NamedTuples and Pairs in collect_columns to achieve this in JuliaDB.

In case 2), nesting in needed for composite types, something like:

struct A
    x::Float64
    y::Float64
end

struct B
    x::A
    y::Int
end

Then if I want to collect an Array of B into a StructArray I would probably also want to "unwrap" the A inside of B (and I wouldn't want to do in two steps as that seems inefficient).

Potentially I would like to specify which are the "types" that would cause the unwrapping (for example, for IndexedTables it's NamedTuple and Pair, for StructArrays it should be structs with more than one field). My intuition is that the implementation is not much harder than what you have already (the main tricks used in IndexedTables were that the equivalent of allocatecolumns calls itself recursively and push! to a Columns object would push! to each vector representing the Columns).

If you think that adding this kind of functionality to Tables is tricky and would complicate the design too much compared to the benefits it brings, I could try to add this functionality to StructArrays instead using tools from here and come back with explicit requests of what tools here would make that design easier and remove code duplication. What I would like to avoid is to have complete code duplication where Tables, StructArrays and IndexedTables each define their own mechanism for collecting columns that do pretty much the same thing.

@quinnj
Copy link
Member

quinnj commented Sep 13, 2018

Cool; I'm certainly open to ideas here. If you work something up in Tables.jl, I'd love to see a PR and we can see what implications there'd be.

@piever
Copy link
Author

piever commented Sep 20, 2018

Not yet a PR, but I've implemented this in StructArrays (for the vector of named tuples to named tuples of vector case) and an interesting thing that came up is that there was no need to change anything in the code other than the initialization: see here. Basically, the buildcolumns does 3 things:

  1. Allocate new columns
  2. push! values to columns (in the unknown length case)
  3. setindex! values to columns (in the known length case)

Now, for any respectable table, if we push! a NamedTuple to the table, it should do the right thing already and wouldn't be much different to pushing a scalar to a vector, so if some field of our named tuple stream is initialized as a table, the usual code that pushes one element at a time will automatically do the right thing. Same for setindex!

In the StructArray the same way as push! and setindex! work already, all I had to do was define a new "similar" (which is what's used to allocate) that can allocate new arrays in the "unwrapped" way. The API I'm choosing there is to have a unwrap = t -> false function as a keyword argument that defines which types should be unwrapped (surprisingly, this still makes the whole procedure inferrable, I'm definitely impressed with the compiler).

The main issue here is that there are no definitions for push! and setindex! on a columntable. Defining them feels a lot like type piracy but I think defining getindex on the columntable is also type piracy, right?

In short, if you're happy with the unwrap interface, and with adding a setindex! and push! methods to the columntable this is definitely doable.

@quinnj
Copy link
Member

quinnj commented Sep 20, 2018

Hmmmm, I'm having a hard time deciphering what push!/setindex! methods could be needed on columntable; it doesn't look like those were defined in the diff you linked to? Maybe do a PR to Tables.jl so it's clear what would need to change here?

@piever
Copy link
Author

piever commented Sep 20, 2018

I'm working on the PR.

To clarify what I meant, here is an example:

julia> using StructArrays

julia> sink = StructArray{ComplexF64}(undef, 0)
0-element StructArray{Complex{Float64},1,NamedTuple{(:re, :im),Tuple{Array{Float64,1},Array{Float64,1}}}}

julia> itr = (i+im for i in 1:10)
Base.Generator{UnitRange{Int64},getfield(Main, Symbol("##3#4"))}(getfield(Main, Symbol("##3#4"))(), 1:10)

julia> for el in itr
       push!(sink, el)
       end

julia> sink
10-element StructArray{Complex{Float64},1,NamedTuple{(:re, :im),Tuple{Array{Float64,1},Array{Float64,1}}}}:
  1.0 + 1.0im
  2.0 + 1.0im
  3.0 + 1.0im
  4.0 + 1.0im
  5.0 + 1.0im
  6.0 + 1.0im
  7.0 + 1.0im
  8.0 + 1.0im
  9.0 + 1.0im
 10.0 + 1.0im

thanks to this method.

This means that the code calling push! works in both scenarios. However I was optimistic in my previous remark because in StructArray case I know the type in advance so I only need to push or setindex!. Here I definitely need to have a concrete PR to make sure my intuition on the extent of the change holds up.

@quinnj
Copy link
Member

quinnj commented Feb 8, 2020

I don't think there's anything more in Tables.jl we need for this.

@quinnj quinnj closed this as completed Feb 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants