-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tables API enhancement #131
Conversation
"Abstract row type with a simple required interface: row values are accessible via `getproperty(row, field)`; for example, a NamedTuple like `nt = (a=1, b=2, c=3)` can access its value for `a` like `nt.a` which turns into a call to the function `getproperty(nt, :a)`" | ||
abstract type Row end | ||
""" | ||
Tables.AbstractColumns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we highlight explict that not all column types inherit from AbstractColumn ?
And that it thus should not be depended on for dispatch.
That it is just for convieinve of source authors (not sink authors)
(And same for Row)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great call out.
| `Tables.getcolumn(table, nm::Symbol)` | getproperty(table, nm) | Retrieve a column by name | | ||
| `Tables.columnnames(table)` | propertynames(table) | Return column names for a table as an indexable collection | | ||
| Optional methods | | | | ||
| `Tables.getcolumn(table, ::Type{T}, i::Int, nm::Symbol)` | Tables.getcolumn(table, nm) | Given a column eltype `T`, index `i`, and column name `nm`, retrieve the column. Provides a type-stable or even constant-prop-able mechanism for efficiency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we include in this PR a test suite method that takes a instance of a subtype and checks it has a correct implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difficulty here in doing a static check of interface implementation is we have generic fallback methods for all the interface methods (so things like NamedTuples and Generators of NamedTuples just work). But I think there's something we could provide to run in your test suite to put your type through the works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. We need to send something though it.
We can't just use hasmethod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to this, I have a thought on interface versioning. See: #133
src/operations.jl
Outdated
getcolumn(row::TransformsRow, i::Int) = (getfunc(row, getfuncs(row), i))(getcolumn(getrow(row), i)) | ||
columnnames(row::TransformsRow) = columnnames(getrow(row)) | ||
|
||
struct Transforms{C, T, F} <: AbstractColumns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before 1.0 should we take the chance to rename this ? (Or to rename the other?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or remove it entirely. I know @bkamins and @nalimilan have been hammering on the select/transform API in DataFrames, so if there's a way we could generalize their excellent design work into a generic Tables.jl API, I think it'd be a great starting point for a TablesOperations.jl package or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
Deleting things before 1.0 is a good way to free up real estate for redevelopment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just mention in the docs that this isn't part of the API, and ensure it's unexported? Then we can decide what to do in the long run.
getfunc(row, nt::NamedTuple, i::Int) = i > fieldcount(typeof(nt)) ? identity : getfield(nt, i) | ||
getfunc(row, d::Dict{String, <:Base.Callable}, i::Int) = get(d, String(columnnames(row)[i]), identity) | ||
getfunc(row, d::Dict{Symbol, <:Base.Callable}, i::Int) = get(d, columnnames(row)[i], identity) | ||
getfunc(row, d::Dict{Int, <:Base.Callable}, i::Int) = get(d, i, identity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we drop the dispatch on Base.Callable
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it hurt at all? Or overly restrict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's overly restrictive.
Doesn't work with functors.
Maybe I want to pass something though a trained NN. Flux.Chain
is callable, but not a subtype of Base.Callaeble
If we make these changes could also adding a "public" method taking a table and returning an iterator of Currently |
@bkamins, can you clarify the use-case a little for the public namedtuple iterator? Note that the machinery exists to do type-stable iteration over generic rows already: using I just worry about publicly blessing NamedTuples too much in relation to data tasks because of the inevitable production issues people will run into with extremely large datasets, especially when we already have machinery to pretty much achieve the same benefits without. |
Yes, please (strongly encouragement, at least), this makes code so much more readable (and writeable) ... |
I agree that for very wide tables one should not use a
and then you apply
Currently in DataFrames.jl we say that the contract for |
Codecov Report
@@ Coverage Diff @@
## master #131 +/- ##
=========================================
+ Coverage 95.87% 97.7% +1.83%
=========================================
Files 7 7
Lines 436 523 +87
=========================================
+ Hits 418 511 +93
+ Misses 18 12 -6
Continue to review full report at Codecov.
|
@bkamins, ok, I understand that use-case, but I also wonder why I should clarify that I'm not against including |
The problem is that materializing a Just a quick code (not useful, but similar things happen in practice):
Now
which is an overkill (now I see that But even given that it is faster (and allocating less to do):
than
(and iterating both seems to have the same performance later; actually - surprisingly - iterating that is why I would prefer to have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some smallish comments and typos.
pages=[ | ||
"Home" => "index.md", | ||
], | ||
repo="https://github.com/JuliaData/Tables.jl/blob/{commit}{path}#L{line}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? It's not set up in other packages I know, and the link works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is all generated from PkgTemplates.jl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we got a bit overzealous? Example.jl is much simpler: https://github.com/JuliaLang/Example.jl/blob/master/docs/make.jl
|
||
makedocs(; | ||
modules=[Tables], | ||
format=Documenter.HTML(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't necessary AFAIK. Same for assets
.
@@ -0,0 +1,2 @@ | |||
[deps] | |||
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's recommended to specify a particular version to avoid breakage. Then the Manifest isn't needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manifest is useful becaue then you can include in it
[[Tables]]
path = ".."
uuid = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
which makes sure the docstrings are right for the dev version docs.
src/utils.jl
Outdated
```julia | ||
vectors = [collect(col) for col in Tables.eachcolumn(Tables.columns(x))] | ||
``` | ||
While the first definition applies to an `Row` object, the last definition simply returns an AbstractColumn iterator for a `Columns` object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the first definition applies to an `Row` object, the last definition simply returns an AbstractColumn iterator for a `Columns` object. | |
While the first definition applies to a `Row` object, the last definition simply returns an `AbstractColumn` iterator for a `Columns` object. |
src/tofromdatavalues.jl
Outdated
""" | ||
Tables.datavaluerows(x) => NamedTuple iterator | ||
|
||
Takes any table input `x` and returns a NamedTuple iterator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Takes any table input `x` and returns a NamedTuple iterator | |
Takes any table input `x` and returns a `NamedTuple` iterator |
src/Tables.jl
Outdated
Check if an object has specifically defined that it is a table. Note that | ||
not all valid tables will return true, since it's possible to satisfy the | ||
Tables.jl interface at "run-time", e.g. a Generator of NamedTuples iterates | ||
NamedTuples, which satisfies the Row interface, but there's no static way |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NamedTuples, which satisfies the Row interface, but there's no static way | |
`NamedTuples`, which satisfies the `Row` interface, but there's no static way |
src/Tables.jl
Outdated
|
||
Check whether an object has specifically defined that it implements the `Tables.rows` | ||
function. Note that `Tables.rows` will work on any object that iterates Row-compatible | ||
objects, even if they don't define `rowaccess`, e.g. a Generator of NamedTuples. Also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
objects, even if they don't define `rowaccess`, e.g. a Generator of NamedTuples. Also | |
objects, even if they don't define `rowaccess`, e.g. a `Generator` of `NamedTuples`. Also |
src/Tables.jl
Outdated
Tables.jl-compatible table input and make an instance of the table type. This enables "transform" | ||
workflows that take table inputs, apply transformations, potentially converting the table to | ||
a different form, and end with producing a table of the same type as the original input. The | ||
default materializer is `Tables.columntable`, which converts any table input into a NamedTuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default materializer is `Tables.columntable`, which converts any table input into a NamedTuple | |
default materializer is `Tables.columntable`, which converts any table input into a `NamedTuple` |
src/Tables.jl
Outdated
workflows that take table inputs, apply transformations, potentially converting the table to | ||
a different form, and end with producing a table of the same type as the original input. The | ||
default materializer is `Tables.columntable`, which converts any table input into a NamedTuple | ||
of Vectors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of Vectors. | |
of `Vector`s. |
docs/src/index.md
Outdated
|
||
### Tables.rows usage | ||
|
||
First up, let's take a look at the [SQLite.jl]() package and how it uses the Tables.jl interface to allow loading of generic table-like data into a sqlite relational table. Here's the code: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@bkamins, I cleaned up |
Thank you. I will add an efficient method for this it in DataFrames.jl when you release this change. |
an existing row table and appends the input table source `x` | ||
to the existing row table. | ||
""" | ||
function rowtable end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to add isrowtable
trait function #134. It would introduce "row table" to mean something different. I'm not sure if you want to include #134 in 1.0. But can you at least remove Tables.rowtable
from the public API? Of course, rejecting #134 and keeping Tables.rowtable
is a reasonable choice, too.
src/namedtuples.jl
Outdated
# sink function | ||
""" | ||
Tables.rowtable(x) => Vector{NamedTuple} | ||
Tables.rowtable(rt, x) => rt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know enough context to understand why rowtable(rt, x)
and columntable(ct, x)
exist. But wouldn't it be better to introduce tablecat(table, tables...)
or maybe even Tables.cat(table, tables...)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we should probably remove the 2 appending methods; I agree they don't really make sense.
I filed a few feature requests that may be easier to resolve before going 1.0:
|
@bkamins , just fyi, I just pushed a commit that improves |
I do not see a significant difference in my benchmarks. What is exactly your test scenario? I was generating |
Current master: julia> using BenchmarkTools, DataFrames, Tables
[ Info: Precompiling DataFrames [a93c6f00-e57d-5684-b7b6-d8193f3e46c0]
julia> df = DataFrame(a=1:10000, b=1.0:10000.0, c=["hey" for i = 1:10000]);
julia> @btime rowtable(df);
3.102 ms (117974 allocations: 2.49 MiB) This PR: julia> using BenchmarkTools, DataFrames, Tables
[ Info: Precompiling DataFrames [a93c6f00-e57d-5684-b7b6-d8193f3e46c0]
julia> df = DataFrame(a=1:10000, b=1.0:10000.0, c=["hey" for i = 1:10000]);
julia> @btime rowtable(df);
1.632 ms (97975 allocations: 1.88 MiB) |
Are you benchmarking against DataFrames.jl master? The issue is that we have release-pending https://github.com/JuliaData/DataFrames.jl/blob/master/src/other/tables.jl#L5 change (it seems you are benchmarking against v20.0 release). The change was due to the fact that in really wide tables we want to avoid compilation of That is why I have opened JuliaData/DataFrames.jl#2100 to make sure we are fast after you merge this PR and release Tables.jl (so I am waiting with fingers crossed to see it merged 😄 - thank you for all the efforts here). |
Sorry - I see you are benchmarking it against master of DataFrames.jl. What I mean to do in JuliaData/DataFrames.jl#2100 is to move from:
to
(i.e. when the user requests What I meant earlier with my comment is that the commit did not affect the performance of |
Merging this now, since there's a good amount of work that's built up. I still want to address #134, and allow anyone else to review who would like before making the 1.0 release. I also want to make sure the new docs get built correctly. Please comment here or open an issue for anything else you have concerns with. |
Tables in the documentation do not render correctly in section https://juliadata.github.io/Tables.jl/dev/#Implementing-the-Interface-(i.e.-becoming-a-Tables.jl-source)-1 |
Hmmm.........I tried to fix them using https://www.tablesgenerator.com/markdown_tables, but even that seems to not have them rendering. Anyone know what's going on there? |
EDIT: actually indeed something strange is happening. I will try to investigate into it. It seems that the tables that are in doc strings have the newlines removed (which in particular makes them print pad in REPL; and in general the tables are too wide for printing on 80-character terminals). @mortenpi probably should know the right approach to use. |
Ok, here's a proposal for an official 1.0 API interface for Tables.jl. This PR is actually non-breaking in the sense that everything currently relying on the Tables.jl interface and it's current API definition will continue working with this code (see the unchanged, yet passing tests on this PR), but it enhances the API definitions to require a bit more in order to provide additional functionality, conveniences, and consistency. The enhanced APIs mean sources will need to enhance their implementations to be compatible with sinks who start using the new enhanced APIs.
To try and boil the changes down as succinctly as possible, we have:
propertynames(x)
andgetproperty(x, nm)
to requiringTables.columnnames(x)
andTables.getcolumn(x, col::Union{Int, Symbol})
columnnames
andgetcolumn
definitions on the "Columns" interfaceTables.getcolumn(x, col)
on a "Columns" object, it must return a column as an indexable collection with known length, instead of just as an iteratorThe motivation for these changes include:
propertynames
/getproperty
to participate safely in the Tables interfaceThe biggest disadvantage, IMO, of these changes is requiring less convenient methods for the interfaces; i.e. before, you could reliably count on doing
row.col1
, but that's not strictly required anymore (though could be strongly encouraged). The counterargument to that is the api enhancements make working generically (like with generic packages) with rows/columns more convenient via consistency/functionality, and for more casual users interested in a specific table/row type, they'll probably be more aware of that types official api (like DataFrames, or thatCSV.Row
documents that it supports property-access).Along those lines, this PR also defines the
AbstractColumns
andAbstractRow
abstract types, which allow for custom types to inherit useful behavior in the form of default definitions using the interface methods; i.e. they automatically get indexing, property access, abstractdict interface, iteration, etc. just by relying on thegetcolumn
andcolumnnames
interface methods. Hopefully that encourages types to still be functionally robust (and for the most part, more consistent) in the case that users want to use the Tables api directly.As a final note, this PR hasn't completely updated all the docs, and hasn't implemented tests to cover the new functionality, but it's far enough along that I wanted to get feedback and people's thoughts.