-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tables API enhancement #131
Changes from 1 commit
0e3ea80
4cf8f0b
74e303e
7f790aa
97fb679
5de2ed5
26b7d93
6e18a29
4611bb3
cd93aab
d3bf9d5
dc438ef
3c549d4
f32bd2e
8bdd327
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,24 @@ | ||
# Documentation: http://docs.travis-ci.com/user/languages/julia/ | ||
language: julia | ||
os: | ||
- linux | ||
- osx | ||
- windows | ||
arch: | ||
- x64 | ||
- x86 | ||
julia: | ||
- 1.0 | ||
- 1.1 | ||
- 1.3 | ||
- nightly | ||
matrix: | ||
allow_failures: | ||
- julia: nightly | ||
fast_finish: true | ||
exclude: | ||
- os: osx | ||
arch: x86 | ||
notifications: | ||
email: false | ||
after_success: | ||
- julia -e 'using Pkg; Pkg.add("Coverage"); using Coverage; Codecov.submit(process_folder())' | ||
- julia -e 'ENV["TRAVIS_JULIA_VERSION"] == "1.3" && ENV["TRAVIS_OS_NAME"] != "linux" && exit(); using Pkg; Pkg.add("Coverage"); using Coverage; Codecov.submit(Codecov.process_folder())' |
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,29 +8,128 @@ if !hasmethod(getproperty, Tuple{Tuple, Int}) | |
Base.getproperty(t::Tuple, i::Int) = t[i] | ||
end | ||
|
||
"Abstract row type with a simple required interface: row values are accessible via `getproperty(row, field)`; for example, a NamedTuple like `nt = (a=1, b=2, c=3)` can access its value for `a` like `nt.a` which turns into a call to the function `getproperty(nt, :a)`" | ||
abstract type Row end | ||
""" | ||
Tables.AbstractColumns | ||
|
||
Abstract type provided to allow custom table types to inherit useful and required behavior. | ||
|
||
Interface definition: | ||
| Required Methods | Default Definition | Brief Description | | ||
| ---------------- | ------------------ | ----------------- | | ||
| `Tables.getcolumn(table, i::Int)` | getfield(table, i) | Retrieve a column by index | | ||
| `Tables.getcolumn(table, nm::Symbol)` | getproperty(table, nm) | Retrieve a column by name | | ||
| `Tables.columnnames(table)` | propertynames(table) | Return column names for a table as an indexable collection | | ||
| Optional methods | | | | ||
| `Tables.getcolumn(table, ::Type{T}, i::Int, nm::Symbol)` | Tables.getcolumn(table, nm) | Given a column eltype `T`, index `i`, and column name `nm`, retrieve the column. Provides a type-stable or even constant-prop-able mechanism for efficiency. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we include in this PR a test suite method that takes a instance of a subtype and checks it has a correct implementation? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The difficulty here in doing a static check of interface implementation is we have generic fallback methods for all the interface methods (so things like NamedTuples and Generators of NamedTuples just work). But I think there's something we could provide to run in your test suite to put your type through the works. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah. We need to send something though it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Related to this, I have a thought on interface versioning. See: #133 |
||
|
||
While custom table types aren't required to subtype `Tables.AbstractColumns`, benefits of doing so include: | ||
* Indexing interface defined (using `getcolumn`); i.e. `tbl[i]` will retrieve the column at index `i` | ||
* Property access interface defined (using `columnnames` and `getcolumn`); i.e. `tbl.col1` will retrieve column named `col1` | ||
* Iteration interface defined; i.e. `for col in table` will iterate each column in the table | ||
* A default `show` method | ||
This allows a custom table type to behave as close as possible to a builtin `NamedTuple` of vectors object. | ||
""" | ||
abstract type AbstractColumns end | ||
|
||
""" | ||
Tables.AbstractRow | ||
|
||
Abstract type provided to allow custom row types to inherit useful and required behavior. | ||
|
||
Interface definition: | ||
| Required Methods | Default Definition | Brief Description | | ||
| ---------------- | ------------------ | ----------------- | | ||
| `Tables.getcolumn(row, i::Int)` | getfield(row, i) | Retrieve a column value by index | | ||
| `Tables.getcolumn(row, nm::Symbol)` | getproperty(row, nm) | Retrieve a column value by name | | ||
| `Tables.columnnames(row)` | propertynames(row) | Return column names for a row as an indexable collection | | ||
| Optional methods | | | | ||
| `Tables.getcolumn(row, ::Type{T}, i::Int, nm::Symbol)` | Tables.getcolumn(row, nm) | Given a column type `T`, index `i`, and column name `nm`, retrieve the column value. Provides a type-stable or even constant-prop-able mechanism for efficiency. | ||
|
||
While custom row types aren't required to subtype `Tables.AbstractRow`, benefits of doing so include: | ||
* Indexing interface defined (using `getcolumn`); i.e. `row[i]` will return the column value at index `i` | ||
* Property access interface defined (using `columnnames` and `getcolumn`); i.e. `row.col1` will retrieve the value for the column named `col1` | ||
* Iteration interface defined; i.e. `for x in row` will iterate each column value in the row | ||
* A default `show` method | ||
This allows the custom row type to behave as close as possible to a builtin `NamedTuple` object. | ||
""" | ||
abstract type AbstractRow <: AbstractColumns end | ||
|
||
""" | ||
Tables.getcolumn(::Columns, nm::Symbol) => Indexable collection with known length | ||
Tables.getcolumn(::Columns, i::Int) => Indexable collection with known length | ||
Tables.getcolumn(::Columns, T, i::Int, nm::Symbol) => Indexable collection with known length | ||
|
||
Tables.getcolumn(::Row, nm::Symbol) => Column value | ||
Tables.getcolumn(::Row, i::Int) => Column value | ||
Tables.getcolumn(::Row, T, i::Int, nm::Symbol) => Column value | ||
|
||
Retrieve an entire column (`Columns`) or single row column value (`Row`) by column name (`nm`), index (`i`), | ||
or if desired, by column type (`T`), index (`i`), and name (`nm`). When called on a `Columns` interface object, | ||
a `Column` is returned, which is an indexable collection with known length. When called on a `Row` interface | ||
object, it returns the single column value. The methods taking a single `Symbol` or `Int` are both required | ||
for the `AbstractColumns` and `AbstractRow` interfaces; the third method is optional if type stability is possible. | ||
The default definition of `Tables.getcolumn(x, i::Int)` is `getfield(x, i)`. The default definition of | ||
`Tables.getcolumn(x, nm::Symbol)` is `getproperty(x, nm)`. | ||
""" | ||
function getcolumn end | ||
|
||
getcolumn(x, i::Int) = getfield(x, i) | ||
getcolumn(x, nm::Symbol) = getproperty(x, nm) | ||
getcolumn(x, ::Type{T}, i::Int, nm::Symbol) where {T} = getcolumn(x, nm) | ||
getcolumn(x::NamedTuple{names, types}, ::Type{T}, i::Int, nm::Symbol) where {names, types, T} = Core.getfield(x, i) | ||
|
||
""" | ||
Tables.columnnames(::Union{Columns, Row}) => Indexable collection | ||
|
||
Retrieves the list of column names as an indexable collection (like a `Tuple` or `Vector`) for a `Columns` or `Row` interface object. The default definition calls `propertynames(x)`. | ||
""" | ||
function columnnames end | ||
|
||
columnnames(x) = propertynames(x) | ||
|
||
Base.IteratorSize(::Type{R}) where {R <: AbstractColumns} = Base.HasLength() | ||
Base.length(r::AbstractColumns) = length(columnnames(r)) | ||
Base.firstindex(r::AbstractColumns) = 1 | ||
Base.lastindex(r::AbstractColumns) = length(r) | ||
Base.getindex(r::AbstractColumns, i::Int) = getcolumn(r, i) | ||
Base.getindex(r::AbstractColumns, nm::Symbol) = getcolumn(r, nm) | ||
Base.getproperty(r::AbstractColumns, nm::Symbol) = getcolumn(r, nm) | ||
Base.getproperty(r::AbstractColumns, i::Int) = getcolumn(r, i) | ||
Base.propertynames(r::AbstractColumns) = columnnames(r) | ||
Base.keys(r::AbstractColumns) = columnnames(r) | ||
Base.values(r::AbstractColumns) = collect(r) | ||
Base.haskey(r::AbstractColumns, key::Union{Integer, Symbol}) = key in columnnames(r) | ||
Base.get(r::AbstractColumns, key::Union{Integer, Symbol}, default) = haskey(r, key) ? getcolumn(r, key) : default | ||
Base.get(f::Base.Callable, r::AbstractColumns, key::Union{Integer, Symbol}) = haskey(r, key) ? getcolumn(r, key) : f() | ||
Base.@propagate_inbounds Base.iterate(r::AbstractColumns, i=1) = i > length(r) ? nothing : (getcolumn(r, i), i + 1) | ||
|
||
function Base.show(io::IO, x::T) where {T <: AbstractColumns} | ||
println(io, "$T:") | ||
names = collect(columnnames(x)) | ||
values = [getcolumn(row, nm) for nm in names] | ||
Base.print_matrix(io, hcat(names, values)) | ||
end | ||
|
||
""" | ||
The Tables.jl package provides simple, yet powerful interface functions for working with all kinds tabular data through predictable access patterns. | ||
The Tables.jl package provides simple, yet powerful interface functions for working with all kinds of tabular data through predictable access patterns. | ||
|
||
```julia | ||
Tables.rows(table) => Rows | ||
Tables.rows(table) => Row iterator (also known as a Rows object) | ||
Tables.columns(table) => Columns | ||
``` | ||
Where `Rows` and `Columns` are the duals of each other: | ||
* `Rows` is an iterator of property-accessible objects (any type that supports `propertynames(row)` and `getproperty(row, nm::Symbol`) | ||
* `Columns` is a property-accessible object of iterators (i.e. each column is an iterator) | ||
Where `Row` and `Columns` are objects that support a common interface: | ||
* `Tables.getcolumn(x, col::Union{Int, Symbol})`: Retrieve an entire column (`Columns`), or single column value (`Row`) by column index (as an `Int`), or by column name (as a `Symbol`) | ||
* `Tables.columnnames(x)`: Retrieve the possible column names for a `Row` or `Columns` object | ||
|
||
In addition to these `Rows` and `Columns` objects, it's useful to be able to query properties of these objects: | ||
In addition to these `Row` and `Columns` objects, it's useful to be able to query properties of these objects: | ||
* `Tables.schema(x::Union{Rows, Columns}) => Union{Tables.Schema, Nothing}`: returns a `Tables.Schema` object, or `nothing` if the table's schema is unknown | ||
* For the `Tables.Schema` object: | ||
* column names can be accessed as a tuple of Symbols like `sch.names` | ||
* column types can be accessed as a tuple of types like `sch.types` | ||
* See `?Tables.Schema` for more details on this type | ||
|
||
A big part of the power in these simple interface functions is that each (`Tables.rows` & `Tables.columns`) is defined for any table type, even if the table type only explicitly implements one interface function or the other. | ||
This is accomplished by providing performant, generic fallback definitions in Tables.jl itself (though obviously nothing prevents a table type from implementing each interface function directly). | ||
This is accomplished by providing performant, generic fallback definitions in Tables.jl itself (though obviously nothing prevents a table type from implementing each interface function directly if so desired). | ||
|
||
With these simple definitions, powerful workflows are enabled: | ||
* A package providing data cleansing, manipulation, visualization, or analysis can automatically handle any number of decoupled input table types | ||
|
@@ -173,7 +272,7 @@ include("operations.jl") | |
include("matrix.jl") | ||
|
||
"Return the column index (1-based) of a `colname` in a table with a known schema; returns 0 if `colname` doesn't exist in table" | ||
columnindex(table, colname) = columnindex(schema(table).names, colname) | ||
columnindex(table, colname) = columnindex(schema(table), colname) | ||
|
||
"Return the column type of a `colname` in a table with a known schema; returns Union{} if `colname` doesn't exist in table" | ||
columntype(table, colname) = columntype(schema(table), colname) | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -1,23 +1,30 @@ | ||||||
## generic `Tables.rows` and `Tables.columns` fallbacks | ||||||
## if a table provides Tables.rows or Tables.columns, | ||||||
## we'll provide a default implementation of the dual | ||||||
## we'll provide a default implementation of the other | ||||||
|
||||||
# generic row iteration of columns | ||||||
# for Columns objects, we define a generic RowIterator wrapper to turn any Columns into a Rows | ||||||
|
||||||
# get the number of rows in the incoming table | ||||||
function rowcount(cols) | ||||||
props = propertynames(cols) | ||||||
isempty(props) && return 0 | ||||||
return length(getproperty(cols, props[1])) | ||||||
names = columnnames(cols) | ||||||
isempty(names) && return 0 | ||||||
return length(getcolumn(cols, names[1])) | ||||||
end | ||||||
|
||||||
struct ColumnsRow{T} | ||||||
# a lazy row view into a Columns object | ||||||
struct ColumnsRow{T} <: AbstractRow | ||||||
columns::T # a `Columns` object | ||||||
row::Int | ||||||
row::Int # row number | ||||||
end | ||||||
|
||||||
Base.getproperty(c::ColumnsRow, ::Type{T}, col::Int, nm::Symbol) where {T} = getproperty(getfield(c, 1), T, col, nm)[getfield(c, 2)] | ||||||
Base.getproperty(c::ColumnsRow, nm::Int) = getproperty(getfield(c, 1), nm)[getfield(c, 2)] | ||||||
Base.getproperty(c::ColumnsRow, nm::Symbol) = getproperty(getfield(c, 1), nm)[getfield(c, 2)] | ||||||
Base.propertynames(c::ColumnsRow) = propertynames(getfield(c, 1)) | ||||||
getcolumns(c::ColumnsRow) = getfield(c, :columns) | ||||||
getrow(c::ColumnsRow) = getfield(c, :row) | ||||||
|
||||||
# AbstractRow interface | ||||||
Base.@propagate_inbounds getcolumn(c::ColumnsRow, ::Type{T}, col::Int, nm::Symbol) where {T} = getcolumn(getcolumns(c), T, col, nm)[getrow(c)] | ||||||
Base.@propagate_inbounds getcolumn(c::ColumnsRow, i::Int) = getcolumn(getcolumns(c), i)[getrow(c)] | ||||||
Base.@propagate_inbounds getcolumn(c::ColumnsRow, nm::Symbol) = getcolumn(getcolumns(c), nm)[getrow(c)] | ||||||
columnnames(c::ColumnsRow) = columnnames(getcolumns(c)) | ||||||
|
||||||
@generated function Base.isless(c::ColumnsRow{T}, d::ColumnsRow{T}) where {T <: NamedTuple{names}} where names | ||||||
exprs = Expr[] | ||||||
|
@@ -46,16 +53,19 @@ end | |||||
Expr(:block, exprs...) | ||||||
end | ||||||
|
||||||
# RowIterator wraps a Columns object and provides row iteration via lazy row views | ||||||
struct RowIterator{T} | ||||||
columns::T | ||||||
len::Int | ||||||
end | ||||||
|
||||||
Base.eltype(x::RowIterator{T}) where {T} = ColumnsRow{T} | ||||||
Base.length(x::RowIterator) = x.len | ||||||
istable(::Type{<:RowIterator}) = true | ||||||
rowaccess(::Type{<:RowIterator}) = true | ||||||
rows(x::RowIterator) = x | ||||||
columnaccess(::Type{<:RowIterator{T}}) where T = columnaccess(T) | ||||||
|
||||||
columnaccess(::Type{<:RowIterator}) = true | ||||||
columns(x::RowIterator) = x.columns | ||||||
materializer(x::RowIterator) = materializer(x.columns) | ||||||
schema(x::RowIterator) = schema(x.columns) | ||||||
|
@@ -65,21 +75,29 @@ function Base.iterate(rows::RowIterator, st=1) | |||||
return ColumnsRow(rows.columns, st), st + 1 | ||||||
end | ||||||
|
||||||
# this is our generic Tables.rows fallback definition | ||||||
function rows(x::T) where {T} | ||||||
# because this method is being called, we know `x` didn't define it's own Tables.rows | ||||||
# first check if it supports column access, and if so, wrap it in a RowIterator | ||||||
if columnaccess(T) | ||||||
cols = columns(x) | ||||||
return RowIterator(cols, Int(rowcount(cols))) | ||||||
# otherwise, if the input is at least iterable, we'll wrap it in an IteratorWrapper | ||||||
# which will iterate the input, validating that it supports the AbstractRow interface | ||||||
# and unwrapping any DataValues that are encountered | ||||||
elseif IteratorInterfaceExtensions.isiterable(x) | ||||||
return nondatavaluerows(x) | ||||||
end | ||||||
throw(ArgumentError("no default `Tables.rows` implementation for type: $T")) | ||||||
end | ||||||
|
||||||
# build columns from rows | ||||||
# for Rows objects, we define a "collect"-like routine to build up columns from iterated rows | ||||||
|
||||||
""" | ||||||
Tables.allocatecolumn(::Type{T}, len) => returns a column type (usually AbstractVector) w/ size to hold `len` elements | ||||||
Tables.allocatecolumn(::Type{T}, len) => returns a column type (usually AbstractVector) with size to hold `len` elements | ||||||
|
||||||
Custom column types can override with an appropriate "scalar" element type that should dispatch to their column allocator. | ||||||
Alternatively, and more generally, custom scalars can overload `DataAPI.defaultarray` to signal the default array type | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
""" | ||||||
allocatecolumn(T, len) = DataAPI.defaultarray(T, 1)(undef, len) | ||||||
|
||||||
|
@@ -131,11 +149,20 @@ function __buildcolumns(rowitr, st, sch, columns, rownbr, updated) | |||||
row, st = state | ||||||
rownbr += 1 | ||||||
eachcolumns(add_or_widen!, sch, row, columns, rownbr, updated, Base.IteratorSize(rowitr)) | ||||||
# little explanation here: we just called add_or_widen! for each column value of our row | ||||||
# note that when a column's type is widened, `updated` is set w/ the new set of columns | ||||||
# we then check if our current `columns` isn't the same object as our `updated` ref | ||||||
# if it isn't, we're going to call __buildcolumns again, passing our new updated ref as | ||||||
# columns, which allows __buildcolumns to specialize (i.e. recompile) based on the new types | ||||||
# of updated. So a new __buildcolumns will be compiled for each widening event. | ||||||
columns !== updated[] && return __buildcolumns(rowitr, st, sch, updated[], rownbr, updated) | ||||||
end | ||||||
return updated | ||||||
end | ||||||
|
||||||
# for the schema-less case, we do one extra step of initializing each column as an `EmptyVector` | ||||||
# and doing an initial widening for each column in _buildcolumns, before passing the widened | ||||||
# set of columns on to __buildcolumns | ||||||
struct EmptyVector <: AbstractVector{Union{}} | ||||||
len::Int | ||||||
end | ||||||
|
@@ -153,14 +180,20 @@ end | |||||
state = iterate(rowitr) | ||||||
state === nothing && return NamedTuple() | ||||||
row, st = state | ||||||
names = Tuple(propertynames(row)) | ||||||
names = Tuple(columnnames(row)) | ||||||
len = Base.haslength(T) ? length(rowitr) : 0 | ||||||
sch = Schema(names, nothing) | ||||||
columns = Tuple(EmptyVector(len) for _ = 1:length(names)) | ||||||
return NamedTuple{Base.map(Symbol, names)}(_buildcolumns(rowitr, row, st, sch, columns, Ref{Any}(columns))[]) | ||||||
end | ||||||
|
||||||
struct CopiedColumns{T} | ||||||
# for some sinks, there's a concern about whether they can safely "own" columns from the input | ||||||
# to be safe, they should always copy input columns, to avoid unintended mutation. | ||||||
# when we've called buildcolumns, however, Tables.jl essentially built/owns the columns, | ||||||
# and it's happy to pass ownership to the sink. Thus, any built columns will be wrapped | ||||||
# in a CopiedColumns struct to signal to the sink that essentially "a copy has already been made" | ||||||
# and they're safe to assume ownership | ||||||
struct CopiedColumns{T} <: AbstractColumns | ||||||
x::T | ||||||
end | ||||||
|
||||||
|
@@ -170,15 +203,25 @@ columnaccess(::Type{<:CopiedColumns}) = true | |||||
columns(x::CopiedColumns) = x | ||||||
schema(x::CopiedColumns) = schema(source(x)) | ||||||
materializer(x::CopiedColumns) = materializer(source(x)) | ||||||
Base.propertynames(x::CopiedColumns) = propertynames(source(x)) | ||||||
Base.getproperty(x::CopiedColumns, nm::Symbol) = getproperty(source(x), nm) | ||||||
|
||||||
getcolumn(x::CopiedColumns, ::Type{T}, col::Int, nm::Symbol) where {T} = getcolumn(source(x), T, col, nm) | ||||||
getcolumn(x::CopiedColumns, i::Int) = getcolumn(source(x), i) | ||||||
getcolumn(x::CopiedColumns, nm::Symbol) = getcolumn(source(x), nm) | ||||||
columnnames(x::CopiedColumns) = columnnames(source(x)) | ||||||
|
||||||
# here's our generic fallback Tables.columns definition | ||||||
@inline function columns(x::T) where {T} | ||||||
# because this method is being called, we know `x` didn't define it's own Tables.columns method | ||||||
# first check if it supports row access, and if so, build up the desired columns | ||||||
if rowaccess(T) | ||||||
r = rows(x) | ||||||
return CopiedColumns(buildcolumns(schema(r), r)) | ||||||
# though not widely supported, if a source supports the TableTraits column interface, use it | ||||||
elseif TableTraits.supports_get_columns_copy_using_missing(x) | ||||||
return CopiedColumns(TableTraits.get_columns_copy_using_missing(x)) | ||||||
# otherwise, if the source is at least iterable, we'll wrap it in an IteratorWrapper and | ||||||
# build columns from that, which will check if the source correctly iterates AbstractRows | ||||||
# and unwraps DataValues for us | ||||||
elseif IteratorInterfaceExtensions.isiterable(x) | ||||||
iw = nondatavaluerows(x) | ||||||
return CopiedColumns(buildcolumns(schema(iw), iw)) | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we highlight explict that not all column types inherit from AbstractColumn ?
And that it thus should not be depended on for dispatch.
That it is just for convieinve of source authors (not sink authors)
(And same for Row)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great call out.