Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tables API enhancement #131

Merged
merged 15 commits into from
Feb 8, 2020
16 changes: 14 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,12 +1,24 @@
# Documentation: http://docs.travis-ci.com/user/languages/julia/
language: julia
os:
- linux
- osx
- windows
arch:
- x64
- x86
julia:
- 1.0
- 1.1
- 1.3
- nightly
matrix:
allow_failures:
- julia: nightly
fast_finish: true
exclude:
- os: osx
arch: x86
notifications:
email: false
after_success:
- julia -e 'using Pkg; Pkg.add("Coverage"); using Coverage; Codecov.submit(process_folder())'
- julia -e 'ENV["TRAVIS_JULIA_VERSION"] == "1.3" && ENV["TRAVIS_OS_NAME"] != "linux" && exit(); using Pkg; Pkg.add("Coverage"); using Coverage; Codecov.submit(Codecov.process_folder())'
31 changes: 0 additions & 31 deletions appveyor.yml

This file was deleted.

119 changes: 109 additions & 10 deletions src/Tables.jl
Original file line number Diff line number Diff line change
Expand Up @@ -8,29 +8,128 @@ if !hasmethod(getproperty, Tuple{Tuple, Int})
Base.getproperty(t::Tuple, i::Int) = t[i]
end

"Abstract row type with a simple required interface: row values are accessible via `getproperty(row, field)`; for example, a NamedTuple like `nt = (a=1, b=2, c=3)` can access its value for `a` like `nt.a` which turns into a call to the function `getproperty(nt, :a)`"
abstract type Row end
"""
Tables.AbstractColumns
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we highlight explict that not all column types inherit from AbstractColumn ?
And that it thus should not be depended on for dispatch.
That it is just for convieinve of source authors (not sink authors)
(And same for Row)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great call out.


Abstract type provided to allow custom table types to inherit useful and required behavior.

Interface definition:
| Required Methods | Default Definition | Brief Description |
| ---------------- | ------------------ | ----------------- |
| `Tables.getcolumn(table, i::Int)` | getfield(table, i) | Retrieve a column by index |
| `Tables.getcolumn(table, nm::Symbol)` | getproperty(table, nm) | Retrieve a column by name |
| `Tables.columnnames(table)` | propertynames(table) | Return column names for a table as an indexable collection |
| Optional methods | | |
| `Tables.getcolumn(table, ::Type{T}, i::Int, nm::Symbol)` | Tables.getcolumn(table, nm) | Given a column eltype `T`, index `i`, and column name `nm`, retrieve the column. Provides a type-stable or even constant-prop-able mechanism for efficiency.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include in this PR a test suite method that takes a instance of a subtype and checks it has a correct implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difficulty here in doing a static check of interface implementation is we have generic fallback methods for all the interface methods (so things like NamedTuples and Generators of NamedTuples just work). But I think there's something we could provide to run in your test suite to put your type through the works.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. We need to send something though it.
We can't just use hasmethod

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to this, I have a thought on interface versioning. See: #133


While custom table types aren't required to subtype `Tables.AbstractColumns`, benefits of doing so include:
* Indexing interface defined (using `getcolumn`); i.e. `tbl[i]` will retrieve the column at index `i`
* Property access interface defined (using `columnnames` and `getcolumn`); i.e. `tbl.col1` will retrieve column named `col1`
* Iteration interface defined; i.e. `for col in table` will iterate each column in the table
* A default `show` method
This allows a custom table type to behave as close as possible to a builtin `NamedTuple` of vectors object.
"""
abstract type AbstractColumns end

"""
Tables.AbstractRow

Abstract type provided to allow custom row types to inherit useful and required behavior.

Interface definition:
| Required Methods | Default Definition | Brief Description |
| ---------------- | ------------------ | ----------------- |
| `Tables.getcolumn(row, i::Int)` | getfield(row, i) | Retrieve a column value by index |
| `Tables.getcolumn(row, nm::Symbol)` | getproperty(row, nm) | Retrieve a column value by name |
| `Tables.columnnames(row)` | propertynames(row) | Return column names for a row as an indexable collection |
| Optional methods | | |
| `Tables.getcolumn(row, ::Type{T}, i::Int, nm::Symbol)` | Tables.getcolumn(row, nm) | Given a column type `T`, index `i`, and column name `nm`, retrieve the column value. Provides a type-stable or even constant-prop-able mechanism for efficiency.

While custom row types aren't required to subtype `Tables.AbstractRow`, benefits of doing so include:
* Indexing interface defined (using `getcolumn`); i.e. `row[i]` will return the column value at index `i`
* Property access interface defined (using `columnnames` and `getcolumn`); i.e. `row.col1` will retrieve the value for the column named `col1`
* Iteration interface defined; i.e. `for x in row` will iterate each column value in the row
* A default `show` method
This allows the custom row type to behave as close as possible to a builtin `NamedTuple` object.
"""
abstract type AbstractRow <: AbstractColumns end

"""
Tables.getcolumn(::Columns, nm::Symbol) => Indexable collection with known length
Tables.getcolumn(::Columns, i::Int) => Indexable collection with known length
Tables.getcolumn(::Columns, T, i::Int, nm::Symbol) => Indexable collection with known length

Tables.getcolumn(::Row, nm::Symbol) => Column value
Tables.getcolumn(::Row, i::Int) => Column value
Tables.getcolumn(::Row, T, i::Int, nm::Symbol) => Column value

Retrieve an entire column (`Columns`) or single row column value (`Row`) by column name (`nm`), index (`i`),
or if desired, by column type (`T`), index (`i`), and name (`nm`). When called on a `Columns` interface object,
a `Column` is returned, which is an indexable collection with known length. When called on a `Row` interface
object, it returns the single column value. The methods taking a single `Symbol` or `Int` are both required
for the `AbstractColumns` and `AbstractRow` interfaces; the third method is optional if type stability is possible.
The default definition of `Tables.getcolumn(x, i::Int)` is `getfield(x, i)`. The default definition of
`Tables.getcolumn(x, nm::Symbol)` is `getproperty(x, nm)`.
"""
function getcolumn end

getcolumn(x, i::Int) = getfield(x, i)
getcolumn(x, nm::Symbol) = getproperty(x, nm)
getcolumn(x, ::Type{T}, i::Int, nm::Symbol) where {T} = getcolumn(x, nm)
getcolumn(x::NamedTuple{names, types}, ::Type{T}, i::Int, nm::Symbol) where {names, types, T} = Core.getfield(x, i)

"""
Tables.columnnames(::Union{Columns, Row}) => Indexable collection

Retrieves the list of column names as an indexable collection (like a `Tuple` or `Vector`) for a `Columns` or `Row` interface object. The default definition calls `propertynames(x)`.
"""
function columnnames end

columnnames(x) = propertynames(x)

Base.IteratorSize(::Type{R}) where {R <: AbstractColumns} = Base.HasLength()
Base.length(r::AbstractColumns) = length(columnnames(r))
Base.firstindex(r::AbstractColumns) = 1
Base.lastindex(r::AbstractColumns) = length(r)
Base.getindex(r::AbstractColumns, i::Int) = getcolumn(r, i)
Base.getindex(r::AbstractColumns, nm::Symbol) = getcolumn(r, nm)
Base.getproperty(r::AbstractColumns, nm::Symbol) = getcolumn(r, nm)
Base.getproperty(r::AbstractColumns, i::Int) = getcolumn(r, i)
Base.propertynames(r::AbstractColumns) = columnnames(r)
Base.keys(r::AbstractColumns) = columnnames(r)
Base.values(r::AbstractColumns) = collect(r)
Base.haskey(r::AbstractColumns, key::Union{Integer, Symbol}) = key in columnnames(r)
Base.get(r::AbstractColumns, key::Union{Integer, Symbol}, default) = haskey(r, key) ? getcolumn(r, key) : default
Base.get(f::Base.Callable, r::AbstractColumns, key::Union{Integer, Symbol}) = haskey(r, key) ? getcolumn(r, key) : f()
Base.@propagate_inbounds Base.iterate(r::AbstractColumns, i=1) = i > length(r) ? nothing : (getcolumn(r, i), i + 1)

function Base.show(io::IO, x::T) where {T <: AbstractColumns}
println(io, "$T:")
names = collect(columnnames(x))
values = [getcolumn(row, nm) for nm in names]
Base.print_matrix(io, hcat(names, values))
end

"""
The Tables.jl package provides simple, yet powerful interface functions for working with all kinds tabular data through predictable access patterns.
The Tables.jl package provides simple, yet powerful interface functions for working with all kinds of tabular data through predictable access patterns.

```julia
Tables.rows(table) => Rows
Tables.rows(table) => Row iterator (also known as a Rows object)
Tables.columns(table) => Columns
```
Where `Rows` and `Columns` are the duals of each other:
* `Rows` is an iterator of property-accessible objects (any type that supports `propertynames(row)` and `getproperty(row, nm::Symbol`)
* `Columns` is a property-accessible object of iterators (i.e. each column is an iterator)
Where `Row` and `Columns` are objects that support a common interface:
* `Tables.getcolumn(x, col::Union{Int, Symbol})`: Retrieve an entire column (`Columns`), or single column value (`Row`) by column index (as an `Int`), or by column name (as a `Symbol`)
* `Tables.columnnames(x)`: Retrieve the possible column names for a `Row` or `Columns` object

In addition to these `Rows` and `Columns` objects, it's useful to be able to query properties of these objects:
In addition to these `Row` and `Columns` objects, it's useful to be able to query properties of these objects:
* `Tables.schema(x::Union{Rows, Columns}) => Union{Tables.Schema, Nothing}`: returns a `Tables.Schema` object, or `nothing` if the table's schema is unknown
* For the `Tables.Schema` object:
* column names can be accessed as a tuple of Symbols like `sch.names`
* column types can be accessed as a tuple of types like `sch.types`
* See `?Tables.Schema` for more details on this type

A big part of the power in these simple interface functions is that each (`Tables.rows` & `Tables.columns`) is defined for any table type, even if the table type only explicitly implements one interface function or the other.
This is accomplished by providing performant, generic fallback definitions in Tables.jl itself (though obviously nothing prevents a table type from implementing each interface function directly).
This is accomplished by providing performant, generic fallback definitions in Tables.jl itself (though obviously nothing prevents a table type from implementing each interface function directly if so desired).

With these simple definitions, powerful workflows are enabled:
* A package providing data cleansing, manipulation, visualization, or analysis can automatically handle any number of decoupled input table types
Expand Down Expand Up @@ -173,7 +272,7 @@ include("operations.jl")
include("matrix.jl")

"Return the column index (1-based) of a `colname` in a table with a known schema; returns 0 if `colname` doesn't exist in table"
columnindex(table, colname) = columnindex(schema(table).names, colname)
columnindex(table, colname) = columnindex(schema(table), colname)

"Return the column type of a `colname` in a table with a known schema; returns Union{} if `colname` doesn't exist in table"
columntype(table, colname) = columntype(schema(table), colname)
Expand Down
79 changes: 61 additions & 18 deletions src/fallbacks.jl
Original file line number Diff line number Diff line change
@@ -1,23 +1,30 @@
## generic `Tables.rows` and `Tables.columns` fallbacks
## if a table provides Tables.rows or Tables.columns,
## we'll provide a default implementation of the dual
## we'll provide a default implementation of the other

# generic row iteration of columns
# for Columns objects, we define a generic RowIterator wrapper to turn any Columns into a Rows

# get the number of rows in the incoming table
function rowcount(cols)
props = propertynames(cols)
isempty(props) && return 0
return length(getproperty(cols, props[1]))
names = columnnames(cols)
isempty(names) && return 0
return length(getcolumn(cols, names[1]))
end

struct ColumnsRow{T}
# a lazy row view into a Columns object
struct ColumnsRow{T} <: AbstractRow
columns::T # a `Columns` object
row::Int
row::Int # row number
end

Base.getproperty(c::ColumnsRow, ::Type{T}, col::Int, nm::Symbol) where {T} = getproperty(getfield(c, 1), T, col, nm)[getfield(c, 2)]
Base.getproperty(c::ColumnsRow, nm::Int) = getproperty(getfield(c, 1), nm)[getfield(c, 2)]
Base.getproperty(c::ColumnsRow, nm::Symbol) = getproperty(getfield(c, 1), nm)[getfield(c, 2)]
Base.propertynames(c::ColumnsRow) = propertynames(getfield(c, 1))
getcolumns(c::ColumnsRow) = getfield(c, :columns)
getrow(c::ColumnsRow) = getfield(c, :row)

# AbstractRow interface
Base.@propagate_inbounds getcolumn(c::ColumnsRow, ::Type{T}, col::Int, nm::Symbol) where {T} = getcolumn(getcolumns(c), T, col, nm)[getrow(c)]
Base.@propagate_inbounds getcolumn(c::ColumnsRow, i::Int) = getcolumn(getcolumns(c), i)[getrow(c)]
Base.@propagate_inbounds getcolumn(c::ColumnsRow, nm::Symbol) = getcolumn(getcolumns(c), nm)[getrow(c)]
columnnames(c::ColumnsRow) = columnnames(getcolumns(c))

@generated function Base.isless(c::ColumnsRow{T}, d::ColumnsRow{T}) where {T <: NamedTuple{names}} where names
exprs = Expr[]
Expand Down Expand Up @@ -46,16 +53,19 @@ end
Expr(:block, exprs...)
end

# RowIterator wraps a Columns object and provides row iteration via lazy row views
struct RowIterator{T}
columns::T
len::Int
end

Base.eltype(x::RowIterator{T}) where {T} = ColumnsRow{T}
Base.length(x::RowIterator) = x.len
istable(::Type{<:RowIterator}) = true
rowaccess(::Type{<:RowIterator}) = true
rows(x::RowIterator) = x
columnaccess(::Type{<:RowIterator{T}}) where T = columnaccess(T)

columnaccess(::Type{<:RowIterator}) = true
columns(x::RowIterator) = x.columns
materializer(x::RowIterator) = materializer(x.columns)
schema(x::RowIterator) = schema(x.columns)
Expand All @@ -65,21 +75,29 @@ function Base.iterate(rows::RowIterator, st=1)
return ColumnsRow(rows.columns, st), st + 1
end

# this is our generic Tables.rows fallback definition
function rows(x::T) where {T}
# because this method is being called, we know `x` didn't define it's own Tables.rows
# first check if it supports column access, and if so, wrap it in a RowIterator
if columnaccess(T)
cols = columns(x)
return RowIterator(cols, Int(rowcount(cols)))
# otherwise, if the input is at least iterable, we'll wrap it in an IteratorWrapper
# which will iterate the input, validating that it supports the AbstractRow interface
# and unwrapping any DataValues that are encountered
elseif IteratorInterfaceExtensions.isiterable(x)
return nondatavaluerows(x)
end
throw(ArgumentError("no default `Tables.rows` implementation for type: $T"))
end

# build columns from rows
# for Rows objects, we define a "collect"-like routine to build up columns from iterated rows

"""
Tables.allocatecolumn(::Type{T}, len) => returns a column type (usually AbstractVector) w/ size to hold `len` elements
Tables.allocatecolumn(::Type{T}, len) => returns a column type (usually AbstractVector) with size to hold `len` elements

Custom column types can override with an appropriate "scalar" element type that should dispatch to their column allocator.
Alternatively, and more generally, custom scalars can overload `DataAPI.defaultarray` to signal the default array type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Alternatively, and more generally, custom scalars can overload `DataAPI.defaultarray` to signal the default array type
Alternatively, and more generally, custom scalars can overload `DataAPI.defaultarray` to signal the default array type.

"""
allocatecolumn(T, len) = DataAPI.defaultarray(T, 1)(undef, len)

Expand Down Expand Up @@ -131,11 +149,20 @@ function __buildcolumns(rowitr, st, sch, columns, rownbr, updated)
row, st = state
rownbr += 1
eachcolumns(add_or_widen!, sch, row, columns, rownbr, updated, Base.IteratorSize(rowitr))
# little explanation here: we just called add_or_widen! for each column value of our row
# note that when a column's type is widened, `updated` is set w/ the new set of columns
# we then check if our current `columns` isn't the same object as our `updated` ref
# if it isn't, we're going to call __buildcolumns again, passing our new updated ref as
# columns, which allows __buildcolumns to specialize (i.e. recompile) based on the new types
# of updated. So a new __buildcolumns will be compiled for each widening event.
columns !== updated[] && return __buildcolumns(rowitr, st, sch, updated[], rownbr, updated)
end
return updated
end

# for the schema-less case, we do one extra step of initializing each column as an `EmptyVector`
# and doing an initial widening for each column in _buildcolumns, before passing the widened
# set of columns on to __buildcolumns
struct EmptyVector <: AbstractVector{Union{}}
len::Int
end
Expand All @@ -153,14 +180,20 @@ end
state = iterate(rowitr)
state === nothing && return NamedTuple()
row, st = state
names = Tuple(propertynames(row))
names = Tuple(columnnames(row))
len = Base.haslength(T) ? length(rowitr) : 0
sch = Schema(names, nothing)
columns = Tuple(EmptyVector(len) for _ = 1:length(names))
return NamedTuple{Base.map(Symbol, names)}(_buildcolumns(rowitr, row, st, sch, columns, Ref{Any}(columns))[])
end

struct CopiedColumns{T}
# for some sinks, there's a concern about whether they can safely "own" columns from the input
# to be safe, they should always copy input columns, to avoid unintended mutation.
# when we've called buildcolumns, however, Tables.jl essentially built/owns the columns,
# and it's happy to pass ownership to the sink. Thus, any built columns will be wrapped
# in a CopiedColumns struct to signal to the sink that essentially "a copy has already been made"
# and they're safe to assume ownership
struct CopiedColumns{T} <: AbstractColumns
x::T
end

Expand All @@ -170,15 +203,25 @@ columnaccess(::Type{<:CopiedColumns}) = true
columns(x::CopiedColumns) = x
schema(x::CopiedColumns) = schema(source(x))
materializer(x::CopiedColumns) = materializer(source(x))
Base.propertynames(x::CopiedColumns) = propertynames(source(x))
Base.getproperty(x::CopiedColumns, nm::Symbol) = getproperty(source(x), nm)

getcolumn(x::CopiedColumns, ::Type{T}, col::Int, nm::Symbol) where {T} = getcolumn(source(x), T, col, nm)
getcolumn(x::CopiedColumns, i::Int) = getcolumn(source(x), i)
getcolumn(x::CopiedColumns, nm::Symbol) = getcolumn(source(x), nm)
columnnames(x::CopiedColumns) = columnnames(source(x))

# here's our generic fallback Tables.columns definition
@inline function columns(x::T) where {T}
# because this method is being called, we know `x` didn't define it's own Tables.columns method
# first check if it supports row access, and if so, build up the desired columns
if rowaccess(T)
r = rows(x)
return CopiedColumns(buildcolumns(schema(r), r))
# though not widely supported, if a source supports the TableTraits column interface, use it
elseif TableTraits.supports_get_columns_copy_using_missing(x)
return CopiedColumns(TableTraits.get_columns_copy_using_missing(x))
# otherwise, if the source is at least iterable, we'll wrap it in an IteratorWrapper and
# build columns from that, which will check if the source correctly iterates AbstractRows
# and unwraps DataValues for us
elseif IteratorInterfaceExtensions.isiterable(x)
iw = nondatavaluerows(x)
return CopiedColumns(buildcolumns(schema(iw), iw))
Expand Down
Loading