Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: support FeatureTransforms.jl #44

Closed
wants to merge 13 commits into from
6 changes: 4 additions & 2 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
name = "AxisSets"
uuid = "a1a1544e-ba16-4f6d-8861-e833517b754e"
authors = ["Invenia Technical Computing Corporation"]
version = "0.1.4"
version = "0.1.5"

[deps]
AutoHashEquals = "15f4f7f2-30c1-5605-9d31-71845cf9641f"
AxisKeys = "94b1ba4f-4ee9-5380-92f1-94cde586c3c5"
FeatureTransforms = "8fd68953-04b8-4117-ac19-158bf6de9782"
Impute = "f7bf1975-0170-51b9-8c5f-a992d46b9575"
NamedDims = "356022a1-0364-5f58-8944-0da4b18d706f"
OrderedCollections = "bac558e1-5e72-5ebc-8fee-abe8a469f55d"
Expand All @@ -14,11 +15,12 @@ ReadOnlyArrays = "988b38a3-91fc-5605-94a2-ee2116b3bd83"
[compat]
AutoHashEquals = "0.2"
AxisKeys = "0.1"
FeatureTransforms = "0.3"
Impute = "0.6"
NamedDims = "0.2"
OrderedCollections = "1"
ReadOnlyArrays = "0.1"
julia = "1.3"
julia = "1.5"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comply with FeatureTransforms compat

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of annoying that FeatureTransforms only supports 1.5, but I guess all our packages should support it anyways?


[extras]
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
Expand Down
2 changes: 2 additions & 0 deletions src/AxisSets.jl
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ module AxisSets

using AutoHashEquals
using AxisKeys
using FeatureTransforms
using Impute
using NamedDims
using OrderedCollections
Expand Down Expand Up @@ -88,5 +89,6 @@ include("dataset.jl")
include("indexing.jl")
include("functions.jl")
include("impute.jl")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#46

include("featuretransforms.jl")

end
154 changes: 154 additions & 0 deletions src/featuretransforms.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
FeatureTransforms.is_transformable(::KeyedDataset) = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now automatic when an apply method is defined for a type

https://github.com/invenia/FeatureTransforms.jl/releases/tag/v0.3.4


_transform_pattern(keys, dims) = isempty(keys) ? _transform_pattern(dims) : Pattern[keys...]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment in impute.jl

_transform_pattern(::Colon) = Pattern[(:__,)]
_transform_pattern(dims::Symbol) = Pattern[(:__, dims)]
_transform_pattern(dims) = Pattern[(:__, d) for d in dims]

"""
FeatureTransforms.apply(ds::KeyedDataset, t::Transform, [key]; dims=:, kwargs...)

Apply the `Transform` to each component of the [`KeyedDataset`](@ref).
Returns a new dataset with the same constraints, but transformed components.

The transform can be applied to a subselection of components via a [`Pattern`](@ref) `key`.
Otherwise, components are selected by the desired `dims`.

Keyword arguments including `dims` are passed to the appropriate `FeatureTransforms` method
for a component.

# Example
```jldoctest
julia> using AxisKeys, FeatureTransforms; using AxisSets: KeyedDataset, Pattern, flatten;

julia> ds = KeyedDataset(
flatten([
:train => [
:load => KeyedArray([7.0 7.7; 8.0 8.2; 9.0 9.9]; time=1:3, loc=[:x, :y]),
:price => KeyedArray([-2.0 4.0; 3.0 2.0; -1.0 -1.0]; time=1:3, id=[:a, :b]),
],
:predict => [
:load => KeyedArray([7.0 7.7; 8.1 7.9; 9.0 9.9]; time=1:3, loc=[:x, :y]),
:price => KeyedArray([0.5 -1.0; -5.0 -2.0; 0.0 1.0]; time=1:3, id=[:a, :b]),
]
])...
);

julia> p = Power(2);

julia> r = FeatureTransforms.apply(ds, p, (:_, :price, :_));

julia> [k => parent(parent(v)) for (k, v) in r.data]
4-element Vector{Pair{Tuple{Symbol, Symbol}, Matrix{Float64}}}:
(:train, :load) => [7.0 7.7; 8.0 8.2; 9.0 9.9]
(:train, :price) => [4.0 16.0; 9.0 4.0; 1.0 1.0]
(:predict, :load) => [7.0 7.7; 8.1 7.9; 9.0 9.9]
(:predict, :price) => [0.25 1.0; 25.0 4.0; 0.0 1.0]
```
"""
function FeatureTransforms.apply(
ds::KeyedDataset, t::Transform, keys...;
dims=:, kwargs...
)
return map(ds, _transform_pattern(keys, dims)...) do a
FeatureTransforms.apply(a, t; dims=dims, kwargs...)
end
end

"""
FeatureTransforms.apply!(ds::KeyedDataset, t::Transform, [key]; dims=:, kwargs...)

Apply the `Transform` to each component of the [`KeyedDataset`](@ref).
Returns a new dataset with the same constraints, but transformed components.

The transform can be applied to a subselection of components via a [`Pattern`](@ref) `key`.
Otherwise, components are selected by the desired `dims`.

Keyword arguments including `dims` are passed to the appropriate `FeatureTransforms` method
for a component.
"""
function FeatureTransforms.apply!(
ds::KeyedDataset, t::Transform, keys...;
dims=:, kwargs...
)
return map(ds, _transform_pattern(keys, dims)...) do a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a for-loop and the output be just return ds? otherwise doesn't it just return the components that were affected?

FeatureTransforms.apply!(a, t; dims=dims, kwargs...)
end
end

"""
FeatureTransforms.apply_append(
ds::KeyedDataset, t::Transform, [key];
dims=:, inner=false, component_name=:component, kwargs...
)

Apply the `Transform` to each component of the [`KeyedDataset`](@ref).

The transform can be applied to a subselection of components via a [`Pattern`](@ref) `key`.
Otherwise, components are selected by the desired `dims`.

If `inner=true`, perform `FeatureTransforms.apply_append` on each component,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure inner is the right term for this as it feels like it overlaps with things like inner joins. Maybe batch would be more appropriate?

returning a new dataset with the same constraints, but transformed components.

Otherwise, transform each component using `FeatureTransforms.apply`, and append
to a copy of the dataset as a new component called `component_name`.

Keyword arguments including `dims` are passed to the appropriate `FeatureTransforms` method
for a component.

# Example
```jldoctest
julia> using AxisKeys, FeatureTransforms; using AxisSets: KeyedDataset, Pattern, flatten;

julia> ds = KeyedDataset(
flatten([
:train => [
:load => KeyedArray([7.0 7.7; 8.0 8.2; 9.0 9.9]; time=1:3, loc=[:x, :y]),
:price => KeyedArray([-2.0 4.0; 3.0 2.0; -1.0 -1.0]; time=1:3, id=[:a, :b]),
],
:predict => [
:load => KeyedArray([7.0 7.7; 8.1 7.9; 9.0 9.9]; time=1:3, loc=[:x, :y]),
:price => KeyedArray([0.5 -1.0; -5.0 -2.0; 0.0 1.0]; time=1:3, id=[:a, :b]),
]
])...
);

julia> p = Power(2);

julia> r = FeatureTransforms.apply_append(ds, p, (:_, :price, :_); component_name=:price2);

julia> [k => parent(parent(v)) for (k, v) in r.data]
6-element Vector{Pair{Tuple{Symbol, Symbol}, Matrix{Float64}}}:
(:train, :load) => [7.0 7.7; 8.0 8.2; 9.0 9.9]
(:train, :price) => [-2.0 4.0; 3.0 2.0; -1.0 -1.0]
(:predict, :load) => [7.0 7.7; 8.1 7.9; 9.0 9.9]
(:predict, :price) => [0.5 -1.0; -5.0 -2.0; 0.0 1.0]
(:train, :price2) => [4.0 16.0; 9.0 4.0; 1.0 1.0]
(:predict, :price2) => [0.25 1.0; 25.0 4.0; 0.0 1.0]
```
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These docstrings are a bit verbose (ie: several duplicate sentences between them). Could we simplify the apply_append and append docstrings to reference the apply! method?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good idea

function FeatureTransforms.apply_append(
ds::KeyedDataset, t::Transform, keys...;
dims=:, inner=false, component_name=:component, kwargs...
)
patterns = _transform_pattern(keys, dims)

if inner # batched apply_append on each component
return map(ds, patterns...) do a
FeatureTransforms.apply_append(a, t; dims=dims, kwargs...)
end
Comment on lines +136 to +139
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thought this could be handy and worth including.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see this being useful, but I'm not sure we have enough use-cases yet. That being said, it wouldn't be too hard to deprecated if it isn't useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how useful this is right away either, and it might complicate the code and tests if we have to support both batch=true/false .

given the implementation is rather easy (just a map over the components) it should be straightforward for users to do it themselves and hold off doing it here until we know it's worth doing.

else # merge transformed components as new components of dataset
# select any components the keys match
selected = unique(x[1:end-1] for x in dimpaths(ds) if any(p -> x in p, patterns))

# construct keys of new transformed components
new_keys = [(k[1:end-1]..., component_name) for k in selected]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By using a single component name this assumes, but does not enforce, that there is only one kind of component being transformed e.g. :price. It could still be multiple components e.g. (:train, :price) and (:predict, :price).

But we also discussed the idea of passing in a full dimspath, which I'm open to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to throw an argument error if that assumption doesn't hold for now. Passing multiple dimpaths does seem noisy, and hard to justify without use-cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the condition would be that the last part of each dimpath is the same? Off the top of my head:

length(unique([dpath[end] for dpath in selected])) == 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, though that could probably be simplified to only(last.(selected))?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this assumption is fine IMO

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe shouldn't call it keys to avoid confusion with KeyedArray keys. keys means dimspaths here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd probably use a variable like _dimpaths or dpaths.


# pair new keys with transformed components
pairs = map(new_keys, selected) do new_k, k
new_k => FeatureTransforms.apply(ds.data[k], t; dims=dims, kwargs...)
end

return merge(ds, KeyedDataset(pairs...))
end
end
8 changes: 4 additions & 4 deletions src/impute.jl
Original file line number Diff line number Diff line change
Expand Up @@ -126,12 +126,12 @@ julia> [k => parent(parent(v)) for (k, v) in Impute.filter(ds; dims=:loc).data]
"""
Impute.apply(ds::KeyedDataset, f::Filter; dims) = Impute.apply!(deepcopy(ds), f; dims=dims)

_pattern(dims::Pattern) = dims
_pattern(dims::Tuple) = Pattern(dims)
_pattern(dims) = Pattern(:__, dims)
_impute_pattern(dims::Pattern) = dims
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having to distinguish _impute_pattern and _transform_pattern could be a code smell? But there is a difference in what dims means for Impute vs. FT.

_transform_pattern is closer to what mapslices does:

function Base.mapslices(f::Function, ds::KeyedDataset, keys...; dims)
patterns = if isempty(keys)
dims isa Symbol ? Pattern[(:__, dims)] : Pattern[(:__, d) for d in dims]
else
Pattern[keys...]
end

Ideally we'd standardise how to handle patterns/dims everywhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe add a comment to invenia/Impute.jl#66? If we're gonna change that in Impute.jl it'd be good to do that before a 1.0 release?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_impute_pattern(dims::Tuple) = Pattern(dims)
_impute_pattern(dims) = Pattern(:__, dims)

function Impute.apply!(ds::KeyedDataset, f::Filter; dims)
pattern = _pattern(dims)
pattern = _impute_pattern(dims)
dim = pattern.segments[end]

dim in (:_, :__) && throw(ArgumentError(
Expand Down
Loading