-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Glenn Moynihan
authored
Apr 1, 2021
1 parent
c608b3f
commit 55e3c1a
Showing
9 changed files
with
192 additions
and
53 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,18 +1,15 @@ | ||
# FeatureTransforms | ||
|
||
FeatureTransforms.jl provides utilities for performing feature engineering in machine learning pipelines. | ||
FeatureTransforms supports operations on `AbstractArray`s and [`Table`](https://github.com/JuliaData/Tables.jl)s. | ||
FeatureTransforms.jl provides utilities for performing feature engineering in machine learning pipelines with support for `AbstractArray`s and [`Table`](https://github.com/JuliaData/Tables.jl)s. | ||
|
||
There are three key parts of the Transforms.jl API: | ||
## Why does this package exist? | ||
|
||
* Subtypes of [`Transform`](@ref about-transforms) define transformations of data, for example normalization or a periodic function. | ||
* The [`FeatureTransforms.apply`](@ref), [`FeatureTransforms.apply!`](@ref) and [`FeatureTransforms.apply_append`](@ref) methods transform data according to the given [`Transform`](@ref about-transforms), in a manner determined by the data type and specified dimensions, column names, indices, and other `Transform`-specific parameters. | ||
* The [`transform`](@ref transform-interface) method should be overloaded to define feature engineering pipelines that include [`Transform`](@ref about-transforms)s. | ||
FeatureTransforms.jl aims to provide common feature engineering transforms that are composable, reusable, and performant. | ||
|
||
## Getting Started | ||
FeatureTransforms.jl is conceptually different from other widely-known packages that provide similar utilities for manipulating data, such as [DataFramesMeta.jl](https://github.com/JuliaData/DataFramesMeta.jl), [DataKnots.jl](https://github.com/rbt-lang/DataKnots.jl), and [Query.jl](https://github.com/queryverse/Query.jl). | ||
These packages provide methods for composing relational operations to filter, join, or combine structured data. | ||
However, a query-based syntax or an API that only supports one type are not the most suitable for composing the kinds of mathematical operations, such as one-hot-encoding, that underpin most (non-trivial) feature engineering pipelines. | ||
|
||
Here are some resources for getting started with FeatureTransforms.jl: | ||
|
||
* Refer to the page on [Transforms](@ref about-transforms) to learn how they are defined and used. | ||
* Consult the [examples](@ref) section for a quick guide to some typical use cases. | ||
* The [API](@ref) page has the list of all currently supported `Transform`s. | ||
The composability of transforms reflects the practice of piping the output of one operation to the input of another, as well as combining the pipelines of multiple features. | ||
Reusability is achieved by having native support for the `Tables` and `AbstractArray` types, which includes [DataFrames](https://github.com/JuliaData/DataFrames.jl/), [TypedTables](https://github.com/JuliaData/TypedTables.jl), [LibPQ.Result](https://github.com/invenia/LibPQ.jl), etc, as well as [AxisArrays](https://github.com/JuliaArrays/AxisArrays.jl), [KeyedArrays](https://github.com/mcabbott/AxisKeys.jl), and [NamedDimsArrays](https://github.com/invenia/NamedDims.jl). | ||
This flexible design allows for performant code that should satisfy the needs of most users while not being restricted to (or by) any one data type. |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
# [Transform Interface](@id transform-interface) | ||
|
||
The "transform interface” is a mechanism that allows sequences of `Transform`s to be combined (with other steps) into end-to-end feature engineering pipelines. | ||
|
||
This is supported by the return of a `Transform`s having the same type as the input. | ||
This type consistency helps to make `Transform`s _composable_, i.e., the output of one is always a valid input to another, which allows users to "stack" sequences of `Transform`s together with minimal glue code needed to keep it working. | ||
|
||
Morever, the end-to-end pipelines themselves should obey the same principle: you should be able to add or remove `Transform`s (or another pipeline) to the output without breaking your code. | ||
That is, the output should also be a valid "transformable" type: either an `AbstractArray`, a `Table`, or other type for which the user has extended [`FeatureTransforms.apply`](@ref) to support. | ||
Valid types can be checked by calling `is_transformable`, which is the first part of the transform interface. | ||
|
||
The second part is the `transform` method stub, which users should overload when they want to "encapsulate" an end-to-end pipeline. | ||
The exact method for doing so is an implementation detail for the user but refer to the code below as an example. | ||
The only requirement of the transform API is that the return of the implemented `transform` method is itself "transformable", i.e. satisfies `is_transformable`. | ||
|
||
## Example | ||
|
||
This is a trivial example of a feature engineering pipeline. | ||
In practice, there may be other steps involved, such as checking for missing data or logging, which are omitted for clarity. | ||
An advantage of the transform API is that the output can be readily integrated into another transform pipeline downstream. | ||
For example, if `MyModel` were being stacked with the result of a previous model. | ||
|
||
|
||
```@meta | ||
DocTestSetup = quote | ||
using FeatureTransforms | ||
end | ||
``` | ||
|
||
```jldoctest transform | ||
function FeatureTransforms.transform(data) | ||
# Define the Transforms we will apply | ||
p = Power(0.123) | ||
lc = LinearCombination([0.1, 0.9]) | ||
ohe = OneHotEncoding(["type1", "type2", "type3"]) | ||
features = deepcopy(data) | ||
FeatureTransforms.apply!(features, p; cols=[:a], header=[:a]) | ||
features = FeatureTransforms.apply_append(features, lc; cols=[:a, :b], header=[:ab]) | ||
features = FeatureTransforms.apply_append(features, ohe; cols=:types, header=[:type1, :type2, :type3]) | ||
end | ||
# this could be any table-type, including a DataFrame | ||
input = (a=rand(5), b=rand(5), types=["type1", "type2", "type1", "type1", "type1"]); | ||
output = FeatureTransforms.transform(input); | ||
# verify the output is transformable | ||
is_transformable(output) && print("output is transformable") | ||
# output | ||
output is transformable | ||
``` | ||
|
||
```@meta | ||
DocTestSetup = Nothing | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
55e3c1a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registration pull request created: JuliaRegistries/General/33328
After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.
This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via: