Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update impute/impute! docstrings #145

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 76 additions & 43 deletions src/imputors.jl
Original file line number Diff line number Diff line change
Expand Up @@ -38,24 +38,40 @@ function Base.:(==)(a::T, b::T) where T <: Imputor
return result
end

"""
impute(data::T, imp; kwargs...) -> T
impute_docstring = """
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than assigning this to a variable, you can just do:

"""
...
""" impute, impute!

This will declare the docstring for those function names and then we can define the actual methods below.

impute(data::T, imp; dims=:, kwargs...) -> T
impute!(data::A, imp; dims=:, kwargs...) -> A
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can also use T here.

Suggested change
impute!(data::A, imp; dims=:, kwargs...) -> A
impute!(data::T, imp; dims=:, kwargs...) -> T


Returns a new copy of the `data` with the missing data imputed by the imputor `imp`.
For matrices and tables, data is imputed one variable/column at a time.
If this is not the desired behaviour then you should overload this method or specify a different `dims` value.
Returns a new copy of the `data` with the missing data imputed by the imputor `imp`. If the mutating version
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you leave the newlines as they were unless it's necessary? It makes it harder to review the diff.

`impute!` is used, it will also update the missing values in-place.

By default, `data` is assumed to be laid out like a `DataFrame`, with each column representing a variable and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat misleading. Impute assumes it's working on either an array or a table. Perhaps this wording would satisfy what you're trying to achieve?

By default, data is imputed along the first dimension of the input data (i.e., columns for matrices or tables). Other orientations can be handled via the `dims` keyword argument.

each row representing one observation. Other layouts can be handled via the `dims` keyword argument.

# Arguments
* `data`: the data to be impute
* `imp::Imputor`: the Imputor method to use

# Keyword arguments
* `dims = :`: The dimensions to impute along, either `:cols` or `:rows`. If data are in `DataFrame` format,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't usually have whitespace around the = for kwargs. Again, we don't focus on DataFrames here, and I think just saying columns is more representative of the behaviour for both tables and arrays.

with variables in columns and observations in rows, use `dims = :cols`. If it is transposed, with variables
in rows and observations in columns, use `dims=:rows`.

# Returns
* the input `data` with values imputed
* `AbstractArray{Union{T, Missing}}`: the input `data` with values imputed. (Mutation isn't guaranteed for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type isn't this specific. It could be an array, but it could also be a table type.

all array types, so we always return the result).

# Example
# NOTES
1. Matrices have a deprecated `dims=2` special case as `dims=:` is a breaking change
2. `eachslice` is used internally which requires Julia 1.1

# Examples
```jldoctest
julia> using Impute: Interpolate, impute
julia> using Impute: Interpolate, Substitute, impute, impute!

julia> using Statistics: mean

# Linear interpolation in a vector
julia> v = [1.0, 2.0, missing, missing, 5.0]
5-element Vector{Union{Missing, Float64}}:
1.0
Expand All @@ -71,46 +87,49 @@ julia> impute(v, Interpolate())
3.0
4.0
5.0
```
"""
function impute(data, imp::Imputor; kwargs...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference, splitting up with merging of the docstrings and your additions makes it hard to follow what is new vs merged.

# NOTE: We don't use a return type declaration here because `trycopy` isn't guaranteed
# to return the same type passed in. For example, subarrays and subdataframes will
# return a regular array or dataframe.
return impute!(trycopy(data), imp; kwargs...)
end

"""
impute!(data::A, imp; dims=:, kwargs...) -> A

Impute the `missing` values in the array `data` using the imputor `imp`.
Optionally, you can specify the dimension to impute along.

# Arguments
* `data::AbstractArray{Union{T, Missing}}`: the data to be impute along dimensions `dims`
* `imp::Imputor`: the Imputor method to use

# Keyword Arguments
* `dims=:`: The dimension to impute along. `:rows` and `:cols` are also supported for matrices.

# Returns
* `AbstractArray{Union{T, Missing}}`: the input `data` with values imputed

# NOTES
1. Matrices have a deprecated `dims=2` special case as `dims=:` is a breaking change
2. Mutation isn't guaranteed for all array types, hence we return the result
3. `eachslice` is used internally which requires Julia 1.1

# Example
```jldoctest
julia> using Impute: Interpolate, impute!
`
# Usage of the `dims` keyword argument

julia> x = [1.0 missing; missing 2.0]
2×2 Matrix{Union{Missing, Float64}}:
1.0 missing
missing 2.0

julia> imp = Substitute(statistic = mean)
Substitute{typeof(mean)}(Statistics.mean)

# Variables in columns, each row is an observation (like a DataFrame)
julia> impute(x, imp, dims=:cols)
2×2 Matrix{Union{Missing, Float64}}:
1.0 2.0
1.0 2.0

# Variables in rows, each column is an observation
julia> impute(x, imp, dims=:rows)
2×2 Matrix{Union{Missing, Float64}}:
1.0 1.0
2.0 2.0

# Impute over all dimensions
julia> impute(x, imp, dims=:)
2×2 Matrix{Union{Missing, Float64}}:
1.0 1.5
1.5 2.0

# Default is `dims = :cols`
julia> impute(x, imp)
2×2 Matrix{Union{Missing, Float64}}:
1.0 2.0
1.0 2.0

# In-place imputation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of examples for one docstring. I'm okay with two examples to differentiate :rows from :cols, but NamedDims exists, and the dims keyword seems pretty ubiquitous throughout the Julia ecosystem at this point.


julia> M = [1.0 2.0 missing missing 5.0; 1.1 2.2 3.3 missing 5.5]
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing missing 5.0
1.1 2.2 3.3 missing 5.5

julia> impute!(M, Interpolate(); dims=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think showing that dims can still be an integer dimension is important, as that's how it works in Base. For example, you can impute along dims=3 for a n-dimensional array.

https://github.com/invenia/Impute.jl/blob/master/test/testutils.jl#L224

julia> impute!(M, Interpolate(), dims=:rows)
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 3.0 4.0 5.0
1.1 2.2 3.3 4.4 5.5
Expand All @@ -119,7 +138,21 @@ julia> M
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 3.0 4.0 5.0
1.1 2.2 3.3 4.4 5.5
```
"""


"""
$impute_docstring
"""
function impute(data, imp::Imputor; kwargs...)
# NOTE: We don't use a return type declaration here because `trycopy` isn't guaranteed
# to return the same type passed in. For example, subarrays and subdataframes will
# return a regular array or dataframe.
return impute!(trycopy(data), imp; kwargs...)
end

"""
$impute_docstring
"""
function impute!(
data::A, imp::Imputor; dims=:, kwargs...
Expand Down
Loading