Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle interp for integers by casting and rounding (#71) #142

Merged
merged 9 commits into from
Apr 19, 2024

Conversation

kpa28-git
Copy link
Contributor

This is a simple solution for #71.

src/imputors/interp.jl Outdated Show resolved Hide resolved
1. We add a `r::Union{RoundingMode, Nothing}=nothing` to `Interpolate`, avoiding any breaking changes.
2. Added unit tests for unsigned integers and cases with and without the RoundingMode.
3. Increment calculation:
  - Still using `(next - prev) / n` in the default case
  - If a rounding mode is specified then `div` is used with the corresponding mode (avoids `InexactError`)
  - If we're given unsigned ints then we convert them to integers for the increment calculation (avoids integer overflow)
@rofinn
Copy link
Member

rofinn commented Apr 9, 2024

I hope you don't mind. I opted to add the commit to your PR to retain the commit history. The force push was to trigger the CI as it was disabled. Adding a rounding mode and dispatching on the increment calculation mostly resolves #71 without introducing a breaking change. Feel free to check the new tests in case I missed something.

@kpa28-git
Copy link
Contributor Author

Nope, I don't mind. Thanks! Looks good, I'll have a chance to test it later. I don't think I'll have anything to add.

@kpa28-git
Copy link
Contributor Author

I found what looks to be a problem with interpolate going past a lower boundary, here are some cases:

ints = [8, fill(missing, 10)..., 2]
uints = [0x08, fill(missing, 10)..., 0x02]
Impute.impute(ints, Impute.Interpolate(r=RoundNearest);)   # [8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, 2]
Impute.impute(uints, Impute.Interpolate(r=RoundNearest);) # InexactError: trunc(UInt8, -1)

Making the gap between the top and bottom value smaller (in this example, it was 5 or lower) makes it not go past the bottom value, but gives RoundUp instead of RoundNearest:

ints_sm = [7, fill(missing, 10)..., 2]
uints_sm = [0x07, fill(missing, 10)..., 0x02]
Impute.impute(ints_sm, Impute.Interpolate(r=RoundNearest);)     # [7, 7, 7, ..., 7, 2]
Impute.impute(uints_sm, Impute.Interpolate(r=RoundNearest);)   # [0x07, 0x07, 0x07, ..., 0x07, 0x02]

@rofinn
Copy link
Member

rofinn commented Apr 12, 2024

Right, forgot rounding would violate the increment assumption from before. I've added a fix:

julia> ints = [8, fill(missing, 10)..., 2];

julia> uints = [0x08, fill(missing, 10)..., 0x02];

julia> Impute.impute(ints, Impute.Interpolate(r=RoundNearest);)
12-element Vector{Union{Missing, Int64}}:
 8
 7
 6
 5
 4
 3
 2
 2
 2
 2
 2
 2

julia> Impute.impute(uints, Impute.Interpolate(r=RoundNearest);)
12-element Vector{Union{Missing, UInt8}}:
 0x08
 0x07
 0x06
 0x05
 0x04
 0x03
 0x02
 0x02
 0x02
 0x02
 0x02
 0x02

@rofinn
Copy link
Member

rofinn commented Apr 16, 2024

Alright if there aren’t any comments I’m gonna go ahead and merge/tag this.

@kpa28-git
Copy link
Contributor Author

I don't mind if you merge it.

This is for future reference (maybe another issue), but my only problem is that the code doesn't handle partial increments as expected. For example:

Impute.impute([08, fill(missing, 12)..., 02], Impute.Interpolate(r=RoundNearest))

[8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 2]

round.(Int, Impute.impute([08., fill(missing, 12)..., 02.], Impute.Interpolate()), RoundNearest)

[8, 8, 7, 7, 6, 6, 5, 5, 4, 4, 3, 3, 2, 2] # what I expect

The user can always cast their data to floats, but then it would be the same as before.

For discrete types that might require a variable increment between values in gaps (e.g. integers), maybe we can dispatch _impute! or have a branch of the main function to do imputation for gaps each gap at a time. That way, each imputed value can be independent from the previous value.

This would allocate intermediate vectors but it would be better than having to cast the entire input to floats.

Any ideas or thoughts?

@kpa28-git
Copy link
Contributor Author

Using a generator and setting the gap slices doesn't allocate and gives results that are more even over larger gaps
Example (I don't know if there is a built-in way to set slices over generators so I made a genset! function):

_calculate_increment(a, b, n, ::Nothing) = (b - a) / n

function interpolate(a::T, b::T, n::Integer, r::RoundingMode) where {T<:Integer}
    inc = _calculate_increment(float(a), float(b), n, nothing)
    (round(T, a + inc*i, r) for i=1:n)
end

function genset!(v::AbstractVector, st, g)
    for (i, val) in enumerate(g)
       v[i+st] = val
    end
    v
end

v = [8, fill(missing, 12)..., 2]
v2 = [0x08, fill(missing, 12)..., 0x02]
@benchmark genset!(v, 1, interpolate(8, 2, 12, RoundNearest)) # no alloc
@benchmark genset!(v2, 1, interpolate(0x08, 0x02, 12, RoundNearest)) # no alloc

@rofinn
Copy link
Member

rofinn commented Apr 17, 2024

Sure. If we don't care about the increment type, then I think just rounding/clamping on the insert could make sense.

@kpa28-git
Copy link
Contributor Author

Well I have something here that might work. It does the interpolation in floating point space, but uses a generator for to avoid allocation.
I haven't benchmarked it on large inputs yet, but on small veectors it has about the same alloc as the integer increment version and maybe a bit faster.
It passes all the tests, added some more in this commit too.

@kpa28-git
Copy link
Contributor Author

The second commit is just to ignore RoundingMode for floats, probably don't want to have the same RoundingMode apply to different vectors in a mixed dataset. Anyway, we usually don't want to round floats.

src/imputors/interp.jl Outdated Show resolved Hide resolved
src/imputors/interp.jl Outdated Show resolved Hide resolved
src/imputors/interp.jl Outdated Show resolved Hide resolved
src/imputors/interp.jl Outdated Show resolved Hide resolved
@rofinn
Copy link
Member

rofinn commented Apr 17, 2024

Sure, makes sense to me 👍🏻 I just added a commit to minimize the changes. Rather than creating a generator with duplicated logic to insert values, we can just dispatch on two special cases:

  1. Calculating an increment value with an Unsigned
  2. Rounding when inserting a non-integer value into an integer array if a RoundingMode is specified.

@kpa28-git
Copy link
Contributor Author

I pushed what I had now (based on your feedback + some benchmarking) in case I can't get back to you quick enough. I don't know if you're interested, but I figure sharing what I have is better.

1. We just need a kernel function barrier to avoid type inner loop type instability with integers
2. The generator caused overhead for the default floats case
3. Simplified some of the indexing logic since the kernel function can handle most of it.
4. Add some comments about why all these extra internal functions exist.
@rofinn rofinn merged commit 11658f6 into invenia:master Apr 19, 2024
5 checks passed
rofinn referenced this pull request Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants