Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KNN unexpected behaviour #141

Open
mbataillou opened this issue Dec 22, 2023 · 2 comments
Open

KNN unexpected behaviour #141

mbataillou opened this issue Dec 22, 2023 · 2 comments

Comments

@mbataillou
Copy link

Hi !

Given a matrix M as follows

M = [1.0 100.0
2.0 200.0
2.0 missing
1.0 missing]

I would expect to get:

Impute.knn(M, k=1, dims=:cols) --> [1.0 100.0
2.0 200.0
2.0 200.0
1.0 100.0]

However, I get:

Impute.knn(M, k=1, dims=:cols) --> [1.0 100.0
2.0 200.0
2.0 150
1.0 150]

Am I missing something?

@rofinn
Copy link
Member

rofinn commented Dec 22, 2023

Yeah, you're hitting the somewhat unintuitive part where we need to initialize the missings with a real value. It looks like we're using Substitute(), which is where the 150 is coming from.

https://github.com/invenia/Impute.jl/blob/master/src/imputors/knn.jl#L40

In this particular case, that means rows 3 and 4 are neighbours of each other.

julia> kdtree = KDTree(X, Euclidean())
KDTree{StaticArraysCore.SVector{2, Float64}, Euclidean, Float64}
  Number of points: 4
  Dimensions: 2
  Metric: Euclidean(0.0)
  Reordered: true

julia> Dict(idxs => dists for (idxs, dists) in zip(NearestNeighbors.knn(kdtree, points, 3, true)...))
Dict{Vector{Int64}, Vector{Float64}} with 2 entries:
  [3, 4, 2] => [0.0, 1.0, 50.0]
  [4, 3, 1] => [0.0, 1.0, 50.0]

We could use NaN as the initialization value, which would give you the answer you're expecting here.

julia> X[2,3:4] .= NaN;

julia> kdtree = KDTree(X, Euclidean())
KDTree{StaticArraysCore.SVector{2, Float64}, Euclidean, Float64}
  Number of points: 4
  Dimensions: 2
  Metric: Euclidean(0.0)
  Reordered: true
  
julia> Dict(idxs => dists for (idxs, dists) in zip(NearestNeighbors.knn(kdtree, points, 3, true)...))
Dict{Vector{Int64}, Vector{Float64}} with 2 entries:
  [2, 1, 4] => [50.0, 50.01, Inf]
  [1, 2, 4] => [50.0, 50.01, Inf]

The drawback with this approach is that:

  1. We'd need to handle non-finite values explicitly, but we might want to anyway.
  2. We'd be limiting neighbours to only complete cases, which could be an issue for large matrices.

I suppose the least opinionated option might be to allow folks to specify the initialization and aggregation options? That's what I did for the SVD implementation.

https://github.com/invenia/Impute.jl/blob/master/src/imputors/svd.jl#L19

@rofinn
Copy link
Member

rofinn commented Dec 28, 2023

In retrospect, the better way to do this is to support skipmissing distances in Distances.jl and then extend NearestNeighbors.jl to include that in the Union type of which metric you can pass in. Then the initialization step could become a no-op by default. I don't know how likely it is that other folks will be willing to accept those changes elsewhere though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants