-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse matrix seems to not work correctly #1
Comments
@pommedeterresautee I'm sorry for the late reply, would you mind sharing a reproducible example so that I'm able to find out if it's a bug in the R code that I added to the package. I assume, you mean that the error occurs when you call the init_nms$knn_Query_Batch function. |
Thank you for your answer. I have > mat <- Matrix::Matrix(runif(1000, 0, 1), nrow = 100, sparse = TRUE, byrow = TRUE)
> dtm_not_prefix_tfidf_scipy <- dgCMatrix_2scipy_sparse(mat)
>
> dtm_not_prefix_tfidf_scipy$shape
(100, 10)
>
> dtm_not_prefix_tfidf_scipy$getrow(0L)
(0, 0) 0.4460052342619747
(0, 1) 0.267301544547081
(0, 2) 0.9658534412737936
(0, 3) 0.4592860534321517
(0, 4) 0.7499000441748649
(0, 5) 0.29996729688718915
(0, 6) 0.5544748182874173
(0, 7) 0.9539664378389716
(0, 8) 0.8479003391694278
(0, 9) 0.5923537667840719
> mat[1,]
[1] 0.4460052 0.2673015 0.9658534 0.4592861 0.7499000 0.2999673 0.5544748 0.9539664 0.8479003 0.5923538
# The conversion seems ok (dim and content)
>
> init_nms = NMSlib$new(input_data = dtm_not_prefix_tfidf_scipy,
+ Index_Params = list('indexThreadQty' = config$cpu),
+ space = 'cosinesimil_sparse',
+ method = 'hnsw',
+ data_type = 'SPARSE_VECTOR',
+ dtype = 'FLOAT',
+ print_progress = TRUE)
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
>
> b = init_nms$knn_Query_Batch(query_data = dtm_not_prefix_tfidf_scipy,
+ k = 5,
+ num_threads = config$cpu)
>
> dim(b$knn_idx)
[1] 10 5
> max(b$knn_idx)
[1] 10
>
> as.data.frame(b$knn_idx)
V1 V2 V3 V4 V5
1 5 3 8 10 6
2 10 5 7 6 8
3 8 10 1 6 7
4 10 9 7 3 6
5 1 10 7 8 6
6 3 5 10 7 9
7 10 3 5 8 9
8 3 10 5 1 7
9 5 3 7 8 4
10 2 8 7 3 5 |
I have made several tests with my real data. The R binder nmslibR requires that the sparse matrix is transposed before being converted to scipy. After that nmlibR works as expected. I have no idea why. |
@pommedeterresautee I'm sorry for the late reply, I didn't have access to my computer. I'll start looking at this issue right now. |
@pommedeterresautee thanks for making me aware of this issue, the nmslib python package expects a Compressed Sparse Row matrix (scipy.sparse.csr_matrix). The corresponding one in R is the dgRMatrix-class. The dgCMatrix_2scipy_sparse() function returns by default a Compressed Sparse Column matrix ( scipy.sparse.csc_matrix), which is not the correct one to use in the nmblib python library. A detailed example using a scipy sparse matrix in nmslib was added recently by the author. I will adjust the dgCMatrix_2scipy_sparse() function to return either a sparse column or sparse row matrix (I will notify you once l push the changes on Github). For now you can receive the expected results using, set.seed(1)
mat <- Matrix::Matrix(runif(1000, 0, 1), nrow = 100, sparse = TRUE, byrow = TRUE)
class(mat)
# [1] "dgCMatrix"
# attr(,"package")
# [1] "Matrix"
dgr_mat = as(mat, "RsparseMatrix")
class(dgr_mat)
# [1] "dgRMatrix"
# attr(,"package")
# [1] "Matrix"
SCP = reticulate::import("scipy", convert = F)
dtm_not_prefix_tfidf_scipy = SCP$sparse$csr_matrix(reticulate::tuple(dgr_mat@x, dgr_mat@j, dgr_mat@p),
shape = reticulate::tuple(dgr_mat@Dim[1], dgr_mat@Dim[2])) The conversion to a scipy row sparse matrix follows the same logic as before. dtm_not_prefix_tfidf_scipy$shape
# (100, 10)
dtm_not_prefix_tfidf_scipy$getrow(0L)
# (0, 0) 0.2655086631421
# (0, 1) 0.37212389963679016
# (0, 2) 0.5728533633518964
# (0, 3) 0.9082077899947762
# (0, 4) 0.2016819310374558
# (0, 5) 0.8983896849676967
# (0, 6) 0.9446752686053514
# (0, 7) 0.6607977924868464
# (0, 8) 0.6291140438988805
# (0, 9) 0.061786270467564464
library(nmslibR)
init_nms = NMSlib$new(input_data = dtm_not_prefix_tfidf_scipy, Index_Params = list('indexThreadQty' = 1),
space = 'cosinesimil_sparse', method = 'hnsw', data_type = 'SPARSE_VECTOR', dtype = 'FLOAT',
print_progress = TRUE)
b = init_nms$knn_Query_Batch(query_data = dtm_not_prefix_tfidf_scipy,
k = 5, num_threads = 1)
dim(b$knn_idx)
# [1] 100 5
max(b$knn_idx)
# [1] 100
df_out = as.data.frame(b$knn_idx)
dim(df_out)
# [1] 100 5
head(df_out)
# V1 V2 V3 V4 V5
# 1 24 89 73 51 82
# 2 19 24 75 60 82
# 3 85 5 77 62 65
# 4 94 93 58 8 9
# 5 84 51 42 18 63
# 6 76 71 48 92 66
please test it and let me know. |
Hi, I confirm the code above works correctly on my real data compared to my own brute force search implementation. |
@pommedeterresautee I did the following changes,
data(ionosphere, package = 'KernelKnn')
X = as.matrix(ionosphere[, -c(1:2, ncol(ionosphere))])
mt_2_sprm = mat_2scipy_sparse(X, format = 'sparse_row_matrix') # first case : R-matrix to scipy-sparse-row-matrix
mt_2_dgr = as(X, "dgRMatrix")
dgr_2_scsp = TO_scipy_sparse(R_sparse_matrix = mt_2_dgr) # second case : R-sparse-matrix to scipy-sparse-row-matrix
# test that both 'dense knn' (using an R object) and 'sparse knn' (using a python scipy sparse object) return the same output
#----------------------------------------------------------------------------------------------------------------------------
dist_knn = KernelKnn::knn.index.dist(X, TEST_data = NULL, k = 5, method = "euclidean") # the corresponding distance for 'euclidean' in nmslibR is 'l2' or 'l2_sparse (in case of sparse matrices). Page 31 of manual.
# nmslibR with "dense" data and sequential search
#------------------------------------------------
init_nms = NMSlib$new(input_data = X, space = "l2", method = 'seq_search',
data_type = 'DENSE_VECTOR', dtype = 'FLOAT', print_progress = F)
all_dat = init_nms$knn_Query_Batch(X, k = 5, num_threads = 1)
# nmslibR with "sparse" data and sequential search
#-------------------------------------------------
init_nms_spr = NMSlib$new(input_data = dgr_2_scsp, space = "l2_sparse", method = 'seq_search',
data_type = 'SPARSE_VECTOR', dtype = 'FLOAT', print_progress = F)
all_dat_spr = init_nms_spr$knn_Query_Batch(dgr_2_scsp, k = 5, num_threads = 1)
# all three outputs (dist_knn, all_dat, all_dat_spr) must return approximately equal results
#-------------------------------------------------------------------------------------------
# indices [ first 6 rows ]
tmp1 = identical(dist_knn$train_knn_idx[1:6, ], all_dat$knn_idx[1:6, ])
tmp2 = identical(all_dat_spr$knn_idx[1:6, ], all_dat$knn_idx[1:6, ])
tmp3 = identical(dist_knn$train_knn_idx[1:6, ], all_dat_spr$knn_idx[1:6, ])
# distances [ last 6 rows ] -- approximately equal ( use of round() function )
tmp_row1 = identical(round(tail(dist_knn$train_knn_dist), 4), round(tail(all_dat$knn_dist), 4))
tmp_row2 = identical(round(tail(all_dat_spr$knn_idx), 4), round(tail(all_dat$knn_idx), 4))
tmp_row3 = identical(round(tail(dist_knn$train_knn_idx), 4), round(tail(all_dat_spr$knn_idx), 4))
print(all(tmp1, tmp2, tmp3, tmp_row1, tmp_row2, tmp_row3)
The results for the ionoshpere data set agree for dense or sparse data (after conversion from an R sparse matrix to a scipy sparse matrix) I submitted the new version to CRAN and uploaded the updated on Github. Feel free to comment / close the issue if the updated version returns the expected output. |
I close the issue for now, feel free to reopen it. |
Had the same issue - root cause was that the space parameter wasn't adding input verification that |
hi @ophiry, |
I have made several tests with sparse matrix based on the Vignette and
/tests
folder.Results are obviously wrong, even more strange is that it seems that there is a transpose of the matrix (I have 700K rows and 14K columns, and when I perform
knn_Query_Batch
on the full matrix, the max index returned is 14K.Is there something I need to do to make it work correctly?
If I change the space to
l1
(same code as above) I get the following error at the init part (there is an error per thread):The text was updated successfully, but these errors were encountered: