Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse matrix seems to not work correctly #1

Closed
pommedeterresautee opened this issue May 8, 2018 · 10 comments
Closed

Sparse matrix seems to not work correctly #1

pommedeterresautee opened this issue May 8, 2018 · 10 comments

Comments

@pommedeterresautee
Copy link

pommedeterresautee commented May 8, 2018

I have made several tests with sparse matrix based on the Vignette and /tests folder.
Results are obviously wrong, even more strange is that it seems that there is a transpose of the matrix (I have 700K rows and 14K columns, and when I perform knn_Query_Batch on the full matrix, the max index returned is 14K.

Is there something I need to do to make it work correctly?

dtm_not_prefix_tfidf_scipy$shape
# (76212, 14498)
init_nms = NMSlib$new(input_data = dtm_not_prefix_tfidf_scipy,
                      Index_Params = list('indexThreadQty' = config$cpu),
                      space = 'cosinesimil_sparse',
                      method = 'hnsw',
                      data_type = 'SPARSE_VECTOR',
                      dtype = 'FLOAT',
                      print_progress = TRUE)

 b = init_nms$knn_Query_Batch(query_data = dtm_not_prefix_tfidf_scipy$getcol(1L),
                            k = 5,
                            num_threads = config$cpu)

If I change the space to l1 (same code as above) I get the following error at the init part (there is an error per thread):

Check failed: obj1->datalength() == obj2->datalength()
Check failed: obj1->datalength() == obj2->datalength()
Check failed: obj1->datalength() == obj2->datalength()
Check failed: obj1->datalength() == obj2->datalength()
Check failed: obj1->datalength() == obj2->datalength()
Check failed: obj1->datalength() == obj2->datalength()
Check failed: obj1->datalength() == obj2->datalength()
Check failed: obj1->datalength() == obj2->datalength()
Check failed: obj1->datalength() == obj2->datalength()
Check failed: obj1->datalength() == obj2->datalength()
Check failed: obj1->datalength() == obj2->datalength()
Error in py_call_impl(callable, dots$args, dots$keywords) : 
  RuntimeError: Check failed: it's either a bug or inconsistent data!
@mlampros
Copy link
Owner

mlampros commented May 9, 2018

@pommedeterresautee I'm sorry for the late reply,

would you mind sharing a reproducible example so that I'm able to find out if it's a bug in the R code that I added to the package. I assume, you mean that the error occurs when you call the init_nms$knn_Query_Batch function.

@pommedeterresautee
Copy link
Author

pommedeterresautee commented May 9, 2018

Thank you for your answer.

I have 100 rows below, and I expect to get 100 * 5 predictions (with k = 5).
I get 10 predictions and the max row returned is 10 like I have provided a matrix 10X100 and not 100X10, making me believing that there is a kind of transpose somewhere.

> mat <- Matrix::Matrix(runif(1000, 0, 1), nrow = 100, sparse = TRUE, byrow = TRUE)
> dtm_not_prefix_tfidf_scipy <- dgCMatrix_2scipy_sparse(mat)
> 
> dtm_not_prefix_tfidf_scipy$shape
(100, 10)
> 
> dtm_not_prefix_tfidf_scipy$getrow(0L)
  (0, 0)	0.4460052342619747
  (0, 1)	0.267301544547081
  (0, 2)	0.9658534412737936
  (0, 3)	0.4592860534321517
  (0, 4)	0.7499000441748649
  (0, 5)	0.29996729688718915
  (0, 6)	0.5544748182874173
  (0, 7)	0.9539664378389716
  (0, 8)	0.8479003391694278
  (0, 9)	0.5923537667840719
> mat[1,]
 [1] 0.4460052 0.2673015 0.9658534 0.4592861 0.7499000 0.2999673 0.5544748 0.9539664 0.8479003 0.5923538
# The conversion seems ok (dim and content)
> 
> init_nms = NMSlib$new(input_data = dtm_not_prefix_tfidf_scipy,
+                       Index_Params = list('indexThreadQty' = config$cpu),
+                       space = 'cosinesimil_sparse',
+                       method = 'hnsw',
+                       data_type = 'SPARSE_VECTOR',
+                       dtype = 'FLOAT',
+                       print_progress = TRUE)

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
> 
> b = init_nms$knn_Query_Batch(query_data = dtm_not_prefix_tfidf_scipy,
+                              k = 5,
+                              num_threads = config$cpu)
> 
> dim(b$knn_idx)
[1] 10  5
> max(b$knn_idx)
[1] 10
> 
> as.data.frame(b$knn_idx)
   V1 V2 V3 V4 V5
1   5  3  8 10  6
2  10  5  7  6  8
3   8 10  1  6  7
4  10  9  7  3  6
5   1 10  7  8  6
6   3  5 10  7  9
7  10  3  5  8  9
8   3 10  5  1  7
9   5  3  7  8  4
10  2  8  7  3  5

@pommedeterresautee
Copy link
Author

I have made several tests with my real data. The R binder nmslibR requires that the sparse matrix is transposed before being converted to scipy. After that nmlibR works as expected. I have no idea why.
The very same sparse matrix saved on R (MM format) and reloaded in Python works without requiring any transpose.
So there is a kind of bug somewhere in the binder.

@mlampros
Copy link
Owner

mlampros commented May 9, 2018

@pommedeterresautee I'm sorry for the late reply,

I didn't have access to my computer. I'll start looking at this issue right now.

@mlampros
Copy link
Owner

mlampros commented May 9, 2018

@pommedeterresautee thanks for making me aware of this issue,

the nmslib python package expects a Compressed Sparse Row matrix (scipy.sparse.csr_matrix). The corresponding one in R is the dgRMatrix-class.

The dgCMatrix_2scipy_sparse() function returns by default a Compressed Sparse Column matrix ( scipy.sparse.csc_matrix), which is not the correct one to use in the nmblib python library.

A detailed example using a scipy sparse matrix in nmslib was added recently by the author.

I will adjust the dgCMatrix_2scipy_sparse() function to return either a sparse column or sparse row matrix (I will notify you once l push the changes on Github).

For now you can receive the expected results using,

set.seed(1)
mat <- Matrix::Matrix(runif(1000, 0, 1), nrow = 100, sparse = TRUE, byrow = TRUE)

class(mat)

# [1] "dgCMatrix"
# attr(,"package")
# [1] "Matrix"

dgr_mat = as(mat, "RsparseMatrix")

class(dgr_mat)

# [1] "dgRMatrix"
# attr(,"package")
# [1] "Matrix"

SCP = reticulate::import("scipy", convert = F)

dtm_not_prefix_tfidf_scipy = SCP$sparse$csr_matrix(reticulate::tuple(dgr_mat@x, dgr_mat@j, dgr_mat@p), 
                                                   
                                                   shape = reticulate::tuple(dgr_mat@Dim[1], dgr_mat@Dim[2]))

The conversion to a scipy row sparse matrix follows the same logic as before.

dtm_not_prefix_tfidf_scipy$shape

# (100, 10)

dtm_not_prefix_tfidf_scipy$getrow(0L)

# (0, 0)	0.2655086631421
# (0, 1)	0.37212389963679016
# (0, 2)	0.5728533633518964
# (0, 3)	0.9082077899947762
# (0, 4)	0.2016819310374558
# (0, 5)	0.8983896849676967
# (0, 6)	0.9446752686053514
# (0, 7)	0.6607977924868464
# (0, 8)	0.6291140438988805
# (0, 9)	0.061786270467564464

library(nmslibR)

init_nms = NMSlib$new(input_data = dtm_not_prefix_tfidf_scipy, Index_Params = list('indexThreadQty' = 1),
                      
                      space = 'cosinesimil_sparse', method = 'hnsw', data_type = 'SPARSE_VECTOR', dtype = 'FLOAT', 
                      
                      print_progress = TRUE)

b = init_nms$knn_Query_Batch(query_data = dtm_not_prefix_tfidf_scipy,
                             
                             k = 5, num_threads = 1)

dim(b$knn_idx)

# [1] 100   5

max(b$knn_idx)

# [1] 100

df_out = as.data.frame(b$knn_idx)

dim(df_out)

# [1] 100   5

head(df_out)

#   V1 V2 V3 V4 V5
# 1 24 89 73 51 82
# 2 19 24 75 60 82
# 3 85  5 77 62 65
# 4 94 93 58  8  9
# 5 84 51 42 18 63
# 6 76 71 48 92 66

please test it and let me know.

@pommedeterresautee
Copy link
Author

Hi, I confirm the code above works correctly on my real data compared to my own brute force search implementation.

@mlampros
Copy link
Owner

mlampros commented May 10, 2018

@pommedeterresautee I did the following changes,

  • I renamed / modified the dgCMatrix_2scipy_sparse function to TO_scipy_sparse and this function can convert from dgCMatrix or dgRMatrix to scipy.sparse.csc_matrix or scipy.sparse.csr_matrix
  • I added a startup message when nmslibR is loaded to inform for this change
  • I updated the vignette
  • I added a test case which returns the same output for the following cases,
data(ionosphere, package = 'KernelKnn')

X = as.matrix(ionosphere[, -c(1:2, ncol(ionosphere))])

mt_2_sprm = mat_2scipy_sparse(X, format = 'sparse_row_matrix')                  # first case : R-matrix to scipy-sparse-row-matrix
    
mt_2_dgr = as(X, "dgRMatrix")

dgr_2_scsp = TO_scipy_sparse(R_sparse_matrix = mt_2_dgr)                        # second case : R-sparse-matrix to scipy-sparse-row-matrix
    
    
    
# test that both 'dense knn' (using an R object) and 'sparse knn' (using a python scipy sparse object) return the same output
#----------------------------------------------------------------------------------------------------------------------------
    
dist_knn = KernelKnn::knn.index.dist(X, TEST_data = NULL, k = 5, method = "euclidean")        # the corresponding distance for 'euclidean' in nmslibR is 'l2' or 'l2_sparse (in case of sparse matrices). Page 31 of manual.

    
    
# nmslibR with "dense" data and sequential search
#------------------------------------------------
    
 init_nms = NMSlib$new(input_data = X, space = "l2", method = 'seq_search', 
                                   
                                   data_type = 'DENSE_VECTOR', dtype = 'FLOAT', print_progress = F)
    
    
all_dat = init_nms$knn_Query_Batch(X, k = 5, num_threads = 1)

    
    
# nmslibR with "sparse" data and sequential search
#-------------------------------------------------
    
init_nms_spr = NMSlib$new(input_data = dgr_2_scsp, space = "l2_sparse", method = 'seq_search', 
                                       
                                       data_type = 'SPARSE_VECTOR', dtype = 'FLOAT', print_progress = F)
    
    
all_dat_spr = init_nms_spr$knn_Query_Batch(dgr_2_scsp, k = 5, num_threads = 1)

    
    
# all three outputs (dist_knn, all_dat, all_dat_spr) must return approximately equal results
#-------------------------------------------------------------------------------------------
  
# indices [ first 6 rows ]
    
tmp1 = identical(dist_knn$train_knn_idx[1:6, ], all_dat$knn_idx[1:6, ])
tmp2 = identical(all_dat_spr$knn_idx[1:6, ], all_dat$knn_idx[1:6, ])
tmp3 = identical(dist_knn$train_knn_idx[1:6, ], all_dat_spr$knn_idx[1:6, ])
    
# distances [ last 6 rows ] -- approximately equal ( use of round() function )
    
tmp_row1 = identical(round(tail(dist_knn$train_knn_dist), 4), round(tail(all_dat$knn_dist), 4))
tmp_row2 = identical(round(tail(all_dat_spr$knn_idx), 4), round(tail(all_dat$knn_idx), 4))
tmp_row3 = identical(round(tail(dist_knn$train_knn_idx), 4), round(tail(all_dat_spr$knn_idx), 4))

print(all(tmp1, tmp2, tmp3, tmp_row1, tmp_row2, tmp_row3)

The results for the ionoshpere data set agree for dense or sparse data (after conversion from an R sparse matrix to a scipy sparse matrix)

I submitted the new version to CRAN and uploaded the updated on Github.

Feel free to comment / close the issue if the updated version returns the expected output.

@mlampros
Copy link
Owner

I close the issue for now, feel free to reopen it.

@ophiry
Copy link

ophiry commented Dec 23, 2019

Had the same issue - root cause was that the space parameter wasn't cosinesimil_sparse but cosinesimil

adding input verification that input_data, space, and data_type are compatible could help avoid similar issues

@mlampros
Copy link
Owner

hi @ophiry,
in page 31 of the nmslib-manual there are details about each distance metric that can be used either in the python or in this R package. Every metric with a '_sparse' extension is meant to be used with sparse matrices. After my changes based on this issue I don't think there is much more to do. Don't forget that this R package is a port of the python package using reticulate, which means that error cases related to input data types must be addressed in the python code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants