-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when searching for a word that doesn't exist in the corpus #75
Comments
This is a known issue when using lsi without rb-gel. There is a bug in our matrix SVD, that I haven't been able to track down. In the meantime, using rb-gsl should fix the issue. |
I was already using rb-gsl beforehand (which I needed to do since the Ruby's version was so slow). Using binding.pry, I determined that $GSL is set to true after I required the classifier-reborn library, yet I'm still getting the error that I pointed above. I'll see if I can try to debug the issue... |
So here's Day 1 of me trying to solve this problem (and writing down notes so that I remember when I come back to this problem again). When I call
After I did so, I then call The EDIT: I see what's corrupting the raw_norm. Normalize. After calculating the raw_vector, we call then content_node.raw_vector
=> GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ]
content_node.raw_vector.normalize
=> GSL::Vector
[ nan nan nan nan nan nan nan ... ] It seems that EDIT2: So I found out how to normalize a vector.
Well the sqrt(0 * 0 * ... * 0) is going to be sqrt(0)...which is 0. And, so when we normalize a vector where each coordinate point is 0...we'll be seeing a lot of 0/0 errors. So GSL::Vector's Would it be okay to write some code to check if a normalized vector has any "NaNs", and then replacing all instances of NaNs with 0s, so that vector multiplication can still occur properly? Or would doing so be seen as too hacky? EDIT3: |
Yeah, some form of this problem has been causing issues since the first versions. If you're willing to try and put together a pr with some normalization and NaN handling, that would be amazing! |
I'd be curious to know more about the input that's causing that error. |
Here is the source code of the input that was causing the error. |
@tra38 sorry for the delay. So I was able to reproduce the issue with the following searches: array = lsi.search("we",9)
array = lsi.search("we can p",9) But, Maybe we can catch this error, and respond with a sensible message. |
array = lsi.search("we",9)
array = lsi.search("we can p",9)
This makes logical sense (for a computer, I mean). Please forgive me if this seems a bit too technical, but I wanted to write the following explanation down to clarify for myself what's going on: "we" and "can" are stopwords, so obviously the computer will ignore them. "p" isn't a stop word, but ClassifierReborn::Hasher.word_hash_for_words only stores words that have more than 2 characters. "p" has less than or equal to 2 characters, so...we throw it away. The end result is that we are asking LSI to find a document that is similar to that an empty hash, and obviously none of the documents we have are empty hashes. So we have NaN vectors and errors galore. Interestingly, if I throw in the sentence "we can p" into my array of strings, LSI searching breaks entirely and you will be unable to search for anything. This is how the computer views the sentence "we can p": => #<ClassifierReborn::ContentNode:0x007f9aaa208248
@categories=[],
@lsi_norm=GSL::Vector
[ nan nan nan nan nan nan nan ... ],
@lsi_vector=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
@raw_norm=GSL::Vector
[ nan nan nan nan nan nan nan ... ],
@raw_vector=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
@word_hash={}> (This is the same ContentNode that is constructed if we use "we can p" as a search term instead of as a document.)
On the other hand... array = lsi.search("we can predict", 9) Well, "predict" isn't a stopword. It's a word that has more than 2 characters. So a new Content Node can be created, which only includes the word "predict": #<ClassifierReborn::ContentNode:0x007f8b131d8df0
@categories=[],
@lsi_norm=GSL::Vector
[ 5.018e-03 5.836e-03 9.885e-04 -2.808e-02 5.018e-03 5.044e-02 -2.720e-03 ... ],
@lsi_vector=GSL::Vector
[ 3.550e-03 4.129e-03 6.994e-04 -1.987e-02 3.550e-03 3.569e-02 -1.924e-03 ... ],
@raw_norm=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
@raw_vector=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
@word_hash={:predict=>1}> Since the ...However, keep in mind that to replicate the error that led to me posting this issue, you need to type this out... array = lsi.search("dogs", 9) We only just discovered more bugs in the system. So there's two major issues to worry about then.
Number 2 is more of an edge case, since the larger the document, the more likely it is that there will be words that aren't stop words. So I'll probably focus on dealing with Number 1 (normalization and NaN handling). |
fixed by #77 |
I'm assuming that somewhere in the code, we have a "0/0" out there that is being converted into a NaN. This error is avoidable so long as you don't search for terms not specifically within the corpus you're training the LSI on, but...well...what happens if I'm using a very huge corpus? How am I supposed to know what words are (or are not) present?
The text was updated successfully, but these errors were encountered: