-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BaseKnnVectorsFormatTestCase.testRecall() and fix old codecs #13910
Conversation
…in Lucene90HnswVectorsReader
9d8cd9b
to
b581275
Compare
I wonder whether we should backport the fixes to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a test like this. I wonder if its better to just grab 100-500 sift vectors and put them in a file though.
@@ -260,7 +260,7 @@ public void search(String field, float[] target, KnnCollector knnCollector, Bits | |||
int node = results.topNode(); | |||
float minSimilarity = results.topScore(); | |||
results.pop(); | |||
knnCollector.collect(node, minSimilarity); | |||
knnCollector.collect(vectorValues.ordToDoc(node), minSimilarity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Lucene90 didn't support sparse vector values, I am not sure this is strictly necessary. But I can understand it from a consistency standpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's a relief! I couldn't remember if we had that or not. At any rate it is possible to create a sparse 90 index in tests now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's a relief! I couldn't remember if we had that or not. At any rate it is possible to create a sparse 90 index in tests now.
This is odd? When a Codec format is replaced, we move the "read only" part to backwards-codecs
module, and the full read/write original codec to test-framework
so that unit tests can produce a 9.0 index and run modern tests (even new unit tests added since 9.0) against such "old" indices.
But that read/write codec format should not be able to produce an index that the original 9.0 (core) format was not able to produce. They should be the same original code ... so maybe sparse vector values were in fact writable in 9.0 (by accident?)?
public void testRecall() { | ||
// ignore this test since this class always returns no results from search | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think that the underlying flat format should allow search
to be called and it simply iterates all matching vectors. But we can adjust that in a later PR.
@@ -562,7 +562,7 @@ private void add( | |||
String idString = Integer.toString(id); | |||
doc.add(new StringField("id", idString, Field.Store.YES)); | |||
doc.add(new SortedDocValuesField("id", new BytesRef(idString))); | |||
// XSSystem.out.println("add " + idString + " " + Arrays.toString(vector)); | |||
// System.out.println("add " + idString + " " + Arrays.toString(vector)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just delete the line?
++recalled; | ||
} | ||
} | ||
assertTrue("recall should be at least 5/10, got " + recalled, recalled >= 5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would hope recall should be within some known parameter. It would be good to know if recall improved or worsened. Either case could show an unexpected change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could be tricky to be very precise given the range of codec options. I guess we could specialize per codec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having an assertAvgRecall
in the base class that can be overridden would be really nice.
// indexed 421 lines from LICENSE.txt | ||
// indexed 157 lines from NOTICE.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are adding two files like this, I wonder if we should simply take real vectors from a real embedding model and put them in the resources folder.
Maybe we can use glove or sift
? Those are pretty small vectors, though only euclidean is expected (we would have to normalize for dot-product).
My concern is that having such a simplistic vector with so few dimensions might not actually be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lemme see if I can find bugs with the "dictionary" PR? We can always beef this up with more realistic data
* gross failures only, not to represent the true expected recall. | ||
*/ | ||
public void testRecall() throws IOException { | ||
VectorSimilarityFunction vectorSimilarityFunction = VectorSimilarityFunction.EUCLIDEAN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be really neat if this went through each one. Quantization will do special things for different similarity functions and exercising each of those paths would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, we want to test all the things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for review!
@@ -260,7 +260,7 @@ public void search(String field, float[] target, KnnCollector knnCollector, Bits | |||
int node = results.topNode(); | |||
float minSimilarity = results.topScore(); | |||
results.pop(); | |||
knnCollector.collect(node, minSimilarity); | |||
knnCollector.collect(vectorValues.ordToDoc(node), minSimilarity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's a relief! I couldn't remember if we had that or not. At any rate it is possible to create a sparse 90 index in tests now.
* gross failures only, not to represent the true expected recall. | ||
*/ | ||
public void testRecall() throws IOException { | ||
VectorSimilarityFunction vectorSimilarityFunction = VectorSimilarityFunction.EUCLIDEAN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, we want to test all the things
// indexed 421 lines from LICENSE.txt | ||
// indexed 157 lines from NOTICE.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lemme see if I can find bugs with the "dictionary" PR? We can always beef this up with more realistic data
++recalled; | ||
} | ||
} | ||
assertTrue("recall should be at least 5/10, got " + recalled, recalled >= 5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could be tricky to be very precise given the range of codec options. I guess we could specialize per codec?
KnnFloatVectorQuery exactQuery = | ||
new KnnFloatVectorQuery("field", queryEmbedding, 1000, new MatchAllDocsQuery()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think for more consistent runs, we may want to have multiple query embeddings that we test with and gather min
max
and avg
recalls. But this can be a further refinement on this work.
I just think having a single query might be very flaky in the long run.
@benwtrent I think I addressed all your comments except adding binary vectors. I think as long as the vectors are not too degenerate and always the same the test purpose is satisfied, but I'm not averse to replacing these janky ones with more "scientific" ones either, I just don't have the handy. This test seems to pass on multiple runs and it is exposing problems with the dictionary refactor I'm working on, so I think it's helpful |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! This is a nice sanity check for abhorrent behavior :)
After reflection, I don't think this is true. We always supported sparse vector indexes - I don't see how we could have avoided it, really. It seems to me this bug was introduced here and it was released as part of 9.8, meaning that releases 9.8 and later in the 9.x series will be able to read indexes produced in 9.0 but will give meaningless results for HNSW searches over those indexes. This seems like something we maybe ought to make the user community aware of. I'm just going to leave this here for comment for a bit, but soon I'll draft an actual email describing the situation and any mitigations (upgrade your software to 10? Rewrite your index with a 9.1-9.7 release?) |
ok something like this: Dear Lucene user community, We recently uncovered a backwards compatibility bug that affects indexes created with version 9.0 containing KNN vector fields. Versions 9.8 - 9.12 are unable to search vectors in such indexes correctly and will return incorrect results without raising any error. We think it's likely very few if any of you are using 9.0 indexes, but if you are, possible mitigation steps are:
|
@msokolov could we do a simpler patch for 9.12.1? |
Yes, maybe we should -- I think it would be a one-liner |
There is another upgrade path -- if you started with 9.0 and then "upgraded" your index by rewriting it (eg with IndexUpdater tool) via merge to 9.1-9.7 you could subsequently read the index with later versions. But this seemed kind of complex to explain for a case that probably doesn't exist. |
+1 thanks @msokolov.
+1. 9.12.1 would also contain the fix for the other (horrible, caused by me!) bwc break specifically for LPVQ quantized vector indices. Backwards compatibility is hard!! |
@mikemccand I have a PR open for this bug fix for 9.12. Will merge soon. Could you add a CHANGES entry in 9.12 for your bug fix for 9.12.1? |
Ahh yes sorry I will do that today! |
While exploring some recall-related failures in another PR I went looking for a unit test that checks HNSW/KNN recall and couldn't find any. I think we used to have one but maybe we removed it because it was flaky? But we really do need such a test since it is possible to make changes that preserve all the formal properties of the codecs and queries yet destroy recall. I thought if we can create such a test with known data and vectors it would be more predictable than one using random data, so I made one, and it uncovered a couple of bugs:
In Lucene90HnswVectorsReader we messed up (removed) ord-to-doc mappings so we were returning vector ords instead of docids in search results. I guess this would have totally borked back-compat for Lucene90 indexes. Probably there are none in the wild, and this was never noticed?
In Lucene91RWFormat (used only for back-compat testing) we messed up diversity check so we were producing bad graphs.
This PR fixes these things and adds the new test