-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Re-inserting records leads to log messages on every subsequent operation #870
Comments
@HammadB can you take a look at this? |
Hi @mattpovey, The embeddings queue will store all operations as its designed to be an event-log of user operations, I think we could definitely purge duplicate entries, but thats not something we intend to take on now. The reason we blindly store all operations, is in the distributed architecture of Chroma, we plan to back the emebddings_queue implementation with Pulsar, a proper message queue. However, we don't want to validate entries before putting them on the queue for duplication since this would negatively affect speed. The design then, is to put things on the queue and let downstream indexing nodes decide whether or not there is a duplicate. In order to preserve API behavior, the local mode logs a warning if you add an existing embedding. However it should only log if the specific id is being added, are you seeing the warning on any add? Also, it should not log on query, I am unable to reproduce this behavior, can you share a reproduction? |
I think it is related |
@mattpovey any chance you have a repro here?
|
127.0.0.1 - - [14/Sep/2023 00:42:15] "GET /post_url?url=https://alcova.com/4-simple-home-security-hacks/&tenantID=050C7CB8-1DD5-430E-AF6D-67F2B5161E0B HTTP/1.1" 200 - |
This is a tough bug to reproduce because it only seems to happen when items get stuck in the embedding_queue. I ran into this when I noticed that my documents were not being returned when using "query_texts" but were getting returned when using "where_document={"$contains": text}". I learned that the reason was that the embeddings were never getting added to the documents because the same document embedding was stuck in the embedding_queue. This resulted in a weird silent behavior where records stopped showing up in results because there were no embeddings. collection.delete(where={'doc_id': doc_id})
collection.add(
documents=contents,
ids=ids,
metadatas=metadatas)
results = collection.get(
where={'doc_id': {"$in": [doc_id]}},
include = ['embeddings', 'metadatas', 'documents'],
)
print('input id count: ', len(ids))
print('results embeddings count: ', len(results.get('embeddings', [])))
print('results ids count: ', len(results.get('ids', []))) input id count: 10
results embedding count: 5
results ids count: 10 This would result in
Switching to a new collection and rerunning the script showed that all the embeddings were added correctly. From what I've read this is because the embedding_queue is "stuck or full" so when you try to reembed the same document (with the same hash?) the embed is never added because its still stuck in the queue. It would be helpful to have an error thrown when this happens I know the embeddings are not being added before I go to try to retrieve them. It would also be helpful to have some way to remove the stuck items in the queue or retry them? |
@wroscoe, thank you for the detailed analysis. This behaviour you describe was fixed in version 0.4.11+ https://github.com/chroma-core/chroma/releases/tag/0.4.11 (PR). I tried to reproduce it with the latest Chroma version but couldn't. The issue you describe was due to a lack of checks in the BF (bruteforce index). For completeness, here is a diagram explaining how Chroma WAL (write-ahead log or, as you referred to it, embedding queue) works. For each collection Chroma maintains two binary indices - Bruteforce (in-memory, fast) and HNSW lib (persisted to disk, slow when adding new vectors and persisting). As you can imagine, the BF index serves the role of a buffer that holds the uncommitted to HNWS persisted index portion of the WAL. The HNSW index itself has a max sequence id counter, stored in a metadata file, that indicates from which position in the WAL the buffering to the BF index should begin. The latter buffering usually happens when the collection is first accessed. There are two transfer points (in the diagram, sync threshold) for BF to HNSW:
Both of the above sync points are controlled via Collection-level metadata with respective named params. It is customary |
Can confirm that this issue still exists on 0.4.24. Is there any way to clear the embedding queue? |
@mcflem06, we've found a bug and are working to fix it ASAP. |
Still getting |
Yeah whats up with this, what is the behavior that makes this happen ? |
@tazarov Hi, Following the flow in the diagram, does it mean that each vector will be stored in two copies, one in the embedding_queue and one on disk via a persistent index? Are the vectors in the embedding_queue useless after adding the vectors to the index in memory? |
Any workaround to solve this issue ? we are facing this on chrom 0.5.6 version |
What happened?
Reinserting records without embeddings (i.e. requiring Chromadb to generate the embeddings) causes them to be held in the
embeddings_queue
table of chromadb.sqlite3. On every subsequent operation, log messages are presented as chroma (presumably) attempts to insert the already existing records:To replicate, attempt to re-insert a record keeping the id, documents and metadata identical. If done n times (e.g. via a broken loop that increments record numbers incorrectly, which is how I did it), the row is added n times to the embeddings_queue table.
The issue is easily worked-around by deleting the records from embeddings_queue:
The records appear to be added to the table in this file,
chroma/chromadb/db/mixins/embeddings_queue.py
Line 88 in 3ed229c
The warnings are raised in:
chroma/chromadb/segment/impl/vector/local_hnsw.py
Line 303 in 3ed229c
and
chroma/chromadb/segment/impl/metadata/sqlite.py
Line 216 in 3ed229c
and
chroma/chromadb/segment/impl/vector/local_persistent_hnsw.py
Line 242 in 3ed229c
If there are data-loss risks associated with just dropping the duplicate queue entries, perhaps add an explanation of how to delete the offending records to the warnings?
Versions
Chroma v0.4.2 MacOS Ventura.
Relevant log output
The text was updated successfully, but these errors were encountered: