[Bug]: Re-inserting records leads to log messages on every subsequent operation #870

mattpovey · 2023-07-24T09:06:42Z

What happened?

Reinserting records without embeddings (i.e. requiring Chromadb to generate the embeddings) causes them to be held in the embeddings_queue table of chromadb.sqlite3. On every subsequent operation, log messages are presented as chroma (presumably) attempts to insert the already existing records:

Add of existing embedding ID: 21
Insert of existing embedding ID: 21

To replicate, attempt to re-insert a record keeping the id, documents and metadata identical. If done n times (e.g. via a broken loop that increments record numbers incorrectly, which is how I did it), the row is added n times to the embeddings_queue table.

The issue is easily worked-around by deleting the records from embeddings_queue:

cur.execute("DELETE FROM embeddings_queue;")
conn.commit()

The records appear to be added to the table in this file,

chroma/chromadb/db/mixins/embeddings_queue.py

Line 88 in 3ed229c

t = Table("embeddings_queue")

The warnings are raised in:

chroma/chromadb/segment/impl/vector/local_hnsw.py

Line 303 in 3ed229c

logger.warning(f"Add of existing embedding ID: {id}")

and

chroma/chromadb/segment/impl/metadata/sqlite.py

Line 216 in 3ed229c

# We are trying to add for a record that already exists. Fail the call.

and

chroma/chromadb/segment/impl/vector/local_persistent_hnsw.py

Line 242 in 3ed229c

logger.warning(f"Add of existing embedding ID: {id}")

If there are data-loss risks associated with just dropping the duplicate queue entries, perhaps add an explanation of how to delete the offending records to the warnings?

Versions

Chroma v0.4.2 MacOS Ventura.

Relevant log output

RECEIVED WHEN QUERYING ETC.

Add of existing embedding ID: 1
Add of existing embedding ID: 2
Add of existing embedding ID: 3
Add of existing embedding ID: 4
Add of existing embedding ID: 5
Add of existing embedding ID: 6
Add of existing embedding ID: 7
Add of existing embedding ID: 8
Add of existing embedding ID: 9
Add of existing embedding ID: 10
Add of existing embedding ID: 11
Add of existing embedding ID: 12
Add of existing embedding ID: 13
Add of existing embedding ID: 14
Add of existing embedding ID: 15
Add of existing embedding ID: 16
Add of existing embedding ID: 17
Add of existing embedding ID: 18
Add of existing embedding ID: 19
Add of existing embedding ID: 20
Add of existing embedding ID: 21
Add of existing embedding ID: 22
Add of existing embedding ID: 23
Add of existing embedding ID: 24
Add of existing embedding ID: 25
Add of existing embedding ID: 26
Add of existing embedding ID: 27
Add of existing embedding ID: 28
Add of existing embedding ID: 29
Add of existing embedding ID: 30
Add of existing embedding ID: 31
Add of existing embedding ID: 32
Add of existing embedding ID: 33
Add of existing embedding ID: 34
Add of existing embedding ID: 35
Add of existing embedding ID: 36
Add of existing embedding ID: 37
Add of existing embedding ID: 38
Add of existing embedding ID: 39
Add of existing embedding ID: 40
Add of existing embedding ID: 41
Add of existing embedding ID: 42
Add of existing embedding ID: 43
Add of existing embedding ID: 44
Add of existing embedding ID: 45
Add of existing embedding ID: 46
Add of existing embedding ID: 47
Add of existing embedding ID: 48
Add of existing embedding ID: 49
Add of existing embedding ID: 50
Add of existing embedding ID: 51

RECEIVED DURING THE ADD OPERATION WHICH CAUSED THE PROBLEM:

Add of existing embedding ID: 1
Insert of existing embedding ID: 1
Add of existing embedding ID: 2
Insert of existing embedding ID: 2
Add of existing embedding ID: 3
Insert of existing embedding ID: 3
Add of existing embedding ID: 4
Insert of existing embedding ID: 4
Add of existing embedding ID: 5
Insert of existing embedding ID: 5
Add of existing embedding ID: 6
Insert of existing embedding ID: 6
Add of existing embedding ID: 7
Insert of existing embedding ID: 7
Add of existing embedding ID: 8
Insert of existing embedding ID: 8
Add of existing embedding ID: 9
Insert of existing embedding ID: 9
Add of existing embedding ID: 10
Insert of existing embedding ID: 10
Add of existing embedding ID: 11
Insert of existing embedding ID: 11
Add of existing embedding ID: 12
Insert of existing embedding ID: 12
Add of existing embedding ID: 13
Insert of existing embedding ID: 13
Add of existing embedding ID: 14
Insert of existing embedding ID: 14
Add of existing embedding ID: 15
Insert of existing embedding ID: 15
Add of existing embedding ID: 16
Insert of existing embedding ID: 16
Add of existing embedding ID: 17
Insert of existing embedding ID: 17
Add of existing embedding ID: 18
Insert of existing embedding ID: 18
Add of existing embedding ID: 19
Insert of existing embedding ID: 19
Add of existing embedding ID: 20
Insert of existing embedding ID: 20
Add of existing embedding ID: 21
Insert of existing embedding ID: 21
Add of existing embedding ID: 22
Insert of existing embedding ID: 22
Add of existing embedding ID: 23
Insert of existing embedding ID: 23
Add of existing embedding ID: 24
Insert of existing embedding ID: 24
Add of existing embedding ID: 25
Insert of existing embedding ID: 25
Add of existing embedding ID: 26
Insert of existing embedding ID: 26
Add of existing embedding ID: 27
Insert of existing embedding ID: 27
Add of existing embedding ID: 28
Insert of existing embedding ID: 28
Add of existing embedding ID: 29
Insert of existing embedding ID: 29
Add of existing embedding ID: 30
Insert of existing embedding ID: 30
Add of existing embedding ID: 31
Insert of existing embedding ID: 31
Add of existing embedding ID: 32
Insert of existing embedding ID: 32
Add of existing embedding ID: 33
Insert of existing embedding ID: 33
Add of existing embedding ID: 34
Insert of existing embedding ID: 34
Add of existing embedding ID: 35
Insert of existing embedding ID: 35
Add of existing embedding ID: 36
Insert of existing embedding ID: 36
Add of existing embedding ID: 37
Insert of existing embedding ID: 37

The text was updated successfully, but these errors were encountered:

jeffchuber · 2023-07-24T13:38:19Z

@HammadB can you take a look at this?

HammadB · 2023-07-24T22:43:26Z

Hi @mattpovey,

The embeddings queue will store all operations as its designed to be an event-log of user operations, I think we could definitely purge duplicate entries, but thats not something we intend to take on now.

The reason we blindly store all operations, is in the distributed architecture of Chroma, we plan to back the emebddings_queue implementation with Pulsar, a proper message queue. However, we don't want to validate entries before putting them on the queue for duplication since this would negatively affect speed. The design then, is to put things on the queue and let downstream indexing nodes decide whether or not there is a duplicate. In order to preserve API behavior, the local mode logs a warning if you add an existing embedding. However it should only log if the specific id is being added, are you seeing the warning on any add?

Also, it should not log on query, I am unable to reproduce this behavior, can you share a reproduction?

andrewshvv · 2023-08-12T07:42:27Z

I think it is related
#969

jeffchuber · 2023-09-06T03:35:18Z

@mattpovey any chance you have a repro here?

Also, it should not log on query, I am unable to reproduce this behavior, can you share a reproduction?

simulanics · 2023-09-14T04:42:40Z

127.0.0.1 - - [14/Sep/2023 00:42:15] "GET /post_url?url=https://alcova.com/4-simple-home-security-hacks/&tenantID=050C7CB8-1DD5-430E-AF6D-67F2B5161E0B HTTP/1.1" 200 -
Insert of existing embedding ID: 801
Add of existing embedding ID: 801
Insert of existing embedding ID: 802
Add of existing embedding ID: 802
Insert of existing embedding ID: 803
Add of existing embedding ID: 803
Insert of existing embedding ID: 804
Add of existing embedding ID: 804
Insert of existing embedding ID: 805
Add of existing embedding ID: 805
127.0.0.1 - - [14/Sep/2023 00:42:17] "GET /post_url?url=https://alcova.com/4-points-to-know-about-home-inspections/&tenantID=050C7CB8-1DD5-430E-AF6D-67F2B5161E0B HTTP/1.1" 200 -
Insert of existing embedding ID: 801
Add of existing embedding ID: 801
Insert of existing embedding ID: 802
Add of existing embedding ID: 802
Insert of existing embedding ID: 803
Add of existing embedding ID: 803
Insert of existing embedding ID: 804
Add of existing embedding ID: 804
Insert of existing embedding ID: 805
Add of existing embedding ID: 805
127.0.0.1 - - [14/Sep/2023 00:42:19] "GET /post_url?url=https://alcova.com/tips-to-pet-proof-your-home/&tenantID=050C7CB8-1DD5-430E-AF6D-67F2B5161E0B HTTP/1.1" 200 -
Insert of existing embedding ID: 801
Add of existing embedding ID: 801
Insert of existing embedding ID: 802
Add of existing embedding ID: 802
Insert of existing embedding ID: 803
Add of existing embedding ID: 803
Insert of existing embedding ID: 804
Add of existing embedding ID: 804
Insert of existing embedding ID: 805
Add of existing embedding ID: 805

wroscoe · 2023-10-06T17:35:59Z

This is a tough bug to reproduce because it only seems to happen when items get stuck in the embedding_queue. I ran into this when I noticed that my documents were not being returned when using "query_texts" but were getting returned when using "where_document={"$contains": text}". I learned that the reason was that the embeddings were never getting added to the documents because the same document embedding was stuck in the embedding_queue. This resulted in a weird silent behavior where records stopped showing up in results because there were no embeddings.

collection.delete(where={'doc_id': doc_id})

collection.add(
            documents=contents, 
            ids=ids, 
            metadatas=metadatas)

results = collection.get(
    where={'doc_id': {"$in": [doc_id]}},
    include = ['embeddings', 'metadatas', 'documents'],
)

print('input id count: ', len(ids))
print('results embeddings count: ', len(results.get('embeddings', [])))
print('results ids count: ', len(results.get('ids', [])))

input id count: 10
results embedding count: 5
results ids count: 10

This would result in

deleting the documents successfully
adding the documents (Add of existing embedding ID warnings would be shown on chroma docker instance)
getting the documents returned all the documents but many of them did not have any embedding.

Switching to a new collection and rerunning the script showed that all the embeddings were added correctly.

From what I've read this is because the embedding_queue is "stuck or full" so when you try to reembed the same document (with the same hash?) the embed is never added because its still stuck in the queue.

It would be helpful to have an error thrown when this happens I know the embeddings are not being added before I go to try to retrieve them. It would also be helpful to have some way to remove the stuck items in the queue or retry them?

tazarov · 2024-01-18T15:10:39Z

@wroscoe, thank you for the detailed analysis. This behaviour you describe was fixed in version 0.4.11+ https://github.com/chroma-core/chroma/releases/tag/0.4.11 (PR). I tried to reproduce it with the latest Chroma version but couldn't.

The issue you describe was due to a lack of checks in the BF (bruteforce index).

For completeness, here is a diagram explaining how Chroma WAL (write-ahead log or, as you referred to it, embedding queue) works.

For each collection Chroma maintains two binary indices - Bruteforce (in-memory, fast) and HNSW lib (persisted to disk, slow when adding new vectors and persisting). As you can imagine, the BF index serves the role of a buffer that holds the uncommitted to HNWS persisted index portion of the WAL. The HNSW index itself has a max sequence id counter, stored in a metadata file, that indicates from which position in the WAL the buffering to the BF index should begin. The latter buffering usually happens when the collection is first accessed.

There are two transfer points (in the diagram, sync threshold) for BF to HNSW:

hnsw:batch_size - forces the BF vectors to be added to HNSW in-memory (this is a slow operation)
hnsw:sync_threshold - forces Chroma to dump the HNSW in-memory index to disk (this is a slow operation)

Both of the above sync points are controlled via Collection-level metadata with respective named params. It is customary hnsw:sync_threshold > hnsw:batch_size

mcflem06 · 2024-04-01T18:10:08Z

Can confirm that this issue still exists on 0.4.24. Is there any way to clear the embedding queue?

tazarov · 2024-04-24T12:43:53Z

@mcflem06, we've found a bug and are working to fix it ASAP.

Bourhano · 2024-08-29T11:07:07Z

Still gettingAdd of existing embedding ID: xx when querying a collection for the first time on chromadb==0.5.5 and 0.5.3

rkrishnasanka · 2024-09-20T06:09:32Z

Yeah whats up with this, what is the behavior that makes this happen ?

Amphetaminewei · 2024-09-25T09:47:45Z

For completeness, here is a diagram explaining how Chroma WAL (write-ahead log or, as you referred to it, embedding queue) works.

@tazarov Hi, Following the flow in the diagram, does it mean that each vector will be stored in two copies, one in the embedding_queue and one on disk via a persistent index? Are the vectors in the embedding_queue useless after adding the vectors to the index in memory?

RashmiSutrave · 2024-09-26T11:19:11Z

Any workaround to solve this issue ? we are facing this on chrom 0.5.6 version

mattpovey added the bug Something isn't working label Jul 24, 2023

jeffchuber assigned HammadB Jul 24, 2023

HammadB assigned tazarov and unassigned HammadB Nov 7, 2023

HammadB added the taz-sprint-4 label Jan 15, 2024

tazarov mentioned this issue Apr 25, 2024

[BUG]: The batch, the sync and the missing vector #2062

Closed

1 task

ymzayek mentioned this issue Oct 7, 2024

[Bug]: Missing embeddings in collections after a system reboot #2905

Closed

itaismith closed this as completed Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Re-inserting records leads to log messages on every subsequent operation #870

[Bug]: Re-inserting records leads to log messages on every subsequent operation #870

mattpovey commented Jul 24, 2023

jeffchuber commented Jul 24, 2023

HammadB commented Jul 24, 2023

andrewshvv commented Aug 12, 2023

jeffchuber commented Sep 6, 2023

simulanics commented Sep 14, 2023

wroscoe commented Oct 6, 2023

tazarov commented Jan 18, 2024

mcflem06 commented Apr 1, 2024

tazarov commented Apr 24, 2024

Bourhano commented Aug 29, 2024

rkrishnasanka commented Sep 20, 2024

Amphetaminewei commented Sep 25, 2024

RashmiSutrave commented Sep 26, 2024

[Bug]: Re-inserting records leads to log messages on every subsequent operation #870

[Bug]: Re-inserting records leads to log messages on every subsequent operation #870

Comments

mattpovey commented Jul 24, 2023

What happened?

Versions

Relevant log output

jeffchuber commented Jul 24, 2023

HammadB commented Jul 24, 2023

andrewshvv commented Aug 12, 2023

jeffchuber commented Sep 6, 2023

simulanics commented Sep 14, 2023

wroscoe commented Oct 6, 2023

tazarov commented Jan 18, 2024

mcflem06 commented Apr 1, 2024

tazarov commented Apr 24, 2024

Bourhano commented Aug 29, 2024

rkrishnasanka commented Sep 20, 2024

Amphetaminewei commented Sep 25, 2024

RashmiSutrave commented Sep 26, 2024