Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Re-inserting records leads to log messages on every subsequent operation #870

Closed
mattpovey opened this issue Jul 24, 2023 · 13 comments
Assignees
Labels
bug Something isn't working taz-sprint-4

Comments

@mattpovey
Copy link

What happened?

Reinserting records without embeddings (i.e. requiring Chromadb to generate the embeddings) causes them to be held in the embeddings_queue table of chromadb.sqlite3. On every subsequent operation, log messages are presented as chroma (presumably) attempts to insert the already existing records:

Add of existing embedding ID: 21
Insert of existing embedding ID: 21

To replicate, attempt to re-insert a record keeping the id, documents and metadata identical. If done n times (e.g. via a broken loop that increments record numbers incorrectly, which is how I did it), the row is added n times to the embeddings_queue table.

The issue is easily worked-around by deleting the records from embeddings_queue:

cur.execute("DELETE FROM embeddings_queue;")
conn.commit()

The records appear to be added to the table in this file,

t = Table("embeddings_queue")

The warnings are raised in:

logger.warning(f"Add of existing embedding ID: {id}")

and

# We are trying to add for a record that already exists. Fail the call.

and

logger.warning(f"Add of existing embedding ID: {id}")

If there are data-loss risks associated with just dropping the duplicate queue entries, perhaps add an explanation of how to delete the offending records to the warnings?

Versions

Chroma v0.4.2 MacOS Ventura.

Relevant log output

RECEIVED WHEN QUERYING ETC.

Add of existing embedding ID: 1
Add of existing embedding ID: 2
Add of existing embedding ID: 3
Add of existing embedding ID: 4
Add of existing embedding ID: 5
Add of existing embedding ID: 6
Add of existing embedding ID: 7
Add of existing embedding ID: 8
Add of existing embedding ID: 9
Add of existing embedding ID: 10
Add of existing embedding ID: 11
Add of existing embedding ID: 12
Add of existing embedding ID: 13
Add of existing embedding ID: 14
Add of existing embedding ID: 15
Add of existing embedding ID: 16
Add of existing embedding ID: 17
Add of existing embedding ID: 18
Add of existing embedding ID: 19
Add of existing embedding ID: 20
Add of existing embedding ID: 21
Add of existing embedding ID: 22
Add of existing embedding ID: 23
Add of existing embedding ID: 24
Add of existing embedding ID: 25
Add of existing embedding ID: 26
Add of existing embedding ID: 27
Add of existing embedding ID: 28
Add of existing embedding ID: 29
Add of existing embedding ID: 30
Add of existing embedding ID: 31
Add of existing embedding ID: 32
Add of existing embedding ID: 33
Add of existing embedding ID: 34
Add of existing embedding ID: 35
Add of existing embedding ID: 36
Add of existing embedding ID: 37
Add of existing embedding ID: 38
Add of existing embedding ID: 39
Add of existing embedding ID: 40
Add of existing embedding ID: 41
Add of existing embedding ID: 42
Add of existing embedding ID: 43
Add of existing embedding ID: 44
Add of existing embedding ID: 45
Add of existing embedding ID: 46
Add of existing embedding ID: 47
Add of existing embedding ID: 48
Add of existing embedding ID: 49
Add of existing embedding ID: 50
Add of existing embedding ID: 51

RECEIVED DURING THE ADD OPERATION WHICH CAUSED THE PROBLEM:

Add of existing embedding ID: 1
Insert of existing embedding ID: 1
Add of existing embedding ID: 2
Insert of existing embedding ID: 2
Add of existing embedding ID: 3
Insert of existing embedding ID: 3
Add of existing embedding ID: 4
Insert of existing embedding ID: 4
Add of existing embedding ID: 5
Insert of existing embedding ID: 5
Add of existing embedding ID: 6
Insert of existing embedding ID: 6
Add of existing embedding ID: 7
Insert of existing embedding ID: 7
Add of existing embedding ID: 8
Insert of existing embedding ID: 8
Add of existing embedding ID: 9
Insert of existing embedding ID: 9
Add of existing embedding ID: 10
Insert of existing embedding ID: 10
Add of existing embedding ID: 11
Insert of existing embedding ID: 11
Add of existing embedding ID: 12
Insert of existing embedding ID: 12
Add of existing embedding ID: 13
Insert of existing embedding ID: 13
Add of existing embedding ID: 14
Insert of existing embedding ID: 14
Add of existing embedding ID: 15
Insert of existing embedding ID: 15
Add of existing embedding ID: 16
Insert of existing embedding ID: 16
Add of existing embedding ID: 17
Insert of existing embedding ID: 17
Add of existing embedding ID: 18
Insert of existing embedding ID: 18
Add of existing embedding ID: 19
Insert of existing embedding ID: 19
Add of existing embedding ID: 20
Insert of existing embedding ID: 20
Add of existing embedding ID: 21
Insert of existing embedding ID: 21
Add of existing embedding ID: 22
Insert of existing embedding ID: 22
Add of existing embedding ID: 23
Insert of existing embedding ID: 23
Add of existing embedding ID: 24
Insert of existing embedding ID: 24
Add of existing embedding ID: 25
Insert of existing embedding ID: 25
Add of existing embedding ID: 26
Insert of existing embedding ID: 26
Add of existing embedding ID: 27
Insert of existing embedding ID: 27
Add of existing embedding ID: 28
Insert of existing embedding ID: 28
Add of existing embedding ID: 29
Insert of existing embedding ID: 29
Add of existing embedding ID: 30
Insert of existing embedding ID: 30
Add of existing embedding ID: 31
Insert of existing embedding ID: 31
Add of existing embedding ID: 32
Insert of existing embedding ID: 32
Add of existing embedding ID: 33
Insert of existing embedding ID: 33
Add of existing embedding ID: 34
Insert of existing embedding ID: 34
Add of existing embedding ID: 35
Insert of existing embedding ID: 35
Add of existing embedding ID: 36
Insert of existing embedding ID: 36
Add of existing embedding ID: 37
Insert of existing embedding ID: 37
@mattpovey mattpovey added the bug Something isn't working label Jul 24, 2023
@jeffchuber
Copy link
Contributor

@HammadB can you take a look at this?

@HammadB
Copy link
Collaborator

HammadB commented Jul 24, 2023

Hi @mattpovey,

The embeddings queue will store all operations as its designed to be an event-log of user operations, I think we could definitely purge duplicate entries, but thats not something we intend to take on now.

The reason we blindly store all operations, is in the distributed architecture of Chroma, we plan to back the emebddings_queue implementation with Pulsar, a proper message queue. However, we don't want to validate entries before putting them on the queue for duplication since this would negatively affect speed. The design then, is to put things on the queue and let downstream indexing nodes decide whether or not there is a duplicate. In order to preserve API behavior, the local mode logs a warning if you add an existing embedding. However it should only log if the specific id is being added, are you seeing the warning on any add?

Also, it should not log on query, I am unable to reproduce this behavior, can you share a reproduction?

@andrewshvv
Copy link

I think it is related
#969

@jeffchuber
Copy link
Contributor

@mattpovey any chance you have a repro here?

Also, it should not log on query, I am unable to reproduce this behavior, can you share a reproduction?

@simulanics
Copy link

127.0.0.1 - - [14/Sep/2023 00:42:15] "GET /post_url?url=https://alcova.com/4-simple-home-security-hacks/&tenantID=050C7CB8-1DD5-430E-AF6D-67F2B5161E0B HTTP/1.1" 200 -
Insert of existing embedding ID: 801
Add of existing embedding ID: 801
Insert of existing embedding ID: 802
Add of existing embedding ID: 802
Insert of existing embedding ID: 803
Add of existing embedding ID: 803
Insert of existing embedding ID: 804
Add of existing embedding ID: 804
Insert of existing embedding ID: 805
Add of existing embedding ID: 805
127.0.0.1 - - [14/Sep/2023 00:42:17] "GET /post_url?url=https://alcova.com/4-points-to-know-about-home-inspections/&tenantID=050C7CB8-1DD5-430E-AF6D-67F2B5161E0B HTTP/1.1" 200 -
Insert of existing embedding ID: 801
Add of existing embedding ID: 801
Insert of existing embedding ID: 802
Add of existing embedding ID: 802
Insert of existing embedding ID: 803
Add of existing embedding ID: 803
Insert of existing embedding ID: 804
Add of existing embedding ID: 804
Insert of existing embedding ID: 805
Add of existing embedding ID: 805
127.0.0.1 - - [14/Sep/2023 00:42:19] "GET /post_url?url=https://alcova.com/tips-to-pet-proof-your-home/&tenantID=050C7CB8-1DD5-430E-AF6D-67F2B5161E0B HTTP/1.1" 200 -
Insert of existing embedding ID: 801
Add of existing embedding ID: 801
Insert of existing embedding ID: 802
Add of existing embedding ID: 802
Insert of existing embedding ID: 803
Add of existing embedding ID: 803
Insert of existing embedding ID: 804
Add of existing embedding ID: 804
Insert of existing embedding ID: 805
Add of existing embedding ID: 805

@wroscoe
Copy link

wroscoe commented Oct 6, 2023

This is a tough bug to reproduce because it only seems to happen when items get stuck in the embedding_queue. I ran into this when I noticed that my documents were not being returned when using "query_texts" but were getting returned when using "where_document={"$contains": text}". I learned that the reason was that the embeddings were never getting added to the documents because the same document embedding was stuck in the embedding_queue. This resulted in a weird silent behavior where records stopped showing up in results because there were no embeddings.

collection.delete(where={'doc_id': doc_id})

collection.add(
            documents=contents, 
            ids=ids, 
            metadatas=metadatas)

results = collection.get(
    where={'doc_id': {"$in": [doc_id]}},
    include = ['embeddings', 'metadatas', 'documents'],
)

print('input id count: ', len(ids))
print('results embeddings count: ', len(results.get('embeddings', [])))
print('results ids count: ', len(results.get('ids', [])))
input id count: 10
results embedding count: 5
results ids count: 10

This would result in

  • deleting the documents successfully
  • adding the documents (Add of existing embedding ID warnings would be shown on chroma docker instance)
  • getting the documents returned all the documents but many of them did not have any embedding.

Switching to a new collection and rerunning the script showed that all the embeddings were added correctly.

From what I've read this is because the embedding_queue is "stuck or full" so when you try to reembed the same document (with the same hash?) the embed is never added because its still stuck in the queue.

It would be helpful to have an error thrown when this happens I know the embeddings are not being added before I go to try to retrieve them. It would also be helpful to have some way to remove the stuck items in the queue or retry them?

@HammadB HammadB assigned tazarov and unassigned HammadB Nov 7, 2023
@tazarov
Copy link
Contributor

tazarov commented Jan 18, 2024

@wroscoe, thank you for the detailed analysis. This behaviour you describe was fixed in version 0.4.11+ https://github.com/chroma-core/chroma/releases/tag/0.4.11 (PR). I tried to reproduce it with the latest Chroma version but couldn't.

The issue you describe was due to a lack of checks in the BF (bruteforce index).

For completeness, here is a diagram explaining how Chroma WAL (write-ahead log or, as you referred to it, embedding queue) works.

image

For each collection Chroma maintains two binary indices - Bruteforce (in-memory, fast) and HNSW lib (persisted to disk, slow when adding new vectors and persisting). As you can imagine, the BF index serves the role of a buffer that holds the uncommitted to HNWS persisted index portion of the WAL. The HNSW index itself has a max sequence id counter, stored in a metadata file, that indicates from which position in the WAL the buffering to the BF index should begin. The latter buffering usually happens when the collection is first accessed.

There are two transfer points (in the diagram, sync threshold) for BF to HNSW:

  • hnsw:batch_size - forces the BF vectors to be added to HNSW in-memory (this is a slow operation)
  • hnsw:sync_threshold - forces Chroma to dump the HNSW in-memory index to disk (this is a slow operation)

Both of the above sync points are controlled via Collection-level metadata with respective named params. It is customary hnsw:sync_threshold > hnsw:batch_size

@mcflem06
Copy link

mcflem06 commented Apr 1, 2024

Can confirm that this issue still exists on 0.4.24. Is there any way to clear the embedding queue?

@tazarov
Copy link
Contributor

tazarov commented Apr 24, 2024

@mcflem06, we've found a bug and are working to fix it ASAP.

@Bourhano
Copy link

Still gettingAdd of existing embedding ID: xx when querying a collection for the first time on chromadb==0.5.5 and 0.5.3

@rkrishnasanka
Copy link

Yeah whats up with this, what is the behavior that makes this happen ?

@Amphetaminewei
Copy link

For completeness, here is a diagram explaining how Chroma WAL (write-ahead log or, as you referred to it, embedding queue) works.

@tazarov Hi, Following the flow in the diagram, does it mean that each vector will be stored in two copies, one in the embedding_queue and one on disk via a persistent index? Are the vectors in the embedding_queue useless after adding the vectors to the index in memory?

@RashmiSutrave
Copy link

Any workaround to solve this issue ? we are facing this on chrom 0.5.6 version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working taz-sprint-4
Projects
None yet
Development

No branches or pull requests