Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support bm25 milvus function #33

Merged
merged 4 commits into from
Jan 10, 2025
Merged

Conversation

zc277584121
Copy link
Collaborator

@zc277584121 zc277584121 commented Jan 3, 2025

This PR introduced some major refactors:

  • Introduce the abstract class BaseMilvusBuiltInFunction, which is a light wrapper of Milvus Function.
  • Introduce Bm25BuiltInFunction extended from BaseMilvusBuiltInFunction , which includes the Milvus FunctionType.BM25 settings and the configs of Milvus analyzer. We can use this Bm25BuiltInFunction to implement Full text search in Milvus
  • In the future, Milvus will support more built-in Functions which support text-in(instead of vector-in) abilities, without transporting text to embedding on the user's end because it does this on the server's end automatically (here is a FunctionType.TEXTEMBEDDING example). So in the future we can implement more subclass from BaseMilvusBuiltInFunction to support the text-in functions in Milvus.
  • The how-to-use introduction is on the way, and there are some use case examples in the unittest test_builtin_bm25_function(). Simply speaking, we can pass in any customized Langchain embedding functions or milvus built-in functions to the Milvus class initialization function to build multi index fields in Milvus.
    Some use case examples will be like these:
from langchain_milvus import Milvus, BM25BuiltInFunction
from langchain_openai import OpenAIEmbeddings

embedding = OpenAIEmbeddings()

vectorstore = Milvus.from_documents(
    documents=docs,
    embedding=embedding,
    builtin_function=BM25BuiltInFunction(
        output_field_names="sparse"
    ),
    #"dense" field is used for similarity search for OpenAI dense embedding, "sparse" field is used for BM25 full-text search
    vector_field=["dense", "sparse"],
    connection_args={
        "uri": URI,
    },
    drop_old=True,
)

or with multi embedding fields and bm25 function:

from langchain_voyageai import VoyageAIEmbeddings

embedding = OpenAIEmbeddings()
embedding2 = VoyageAIEmbeddings(model="voyage-3")

vectorstore = Milvus.from_documents(
    documents=docs,
    embedding=[embedding, embedding2],
    builtin_function=BM25BuiltInFunction(
        input_field_names="text",
        output_field_names="sparse"
    ),
    text_field="text",
    vector_field=["dense", "dense2", "sparse"],
    connection_args={
        "uri": URI,
    },
    drop_old=True,
)

@ohadeytan
Copy link

@zc277584121 running the test_builtin_bm25_function I get:

    def check_status(status: Status):
        if status.code != 0 or status.error_code != 0:
>           raise MilvusException(status.code, status.reason, status.error_code)
E           pymilvus.exceptions.MilvusException: <MilvusException: (code=65535, message=invalid index type: AUTOINDEX, local mode only support SPARSE_INVERTED_INDEX SPARSE_WAND: )>

Is this expected or something is wrong with my settings?

@zc277584121
Copy link
Collaborator Author

@ohadeytan The full text search feature is so far not supported in Milvus-Lite. It says local mode only support SPARSE_INVERTED_INDEX SPARSE_WAND, because full text search uses BM25 index type. I think to run this unittest successfully, we can only use Milvus Docker Standalone service currently. Thanks for you feedback, i think i have to left some notice in the unittest and futher documents.

@zc277584121 zc277584121 requested a review from efriis January 9, 2025 09:36
@zc277584121
Copy link
Collaborator Author

here is document, which is waiting final reviewing, https://github.com/zc277584121/bootcamp/blob/langchain_doc/bootcamp/tutorials/integration/langchain/full_text_search_with_langchain.ipynb
I merge this PR, and i think the new package version will be released in next week
FYI, @ohadeytan

@zc277584121 zc277584121 merged commit 1c13e43 into langchain-ai:main Jan 10, 2025
8 checks passed
@janaki-sasidhar
Copy link

What if I want to use seperate search prompt for keyword and semantic, this hybird retriever wrapper isnt flexiblefor that i think

@zc277584121
Copy link
Collaborator Author

@janaki-sasidhar Do you mean this kind of case:

search_prompt1->[vector of prompt1]-> semantic search-> result docs from 1,
search_prompt2-> keyword search -> result docs from 2,
[result docs from1 + result docs from2] -> rerank -> final result docs

If so, can you explain what is its scenario, any infomation will be appreciate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants