Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Preliminary Vector Indexing Support #11318

Closed
wants to merge 67 commits into from

Conversation

SkyFan2002
Copy link
Member

@SkyFan2002 SkyFan2002 commented May 5, 2023

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

This PR introduces the Ivf index, currently only supports cosine_distance as a similarity measure.

Syntax

The syntax design is similar to pgvector.

create index:

CREATE INDEX on table_name USING IVFFLAT (column_name COSINE) with (nlist=1);

drop index:

drop vector index on t.c;

set search param:

set table_name.column_name.cosine.nrpobe = 70;

A column and a similarity metric uniquely identify an index. Therefore, when setting parameters, you need to specify the column name and similarity measurement type.

Implement

Build an index: build an index for each block of the table, and store it after compression.
ANN query: For queries of the order by cosine_distance(column_name target) limit n type, it will be rewritten in the RBO stage, and the execution stage will be divided into the following steps:

  1. Read the index and use the index to find the ANN of each block
  2. Merge the ANN of multiple blocks to get the overall ANN
  3. Use the results of the previous step to read other columns of select

Benchmark

run benchmark:

make vi-run-release
cd benchmark/vector_index_benchmark/
cargo run --release

The benchmark results are written to /benchmark/vector_index_benchmark/result.csv.

Next step

  1. Currently execution stage is single-node, single-threaded, this needs to be enhanced for scalability
  2. Support other type of vector index and other measurement type
  3. Support filter in ANN query
  4. A fixed-length array type should be provided

Closes #11054 #9699

@vercel
Copy link

vercel bot commented May 5, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
databend ⬜️ Ignored (Inspect) Visit Preview Jun 8, 2023 3:12pm

@mergify mergify bot added the pr-feature this PR introduces a new feature to the codebase label May 5, 2023
# Conflicts:
#	Cargo.lock
#	src/query/ast/src/ast/statements/mod.rs
#	src/query/service/src/interpreters/mod.rs
#	src/query/sql/src/executor/physical_plan_builder.rs
#	src/query/sql/src/planner/binder/table.rs
#	src/query/sql/src/planner/optimizer/heuristic/decorrelate.rs
#	src/query/sql/src/planner/optimizer/rule/factory.rs
#	src/query/sql/src/planner/optimizer/rule/rewrite/mod.rs
#	src/query/sql/src/planner/optimizer/rule/rule.rs
#	src/query/sql/src/planner/plans/ddl/index.rs
#	src/query/sql/src/planner/plans/plan.rs
#	src/query/sql/src/planner/plans/scan.rs
@SkyFan2002 SkyFan2002 marked this pull request as ready for review May 30, 2023 09:57
@SkyFan2002 SkyFan2002 requested a review from drmingdrmer as a code owner June 6, 2023 12:50
@SkyFan2002 SkyFan2002 marked this pull request as draft June 6, 2023 12:54
# Conflicts:
#	src/meta/proto-conv/src/lib.rs
#	src/query/ast/src/parser/statement.rs
#	src/query/service/Cargo.toml
#	src/query/sql/Cargo.toml
#	src/query/sql/src/planner/plans/plan.rs
#	src/query/storages/common/table-meta/src/meta/mod.rs
#	src/query/storages/fuse/src/operations/fuse_source.rs
#	src/query/storages/fuse/src/operations/read_data.rs
@SkyFan2002 SkyFan2002 changed the title feat: add vector index Feat: Add Preliminary Vector Indexing Support Jun 7, 2023
@SkyFan2002 SkyFan2002 changed the title Feat: Add Preliminary Vector Indexing Support feat: Add Preliminary Vector Indexing Support Jun 8, 2023
benchmark/vector_index_benchmark/src/main.rs Outdated Show resolved Hide resolved
benchmark/vector_index_benchmark/src/main.rs Outdated Show resolved Hide resolved
src/binaries/Cargo.toml Outdated Show resolved Hide resolved
src/query/ast/src/ast/statements/statement.rs Outdated Show resolved Hide resolved
@SkyFan2002 SkyFan2002 marked this pull request as ready for review June 8, 2023 10:13
Copy link
Member

@drmingdrmer drmingdrmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to src/meta/README.md:

How to add new meta data types to store in meta-service

Databend meta-service stores raw bytes and does not understand what the bytes are.

Databend-query use rust types in its runtime, these types such as TableMeta
must be serialized to be stored in meta-service.

The serialization is implemented with protobuf and a protobuf message provides
the backward compatibility, i.e., a newer version(version-B) protobuf message can be deserialized
from an older version(version-A) of serialized bytes, and version-B protobuf
message can be converted to version-B rust types.

  • Rust types are defined in src/meta/app/src/, such as TableMeta that is
    defined in src/meta/app/src/schema/table.rs.

  • The corresponding protobuf message is defined in src/meta/protos/proto/,
    such as src/meta/protos/proto/table.proto.

  • The conversion between protobuf message and rust type is defined in
    src/meta/proto-conv/, such as
    src/meta/proto-conv/src/table_from_to_protobuf_impl.rs,
    by implementing a FromToProto trait.

To add a new feature(add new type or update an type), the developer should do:

  • Add the rust types, in one mod in the src/meta/app/src/;

  • Add a new version in src/meta/proto-conv/src/util.rs. The versions track
    change history and will be checked when converting protobuf message to rust
    types:

    const META_CHANGE_LOG: &[(u64, &str)] = &[
        //
        ( 1, "----------: Initial", ),
        ( 2, "2022-07-13: Add: share.proto", ),
        ( 3, "2022-07-29: Add: user.proto/UserOption::default_role", ),
        ...
        (37, "2023-05-05: Add: index.proto", ),
        (38, "2023-05-19: Rename: table.proto/TableCopiedFileLock to EmptyProto", ),
        (39, "2023-05-22: Add: data_mask.proto", ),
    ];

    Note that only add new version to the bottom and remove old version from the
    top.

  • Add the conversion implementation to src/meta/proto-conv/src/, refer to
    other files in this crate.

  • Add a compatibility test to ensure that compatibility will always be kept in
    future, a good example is: src/meta/proto-conv/tests/it/v039_data_mask.rs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: ANN indexing for vector
3 participants