add Clickhouse Bench #356

gb198871 · 2024-08-02T11:51:48Z

No description provided.

sre-ci-robot · 2024-08-02T11:51:53Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gb198871
To complete the pull request process, please assign xuanyang-cn after the PR has been reviewed.
You can assign the PR to them by writing /assign @xuanyang-cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

xiaofan-luan · 2024-08-04T03:33:17Z

@gb198871

could you also provide some numbers so we can verify it?

gb198871 · 2024-08-05T01:44:24Z

2024-08-02 09:12:44,538 | INFO: SpawnProcess-12:1073 search 30s: actual_dur=39.8413s, count=1, qps in this process: 0.0251 (mp_runner.py:76) (472525) 2024-08-02 09:12:44,539 | INFO: End search in concurrency 100: dur=39.89072586199836s, total_count=103, qps=2.5821 (mp_runner.py:120) (424695) 2024-08-02 09:12:46,551 | INFO: Performance case got result: Metric(max_load_count=0, load_duration=210.3753, qps=2.6495, serial_latency_p99=0.9687, recall=1.0, ndcg=1.0, conc_num_list=[1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100], conc_qps_list=[1.3354, 2.5249, 2.4923, 2.4926, 2.4858, 2.4837, 2.4855, 2.4979, 2.4953, 2.4897, 2.4876, 2.4726, 2.5, 2.5421, 2.5615, 2.4912, 2.5801, 2.5204, 2.5725, 2.6495, 2.5821], conc_latency_p99_list=[0.7145220047979092, 1.6144527919925125, 3.734352893486223, 1.3177170618528613, 7.378083680513816, 1.4176923855114474, 10.432509155307004, 11.407468423815844, 14.189951241974393, 15.578517632178375, 16.799256178579796, 17.48631714056909, 15.56380933373975, 16.75706847510624, 13.737675615931861, 8.24177737453149, 6.403015353684063, 4.902551132985711, 6.845915539138432, 8.234877345580717, 10.033646960868595]) (task_runner.py:195) (424695) 2024-08-02 09:12:46,552 | INFO: [1/1] finish case: {'label': <CaseLabel.Performance: 2>, 'dataset': {'data': {'name': 'Cohere', 'size': 1000000, 'dim': 768, 'metric_type': <MetricType.COSINE: 'COSINE'>}}, 'db': 'Clickhouse'}, result=Metric(max_load_count=0, load_duration=210.3753, qps=2.6495, serial_latency_p99=0.9687, recall=1.0, ndcg=1.0, conc_num_list=[1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100], conc_qps_list=[1.3354, 2.5249, 2.4923, 2.4926, 2.4858, 2.4837, 2.4855, 2.4979, 2.4953, 2.4897, 2.4876, 2.4726, 2.5, 2.5421, 2.5615, 2.4912, 2.5801, 2.5204, 2.5725, 2.6495, 2.5821], conc_latency_p99_list=[0.7145220047979092, 1.6144527919925125, 3.734352893486223, 1.3177170618528613, 7.378083680513816, 1.4176923855114474, 10.432509155307004, 11.407468423815844, 14.189951241974393, 15.578517632178375, 16.799256178579796, 17.48631714056909, 15.56380933373975, 16.75706847510624, 13.737675615931861, 8.24177737453149, 6.403015353684063, 4.902551132985711, 6.845915539138432, 8.234877345580717, 10.033646960868595]), label=ResultLabel.NORMAL (interface.py:166) (424695) 2024-08-02 09:12:46,552 | INFO |Task summary: run_id=4e97c, task_label=clickhouse-1m-2024080208 (models.py:337) 2024-08-02 09:12:46,552 | INFO |DB | db_label case label | load_dur qps latency(p99) recall max_load_count | label (models.py:337) 2024-08-02 09:12:46,552 | INFO |---------- | -------- ----------------- ------------------------ | ----------- ---------- --------------- ------------- -------------- | ----- (models.py:337) 2024-08-02 09:12:46,552 | INFO |Clickhouse | Performance768D1M clickhouse-1m-2024080208 | 210.3753 2.6495 0.9687 1.0 0 | :) (models.py:337) 2024-08-02 09:12:46,553 | WARNING: Replacing existing result with the same file_name: /opt/code/vectordb_bench/results/Clickhouse/result_20240802_clickhouse-1m-2024080208_clickhouse.json (models.py:191) (424695) 2024-08-02 09:12:46,553 | INFO: write results to disk /opt/code/vectordb_bench/results/Clickhouse/result_20240802_clickhouse-1m-2024080208_clickhouse.json (models.py:195) (424695) 2024-08-02 09:12:46,554 | INFO: Success to finish task: label=clickhouse-1m-2024080208, run_id=4e97c74f31414fd9b31c93510404e75f (interface.py:203) (424695)

@xiaofan-luan Do you need this datas?

alwayslove2013 · 2024-08-05T02:29:10Z

install/requirements_py3.11.txt

@@ -22,3 +22,4 @@ environs
 pydantic<v2
 scikit-learn
 pymilvus
+clickhouse_connect


need to add it to pyproject.toml

all = [ ..., "clickhouse_connect" ] clickhouse = [ "clickhouse_connect" ]

so that users could use "pip install vectordb-bench[all]" or "pip install vectordb-bench[clickhouse]" to install dependencies from PYPI.

alwayslove2013 · 2024-08-05T02:35:51Z

vectordb_bench/backend/clients/clickhouse/clickhouse.py

+        if filters:
+            gt = filters.get("id")
+            filterSql = f'SELECT id,cosineDistance(embedding,{query}) AS score FROM {self.db_config["dbname"]}.{self.table_name}  \
+                    WHERE id > {gt} ORDER BY score LIMIT {k};'
+            result = self.conn.query(filterSql).result_rows
+            return [int(row[0]) for row in result]
+        else:
+            selectSql = f'SELECT id,cosineDistance(embedding,{query}) AS score FROM {self.db_config["dbname"]}.{self.table_name}  \
+                    ORDER BY score LIMIT {k};'
+            result = self.conn.query(selectSql).result_rows


It is not recommended to fix the metric to cosine here. although all the datasets used by vectordbbench are cosine at the moment, we may support more datasets in the future, possibly using L2 or IP.
You can get the metric used for the current test case from self.case_config.

alwayslove2013 · 2024-08-05T02:42:29Z

vectordb_bench/backend/clients/clickhouse/config.py

+from typing import TypedDict
+from pydantic import BaseModel, SecretStr
+from ..api import DBConfig, DBCaseConfig, MetricType, IndexType
+
+class ClickhouseConfig(DBConfig):
+    user_name: SecretStr = "default"
+    password: SecretStr
+    host: str = "127.0.0.1"
+    port: int = 30193
+    db_name: str = "default"
+
+    def to_dict(self) -> dict:
+        user_str = self.user_name.get_secret_value()
+        pwd_str = self.password.get_secret_value()
+        return {
+            "host": self.host,
+            "port": self.port,
+            "dbname": self.db_name,
+            "user": user_str,
+            "password": pwd_str
+        }


I did not find any code related to ANN Index in config.py. Since your test results show that both recall and ndcg are equal to 1.0, I'm curious if clickhouse only supports brute-force for vector search.

alwayslove2013 · 2024-08-05T02:49:39Z

@gb198871 Thank you so much for your first PR contribution! I really appreciate you taking the time to work on this.

I've left some comments on the PR with a few suggestions. We are looking forward to collaborating with you and continue improving the VectorDBBbench support for Clickhouse.

add Clickhouse Bench

b80f635

alwayslove2013 self-requested a review August 5, 2024 01:36

alwayslove2013 reviewed Aug 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add Clickhouse Bench #356

add Clickhouse Bench #356

gb198871 commented Aug 2, 2024

sre-ci-robot commented Aug 2, 2024

xiaofan-luan commented Aug 4, 2024

gb198871 commented Aug 5, 2024

alwayslove2013 Aug 5, 2024

alwayslove2013 Aug 5, 2024

alwayslove2013 Aug 5, 2024

alwayslove2013 commented Aug 5, 2024

add Clickhouse Bench #356

Are you sure you want to change the base?

add Clickhouse Bench #356

Conversation

gb198871 commented Aug 2, 2024

sre-ci-robot commented Aug 2, 2024

xiaofan-luan commented Aug 4, 2024

gb198871 commented Aug 5, 2024

alwayslove2013 Aug 5, 2024

Choose a reason for hiding this comment

alwayslove2013 Aug 5, 2024

Choose a reason for hiding this comment

alwayslove2013 Aug 5, 2024

Choose a reason for hiding this comment

alwayslove2013 commented Aug 5, 2024