Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Clickhouse Bench #356

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

add Clickhouse Bench #356

wants to merge 1 commit into from

Conversation

gb198871
Copy link

@gb198871 gb198871 commented Aug 2, 2024

No description provided.

@sre-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gb198871
To complete the pull request process, please assign xuanyang-cn after the PR has been reviewed.
You can assign the PR to them by writing /assign @xuanyang-cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@xiaofan-luan
Copy link
Collaborator

@gb198871

could you also provide some numbers so we can verify it?

@alwayslove2013 alwayslove2013 self-requested a review August 5, 2024 01:36
@gb198871
Copy link
Author

gb198871 commented Aug 5, 2024

2024-08-02 09:12:44,538 | INFO: SpawnProcess-12:1073 search 30s: actual_dur=39.8413s, count=1, qps in this process: 0.0251 (mp_runner.py:76) (472525) 2024-08-02 09:12:44,539 | INFO: End search in concurrency 100: dur=39.89072586199836s, total_count=103, qps=2.5821 (mp_runner.py:120) (424695) 2024-08-02 09:12:46,551 | INFO: Performance case got result: Metric(max_load_count=0, load_duration=210.3753, qps=2.6495, serial_latency_p99=0.9687, recall=1.0, ndcg=1.0, conc_num_list=[1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100], conc_qps_list=[1.3354, 2.5249, 2.4923, 2.4926, 2.4858, 2.4837, 2.4855, 2.4979, 2.4953, 2.4897, 2.4876, 2.4726, 2.5, 2.5421, 2.5615, 2.4912, 2.5801, 2.5204, 2.5725, 2.6495, 2.5821], conc_latency_p99_list=[0.7145220047979092, 1.6144527919925125, 3.734352893486223, 1.3177170618528613, 7.378083680513816, 1.4176923855114474, 10.432509155307004, 11.407468423815844, 14.189951241974393, 15.578517632178375, 16.799256178579796, 17.48631714056909, 15.56380933373975, 16.75706847510624, 13.737675615931861, 8.24177737453149, 6.403015353684063, 4.902551132985711, 6.845915539138432, 8.234877345580717, 10.033646960868595]) (task_runner.py:195) (424695) 2024-08-02 09:12:46,552 | INFO: [1/1] finish case: {'label': <CaseLabel.Performance: 2>, 'dataset': {'data': {'name': 'Cohere', 'size': 1000000, 'dim': 768, 'metric_type': <MetricType.COSINE: 'COSINE'>}}, 'db': 'Clickhouse'}, result=Metric(max_load_count=0, load_duration=210.3753, qps=2.6495, serial_latency_p99=0.9687, recall=1.0, ndcg=1.0, conc_num_list=[1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100], conc_qps_list=[1.3354, 2.5249, 2.4923, 2.4926, 2.4858, 2.4837, 2.4855, 2.4979, 2.4953, 2.4897, 2.4876, 2.4726, 2.5, 2.5421, 2.5615, 2.4912, 2.5801, 2.5204, 2.5725, 2.6495, 2.5821], conc_latency_p99_list=[0.7145220047979092, 1.6144527919925125, 3.734352893486223, 1.3177170618528613, 7.378083680513816, 1.4176923855114474, 10.432509155307004, 11.407468423815844, 14.189951241974393, 15.578517632178375, 16.799256178579796, 17.48631714056909, 15.56380933373975, 16.75706847510624, 13.737675615931861, 8.24177737453149, 6.403015353684063, 4.902551132985711, 6.845915539138432, 8.234877345580717, 10.033646960868595]), label=ResultLabel.NORMAL (interface.py:166) (424695) 2024-08-02 09:12:46,552 | INFO |Task summary: run_id=4e97c, task_label=clickhouse-1m-2024080208 (models.py:337) 2024-08-02 09:12:46,552 | INFO |DB | db_label case label | load_dur qps latency(p99) recall max_load_count | label (models.py:337) 2024-08-02 09:12:46,552 | INFO |---------- | -------- ----------------- ------------------------ | ----------- ---------- --------------- ------------- -------------- | ----- (models.py:337) 2024-08-02 09:12:46,552 | INFO |Clickhouse | Performance768D1M clickhouse-1m-2024080208 | 210.3753 2.6495 0.9687 1.0 0 | :) (models.py:337) 2024-08-02 09:12:46,553 | WARNING: Replacing existing result with the same file_name: /opt/code/vectordb_bench/results/Clickhouse/result_20240802_clickhouse-1m-2024080208_clickhouse.json (models.py:191) (424695) 2024-08-02 09:12:46,553 | INFO: write results to disk /opt/code/vectordb_bench/results/Clickhouse/result_20240802_clickhouse-1m-2024080208_clickhouse.json (models.py:195) (424695) 2024-08-02 09:12:46,554 | INFO: Success to finish task: label=clickhouse-1m-2024080208, run_id=4e97c74f31414fd9b31c93510404e75f (interface.py:203) (424695)

image

@xiaofan-luan Do you need this datas?

@@ -22,3 +22,4 @@ environs
pydantic<v2
scikit-learn
pymilvus
clickhouse_connect
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add it to pyproject.toml

all = [
    ...,
    "clickhouse_connect"
]
clickhouse = [ "clickhouse_connect" ]

so that users could use "pip install vectordb-bench[all]" or "pip install vectordb-bench[clickhouse]" to install dependencies from PYPI.

Comment on lines +130 to +139
if filters:
gt = filters.get("id")
filterSql = f'SELECT id,cosineDistance(embedding,{query}) AS score FROM {self.db_config["dbname"]}.{self.table_name} \
WHERE id > {gt} ORDER BY score LIMIT {k};'
result = self.conn.query(filterSql).result_rows
return [int(row[0]) for row in result]
else:
selectSql = f'SELECT id,cosineDistance(embedding,{query}) AS score FROM {self.db_config["dbname"]}.{self.table_name} \
ORDER BY score LIMIT {k};'
result = self.conn.query(selectSql).result_rows
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not recommended to fix the metric to cosine here. although all the datasets used by vectordbbench are cosine at the moment, we may support more datasets in the future, possibly using L2 or IP.
You can get the metric used for the current test case from self.case_config.

Comment on lines +1 to +21
from typing import TypedDict
from pydantic import BaseModel, SecretStr
from ..api import DBConfig, DBCaseConfig, MetricType, IndexType

class ClickhouseConfig(DBConfig):
user_name: SecretStr = "default"
password: SecretStr
host: str = "127.0.0.1"
port: int = 30193
db_name: str = "default"

def to_dict(self) -> dict:
user_str = self.user_name.get_secret_value()
pwd_str = self.password.get_secret_value()
return {
"host": self.host,
"port": self.port,
"dbname": self.db_name,
"user": user_str,
"password": pwd_str
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not find any code related to ANN Index in config.py. Since your test results show that both recall and ndcg are equal to 1.0, I'm curious if clickhouse only supports brute-force for vector search.

@alwayslove2013
Copy link
Collaborator

@gb198871 Thank you so much for your first PR contribution! I really appreciate you taking the time to work on this.

I've left some comments on the PR with a few suggestions. We are looking forward to collaborating with you and continue improving the VectorDBBbench support for Clickhouse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants