[Question]: How do I respond slowly to concurrent requests for interfaces /api/v1/chats/{chat_id}/completions? #5183

xyk0930 · 2025-02-20T07:23:25Z

Describe your problem

The response time is about 50s when there is only one request
When there are 10 concurrent requests, the last response time is 3min40s
Is this because of the ragflow service itself or because the LLM is not friendly to concurrent requests?

KevinHuSh · 2025-02-21T03:56:20Z

You could click the little lamp using UI to check the time elapsed.

xyk0930 · 2025-02-21T07:57:29Z

I checked. Mostly generating answers,It is definitely LLM problem.
I used ollama to run the deepseek-r1:70b model, 8*4090 (24G) GPU, and the utilization rate of each GPU was less than 20%. I went to the ollama community and saw people raising similar problems, but there seemed to be no good solution. Do you have a good idea on how to increase usage with multiple Gpus? @KevinHuSh

KevinHuSh · 2025-02-21T11:04:27Z

No clue yet.

xyk0930 added the question Further information is requested label Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: How do I respond slowly to concurrent requests for interfaces /api/v1/chats/{chat_id}/completions? #5183

[Question]: How do I respond slowly to concurrent requests for interfaces /api/v1/chats/{chat_id}/completions? #5183

xyk0930 commented Feb 20, 2025

KevinHuSh commented Feb 21, 2025

xyk0930 commented Feb 21, 2025

KevinHuSh commented Feb 21, 2025

[Question]: How do I respond slowly to concurrent requests for interfaces /api/v1/chats/{chat_id}/completions? #5183

[Question]: How do I respond slowly to concurrent requests for interfaces /api/v1/chats/{chat_id}/completions? #5183

Comments

xyk0930 commented Feb 20, 2025

Describe your problem

KevinHuSh commented Feb 21, 2025

xyk0930 commented Feb 21, 2025

KevinHuSh commented Feb 21, 2025