Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: How do I respond slowly to concurrent requests for interfaces /api/v1/chats/{chat_id}/completions? #5183

Open
xyk0930 opened this issue Feb 20, 2025 · 3 comments
Labels
question Further information is requested

Comments

@xyk0930
Copy link

xyk0930 commented Feb 20, 2025

Describe your problem

  1. The response time is about 50s when there is only one request
  2. When there are 10 concurrent requests, the last response time is 3min40s
  3. Is this because of the ragflow service itself or because the LLM is not friendly to concurrent requests?
@xyk0930 xyk0930 added the question Further information is requested label Feb 20, 2025
@KevinHuSh
Copy link
Collaborator

You could click the little lamp using UI to check the time elapsed.

@xyk0930
Copy link
Author

xyk0930 commented Feb 21, 2025

I checked. Mostly generating answers,It is definitely LLM problem.
I used ollama to run the deepseek-r1:70b model, 8*4090 (24G) GPU, and the utilization rate of each GPU was less than 20%. I went to the ollama community and saw people raising similar problems, but there seemed to be no good solution. Do you have a good idea on how to increase usage with multiple Gpus? @KevinHuSh

@KevinHuSh
Copy link
Collaborator

No clue yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants