-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock when creating many processes #1397
Comments
The minimum amount of work for Alternative which is less accurate but more feasible is to simply get number of workers and multiply with prefetch count global constant and then compare with active number of processes. Actually, just run this only when calling |
I pushed an implementation of this warning fix to a branch in my fork. What we learned during this process is that the query to generate the table in A second problem is that getting a CircusClient in order to obtain the number of workers is also quite slow (about 1 second). A final problem is what to tell the user to do when the warning fires. Currently it suggests that they should increase the number of workers. However, we are not sure what the upper limit for processes should be and what is feasible under a typical OS and system setup. In this sense, perhaps giving this advice blindly is a bad idea. We need to think about whether printing a warning is the right course of action, for now, and make plans to fix the cause of the problem. |
Thanks @ConradJohnston for the improvement! @sphuber Should we really close this issue though, or was that just done automatically/accidentally because the PR mentions this issue? To add to my initial suggestions, I think one could also implement some logic where a process gets dropped (and added to the end of the queue) when it has been waiting for a specific number of "worker cycles" or specific time. Maybe my tasks were a bit excessive in the number of processes created, but I've found launching a large number of workers to be problematic in terms of CPU load -- for workers which end up not doing much. |
Yeah, it was closed accidentally. The heavy CPU load, was that on the alphas of |
Yeah that was on alpha, glad to hear that should now be fixed. True, the scheduling definitely isn't an easy fix. Another option (maybe as temporary fix) would also be to have a watcher thread that dynamically scales the workers (as an optional setting). |
The proper solution to this will require significant changes to the way RabbitMQ is used. Therefore I am putting this in |
I'm thinking about how to solve this in a more general way, for now I'm collecting information here: https://github.com/aiidateam/aiida-core/wiki/Scaling-of-AiiDA-workers-and-slots because it is easier than re-reading the full conversaion each time. |
Should we continue discussing here, and you'll collect it there? Or edit there directly?
I guess when the total number of processes is higher than the number of "slots" = workers * processes per worker, that is kind of unavoidable, right? Is my understanding correct that the message would still get queued and they receive it once some other worker picks them up? I think some sort of rotation / scheduling is needed here. Some comments:
Since this seems like a classical scheduling problem, we should definitely consult existing solutions / algorithms. |
@gerschd, what you say is good except we still would need to deal with the problem of how to make sure processes that are 'pulled off' can still deal with incoming messages. Additionally we have to be careful that any scheduler like implementation doesn't end up in a situation where, say, a process is waiting for transport but is pulled off before it gets it because it looks like it's doing nothing. In the worse case processes could never end up getting their transport because they are always pulled off before they get it. summary below (this used to be in the wiki, but is best kept here): Background informationThe AiiDA daemon runs processes by launching workers which each have a finite number of slots representing the number of simultaneous (in the coroutine sense) processes that they can run. Each worker maintains a single connection to the database. This choice of implementation was guided by the following constraints:
ProblemsThere are a number of problems with the current solution, some of which have been discussed in #1397. Specifically:
Possible solutions
|
When many processes are created, the workers will clog up and keep checking the status of a few processes, without actually running the processes that it depends on.
This is due to the prefetch limit.
Possible ways to solve this:
The text was updated successfully, but these errors were encountered: