Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock when creating many processes #1397

Open
greschd opened this issue Apr 10, 2018 · 9 comments · Fixed by #2774
Open

Deadlock when creating many processes #1397

greschd opened this issue Apr 10, 2018 · 9 comments · Fixed by #2774

Comments

@greschd
Copy link
Member

greschd commented Apr 10, 2018

When many processes are created, the workers will clog up and keep checking the status of a few processes, without actually running the processes that it depends on.

This is due to the prefetch limit.

Possible ways to solve this:

  • Set the prefetch limit to 0 (no limit). Maybe this would create some problems with the heartbeat?
  • Add a scheduler which tells the workers to "drop" waiting processes and pick up new ones.
@greschd greschd added type/bug priority/critical-blocking must be resolved before next release labels Apr 10, 2018
@muhrin muhrin added this to the v1.0.0 milestone Jul 25, 2018
@sphuber sphuber modified the milestones: v1.0.0, v1.0.0b1 Dec 4, 2018
@sphuber sphuber modified the milestones: v1.0.0b1, v1.0.0 Dec 11, 2018
@sphuber
Copy link
Contributor

sphuber commented Apr 8, 2019

The minimum amount of work for 1.0.0 should be to at least to see if we can add a warning if a user launches a process and the available daemon worker slots are below a certain percentage/number. Not sure where this should be triggered. The most logical point would be in the launchers, but that means that anytime this is called some sort of query to all daemon workers has to be performed to get the number of open slots, if this is even possible, which might be quite slot.

Alternative which is less accurate but more feasible is to simply get number of workers and multiply with prefetch count global constant and then compare with active number of processes. Actually, just run this only when calling verdi process list and issue the warning at the bottom, just like when the daemon is not running for example. Add instructions with what to do, verdi daemon incr but maybe also link to documentation explaining the problem. Suggested threshold 90%

@ConradJohnston
Copy link
Contributor

I pushed an implementation of this warning fix to a branch in my fork.

What we learned during this process is that the query to generate the table in verdi process list can be quite costly (~ 18 secs on @sphuber 's 10.5 million node DB). The problem with piggy-backing this query to get the number of active processes is that really one needs to do a second query for only the active processes. This is because the user can apply any filters they want to the output of verdi process list but we want to guarantee that the warning will be printed in all cases (except for raw output). 30+ secs to get a response is excessive. The reason for this performance is likely related to filtering on process state, which is stored as an attribute.

A second problem is that getting a CircusClient in order to obtain the number of workers is also quite slow (about 1 second).

A final problem is what to tell the user to do when the warning fires. Currently it suggests that they should increase the number of workers. However, we are not sure what the upper limit for processes should be and what is feasible under a typical OS and system setup. In this sense, perhaps giving this advice blindly is a bad idea.

We need to think about whether printing a warning is the right course of action, for now, and make plans to fix the cause of the problem.

@greschd
Copy link
Member Author

greschd commented May 11, 2019

Thanks @ConradJohnston for the improvement!

@sphuber Should we really close this issue though, or was that just done automatically/accidentally because the PR mentions this issue?

To add to my initial suggestions, I think one could also implement some logic where a process gets dropped (and added to the end of the queue) when it has been waiting for a specific number of "worker cycles" or specific time.

Maybe my tasks were a bit excessive in the number of processes created, but I've found launching a large number of workers to be problematic in terms of CPU load -- for workers which end up not doing much.

@sphuber sphuber reopened this May 11, 2019
@sphuber
Copy link
Contributor

sphuber commented May 11, 2019

Yeah, it was closed accidentally. The heavy CPU load, was that on the alphas of aiida-core. There were some inefficiencies that have been solved that could cause them too spin too often, i.e. the polling fallback was firing all the time because the default was wrongly set to 0. The actual solution to this problem will probably be quite complicated, as any typical solution might add other inefficiencies and something we might want to do, would require functionality from RabbitMQ for which it was not designed and so won't be able to provide.

@greschd
Copy link
Member Author

greschd commented May 12, 2019

Yeah that was on alpha, glad to hear that should now be fixed.

True, the scheduling definitely isn't an easy fix. Another option (maybe as temporary fix) would also be to have a watcher thread that dynamically scales the workers (as an optional setting).

@sphuber sphuber modified the milestones: v1.0.0, v2.0.0 Jun 17, 2019
@sphuber
Copy link
Contributor

sphuber commented Jun 17, 2019

The proper solution to this will require significant changes to the way RabbitMQ is used. Therefore I am putting this in v2.0.0. For now, the "fix" implemented #2774 should give some breathing room, but the problem is not fixed

@muhrin
Copy link
Contributor

muhrin commented Dec 12, 2019

I'm thinking about how to solve this in a more general way, for now I'm collecting information here:

https://github.com/aiidateam/aiida-core/wiki/Scaling-of-AiiDA-workers-and-slots

because it is easier than re-reading the full conversaion each time.

@greschd
Copy link
Member Author

greschd commented Dec 12, 2019

Should we continue discussing here, and you'll collect it there? Or edit there directly?

The problem with this is that the parent would then not be able to receive messages and be unable to be killed, paused, etc.

I guess when the total number of processes is higher than the number of "slots" = workers * processes per worker, that is kind of unavoidable, right? Is my understanding correct that the message would still get queued and they receive it once some other worker picks them up?

I think some sort of rotation / scheduling is needed here. Some comments:

  • As long as the number of slots is larger than the number of processes, I think a process should not be "pulled off".
  • There could either be a dedicated "load balancer", or some heuristic for when a process is "dropped" while waiting. A middle ground would be a "monitor" that observes process performance / stats, and adjusts "heuristic parameters".

Since this seems like a classical scheduling problem, we should definitely consult existing solutions / algorithms.

@muhrin
Copy link
Contributor

muhrin commented Dec 13, 2019

Discussing here is good, I was just thinking of putting summary sutff in the wiki.

@gerschd, what you say is good except we still would need to deal with the problem of how to make sure processes that are 'pulled off' can still deal with incoming messages. Additionally we have to be careful that any scheduler like implementation doesn't end up in a situation where, say, a process is waiting for transport but is pulled off before it gets it because it looks like it's doing nothing. In the worse case processes could never end up getting their transport because they are always pulled off before they get it.

summary below (this used to be in the wiki, but is best kept here):

Background information

The AiiDA daemon runs processes by launching workers which each have a finite number of slots representing the number of simultaneous (in the coroutine sense) processes that they can run. Each worker maintains a single connection to the database.

This choice of implementation was guided by the following constraints:

  1. There is a database connection limit (typically 100 for ubuntu) which should be respected
  2. Threads don't always play nice with Django or sqlalchemy and indeed Django uses a separate connection per thread.

Problems

There are a number of problems with the current solution, some of which have been discussed in #1397. Specifically:

  1. AiiDA workflows are allowed to block for lengthy periods of time e.g. if carrying out a complex local calculation. This means that other processes on the same worker are blocked (and can't even receive message), this makes it undesirable to have too many on the same worker.
  2. Workchains may spawn child processes which they wait on, however if there are no more slots left this child will never run and the parent will wait indefinitely whilst blocking a slot.

Possible solutions

  1. Is is tempting to try and pull a waiting parent processes off a worker to allow the child to run and then re-create it allowing it to continue once the child finishes. The problem with this is that the parent would then not be able to receive messages and be unable to be killed, paused, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants