Reinstate broadcast subscription of workchain for child processes #4154

sphuber · 2020-06-05T16:52:30Z

When a WorkChain step submits sub processes, awaitables will be
created for them. Only when these awaitables have been resolved, meaning
the subprocesses, have terminated, can the workchain continue to the
next step.

The original concept was, for each awaitable, to schedule a callback
once the process had reached a terminal state. The callback was supposed
to be triggered by having the runner add a broadcast subscriber that would
listen for state changes of the sub process. As a fail-safe, a polling
mechanism would also check periodically just in case the broadcast
message would be missed and prevent the caller from waiting indefinitely.

However, the broadcast subscriber was never added and so the system
relied solely on the polling mechanism. This completely undermines the
benefits of having an event-based mechanism, so in this commit the
Runner.call_on_process_finish now also registers the broadcast
subscriber.

Note that the affected code had some references to calculation which
has been generalized to process, since this also applies to workflows
that might waited upon. The CalculationFuture has been renamed to
ProcessFuture in similar vein. It is currently not used, but it could
have been used for the problem that this commit solves, so it has been
decided to leave it in for now and not remove it entirely.

sphuber · 2020-06-06T07:38:34Z

This PR is blocked because it needs a new release of plumpy. Currently I am just manually installing the develop version in the CI. Once we are happy with the code, I will release plumpy==0.15.0 and update the PR.

unkcpz · 2020-06-06T15:09:56Z

aiida/engine/runners.py

@@ -67,7 +67,7 @@ def __init__(self, poll_interval=0, loop=None, communicator=None, rmq_submit=Fal
        self._plugin_version_provider = PluginVersionProvider()

        if communicator is not None:
-            self._communicator = communicator
+            self._communicator = plumpy.wrap_communicator(communicator, self._loop)


hi @sphuber , as a reminder, since now you wrap the communicator here to LoopCommnicator, then there is no need to add task subscriber with scheduling task_receiver callback to the process event loop by create_task.
In file aiida/manage/manager.py::Manager::create_daemon_runner .

Thanks a lot for the heads-up @unkcpz . I will have a look and update the PR.

@unkcpz I added this simplification in the first commit of this PR

all good. Thanks!

codecov · 2020-06-16T12:12:43Z

Codecov Report

Merging #4154 into develop will increase coverage by 0.02%.
The diff coverage is 96.23%.

@@             Coverage Diff             @@
##           develop    #4154      +/-   ##
===========================================
+ Coverage    78.89%   78.90%   +0.02%     
===========================================
  Files          467      467              
  Lines        34470    34489      +19     
===========================================
+ Hits         27191    27211      +20     
+ Misses        7279     7278       -1

Flag	Coverage Δ
#django	`70.83% <96.23%> (+0.03%)`	⬆️
#sqlalchemy	`71.71% <96.23%> (+0.02%)`	⬆️

Impacted Files	Coverage Δ
aiida/engine/daemon/runner.py	`79.32% <0.00%> (ø)`
aiida/engine/processes/workchains/awaitable.py	`89.48% <ø> (ø)`
aiida/orm/nodes/node.py	`91.22% <ø> (ø)`
aiida/engine/processes/workchains/workchain.py	`90.23% <80.00%> (+0.08%)`	⬆️
aiida/engine/processes/futures.py	`100.00% <100.00%> (+5.89%)`	⬆️
aiida/engine/runners.py	`93.11% <100.00%> (+1.11%)`	⬆️
aiida/manage/manager.py	`95.34% <100.00%> (-0.06%)`	⬇️
aiida/transports/plugins/ssh.py	`76.39% <100.00%> (ø)`
aiida/engine/daemon/client.py	`72.58% <0.00%> (-1.14%)`	⬇️
aiida/orm/nodes/process/process.py	`85.80% <0.00%> (+0.60%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f9d66b4...7441fb8. Read the comment docs.

unkcpz · 2020-06-17T01:20:00Z

aiida/engine/runners.py

@@ -67,7 +67,7 @@ def __init__(self, poll_interval=0, loop=None, communicator=None, rmq_submit=Fal
        self._plugin_version_provider = PluginVersionProvider()

        if communicator is not None:
-            self._communicator = communicator
+            self._communicator = plumpy.wrap_communicator(communicator, self._loop)


all good. Thanks!

muhrin

So, all good except for that callback change. If you can think of an easy way to test the 'double resolution' of the callback that might be useful but I don't see it as necessary so I wouldn't spend much time on it.

muhrin · 2020-06-17T07:46:54Z

aiida/engine/runners.py

+                    # The broadcast subscriber has already removed
+                    pass
+
+        broadcast_filter = kiwipy.BroadcastFilter(functools.partial(inline_callback, resolved), sender=pk)


This, unfortunately, doesn't work. Python has this sneaky thing where all primitives are passed by value other types by reference meaning that your resolved in the callback is actually always a new (False) boolean each time the function is invoked.

You can test it with this:

res = False def do(res): if res: print("Already resolved") return res = True print("Now resolved") do(res) do(res)

As ugly as it is, this is one solution:

res = [False] def do(res): if res[0]: print("Already resolved") return res[0] = True print("Now resolved") do(res) do(res)

But if you don't like it feel free to create anything where an non-primitive is used as a flag. I think at some point I may have created a special Flag class just for this in some of my code. A threading.Event is also a decent choice.

This is necessary because otherwise subscribers that use the runner's communicator and receive a message will execute the callback on the communication thread. This should be prevented for two reasons. Firstly, it will cause the business logic to be executed on the communication thread which can cause it to fail to respond in time to heartbeat requests from RabbitMQ when it is under heavy load. Secondly, the communication thread does not share the same session as the main thread. This can cause problems when the callbacks query the database as it can be out of sync with the session of the main thread. For Django this will most likely go by unnoticed, but SqlAlchemy will except. Since the communicator of the runner is now wrapped, the process launcher of the daemon runner no longer has to be added as a task subscriber through a callback, but the `add_task_subscriber` can be called directly and the `LoopCommunicator` will take of everything.

muhrin

Great, good to go mate.

When a `WorkChain` step submits sub processes, awaitables will be created for them. Only when these awaitables have been resolved, meaning the subprocesses, have terminated, can the workchain continue to the next step. The original concept was, for each awaitable, to schedule a callback once the process had reached a terminal state. The callback was supposed to be triggered by having the runner add a broadcast subscriber that would listen for state changes of the sub process. As a fail-safe, a polling mechanism would also check periodically just in case the broadcast message would be missed and prevent the caller from waiting indefinitely. However, the broadcast subscriber was never added and so the system relied solely on the polling mechanism. This completely undermines the benefits of having an event-based mechanism, so in this commit the `Runner.call_on_process_finish` now also registers the broadcast subscriber. Note that the affected code had some references to `calculation` which has been generalized to `process`, since this also applies to workflows that might waited upon. The `CalculationFuture` has been renamed to `ProcessFuture` in similar vein. It is currently not used, but it could have been used for the problem that this commit solves, so it has been decided to leave it in for now and not remove it entirely.

In PR aiidateam#4154, the LoopCommunicator was used as the runner's communicator. This makes it easy to schecule callbacks into the correct event loops, however, the `Process` has its own mechanism to do this. In object method `_schedule_rpc` process schedule the callback into its event loop. Using `LoopCommunitor` lead to a problem that when controller send the controlling message by rpc for example when calling `controller.pause_process` the `pause` subscriber is run by `plumpy.futures.create_task`, which is not necessary. `call_on_process_finish` receive a callback which not expected to be a coroutine, however, in `action_awaitables` we schedule `_run_task` which is a corotinefunction to it. The `call_soon` method of Process object is more proper here.

sphuber force-pushed the fix/4151/workchain-child-broadcast-subscribe branch 8 times, most recently from d773a9e to dd57b17 Compare June 5, 2020 20:25

sphuber requested a review from muhrin June 6, 2020 07:37

sphuber added the pr/blocked label Jun 6, 2020

unkcpz reviewed Jun 6, 2020

View reviewed changes

sphuber mentioned this pull request Jun 10, 2020

WorkChain: guarantee to maintain order of appended awaitables #4156

Merged

sphuber force-pushed the fix/4151/workchain-child-broadcast-subscribe branch from dd57b17 to fdcba39 Compare June 16, 2020 11:54

sphuber force-pushed the fix/4151/workchain-child-broadcast-subscribe branch from 4e8d88f to ebc6db9 Compare June 16, 2020 17:18

sphuber requested a review from unkcpz June 16, 2020 17:18

unkcpz previously approved these changes Jun 17, 2020

View reviewed changes

muhrin suggested changes Jun 17, 2020

View reviewed changes

sphuber dismissed unkcpz’s stale review via 25d6231 June 17, 2020 12:50

sphuber force-pushed the fix/4151/workchain-child-broadcast-subscribe branch 2 times, most recently from 25d6231 to 076849d Compare June 17, 2020 13:18

sphuber force-pushed the fix/4151/workchain-child-broadcast-subscribe branch from 076849d to bea3cb4 Compare June 17, 2020 13:38

muhrin approved these changes Jun 17, 2020

View reviewed changes

sphuber force-pushed the fix/4151/workchain-child-broadcast-subscribe branch from bea3cb4 to 7441fb8 Compare June 17, 2020 14:20

sphuber merged commit 1af57cc into aiidateam:develop Jun 17, 2020

sphuber deleted the fix/4151/workchain-child-broadcast-subscribe branch June 17, 2020 15:01

sphuber mentioned this pull request Jun 18, 2020

Increase the default for runner.poll.interval config option to 60 #4150

Merged

unkcpz mentioned this pull request Jun 20, 2020

Process on finish call without using LoopCommunicator #4191

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reinstate broadcast subscription of workchain for child processes #4154

Reinstate broadcast subscription of workchain for child processes #4154

sphuber commented Jun 5, 2020

sphuber commented Jun 6, 2020

unkcpz Jun 6, 2020

sphuber Jun 6, 2020

sphuber Jun 16, 2020

unkcpz Jun 17, 2020

codecov bot commented Jun 16, 2020 •

edited

Loading

unkcpz Jun 17, 2020

muhrin left a comment

muhrin Jun 17, 2020

muhrin left a comment

Reinstate broadcast subscription of workchain for child processes #4154

Reinstate broadcast subscription of workchain for child processes #4154

Conversation

sphuber commented Jun 5, 2020

sphuber commented Jun 6, 2020

unkcpz Jun 6, 2020

Choose a reason for hiding this comment

sphuber Jun 6, 2020

Choose a reason for hiding this comment

sphuber Jun 16, 2020

Choose a reason for hiding this comment

unkcpz Jun 17, 2020

Choose a reason for hiding this comment

codecov bot commented Jun 16, 2020 • edited Loading

Codecov Report

unkcpz Jun 17, 2020

Choose a reason for hiding this comment

muhrin left a comment

Choose a reason for hiding this comment

muhrin Jun 17, 2020

Choose a reason for hiding this comment

muhrin left a comment

Choose a reason for hiding this comment

codecov bot commented Jun 16, 2020 •

edited

Loading