-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: scheduler #275
base: main
Are you sure you want to change the base?
Feature: scheduler #275
Conversation
b2f6ad8
to
f7ed561
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #275 +/- ##
==========================================
- Coverage 75.75% 67.30% -8.45%
==========================================
Files 70 70
Lines 4615 6123 +1508
==========================================
+ Hits 3496 4121 +625
- Misses 1119 2002 +883
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
d3e30de
to
56ef506
Compare
139a004
to
c36d72b
Compare
aiida_workgraph/engine/launch.py
Outdated
return result, process.node | ||
|
||
|
||
def instantiate_process( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sphuber , I modified the instantiate_process function from aiida-core.
- Pickup the Scheduler process instead of launching a new one. - submit a workgraph inside the scheduler - Move report from Scheduler process to the workgraph process
The scheduler will listen to the task from scheduler_queue
a791b9d
to
f924e04
Compare
1) can run multiple runner (daemon) for the scheduler, each runner will listen to the `scheduler_queue`, and the prefetch_count is set to 1, thus each runner can only launch one Scheduler process. 2) The scheduler process listen to the `workgraph_queue` to launch WorkGraph 3) the scheduler recieve rpc call to launch WorkGrpah 4) user can submit workgraph to the workgraph queue, or select the shceduler to run it by pk
f924e04
to
d35d63e
Compare
7d9b0c7
to
5a29c79
Compare
Background
When running a workflow (such as a WorkChain or WorkGraph), each workflow is associated with a corresponding process. This process launches and waits for the child processes (e.g., CalcJob processes). In nested workflows like the
PwBandsWorkChain
, you may encounter multiple Workflow processes in a waiting state, with only one CalcJob process actively running. These waiting Workflow processes can be seen as inefficient resource usage.In a
WorkChain
, the workflow logic is encapsulated within the newWorkChain
class, making it challenging to eliminate these waiting processes at the moment. However, in aWorkGraph
, the logic is more explicitly defined, and it has strict rules on who can execute this logic.Besides, it's not good to run the task process and workgraph process in the same runner.
Proposal
To address this, I proposed a Scheduler for the
WorkGraph
in this PR. The Scheduler handles the following:WorkGraph
process only in the database without actually running the process by a daemon worker.CalcJob
, it launches it to the daemon worker as usual. The key difference here is that the Scheduler uses theWorkGraph
's PK as the parent PID, thereby maintaining correct provenance.Let's compare the process count for the
PwBands
case. Suppose we launch 100PwBands
WorkGraphs:The benefit is clear: the new approach significantly reduces the number of active processes.
Moreover, the Scheduler runs in a separate daemon that does not listen to process launching tasks, thereby eliminating the possibility of deadlocks that could occur with the old approach.This is also related to these issues:
Workflow process may spawn child processes which they wait on, however if there are no more slots left this child will never run and the parent will wait indefinitely whilst blocking a slot. More details in Deadlock when creating many processes aiida-core#1397.
User wants to control the maximum running job on a computer.
Note: this scheduler is designed for WorkGraph only. For WorkChain, this will not work.
Usage
https://aiida-workgraph--275.org.readthedocs.build/en/275/howto/scheduler.html
Scheduler
Add a daemon runner for scheduler:
scheduler_queue
, and the prefetch_count is set to 1. Thus, each runner can only launch one Scheduler process.workgraph_queue
to launch WorkGraphKeep provenance
parent_pid
, so that there is a link between the workgraph and the task's process.parent_pid
, and launch the workgraph inside the scheduler.Use one scheduler process or scale the number of processes when needed.
While a single scheduler suffices for most use cases, scaling up the number of schedulers may be beneficial when significantly increasing the number of task workers (created by
verdi daemon start
). A general rule is to maintain a ratio of less than 5 workers per scheduler.Circus
Similar to the
worker
daemon, we use circus to manage thescheduler
daemon.command
Todo
Scheduler
process instead of launching a new one.scheduler_queue
in rmq is increased because the runner does not ack. For example, when the runner stop, the scheduler process is still running, so the runner does not ack back to rmq. When the runner restarts, it will processed the first msg in the queue, but I also send a new msg to the queue to continue the scheduler. This is bug, we don't need send the msg to continue, because, it is already there.checkpoint
how do we save the checkpoint? instead of saving all data every time, it would be great if we only update the context related with the workgraph.
solution 1
save the ctx data for a workgraph to the extras of that workgraph.
submit calcfuntion
I tested, one can submit a calcfunction if it is inside a package, thus the daemon can load it back using
importlib.import_module
. For calcfunction defined on-the-fly, it will raise an error.Other features after this PR
workgraph_queue
, or run directly in the same scheduler?workgraph_queue
will make the schedulers more balanced