-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Adding API for parallel block to task_arena to warm-up/retain/release worker threads #1522
base: master
Are you sure you want to change the base?
Changes from all commits
215a17d
74bf599
60fb251
1501d06
9f9279a
8a159d9
81fb7f0
97b499a
44fccee
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,189 @@ | ||||||||||
# Adding API for parallel block to task_arena to warm-up/retain/release worker threads | ||||||||||
|
||||||||||
## Introduction | ||||||||||
|
||||||||||
In oneTBB, there has never been an API that allows users to block worker threads within the arena. | ||||||||||
This design choice was made to preserve the composability of the application.<br> | ||||||||||
Before PR#1352, workers moved to the thread pool to sleep once there were no arenas with active | ||||||||||
demand. However, PR#1352 introduced a delayed leave behavior to the library that | ||||||||||
results in blocking threads for an _implementation-defined_ duration inside an arena | ||||||||||
if there is no active demand arcoss all arenas. This change significantly | ||||||||||
improved performance for various applications on high thread count systems.<br> | ||||||||||
The main idea is that usually, after one parallel computation ends, | ||||||||||
another will start after some time. The delayed leave behavior is a heuristic to utilize this, | ||||||||||
covering most cases within _implementation-defined_ duration. | ||||||||||
|
||||||||||
However, the new behavior is not the perfect match for all the scenarios: | ||||||||||
* The heuristic of delayed leave is unsuitable for the tasks that are submitted | ||||||||||
in unpredictable pattern and/or duration. | ||||||||||
* If oneTBB is used in composable scenarios it is not behaving as | ||||||||||
a good citizen consuming CPU resources. | ||||||||||
* For example, if an application builds a pipeline where oneTBB is used for one stage | ||||||||||
and OpenMP is used for a subsequent stage, there is a chance that oneTBB workers will | ||||||||||
interfere with OpenMP threads. This interference might result in slight oversubscription, | ||||||||||
which in turn might lead to underperformance. | ||||||||||
|
||||||||||
So there are two related problems but with different resolutions: | ||||||||||
* Completely disable new behavior for scenarios where the heuristic of delayed leave is unsuitable. | ||||||||||
* Optimize library behavior so customers can benefit from the heuristic of delayed leave but | ||||||||||
make it possible to indicate that "it is the time to release threads". | ||||||||||
|
||||||||||
## Proposal | ||||||||||
|
||||||||||
Let's tackle these problems one by one. | ||||||||||
|
||||||||||
### Completely disable new behavior | ||||||||||
|
||||||||||
Let’s consider both “Delayed leave” and “Fast leave” as 2 different states in state machine.<br> | ||||||||||
* The "Delayed leave" heuristic benefits most of the workloads. Therefore, this is the | ||||||||||
default behavior for arena. | ||||||||||
* Workloads that has rather negative performance impact from the heuristic of delayed leave | ||||||||||
can create an arena in “Fast leave” state. | ||||||||||
|
||||||||||
<img src="completely_disable_new_behavior.png" width=800> | ||||||||||
|
||||||||||
There will be a question that we need to answer: | ||||||||||
* Do we see any value if arena potentially can transition from one to another state? | ||||||||||
|
||||||||||
To answer this question, the following scenarios should be considired: | ||||||||||
* What if different types of workloads are mixed in one application? | ||||||||||
* Different types of arenas can be used for different types of workloads. | ||||||||||
|
||||||||||
### When threads should leave? | ||||||||||
|
||||||||||
oneTBB itself can only guess when the ideal time to release threads from the arena is. | ||||||||||
Therefore, it does the best effort to preserve and enhance performance without completely | ||||||||||
messing composability guarantees (that is how delayed leave is implemented). | ||||||||||
|
||||||||||
As we already discussed, there are cases where it does not work perfectly, | ||||||||||
therefore customers that want to further optimize this | ||||||||||
aspect of oneTBB behavior should be able to do it. | ||||||||||
|
||||||||||
This problem can be considered from another angle. Essentially, if the user can indicate | ||||||||||
where parallel computation ends, they can also indicate where it starts. | ||||||||||
|
||||||||||
<img src="parallel_block_introduction.png" width=800> | ||||||||||
|
||||||||||
With this approach, the user not only releases threads when necessary but also specifies a | ||||||||||
programmable block where worker threads should expected new work coming regularly | ||||||||||
to the executing arena. | ||||||||||
|
||||||||||
Let’s add new state to the existing state machine. To represent "Parallel Block" state. | ||||||||||
|
||||||||||
> **_NOTE:_** The "Fast leave" state is colored Grey just for simplicity of the chart. | ||||||||||
Let's assume that arena was created with the "Delayed leave". | ||||||||||
The logic demonstrated below is applicable to the "Fast leave" as well. | ||||||||||
|
||||||||||
<img src="parallel_block_state_initial.png" width=800> | ||||||||||
|
||||||||||
This state diagram leads to several questions. There are some of them: | ||||||||||
* What if there are multiple Parallel Blocks? | ||||||||||
* If “End of Parallel Block” leads back to “Delayed leave” how soon threads | ||||||||||
will be released from arena? | ||||||||||
* What if we indicated that threads should leave arena after the "Parallel Block"? | ||||||||||
* What if we just indicated the end of the "Parallel Block"? | ||||||||||
|
||||||||||
The extended state machine aims to answer these questions. | ||||||||||
* The first call to the “Start of PB” will transition into the “Parallel Block” state. | ||||||||||
* The last call to the “End of PB” will transition back to the “Delayed leave” state | ||||||||||
or into the "One-time Fast leave" if it is indicated that threads should leave sooner. | ||||||||||
* Concurrent or nested calls to the “Start of PB” or the “End of PB” | ||||||||||
increment/decrement reference counter. | ||||||||||
|
||||||||||
<img src="parallel_block_state_final.png" width=800> | ||||||||||
|
||||||||||
Let's consider the semantics that an API for explicit parallel blocks can provide: | ||||||||||
* Start of a parallel block: | ||||||||||
* Indicates the point from which the scheduler can use a hint and keep threads in the arena | ||||||||||
for longer. | ||||||||||
* Serves as a warm-up hint to the scheduler: | ||||||||||
* Allows worker threads to be available by the time real computation starts. | ||||||||||
* "Parallel block" itself: | ||||||||||
* Scheduler can implement different policies to retain threads in the arena. | ||||||||||
* The semantics for retaining threads is a hint to the scheduler; | ||||||||||
thus, no real guarantee is provided. The scheduler can ignore the hint and | ||||||||||
move threads to another arena or to sleep if conditions are met. | ||||||||||
* End of a parallel block: | ||||||||||
* Indicates the point from which the scheduler may drop a hint and | ||||||||||
no longer retain threads in the arena. | ||||||||||
* Indicates that arena should enter the “One-time Fast leave” thus workers can leave sooner. | ||||||||||
* If work was submitted immediately after the end of the parallel block, | ||||||||||
the default arena behavior with regard to "workers leave" policy is restored. | ||||||||||
* If the default "workers leave" policy was the "Fast leave", the result is NOP. | ||||||||||
|
||||||||||
|
||||||||||
### Proposed API | ||||||||||
|
||||||||||
```cpp | ||||||||||
isaevil marked this conversation as resolved.
Show resolved
Hide resolved
Comment on lines
+115
to
+117
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you please summarize the proposed API modifications before the code block?
|
||||||||||
class task_arena { | ||||||||||
enum class workers_leave : /* unspecified type */ { | ||||||||||
fast = /* unspecifed */, | ||||||||||
delayed = /* unspecifed */ | ||||||||||
}; | ||||||||||
Comment on lines
+119
to
+122
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need to find better names to the enum class and its values. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've been thinking... Combined with your previous comment about
If we assume that we have these 3 modes now, What do you think? |
||||||||||
|
||||||||||
task_arena(int max_concurrency = automatic, unsigned reserved_for_masters = 1, | ||||||||||
priority a_priority = priority::normal, | ||||||||||
workers_leave a_workers_leave = workers_leave::delayed); | ||||||||||
|
||||||||||
task_arena(const constraints& constraints_, unsigned reserved_for_masters = 1, | ||||||||||
priority a_priority = priority::normal, | ||||||||||
workers_leave a_workers_leave = workers_leave::delayed); | ||||||||||
akukanov marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
void initialize(int max_concurrency, unsigned reserved_for_masters = 1, | ||||||||||
priority a_priority = priority::normal, | ||||||||||
workers_leave a_workers_leave = workers_leave::delayed); | ||||||||||
|
||||||||||
void initialize(constraints a_constraints, unsigned reserved_for_masters = 1, | ||||||||||
priority a_priority = priority::normal, | ||||||||||
workers_leave a_workers_leave = workers_leave::delayed); | ||||||||||
|
||||||||||
void start_parallel_block(); | ||||||||||
void end_parallel_block(bool set_one_time_fast_leave = false); | ||||||||||
Comment on lines
+140
to
+141
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Therefore, my suggestion is to address both of these by changing the API (here and in other places) to something like the following:
Suggested change
Then to add somewhere the explanation how this affects/changes the behavior of the current parallel block and how this composes with the arena's setting and other parallel blocks within it. For example, it may be like:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is no composability problem really, as all but the last end-of-block calls are simply ignored, and only the last one has the one-time impact on the leave policy. Also it does not affect the arena settings, according to the design. Of course if the calls come from different threads, in general it is impossible to predict which one will be the last. However, even if the code is designed to create parallel blocks in the same arena by multiple threads, all these blocks might have the same leave policy so that it does not matter which one is the last to end. Using the same enum for the end of block as for the construction of the arena seems more confusing than helpful to me, as it may be perceived as changing the arena state permanently. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
My guess it is the last one that decreases the ref counter to zero. I don't see any issue with this. Later blocks use the arena's policy if not specified explicitly.
I indicated the difference in the parameter naming this_block_leave, but if it is not enough, we can also indicate that more explicitly with an additional type: arena_workers_leave and phase_workers_leave. Nevertheless, my opinion is that it would not be a problem if documentation/specification includes explanation of this.
Comment on lines
+140
to
+141
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With the arena's Would it be better to have this setting be specified at the start of a parallel block rather than at its end? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it would make the API harder from usability standpoint. User will need somehow link this parameter from the start of the block to the end of the block.
Comment on lines
+140
to
+141
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. More on the API names.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems like "region" might be confused (or wrongly associated) with OpenMP's "parallel Region" term (we have different guarantees). Therefore, I think "stage" or "phase" are more preferred names. |
||||||||||
|
||||||||||
class scoped_parallel_block { | ||||||||||
scoped_parallel_block(task_arena& ta, bool set_one_time_fast_leave = false); | ||||||||||
}; | ||||||||||
}; | ||||||||||
|
||||||||||
namespace this_task_arena { | ||||||||||
void start_parallel_block(); | ||||||||||
void end_parallel_block(bool set_one_time_fast_leave = false); | ||||||||||
} | ||||||||||
``` | ||||||||||
|
||||||||||
By the contract, users should indicate the end of _parallel block_ for each | ||||||||||
previous start of _parallel block_.<br> | ||||||||||
Let's introduce RAII scoped object that will help to manage the contract. | ||||||||||
|
||||||||||
If the end of the parallel block is not indicated by the user, it will be done automatically when | ||||||||||
the last public reference is removed from the arena (i.e., task_arena is destroyed or a thread | ||||||||||
is joined for an implicit arena). This ensures correctness is | ||||||||||
preserved (threads will not be retained forever). | ||||||||||
|
||||||||||
## Considerations | ||||||||||
|
||||||||||
The alternative approaches were also considered.<br> | ||||||||||
We can express this state machine as complete graph and provide low-level interface that | ||||||||||
will give control over state transition. | ||||||||||
|
||||||||||
<img src="alternative_proposal.png" width=600> | ||||||||||
|
||||||||||
We considered this approach too low-level. Plus, it leaves a question: "How to manage concurrent changes of the state?". | ||||||||||
|
||||||||||
The retaining of worker threads should be implemented with care because | ||||||||||
it might introduce performance problems if: | ||||||||||
* Threads cannot migrate to another arena because they are | ||||||||||
retained in the current arena. | ||||||||||
* Compute resources are not homogeneous, e.g., the CPU is hybrid. | ||||||||||
Heavier involvement of less performant core types might result in artificial work | ||||||||||
imbalance in the arena. | ||||||||||
|
||||||||||
|
||||||||||
## Open Questions in Design | ||||||||||
|
||||||||||
Some open questions that remain: | ||||||||||
* Are the suggested APIs sufficient? | ||||||||||
* Are there additional use cases that should be considered that we missed in our analysis? | ||||||||||
* Do we see any value if arena potentially can transition from one to another state? | ||||||||||
* What if different types of workloads are mixed in one application? | ||||||||||
* What if there concurrent calls to this API? | ||||||||||
Comment on lines
+188
to
+189
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See my comment above about making the approach to be a bit more generic. Essentially, I think we can write something like "implementation-defined" in case of a concurrent calls to this API. However, it seems to me that the behavior should be kind of relaxed, so to say. Meaning that if there is at least one "delayed leave" request happening concurrently with possibly a number of "fast leave" requests, then it, i.e., "delayed leave" policy prevails. Also, having the request stated up front allows scheduler to know the runtime situation earlier, hence making better decisions about optimality of the workers' behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want the thread warmup to always happen at the start of a block, or do we want to allow users having control over it?
Also, do we promise thread availability "by the time the real computation starts"? I do not think we do, because a) in case too little time has passed after the block start, threads might not yet come, and b) in case too much time has passed, threads might leave.
Maybe more accurate description is like "Allows reducing computation start delays by initiating the wake-up of worker threads in advance".