-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Conversation
4d8b214
to
aa10a4a
Compare
@t-nelson should be close to what we discussed |
"execute_batch", | ||
); | ||
|
||
let thread_index = PAR_THREAD_POOL.current_thread_index().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will you know if it's rayon just scheduled a bunch of small jobs on one thread vs. a single (or multiple) large batch(es) which was executed on one thread. Also temporally if that created the longest timings or maybe that thread started sooner than another thread and actually caused the batch to take longer. And the rayon scheduling could vary across machines, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's rayon just scheduled a bunch of small jobs on one thread vs. a single (or multiple) large batch(es) which was executed on one thread
Yeah we won't be able to distinguish this, we'll just take the longest running thread and hope that captures the bottleneck. Right now we're only handing the thread pool 2-3 batches at a time, so hopefully longest thread does reflect the actual end to end timings.
Also temporally if that created the longest timings or maybe that thread started sooner than another thread and actually caused the batch to take longer. And the rayon scheduling could vary across machines, no?
I was thinking if the longest thread's total thread execution time doesn't add up to the total replay time, then we know there's some rayon shenanigans going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if recording either start or end time for threads and snapshotting for the longest running thread could help understand cases where longest thread time is shorter than total execution time. Could shed some light on if that thread got started late, for example
Example of this metric on 5k tps GCE cluster The newly added The These two match up reasonably well. The rest of the gap with the |
} | ||
|
||
impl Default for ConfirmationTiming { | ||
fn default() -> Self { | ||
Self { | ||
started: Instant::now(), | ||
replay_elapsed: 0, | ||
execute_batches_us: 0, | ||
poh_verify_elapsed: 0, | ||
transaction_verify_elapsed: 0, | ||
fetch_elapsed: 0, | ||
fetch_fail_elapsed: 0, | ||
execute_timings: ExecuteTimings::default(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we rename this cumulative_execute_timings
?
ledger/src/blockstore_processor.rs
Outdated
cumulative_execute_timings | ||
.saturating_add_in_place(ExecuteTimingType::TotalBatchesLen, batches.len() as u64); | ||
cumulative_execute_timings.saturating_add_in_place(ExecuteTimingType::NumExecuteBatches, 1); | ||
saturating_add_assign!( | ||
confirmation_timing.execute_batches_us, | ||
execute_batches_elapsed.as_us() | ||
); | ||
|
||
let mut current_max_thread_execution_time: Option<ThreadExecuteTimings> = None; | ||
for (_, thread_execution_time) in execution_timings_per_thread | ||
.into_inner() | ||
.unwrap() | ||
.into_iter() | ||
{ | ||
let ThreadExecuteTimings { | ||
total_thread_us, | ||
execute_timings, | ||
.. | ||
} = &thread_execution_time; | ||
cumulative_execute_timings.accumulate(execute_timings); | ||
if *total_thread_us | ||
> current_max_thread_execution_time | ||
.as_ref() | ||
.map(|thread_execution_time| thread_execution_time.total_thread_us) | ||
.unwrap_or(0) | ||
{ | ||
current_max_thread_execution_time = Some(thread_execution_time); | ||
} | ||
} | ||
|
||
if let Some(current_max_thread_execution_time) = current_max_thread_execution_time { | ||
end_to_end_execute_timings.accumulate(¤t_max_thread_execution_time); | ||
end_to_end_execute_timings | ||
.execute_timings | ||
.saturating_add_in_place(ExecuteTimingType::NumExecuteBatches, 1); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what our typical convention is, but it seems like over half of the code in execute_batches_internal
now deals w/ collection/recording/processing of metrics. The collection part obviously needs to live here, but I'm wondering if the processing/recording piece could be shoved into its own function (either called from here or one level up in execute_batches
).
Maybe create the hashmap up there and pass a reference to be filled out by the threads. I think we could even avoid passing down confirmation_timing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I refactored the code such that execute_batches_internal
now returns all the metrics it collects in a struct ExecuteBatchesInternalMetrics
which is then aggregated in execute_batches()
via a separate function process_execute_batches_internal_metrics()
as you suggested: ef9608d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - Left a comment about separating thread execution & metrics collection from the recording/processing of the metrics, but this is more cosmetic. Functionally, looks good and will be helpful addition!
e631729
to
4b597c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some comments to explain how these macros work?
ledger/src/blockstore_processor.rs
Outdated
for timing in new_timings { | ||
timings.accumulate(&timing); | ||
} | ||
let execution_timings_per_thread: RwLock<HashMap<usize, ThreadExecuteTimings>> = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why HashMap
? could use a Vec
with the same size as the thread pool. the map's metadata is the only thing that makes us need a lock
Done: d8552c5 |
(cherry picked from commit ce39c14) # Conflicts: # Cargo.lock # core/src/progress_map.rs # ledger/src/blockstore_processor.rs # program-runtime/Cargo.toml # programs/bpf/Cargo.lock
Problem
Current
replay-slot-stats
returns the sum of all execution metrics across all replay threads, which is not that useful for debuggingSummary of Changes
On each iteration of replay, take the longest running thread, and accumulate its execution metrics into a new end-to-end
ThreadExecuteTimings
. Report metrics for that finalThreadExecuteTimings
as the end-to-end metrics for the slotFixes #