Add end-to-end replay slot metrics #25752

carllin · 2022-06-03T06:50:00Z

Problem

Current replay-slot-stats returns the sum of all execution metrics across all replay threads, which is not that useful for debugging

Summary of Changes

On each iteration of replay, take the longest running thread, and accumulate its execution metrics into a new end-to-end ThreadExecuteTimings. Report metrics for that final ThreadExecuteTimings as the end-to-end metrics for the slot

Fixes #

carllin · 2022-06-03T06:55:02Z

@t-nelson should be close to what we discussed

sakridge · 2022-06-03T18:23:22Z

ledger/src/blockstore_processor.rs

+                    "execute_batch",
+                );
+
+                let thread_index = PAR_THREAD_POOL.current_thread_index().unwrap();


How will you know if it's rayon just scheduled a bunch of small jobs on one thread vs. a single (or multiple) large batch(es) which was executed on one thread. Also temporally if that created the longest timings or maybe that thread started sooner than another thread and actually caused the batch to take longer. And the rayon scheduling could vary across machines, no?

if it's rayon just scheduled a bunch of small jobs on one thread vs. a single (or multiple) large batch(es) which was executed on one thread

Yeah we won't be able to distinguish this, we'll just take the longest running thread and hope that captures the bottleneck. Right now we're only handing the thread pool 2-3 batches at a time, so hopefully longest thread does reflect the actual end to end timings.

Also temporally if that created the longest timings or maybe that thread started sooner than another thread and actually caused the batch to take longer. And the rayon scheduling could vary across machines, no?

I was thinking if the longest thread's total thread execution time doesn't add up to the total replay time, then we know there's some rayon shenanigans going on.

I'm wondering if recording either start or end time for threads and snapshotting for the longest running thread could help understand cases where longest thread time is shorter than total execution time. Could shed some light on if that thread got started late, for example

carllin · 2022-06-04T00:22:28Z

Example of this metric on 5k tps GCE cluster

The newly added execute_batches_us wraps the thread pool execute batches logic, and measures the total time executing batches.

The total_thread_us measures the longest thread within that thread pool.

These two match up reasonably well.

The rest of the gap with the replay_time metric is all the stuff in process_entries_with_callback() before work is handed to the thread pool

carllin · 2022-06-04T00:29:04Z

And here's an example of breaking down the thread total time into load/execute/store/other misc. Looks like there's some unaccounted for work here:

carllin · 2022-06-07T18:22:42Z

Gap was made up for by `update_transaction_statuses`, updating the status cache

bw-solana · 2022-06-14T20:17:18Z

ledger/src/blockstore_processor.rs

 }

 impl Default for ConfirmationTiming {
    fn default() -> Self {
        Self {
            started: Instant::now(),
            replay_elapsed: 0,
+            execute_batches_us: 0,
            poh_verify_elapsed: 0,
            transaction_verify_elapsed: 0,
            fetch_elapsed: 0,
            fetch_fail_elapsed: 0,
            execute_timings: ExecuteTimings::default(),


should we rename this cumulative_execute_timings ?

bw-solana · 2022-06-14T20:40:54Z

ledger/src/blockstore_processor.rs

+    cumulative_execute_timings
+        .saturating_add_in_place(ExecuteTimingType::TotalBatchesLen, batches.len() as u64);
+    cumulative_execute_timings.saturating_add_in_place(ExecuteTimingType::NumExecuteBatches, 1);
+    saturating_add_assign!(
+        confirmation_timing.execute_batches_us,
+        execute_batches_elapsed.as_us()
+    );
+
+    let mut current_max_thread_execution_time: Option<ThreadExecuteTimings> = None;
+    for (_, thread_execution_time) in execution_timings_per_thread
+        .into_inner()
+        .unwrap()
+        .into_iter()
+    {
+        let ThreadExecuteTimings {
+            total_thread_us,
+            execute_timings,
+            ..
+        } = &thread_execution_time;
+        cumulative_execute_timings.accumulate(execute_timings);
+        if *total_thread_us
+            > current_max_thread_execution_time
+                .as_ref()
+                .map(|thread_execution_time| thread_execution_time.total_thread_us)
+                .unwrap_or(0)
+        {
+            current_max_thread_execution_time = Some(thread_execution_time);
+        }
    }

+    if let Some(current_max_thread_execution_time) = current_max_thread_execution_time {
+        end_to_end_execute_timings.accumulate(&current_max_thread_execution_time);
+        end_to_end_execute_timings
+            .execute_timings
+            .saturating_add_in_place(ExecuteTimingType::NumExecuteBatches, 1);
+    };


Not sure what our typical convention is, but it seems like over half of the code in execute_batches_internal now deals w/ collection/recording/processing of metrics. The collection part obviously needs to live here, but I'm wondering if the processing/recording piece could be shoved into its own function (either called from here or one level up in execute_batches).

Maybe create the hashmap up there and pass a reference to be filled out by the threads. I think we could even avoid passing down confirmation_timing

Makes sense, I refactored the code such that execute_batches_internal now returns all the metrics it collects in a struct ExecuteBatchesInternalMetrics which is then aggregated in execute_batches() via a separate function process_execute_batches_internal_metrics() as you suggested: ef9608d

bw-solana

LGTM - Left a comment about separating thread execution & metrics collection from the recording/processing of the metrics, but this is more cosmetic. Functionally, looks good and will be helpful addition!

Pull request has been modified.

t-nelson

can you add some comments to explain how these macros work?

t-nelson · 2022-06-22T21:39:01Z

ledger/src/blockstore_processor.rs

-    for timing in new_timings {
-        timings.accumulate(&timing);
-    }
+    let execution_timings_per_thread: RwLock<HashMap<usize, ThreadExecuteTimings>> =


why HashMap? could use a Vec with the same size as the thread pool. the map's metadata is the only thing that makes us need a lock

carllin · 2022-07-01T03:41:12Z

can you add some comments to explain how these macros work?

Done: d8552c5

(cherry picked from commit ce39c14) # Conflicts: # Cargo.lock # core/src/progress_map.rs # ledger/src/blockstore_processor.rs # program-runtime/Cargo.toml # programs/bpf/Cargo.lock

carllin added the v1.10 label Jun 3, 2022

carllin force-pushed the FixMetrics branch 2 times, most recently from 4d8b214 to aa10a4a Compare June 3, 2022 06:52

carllin requested review from t-nelson, bw-solana and sakridge June 3, 2022 06:53

sakridge reviewed Jun 3, 2022

View reviewed changes

carllin force-pushed the FixMetrics branch from aa10a4a to 16e3e5d Compare June 3, 2022 18:24

carllin force-pushed the FixMetrics branch from 149b8a9 to 7a1cf05 Compare June 4, 2022 00:45

carllin marked this pull request as ready for review June 7, 2022 18:22

bw-solana reviewed Jun 14, 2022

View reviewed changes

bw-solana previously approved these changes Jun 14, 2022

View reviewed changes

carllin force-pushed the FixMetrics branch 2 times, most recently from e631729 to 4b597c2 Compare June 22, 2022 15:57

t-nelson reviewed Jun 23, 2022

View reviewed changes

carllin added 8 commits June 30, 2022 23:29

Adding metrics

892665d

use eager macros

ccca115

Fixup bad merge

d01f89e

Add timing for execute_batches

f43e6ad

Add validate transactions time

c846470

time update tx statuses

4418417

Refactor and cleanup metrics accumulatoin code

85d3de0

Add comments for eager macros

d8552c5

carllin force-pushed the FixMetrics branch from 629a2d6 to 8d48cf0 Compare July 1, 2022 03:38

Fixup clippy errors

bcf9abd

carllin force-pushed the FixMetrics branch from 8d48cf0 to bcf9abd Compare July 1, 2022 03:40

Change Rwlock to mutex

2f51d7f

carllin merged commit ce39c14 into solana-labs:master Jul 5, 2022

mergify bot mentioned this pull request Jul 5, 2022

Add end-to-end replay slot metrics (backport #25752) #26417

Closed

ryoqun mentioned this pull request Jun 18, 2024

Adjust replay-related metrics for unified scheduler anza-xyz/agave#1741

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add end-to-end replay slot metrics #25752

Add end-to-end replay slot metrics #25752

carllin commented Jun 3, 2022

carllin commented Jun 3, 2022

sakridge Jun 3, 2022

carllin Jun 3, 2022 •

edited

Loading

bw-solana Jun 14, 2022

carllin commented Jun 4, 2022 •

edited

Loading

carllin commented Jun 4, 2022 •

edited

Loading

carllin commented Jun 7, 2022

bw-solana Jun 14, 2022

bw-solana Jun 14, 2022

carllin Jun 22, 2022

bw-solana left a comment

t-nelson left a comment

t-nelson Jun 22, 2022

carllin commented Jul 1, 2022

Add end-to-end replay slot metrics #25752

Add end-to-end replay slot metrics #25752

Conversation

carllin commented Jun 3, 2022

Problem

Summary of Changes

carllin commented Jun 3, 2022

sakridge Jun 3, 2022

Choose a reason for hiding this comment

carllin Jun 3, 2022 • edited Loading

Choose a reason for hiding this comment

bw-solana Jun 14, 2022

Choose a reason for hiding this comment

carllin commented Jun 4, 2022 • edited Loading

carllin commented Jun 4, 2022 • edited Loading

carllin commented Jun 7, 2022

bw-solana Jun 14, 2022

Choose a reason for hiding this comment

bw-solana Jun 14, 2022

Choose a reason for hiding this comment

carllin Jun 22, 2022

Choose a reason for hiding this comment

bw-solana left a comment

Choose a reason for hiding this comment

t-nelson left a comment

Choose a reason for hiding this comment

t-nelson Jun 22, 2022

Choose a reason for hiding this comment

carllin commented Jul 1, 2022

carllin Jun 3, 2022 •

edited

Loading

carllin commented Jun 4, 2022 •

edited

Loading

carllin commented Jun 4, 2022 •

edited

Loading