float8 profiling script: filter out microbenchmarking overhead #629

vkuzo · 2024-08-07T17:49:02Z

Summary:

Our microbenchmarks have a lot of overhead. This PR attempts to get a
cleaner measurement of only the kernels in the fwd+bwd, and subtracts
the kernels unrelated to fwd+bwd code. This makes the kernel summary
tables more reflective of GPU bound real use cases.

Test Plan:

profiling ln -> linear:

python benchmarks/float8/profile_linear_float8.py --dtype_filter both ~/local/tmp --model_type ln_linear

new output, note that only kernels relevant to ln and linear are
displayed

Summary of GPU time by CPU kernel                                                                                                                                                                                                  
                                                                                                                                                                                                                                   
    experiment                                                                                               kernel       category  time_ms  pct_gpu_time bw_gpbs                                                                  
1       0_ref                                                                                             aten::mm         0_gemm   10.153         0.945    None                                                                   
2       0_ref                                      triton_red_fused_native_layer_norm_native_layer_norm_backward_0        2_other    0.350         0.033    None                                                                   
0       0_ref                                                                 triton_red_fused_native_layer_norm_0        2_other    0.241         0.022    None                                                                   
12   1_float8                                                                                     aten::_scaled_mm         0_gemm    5.182         0.736    None                                                                   
16   1_float8  triton_red_fused__scaled_mm__to_copy_clamp_clone_mul_native_layer_norm_native_layer_norm_backwar...  1_f8_overhead    0.813         0.115    None                                                                   
15   1_float8                               triton_poi_fused__scaled_mm__to_copy_clamp_clone_mul_reciprocal_view_2  1_f8_overhead    0.302         0.043    None                                                                   
5    1_float8                                                         triton_red_fused_abs_max_native_layer_norm_0  1_f8_overhead    0.212         0.030    None                                                                   
10   1_float8                              triton_poi_fused__scaled_mm__to_copy_clamp_mul_native_layer_norm_view_5  1_f8_overhead    0.177         0.025    None                                                                   
11   1_float8                        triton_poi_fused__scaled_mm__to_copy_clamp_clone_mul_native_layer_norm_view_6  1_f8_overhead    0.150         0.021    None                                                                   
13   1_float8                                                                           triton_red_fused_abs_max_0  1_f8_overhead    0.126         0.018    None                                                                   
7    1_float8                                                                           triton_red_fused_abs_max_2  1_f8_overhead    0.060         0.008    None                                                                   
3    1_float8                                                                     triton_per_fused_copy_max_roll_0  1_f8_overhead    0.005         0.001    None                                                                   
6    1_float8                           triton_red_fused__to_copy_abs_clamp_max_mul_native_layer_norm_reciprocal_1  1_f8_overhead    0.004         0.001    None                                                                   
4    1_float8                                                                     triton_per_fused_copy_max_roll_1  1_f8_overhead    0.003         0.000    None                                                                   
14   1_float8                       triton_per_fused__scaled_mm__to_copy_abs_clamp_clone_max_mul_reciprocal_view_1  1_f8_overhead    0.003         0.000    None                                                                   
8    1_float8                                                                      triton_per_fused_abs_fill_max_3  1_f8_overhead    0.003         0.000    None                                                                   
9    1_float8                                                                        triton_poi_fused_reciprocal_4        2_other    0.002         0.000    None                                                                   
                                                                                                                                                                                                                                   
Float8 amax/scale sync approx ratio of total time: 0.006                                                                                                                                                                           
                                                                                                                                                                                                                                   
Summary of time (ms) by kernel category                                                                                                                                                                                            
                                                                                                                                                                                                                                   
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8                                                                                                                                                                            
category                                                                                                                                                                                                                           
0_gemm        10.153     5.182       0.510       1.959                                                                                                                                                                             
1_f8_overhead  0.000     1.858         inf       0.000                                                                                                                                                                             
2_other        0.591     0.002       0.004     264.393                                                                                                                                                                             
All           10.743     7.042       0.655       1.526                                                                                                                                                                             ```

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]

vkuzo · 2024-08-07T17:49:03Z

Stack from ghstack (oldest at bottom):

-> float8 profiling script: filter out microbenchmarking overhead #629

pytorch-bot · 2024-08-07T17:49:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/629

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 714b5c0 with merge base d582f9a ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Our microbenchmarks have a lot of overhead. This PR attempts to get a cleaner measurement of only the kernels in the fwd+bwd, and subtracts the kernels unrelated to fwd+bwd code. This makes the kernel summary tables more reflective of GPU bound real use cases. Test Plan: profiling ln -> linear: ``` python benchmarks/float8/profile_linear_float8.py --dtype_filter both ~/local/tmp --model_type ln_linear ``` new output, note that only kernels relevant to ln and linear are displayed ``` Summary of time (ms) by kernel category experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 10.045 5.194 0.517 1.934 1_f8_overhead 0.000 1.778 inf 0.000 2_other 0.592 0.073 0.124 8.066 All 10.637 7.045 0.662 1.510 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 305b51588ef241846b5c9dded1c699c8d648aa9a ghstack-comment-id: 2274007593 Pull Request resolved: #629

vkuzo · 2024-08-07T17:50:02Z

benchmarks/float8/profile_linear_float8.py

-
-    # print the redirected stdout back to regular stdout
-    print(f.getvalue())
+    finally:


this is needed so stdout is still printed if the code inside the try statement hits an exception, useful for debugging

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Co-authored-by: Kimish Patel <[email protected]>

* executable README * fix title of CI workflow * markup commands in markdown * extend the markup-markdown language * Automatically identify cuda from nvidia-smi in install-requirements (pytorch#606) * Automatically identify cuda from nvidia-smi in install-requirements * Update README.md --------- Co-authored-by: Michael Gschwind <[email protected]> * Unbreak zero-temperature sampling (pytorch#599) Fixes pytorch#581. * Improve process README * [retake] Add sentencepiece tokenizer (pytorch#626) * Add sentencepiece tokenizer Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Add white space Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Handle white space: Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Handle control ids Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * More cleanup Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Lint Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Use unique_ptr Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Use a larger runner Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Debug Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Debug Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Cleanup * Update install_utils.sh to use python3 instead of python (pytorch#636) As titled. On some devices `python` and `python3` are pointing to different environments so good to unify them. * Fix quantization doc to specify dytpe limitation on a8w4dq (pytorch#629) Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Co-authored-by: Kimish Patel <[email protected]> * add desktop.json (pytorch#622) * add desktop.json * add fast * remove embedding * improvements * update readme from doc branch * tab/spc * fix errors in updown language * fix errors in updown language, and [skip]: begin/end * fix errors in updown language, and [skip]: begin/end * a storied run * stories run on readme instructions does not need HF token * increase timeout * check for hang un hf_login * executable README improvements * typo * typo --------- Co-authored-by: Ian Barber <[email protected]> Co-authored-by: Scott Wolchok <[email protected]> Co-authored-by: Mengwei Liu <[email protected]> Co-authored-by: Kimish Patel <[email protected]> Co-authored-by: Scott Roy <[email protected]>

Update

714b5c0

[ghstack-poisoned]

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 7, 2024

vkuzo commented Aug 7, 2024

View reviewed changes

vkuzo requested review from drisspg and y-sq August 7, 2024 17:50

drisspg approved these changes Aug 8, 2024

View reviewed changes

vkuzo merged commit 34b24f7 into main Aug 8, 2024
13 checks passed

yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024

Fix quantization doc to specify dytpe limitation on a8w4dq (pytorch#629)

d3582a0

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Co-authored-by: Kimish Patel <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

float8 profiling script: filter out microbenchmarking overhead #629

float8 profiling script: filter out microbenchmarking overhead #629

vkuzo commented Aug 7, 2024 •

edited

Loading

vkuzo commented Aug 7, 2024 •

edited

Loading

pytorch-bot bot commented Aug 7, 2024 •

edited

Loading

vkuzo Aug 7, 2024

float8 profiling script: filter out microbenchmarking overhead #629

float8 profiling script: filter out microbenchmarking overhead #629

Conversation

vkuzo commented Aug 7, 2024 • edited Loading

vkuzo commented Aug 7, 2024 • edited Loading

pytorch-bot bot commented Aug 7, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/629

✅ No Failures

vkuzo Aug 7, 2024

Choose a reason for hiding this comment

vkuzo commented Aug 7, 2024 •

edited

Loading

vkuzo commented Aug 7, 2024 •

edited

Loading

pytorch-bot bot commented Aug 7, 2024 •

edited

Loading