-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
float8 profiling script: filter out microbenchmarking overhead #629
Conversation
Stack from ghstack (oldest at bottom): |
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/629
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 714b5c0 with merge base d582f9a ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Summary: Our microbenchmarks have a lot of overhead. This PR attempts to get a cleaner measurement of only the kernels in the fwd+bwd, and subtracts the kernels unrelated to fwd+bwd code. This makes the kernel summary tables more reflective of GPU bound real use cases. Test Plan: profiling ln -> linear: ``` python benchmarks/float8/profile_linear_float8.py --dtype_filter both ~/local/tmp --model_type ln_linear ``` new output, note that only kernels relevant to ln and linear are displayed ``` Summary of time (ms) by kernel category experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 10.045 5.194 0.517 1.934 1_f8_overhead 0.000 1.778 inf 0.000 2_other 0.592 0.073 0.124 8.066 All 10.637 7.045 0.662 1.510 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 305b51588ef241846b5c9dded1c699c8d648aa9a ghstack-comment-id: 2274007593 Pull Request resolved: #629
|
||
# print the redirected stdout back to regular stdout | ||
print(f.getvalue()) | ||
finally: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is needed so stdout is still printed if the code inside the try
statement hits an exception, useful for debugging
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Co-authored-by: Kimish Patel <[email protected]>
* executable README * fix title of CI workflow * markup commands in markdown * extend the markup-markdown language * Automatically identify cuda from nvidia-smi in install-requirements (pytorch#606) * Automatically identify cuda from nvidia-smi in install-requirements * Update README.md --------- Co-authored-by: Michael Gschwind <[email protected]> * Unbreak zero-temperature sampling (pytorch#599) Fixes pytorch#581. * Improve process README * [retake] Add sentencepiece tokenizer (pytorch#626) * Add sentencepiece tokenizer Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Add white space Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Handle white space: Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Handle control ids Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * More cleanup Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Lint Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Use unique_ptr Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Use a larger runner Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Debug Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Debug Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Cleanup * Update install_utils.sh to use python3 instead of python (pytorch#636) As titled. On some devices `python` and `python3` are pointing to different environments so good to unify them. * Fix quantization doc to specify dytpe limitation on a8w4dq (pytorch#629) Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Co-authored-by: Kimish Patel <[email protected]> * add desktop.json (pytorch#622) * add desktop.json * add fast * remove embedding * improvements * update readme from doc branch * tab/spc * fix errors in updown language * fix errors in updown language, and [skip]: begin/end * fix errors in updown language, and [skip]: begin/end * a storied run * stories run on readme instructions does not need HF token * increase timeout * check for hang un hf_login * executable README improvements * typo * typo --------- Co-authored-by: Ian Barber <[email protected]> Co-authored-by: Scott Wolchok <[email protected]> Co-authored-by: Mengwei Liu <[email protected]> Co-authored-by: Kimish Patel <[email protected]> Co-authored-by: Scott Roy <[email protected]>
Summary:
Our microbenchmarks have a lot of overhead. This PR attempts to get a
cleaner measurement of only the kernels in the fwd+bwd, and subtracts
the kernels unrelated to fwd+bwd code. This makes the kernel summary
tables more reflective of GPU bound real use cases.
Test Plan:
profiling ln -> linear:
new output, note that only kernels relevant to ln and linear are
displayed