-
Notifications
You must be signed in to change notification settings - Fork 528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make FloatToFloat16 conversion 75x faster using SVE2 instructions #3626
Conversation
This pull request was exported from Phabricator. Differential Revision: D68520774 |
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
This pull request was exported from Phabricator. Differential Revision: D68520774 |
…torch#3626) Summary: Pull Request resolved: pytorch#3626 X-link: facebookresearch/FBGEMM#703 Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow. With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding. This is a 0.25% CPU win for AdFinder on Grace: {F1974759899} Differential Revision: D68520774
59a241d
to
85aeb87
Compare
This pull request was exported from Phabricator. Differential Revision: D68520774 |
…torch#3626) Summary: Pull Request resolved: pytorch#3626 X-link: facebookresearch/FBGEMM#703 Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow. With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding. This is a 0.25% CPU win for AdFinder on Grace: {F1974759899} Differential Revision: D68520774
85aeb87
to
49f9fca
Compare
This pull request was exported from Phabricator. Differential Revision: D68520774 |
…torch#3626) Summary: Pull Request resolved: pytorch#3626 X-link: facebookresearch/FBGEMM#703 Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow. With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding. This is a 0.25% CPU win for AdFinder on Grace: {F1974759899} Differential Revision: D68520774
49f9fca
to
641aa26
Compare
This pull request was exported from Phabricator. Differential Revision: D68520774 |
…torch#3626) Summary: Pull Request resolved: pytorch#3626 X-link: facebookresearch/FBGEMM#703 Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow. With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding. This is a 0.25% CPU win for AdFinder on Grace: {F1974759899} Differential Revision: D68520774
641aa26
to
0daa1c8
Compare
This pull request was exported from Phabricator. Differential Revision: D68520774 |
…torch#3626) Summary: Pull Request resolved: pytorch#3626 X-link: facebookresearch/FBGEMM#703 Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow. With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding. Differential Revision: D68520774
0daa1c8
to
bc6aaff
Compare
…torch#3626) Summary: Pull Request resolved: pytorch#3626 X-link: facebookresearch/FBGEMM#703 Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow. With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding. Differential Revision: D68520774
This pull request was exported from Phabricator. Differential Revision: D68520774 |
bc6aaff
to
8fb04c2
Compare
This pull request was exported from Phabricator. Differential Revision: D68520774 |
…torch#3626) Summary: Pull Request resolved: pytorch#3626 X-link: facebookresearch/FBGEMM#703 Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow. With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding. Differential Revision: D68520774
8fb04c2
to
dc5d7d6
Compare
This pull request was exported from Phabricator. Differential Revision: D68520774 |
…torch#3626) Summary: Pull Request resolved: pytorch#3626 X-link: facebookresearch/FBGEMM#703 Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow. With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding. Differential Revision: D68520774
dc5d7d6
to
a34a91d
Compare
This pull request was exported from Phabricator. Differential Revision: D68520774 |
…torch#3626) Summary: Pull Request resolved: pytorch#3626 X-link: facebookresearch/FBGEMM#703 Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow. With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding. Differential Revision: D68520774
a34a91d
to
e460390
Compare
This pull request was exported from Phabricator. Differential Revision: D68520774 |
…torch#3626) Summary: Pull Request resolved: pytorch#3626 X-link: facebookresearch/FBGEMM#703 Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow. With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding. Differential Revision: D68520774
e460390
to
0815817
Compare
…torch#3626) Summary: Pull Request resolved: pytorch#3626 X-link: facebookresearch/FBGEMM#703 Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow. With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding. Reviewed By: q10 Differential Revision: D68520774
This pull request was exported from Phabricator. Differential Revision: D68520774 |
0815817
to
38dbb37
Compare
This pull request has been merged in c5e1cde. |
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/703
Rounding was previously (1) not vectorized and (2) implemented in software, so speeds were less than 1 byte per cycle. That's really slow.
With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding.
Differential Revision: D68520774