Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make FloatToFloat16 conversion 75x faster using SVE2 instructions #3626

Closed
wants to merge 1 commit into from

Conversation

embg
Copy link

@embg embg commented Jan 28, 2025

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/703

Rounding was previously (1) not vectorized and (2) implemented in software, so speeds were less than 1 byte per cycle. That's really slow.

With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding.

Differential Revision: D68520774

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68520774

Copy link

netlify bot commented Jan 28, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 38dbb37
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/6799916492ca850008f3cf9e
😎 Deploy Preview https://deploy-preview-3626--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68520774

embg pushed a commit to embg/FBGEMM that referenced this pull request Jan 28, 2025
…torch#3626)

Summary:
Pull Request resolved: pytorch#3626

X-link: facebookresearch/FBGEMM#703

Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow.

With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding.

This is a 0.25% CPU win for AdFinder on Grace:
 {F1974759899}

Differential Revision: D68520774
@embg embg force-pushed the export-D68520774 branch from 59a241d to 85aeb87 Compare January 28, 2025 19:48
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68520774

embg pushed a commit to embg/FBGEMM that referenced this pull request Jan 28, 2025
…torch#3626)

Summary:
Pull Request resolved: pytorch#3626

X-link: facebookresearch/FBGEMM#703

Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow.

With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding.

This is a 0.25% CPU win for AdFinder on Grace:
 {F1974759899}

Differential Revision: D68520774
@embg embg force-pushed the export-D68520774 branch from 85aeb87 to 49f9fca Compare January 28, 2025 19:56
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68520774

embg pushed a commit to embg/FBGEMM that referenced this pull request Jan 28, 2025
…torch#3626)

Summary:
Pull Request resolved: pytorch#3626

X-link: facebookresearch/FBGEMM#703

Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow.

With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding.

This is a 0.25% CPU win for AdFinder on Grace:
 {F1974759899}

Differential Revision: D68520774
@embg embg force-pushed the export-D68520774 branch from 49f9fca to 641aa26 Compare January 28, 2025 20:36
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68520774

embg pushed a commit to embg/FBGEMM that referenced this pull request Jan 28, 2025
…torch#3626)

Summary:
Pull Request resolved: pytorch#3626

X-link: facebookresearch/FBGEMM#703

Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow.

With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding.

This is a 0.25% CPU win for AdFinder on Grace:
 {F1974759899}

Differential Revision: D68520774
@embg embg force-pushed the export-D68520774 branch from 641aa26 to 0daa1c8 Compare January 28, 2025 20:42
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68520774

embg pushed a commit to embg/FBGEMM that referenced this pull request Jan 28, 2025
…torch#3626)

Summary:
Pull Request resolved: pytorch#3626

X-link: facebookresearch/FBGEMM#703

Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow.

With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding.

Differential Revision: D68520774
@embg embg force-pushed the export-D68520774 branch from 0daa1c8 to bc6aaff Compare January 28, 2025 20:54
embg pushed a commit to embg/FBGEMM that referenced this pull request Jan 28, 2025
…torch#3626)

Summary:
Pull Request resolved: pytorch#3626

X-link: facebookresearch/FBGEMM#703

Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow.

With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding.

Differential Revision: D68520774
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68520774

@embg embg force-pushed the export-D68520774 branch from bc6aaff to 8fb04c2 Compare January 28, 2025 21:01
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68520774

embg pushed a commit to embg/FBGEMM that referenced this pull request Jan 28, 2025
…torch#3626)

Summary:
Pull Request resolved: pytorch#3626

X-link: facebookresearch/FBGEMM#703

Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow.

With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding.

Differential Revision: D68520774
@embg embg force-pushed the export-D68520774 branch from 8fb04c2 to dc5d7d6 Compare January 28, 2025 21:29
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68520774

embg pushed a commit to embg/FBGEMM that referenced this pull request Jan 28, 2025
…torch#3626)

Summary:
Pull Request resolved: pytorch#3626

X-link: facebookresearch/FBGEMM#703

Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow.

With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding.

Differential Revision: D68520774
@embg embg force-pushed the export-D68520774 branch from dc5d7d6 to a34a91d Compare January 28, 2025 21:44
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68520774

embg pushed a commit to embg/FBGEMM that referenced this pull request Jan 28, 2025
…torch#3626)

Summary:
Pull Request resolved: pytorch#3626

X-link: facebookresearch/FBGEMM#703

Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow.

With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding.

Differential Revision: D68520774
@embg embg force-pushed the export-D68520774 branch from a34a91d to e460390 Compare January 28, 2025 21:56
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68520774

embg pushed a commit to embg/FBGEMM that referenced this pull request Jan 28, 2025
…torch#3626)

Summary:
Pull Request resolved: pytorch#3626

X-link: facebookresearch/FBGEMM#703

Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow.

With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding.

Differential Revision: D68520774
@embg embg force-pushed the export-D68520774 branch from e460390 to 0815817 Compare January 28, 2025 22:04
…torch#3626)

Summary:
Pull Request resolved: pytorch#3626

X-link: facebookresearch/FBGEMM#703

Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow.

With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding.

Reviewed By: q10

Differential Revision: D68520774
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68520774

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in c5e1cde.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants