Make FloatToFloat16 conversion 75x faster using SVE2 instructions #3626

embg · 2025-01-28T19:31:02Z

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/703

Rounding was previously (1) not vectorized and (2) implemented in software, so speeds were less than 1 byte per cycle. That's really slow.

With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding.

Differential Revision: D68520774

facebook-github-bot · 2025-01-28T19:31:23Z

This pull request was exported from Phabricator. Differential Revision: D68520774

netlify · 2025-01-28T19:31:24Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`38dbb37`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/6799916492ca850008f3cf9e
😎 Deploy Preview	https://deploy-preview-3626--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

facebook-github-bot · 2025-01-28T19:48:28Z

This pull request was exported from Phabricator. Differential Revision: D68520774

…torch#3626) Summary: Pull Request resolved: pytorch#3626 X-link: facebookresearch/FBGEMM#703 Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow. With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding. This is a 0.25% CPU win for AdFinder on Grace: {F1974759899} Differential Revision: D68520774

facebook-github-bot · 2025-01-28T19:56:18Z

This pull request was exported from Phabricator. Differential Revision: D68520774

…torch#3626) Summary: Pull Request resolved: pytorch#3626 X-link: facebookresearch/FBGEMM#703 Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow. With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding. This is a 0.25% CPU win for AdFinder on Grace: {F1974759899} Differential Revision: D68520774

facebook-github-bot · 2025-01-28T20:36:29Z

This pull request was exported from Phabricator. Differential Revision: D68520774

…torch#3626) Summary: Pull Request resolved: pytorch#3626 X-link: facebookresearch/FBGEMM#703 Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow. With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding. This is a 0.25% CPU win for AdFinder on Grace: {F1974759899} Differential Revision: D68520774

facebook-github-bot · 2025-01-28T20:42:04Z

This pull request was exported from Phabricator. Differential Revision: D68520774

…torch#3626) Summary: Pull Request resolved: pytorch#3626 X-link: facebookresearch/FBGEMM#703 Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow. With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding. This is a 0.25% CPU win for AdFinder on Grace: {F1974759899} Differential Revision: D68520774

facebook-github-bot · 2025-01-28T20:54:16Z

This pull request was exported from Phabricator. Differential Revision: D68520774

…torch#3626) Summary: Pull Request resolved: pytorch#3626 X-link: facebookresearch/FBGEMM#703 Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow. With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding. Differential Revision: D68520774

facebook-github-bot · 2025-01-28T21:01:41Z

This pull request was exported from Phabricator. Differential Revision: D68520774

facebook-github-bot · 2025-01-28T21:29:18Z

This pull request was exported from Phabricator. Differential Revision: D68520774

…torch#3626) Summary: Pull Request resolved: pytorch#3626 X-link: facebookresearch/FBGEMM#703 Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow. With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding. Differential Revision: D68520774

facebook-github-bot · 2025-01-28T21:44:45Z

This pull request was exported from Phabricator. Differential Revision: D68520774

…torch#3626) Summary: Pull Request resolved: pytorch#3626 X-link: facebookresearch/FBGEMM#703 Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow. With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding. Differential Revision: D68520774

facebook-github-bot · 2025-01-28T21:56:24Z

This pull request was exported from Phabricator. Differential Revision: D68520774

…torch#3626) Summary: Pull Request resolved: pytorch#3626 X-link: facebookresearch/FBGEMM#703 Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow. With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding. Differential Revision: D68520774

facebook-github-bot · 2025-01-28T22:04:20Z

This pull request was exported from Phabricator. Differential Revision: D68520774

…torch#3626) Summary: Pull Request resolved: pytorch#3626 X-link: facebookresearch/FBGEMM#703 Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow. With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding. Differential Revision: D68520774

…torch#3626) Summary: Pull Request resolved: pytorch#3626 X-link: facebookresearch/FBGEMM#703 Rounding was previously (1) not vectorized and (2) [implemented in software](https://fburl.com/code/fa1jzpmo), so speeds were less than 1 byte per cycle. That's really slow. With SVE2 instructions, it's 75x faster (see test plan for measurement). That's due to a combination of vectorization + hardware support for rounding. Reviewed By: q10 Differential Revision: D68520774

facebook-github-bot · 2025-01-29T02:24:32Z

This pull request was exported from Phabricator. Differential Revision: D68520774

facebook-github-bot · 2025-01-29T05:01:47Z

This pull request has been merged in c5e1cde.

facebook-github-bot added the cla signed label Jan 28, 2025

facebook-github-bot added the fb-exported label Jan 28, 2025

embg force-pushed the export-D68520774 branch from 59a241d to 85aeb87 Compare January 28, 2025 19:48

embg force-pushed the export-D68520774 branch from 85aeb87 to 49f9fca Compare January 28, 2025 19:56

embg force-pushed the export-D68520774 branch from 49f9fca to 641aa26 Compare January 28, 2025 20:36

embg force-pushed the export-D68520774 branch from 641aa26 to 0daa1c8 Compare January 28, 2025 20:42

embg force-pushed the export-D68520774 branch from 0daa1c8 to bc6aaff Compare January 28, 2025 20:54

embg force-pushed the export-D68520774 branch from bc6aaff to 8fb04c2 Compare January 28, 2025 21:01

embg force-pushed the export-D68520774 branch from 8fb04c2 to dc5d7d6 Compare January 28, 2025 21:29

embg force-pushed the export-D68520774 branch from dc5d7d6 to a34a91d Compare January 28, 2025 21:44

embg force-pushed the export-D68520774 branch from a34a91d to e460390 Compare January 28, 2025 21:56

embg force-pushed the export-D68520774 branch from e460390 to 0815817 Compare January 28, 2025 22:04

embg mentioned this pull request Jan 29, 2025

Add NEON and SVE implementations for Float16 conversions #3424

Open

embg force-pushed the export-D68520774 branch from 0815817 to 38dbb37 Compare January 29, 2025 02:24

facebook-github-bot closed this in c5e1cde Jan 29, 2025

facebook-github-bot added the Merged label Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make FloatToFloat16 conversion 75x faster using SVE2 instructions #3626

Make FloatToFloat16 conversion 75x faster using SVE2 instructions #3626

embg commented Jan 28, 2025 •

edited

Loading

facebook-github-bot commented Jan 28, 2025

netlify bot commented Jan 28, 2025 •

edited

Loading

facebook-github-bot commented Jan 28, 2025

facebook-github-bot commented Jan 28, 2025

facebook-github-bot commented Jan 28, 2025

facebook-github-bot commented Jan 28, 2025

facebook-github-bot commented Jan 28, 2025

facebook-github-bot commented Jan 28, 2025

facebook-github-bot commented Jan 28, 2025

facebook-github-bot commented Jan 28, 2025

facebook-github-bot commented Jan 28, 2025

facebook-github-bot commented Jan 28, 2025

facebook-github-bot commented Jan 29, 2025

facebook-github-bot commented Jan 29, 2025

Make FloatToFloat16 conversion 75x faster using SVE2 instructions #3626

Make FloatToFloat16 conversion 75x faster using SVE2 instructions #3626

Conversation

embg commented Jan 28, 2025 • edited Loading

facebook-github-bot commented Jan 28, 2025

netlify bot commented Jan 28, 2025 • edited Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

facebook-github-bot commented Jan 28, 2025

facebook-github-bot commented Jan 28, 2025

facebook-github-bot commented Jan 28, 2025

facebook-github-bot commented Jan 28, 2025

facebook-github-bot commented Jan 28, 2025

facebook-github-bot commented Jan 28, 2025

facebook-github-bot commented Jan 28, 2025

facebook-github-bot commented Jan 28, 2025

facebook-github-bot commented Jan 28, 2025

facebook-github-bot commented Jan 28, 2025

facebook-github-bot commented Jan 29, 2025

facebook-github-bot commented Jan 29, 2025

embg commented Jan 28, 2025 •

edited

Loading

netlify bot commented Jan 28, 2025 •

edited

Loading