Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding AVX512 path to Base64 encoding/Decoding #92241

Merged
merged 5 commits into from
Oct 25, 2023

Conversation

DeepakRajendrakumaran
Copy link
Contributor

Overview

This PR implements an AVX512 code path for Base64 encoding/Decoding. This is based on the work by Wojciech Muła and Daniel Lemire. There is a fast AVX512VBMI path and the fallback uses AVX512F/BW - I'll refer to these as VBMI_AVAILABLE and VBMI_UNAVAILABLE here on. For performance purposes, this will be compared to an AVX2 implementation which will be referred to as BASE_Version
Reference for the algorithm:

This version uses intrinsics directly and not generic vector libraries due to lack of current support in JIt/Vector libraries to produce optimal code. Some additional support which would be required in order to use generic vector library for implementing this would be

  1. Add ShuffleUnsafe for Vector512
  2. Extend Vector512.Shuffle() to lower to intrinsics instead of going to fallback for more cases.
  3. Expand Vector512 surface area to incorporate more high level functions

Even the current implementation can be further optimized by adding the multishift() implementation. This is a further optimization

Generated code

(Will be focusing on the actual encoding/decoding code within the loop only)

Encoding

VBMI_AVAILABLE

image

VBMI_UNAVAILABLE

image

BASE_VERSION
image

Decoding

VBMI_AVAILABLE

image

VBMI_UNAVAILABLE

image

BASE_VERSION
image

Performance

ON ICX -

BASE_VERSION vs VBMI_UNAVAILABLE

image

BASE_VERSION vs VBMI_AVAILABLE
image

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Sep 18, 2023
@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Sep 18, 2023
@DeepakRajendrakumaran
Copy link
Contributor Author

@BruceForstall @tannergooding @dotnet/avx512-contrib

@danmoseley
Copy link
Member

Do we need a new entry in the third party notices file?

@EgorBo
Copy link
Member

EgorBo commented Sep 19, 2023

Do we need a new entry in the third party notices file?

We should already have them if I am not mistaken (unless avx512 uses a different article)

@DeepakRajendrakumaran
Copy link
Contributor Author

Do we need a new entry in the third party notices file?

We should already have them if I am not mistaken (unless avx512 uses a different article)

I'm not sure about this-

The original sse and avx implementations use a similar algorithm and this is the reference provided - https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Encoder.cs#L13-L14.

This implementation(the VBMI version) uses a modified version based on this(https://github.com/dotnet/runtime/pull/92241/files#diff-db463201901c2d83d2b563871ae11fafee9d5afe94e4d014b77212996b25f770R635) - https://arxiv.org/pdf/1910.05109.pdf

There is an AVX512 based implementation in there - https://github.com/aklomp/base64/tree/master/lib/arch/avx512. But it has only the encode and it requires 'multishift' which we do not currently support

@danmoseley
Copy link
Member

We generally acknowledge significant reuse in the TPN file even if there's a link from the sources. I see

License notice for vectorized base64 encoding / decoding
but I'm not sure it points to that pdf (eg it doesn't include Lemire in the list)

but, whatever @EgorBo recommends..

@EgorBo
Copy link
Member

EgorBo commented Sep 19, 2023

I am not an expert in THIRD-PARTY-NOTICES.TXT 🙂 but seems like the link to that article is worth adding here https://github.com/dotnet/runtime/blob/main/THIRD-PARTY-NOTICES.TXT#L345

There is an AVX512 based implementation in there - https://github.com/aklomp/base64/tree/master/lib/arch/avx512.

Do we use any of the code from that repo? I see it has BSD-2 license

@DeepakRajendrakumaran
Copy link
Contributor Author

I am not an expert in THIRD-PARTY-NOTICES.TXT 🙂 but seems like the link to that article is worth adding here https://github.com/dotnet/runtime/blob/main/THIRD-PARTY-NOTICES.TXT#L345

There is an AVX512 based implementation in there - https://github.com/aklomp/base64/tree/master/lib/arch/avx512.

Do we use any of the code from that repo? I see it has BSD-2 license

Not directly but the logic(including shuffle constants are similar). But this implementation uses '_mm512_multishift_epi64_epi8' and that's not the one I'm using. This(https://github.com/WojciechMula/base64simd/blob/master/encode/encode.avx512vbmi.cpp) in particular is probably worth mentioning in hindsight(Since I used the non multishift version in my implementation with VBMI)

I used the below as resources to understand available implementations and available options(But they are all related). Which brings up the question which of these exactly I should be referencing

esp this for understanding : http://0x80.pl/notesen/2016-04-03-avx512-base64.html#id29

https://github.com/WojciechMula/base64simd/tree/master
https://arxiv.org/pdf/1910.05109.pdf
https://github.com/lemire/fastbase64/blob/master/src/fastavx512bwbase64.c
https://github.com/aklomp/base64/tree/master/lib/arch/avx512

@ghost
Copy link

ghost commented Sep 20, 2023

Tagging subscribers to this area: @dotnet/area-system-buffers
See info in area-owners.md if you want to be subscribed.

Issue Details

Overview

This PR implements an AVX512 code path for Base64 encoding/Decoding. This is based on the work by Wojciech Muła and Daniel Lemire. There is a fast AVX512VBMI path and the fallback uses AVX512F/BW - I'll refer to these as VBMI_AVAILABLE and VBMI_UNAVAILABLE here on. For performance purposes, this will be compared to an AVX2 implementation which will be referred to as BASE_Version
Reference for the algorithm:

This version uses intrinsics directly and not generic vector libraries due to lack of current support in JIt/Vector libraries to produce optimal code. Some additional support which would be required in order to use generic vector library for implementing this would be

  1. Add ShuffleUnsafe for Vector512
  2. Extend Vector512.Shuffle() to lower to intrinsics instead of going to fallback for more cases.
  3. Expand Vector512 surface area to incorporate more high level functions

Even the current implementation can be further optimized by adding the multishift() implementation. This is a further optimization

Generated code

(Will be focusing on the actual encoding/decoding code within the loop only)

Encoding

VBMI_AVAILABLE

image

VBMI_UNAVAILABLE

image

BASE_VERSION
image

Decoding

VBMI_AVAILABLE

image

VBMI_UNAVAILABLE

image

BASE_VERSION
image

Performance

ON ICX -

BASE_VERSION vs VBMI_UNAVAILABLE

image

BASE_VERSION vs VBMI_AVAILABLE
image

Author: DeepakRajendrakumaran
Assignees: -
Labels:

area-System.Buffers, community-contribution, needs-area-label

Milestone: -

@DeepakRajendrakumaran
Copy link
Contributor Author

I am not an expert in THIRD-PARTY-NOTICES.TXT 🙂 but seems like the link to that article is worth adding here https://github.com/dotnet/runtime/blob/main/THIRD-PARTY-NOTICES.TXT#L345

There is an AVX512 based implementation in there - https://github.com/aklomp/base64/tree/master/lib/arch/avx512.

Do we use any of the code from that repo? I see it has BSD-2 license

Not directly but the logic(including shuffle constants are similar). But this implementation uses '_mm512_multishift_epi64_epi8' and that's not the one I'm using. This(https://github.com/WojciechMula/base64simd/blob/master/encode/encode.avx512vbmi.cpp) in particular is probably worth mentioning in hindsight(Since I used the non multishift version in my implementation with VBMI)

I used the below as resources to understand available implementations and available options(But they are all related). Which brings up the question which of these exactly I should be referencing

esp this for understanding : http://0x80.pl/notesen/2016-04-03-avx512-base64.html#id29

https://github.com/WojciechMula/base64simd/tree/master https://arxiv.org/pdf/1910.05109.pdf https://github.com/lemire/fastbase64/blob/master/src/fastavx512bwbase64.c https://github.com/aklomp/base64/tree/master/lib/arch/avx512

@EgorBo I modified the reference to point to https://github.com/WojciechMula/base64simd/tree/master. Which has the closest versions to the implementation I went with. Will this require me adding anything to notice?

@DeepakRajendrakumaran
Copy link
Contributor Author

I am not an expert in THIRD-PARTY-NOTICES.TXT 🙂 but seems like the link to that article is worth adding here https://github.com/dotnet/runtime/blob/main/THIRD-PARTY-NOTICES.TXT#L345

There is an AVX512 based implementation in there - https://github.com/aklomp/base64/tree/master/lib/arch/avx512.

Do we use any of the code from that repo? I see it has BSD-2 license

Not directly but the logic(including shuffle constants are similar). But this implementation uses '_mm512_multishift_epi64_epi8' and that's not the one I'm using. This(https://github.com/WojciechMula/base64simd/blob/master/encode/encode.avx512vbmi.cpp) in particular is probably worth mentioning in hindsight(Since I used the non multishift version in my implementation with VBMI)
I used the below as resources to understand available implementations and available options(But they are all related). Which brings up the question which of these exactly I should be referencing
esp this for understanding : http://0x80.pl/notesen/2016-04-03-avx512-base64.html#id29
https://github.com/WojciechMula/base64simd/tree/master https://arxiv.org/pdf/1910.05109.pdf https://github.com/lemire/fastbase64/blob/master/src/fastavx512bwbase64.c https://github.com/aklomp/base64/tree/master/lib/arch/avx512

@EgorBo I modified the reference to point to https://github.com/WojciechMula/base64simd/tree/master. Which has the closest versions to the implementation I went with. Will this require me adding anything to notice?

Have updated THIRD PARTY NOTICE based on conversation with @tannergooding . Removing fallback avx512Bw path meant I had to add only 2 references.

Copy link
Member

@EgorBo EgorBo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks way simpler now

@BruceForstall BruceForstall added avx512 Related to the AVX-512 architecture and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Sep 25, 2023
@BruceForstall BruceForstall added this to the 9.0.0 milestone Sep 25, 2023
@DeepakRajendrakumaran
Copy link
Contributor Author

@tannergooding @BruceForstall Any comments on this? I'd be great if we can move this forward this week,

@BruceForstall
Copy link
Member

I'm not the right person to review this. If @tannergooding can't review, maybe @stephentoub can review (or pick an appropriate reviewer).


// This algorithm requires AVX512VBMI support.
// Vbmi was first introduced in CannonLake and is avaialable from IceLake on.
// This makes it okay to use Vbmi instructions since Vector512.IsHardwareAccelerated returns True only from IceLake on.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment isn't quite accurate.

Vector512.IsHardwareAccelerated can be made to return true for Skylake-X and up to before IceLake via an environment variable. This is why the caller has the check for Vector512.IsHardwareAccelerated && Avx512Vbmi.IsSupported and why this function has [CompExactlyDependsOn(typeof(Avx512Vbmi))]

We're fine with it not being usable pre IceLake since those often incur heavier downclocking and its unnecessary complexity for a non-default scenario.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed line 667

str = Avx512Vbmi.PermuteVar64x8(multiAdd2.AsByte(), vbmiPackedLanesControl).AsSByte();

AssertWrite<Vector512<sbyte>>(dest, destStart, destLength);
Vector512.Store(str.AsByte(), dest);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Most other places in the JIT we do str.Store(dest) since its an extension method and can be accessed using instance syntax.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I fully understand how this works

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's likely conflicting because dest is a byte* while str is a Vector512<sbyte> and so it can't resolve

str.Store((sbyte*)dest) should fix it, or str.AsByte().Store(dest). The former is less IL, most notably.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah..I messed up and was using str.AsSbyte().Store(dest) It's fixed now. Thank you

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just a request to cleanup a couple minor things.

@DeepakRajendrakumaran DeepakRajendrakumaran force-pushed the encoding branch 2 times, most recently from 4b2f7d1 to fc05beb Compare October 24, 2023 20:28
@DeepakRajendrakumaran
Copy link
Contributor Author

LGTM. Just a request to cleanup a couple minor things.

I've committed the clean-up changes. Please let me know if you want me to make any other changes

@tannergooding tannergooding merged commit 9ad24ae into dotnet:main Oct 25, 2023
175 checks passed
liveans pushed a commit to liveans/dotnet-runtime that referenced this pull request Nov 9, 2023
* Adding AVX512 path to Base64 encoding/Decoding

* Addressing review Comments.

Signed-off-by: Deepak Rajendrakumaran <[email protected]>

* Removing fallback path.

* Updating Third Party Notice.

* Addressing review comments

---------

Signed-off-by: Deepak Rajendrakumaran <[email protected]>
@ghost ghost locked as resolved and limited conversation to collaborators Nov 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Buffers avx512 Related to the AVX-512 architecture community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants