-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorized GetNonRandomizedHashCode
#98838
Conversation
Tagging subscribers to this area: @dotnet/area-system-runtime Issue DetailsStill WIP, Just to run through the CI for testing, no need to review.
|
88f2f40
to
39cfc20
Compare
uint hashed3 = hashVector.GetElement(2); | ||
uint hashed4 = hashVector.GetElement(3); | ||
|
||
while (length > 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why you have changed this to while
loop. I think if
was perfectly fine here and more efficient too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original main loop processes the data stream in the way:
while(length > 2)
{
length -= 4
}
There will be an extra iteration when length is 4n-3, e.g. 7. consuming an extra null terminator, and it also leaves the fact that the trailing string length can only 0/1/2, so it can be processed by the following if
statement.
While in the updated case, since the main loop is operating on a larger granularity, the same trick might violate the memory beyond the null terminator. So I used while
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the condition in the main while loop be while (length >= 8)
and the check if (length >= 4)
?
I understand that the existing code does tricks with the null terminator to save a few instructions. You do not have to match those tricks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thanks for the suggestion, I can try that.
The failures have been fixed. I would post the perf numbers I have locally. I understand there are some discussion on the implementation right now, but I would suppose the suggested implementation would be at worst better than the current one, so I would say the number could be meaningful. Bare perf numbers with the method:Note: Micros with ConcurrentDictionaryNotes: |
Benchmark code used for the method: using System.Diagnostics;
using System.Numerics;
using System.Runtime.Intrinsics;
using System.Runtime;
using BenchmarkDotNet;
using BenchmarkDotNet.Running;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Environments;
using BenchmarkDotNet.Toolchains.CoreRun;
using BenchmarkDotNet.Jobs;
public unsafe class BingSNR
{
[Params(
8, // scalar-path
32, // vector-path
64,
128,
512,
10000
)]
public int Size;
private byte[] _bytes, _sameBytes, _bytesDifferentCase;
private char[] _characters, _sameCharacters, _charactersDifferentCase;
[GlobalSetup]
public void Setup()
{
_bytes = new byte[Size];
_bytesDifferentCase = new byte[Size];
for (int i = 0; i < Size; i++)
{
// let ToLower and ToUpper perform the same amount of work
_bytes[i] = i % 2 == 0 ? (byte)'a' : (byte)'A';
_bytesDifferentCase[i] = i % 2 == 0 ? (byte)'A' : (byte)'a';
}
_sameBytes = _bytes.ToArray();
_characters = _bytes.Select(b => (char)b).ToArray();
_sameCharacters = _characters.ToArray();
_charactersDifferentCase = _bytesDifferentCase.Select(b => (char)b).ToArray();
}
[Benchmark]
public unsafe int Hash_Vector()
{
fixed(char* pStr = _characters)
{
int ret = GetNonRandomizedHashCode_Vector(pStr, Size);
return ret;
}
}
[Benchmark]
public int Hash()
{
fixed(char* pStr = _characters)
{
int ret = GetNonRandomizedHashCode(pStr, Size);
return ret;
}
}
[Benchmark]
public unsafe int HashIgnoreCase_Vector()
{
fixed(char* pStr = _characters)
{
int ret = GetNonRandomizedHashCodeIgnoreCase_Vector(pStr, Size);
return ret;
}
}
[Benchmark]
public int HashIgnoreCase()
{
fixed(char* pStr = _characters)
{
int ret = GetNonRandomizedHashCodeIgnoreCase(pStr, Size);
return ret;
}
}
static unsafe int GetNonRandomizedHashCode_Vector(char* src, int length)
{
uint* ptr = (uint*)src;
uint hash1 = (5381 << 16) + 5381;
uint hash2 = hash1;
uint hash3 = hash1;
uint hash4 = hash1;
if((Vector128.IsHardwareAccelerated && length >= 2 * Vector128<ushort>.Count))
{
Vector128<uint> hashVector = Vector128.Create(hash1);
while (length > 8)
{
Vector128<uint> srcVec = Vector128.Load(ptr);
length -= 8;
hashVector = (hashVector + RotateLeft(hashVector, 5)) ^ srcVec;
ptr += 4;
}
uint hashed1 = hashVector.GetElement(0);
uint hashed2 = hashVector.GetElement(1);
uint hashed3 = hashVector.GetElement(2);
uint hashed4 = hashVector.GetElement(3);
while (length > 4)
{
uint p0 = ptr[0];
uint p1 = ptr[1];
length -= 4;
hashed3 = (BitOperations.RotateLeft(hashed3, 5) + hashed3) ^ (p0);
hashed4 = (BitOperations.RotateLeft(hashed4, 5) + hashed4) ^ (p1);
ptr += 2;
}
while (length > 0)
{
uint p0 = ptr[0];
length -= 2;
hashed4 = (BitOperations.RotateLeft(hashed4, 5) + hashed4) ^ (p0);
ptr += 1;
}
uint res = (((BitOperations.RotateLeft(hashed1, 5) + hashed1)) ^ hashed3) + 1566083941 * (((BitOperations.RotateLeft(hashed2, 5) + hashed2)) ^ hashed4);
return (int)res;
}
while (length > 8)
{
uint p0 = ptr[0];
uint p1 = ptr[1];
uint p2 = ptr[2];
uint p3 = ptr[3];
length -= 8;
// hashVector = (hashVector + RotateLeft(hashVector, 5)) ^ srcVec;
hash1 = (BitOperations.RotateLeft(hash1, 5) + hash1) ^ (p0);
hash2 = (BitOperations.RotateLeft(hash2, 5) + hash2) ^ (p1);
hash3 = (BitOperations.RotateLeft(hash3, 5) + hash3) ^ (p2);
hash4 = (BitOperations.RotateLeft(hash4, 5) + hash4) ^ (p3);
ptr += 4;
}
while (length > 4)
{
uint p0 = ptr[0];
uint p1 = ptr[1];
length -= 4;
// Where length is 4n-1 (e.g. 3,7,11,15,19) this additionally consumes the null terminator
hash3 = (BitOperations.RotateLeft(hash3, 5) + hash3) ^ (p0);
hash4 = (BitOperations.RotateLeft(hash4, 5) + hash4) ^ (p1);
ptr += 2;
}
while (length > 0)
{
uint p0 = ptr[0];
length -= 2;
hash4 = (BitOperations.RotateLeft(hash4, 5) + hash4) ^ (p0);
ptr += 1;
}
uint resOnScalarPath = (((BitOperations.RotateLeft(hash1, 5) + hash1)) ^ hash3) + 1566083941 * (((BitOperations.RotateLeft(hash2, 5) + hash2)) ^ hash4);
return (int)resOnScalarPath;
}
static unsafe int GetNonRandomizedHashCodeIgnoreCase_Vector(char* src, int length)
{
uint* ptr = (uint*)src;
uint hash1 = (5381 << 16) + 5381;
uint hash2 = hash1;
uint hash3 = hash1;
uint hash4 = hash1;
const uint NormalizeToLowercase = 0x0020_0020u;
if((Vector128.IsHardwareAccelerated && length >= 2 * Vector128<ushort>.Count))
{
Vector128<uint> hashVector = Vector128.Create(hash1);
Vector128<uint> NormalizeToLowercaseVec = Vector128.Create(NormalizeToLowercase);
while (length > 8)
{
Vector128<uint> srcVec = Vector128.Load(ptr);
length -= 8;
hashVector = (hashVector + RotateLeft(hashVector, 5)) ^ (srcVec | NormalizeToLowercaseVec);
ptr += 4;
}
uint hashed1 = hashVector.GetElement(0);
uint hashed2 = hashVector.GetElement(1);
uint hashed3 = hashVector.GetElement(2);
uint hashed4 = hashVector.GetElement(3);
while (length > 4)
{
uint p0 = ptr[0];
uint p1 = ptr[1];
length -= 4;
hashed3 = (BitOperations.RotateLeft(hashed3, 5) + hashed3) ^ (p0 | NormalizeToLowercase);
hashed4 = (BitOperations.RotateLeft(hashed4, 5) + hashed4) ^ (p1 | NormalizeToLowercase);
ptr += 2;
}
while (length > 0)
{
uint p0 = ptr[0];
length -= 2;
hashed4 = (BitOperations.RotateLeft(hashed4, 5) + hashed4) ^ (p0 | NormalizeToLowercase);
ptr += 1;
}
uint res = (((BitOperations.RotateLeft(hashed1, 5) + hashed1)) ^ hashed3) + 1566083941 * (((BitOperations.RotateLeft(hashed2, 5) + hashed2)) ^ hashed4);
return (int)res;
}
while (length > 8)
{
uint p0 = ptr[0];
uint p1 = ptr[1];
uint p2 = ptr[2];
uint p3 = ptr[3];
length -= 8;
// hashVector = (hashVector + RotateLeft(hashVector, 5)) ^ srcVec;
hash1 = (BitOperations.RotateLeft(hash1, 5) + hash1) ^ (p0 | NormalizeToLowercase);
hash2 = (BitOperations.RotateLeft(hash2, 5) + hash2) ^ (p1 | NormalizeToLowercase);
hash3 = (BitOperations.RotateLeft(hash3, 5) + hash3) ^ (p2 | NormalizeToLowercase);
hash4 = (BitOperations.RotateLeft(hash4, 5) + hash4) ^ (p3 | NormalizeToLowercase);
ptr += 4;
}
while (length > 4)
{
uint p0 = ptr[0];
uint p1 = ptr[1];
length -= 4;
// Where length is 4n-1 (e.g. 3,7,11,15,19) this additionally consumes the null terminator
hash3 = (BitOperations.RotateLeft(hash3, 5) + hash3) ^ (p0 | NormalizeToLowercase);
hash4 = (BitOperations.RotateLeft(hash4, 5) + hash4) ^ (p1 | NormalizeToLowercase);
ptr += 2;
}
while (length > 0)
{
uint p0 = ptr[0];
length -= 2;
hash4 = (BitOperations.RotateLeft(hash4, 5) + hash4) ^ (p0 | NormalizeToLowercase);
ptr += 1;
}
uint resOnScalarPath = (((BitOperations.RotateLeft(hash1, 5) + hash1)) ^ hash3) + 1566083941 * (((BitOperations.RotateLeft(hash2, 5) + hash2)) ^ hash4);
return (int)resOnScalarPath;
}
static Vector128<uint> RotateLeft(Vector128<uint> src, int control)
{
return Vector128.BitwiseOr(Vector128.ShiftLeft(src, control), Vector128.ShiftRightLogical(src, 32 - control));
}
static unsafe int GetNonRandomizedHashCode(char* src, int length)
{
uint hash1 = (5381 << 16) + 5381;
uint hash2 = hash1;
uint* ptr = (uint*)src;
while (length > 2)
{
length -= 4;
hash1 = (BitOperations.RotateLeft(hash1, 5) + hash1) ^ ptr[0];
hash2 = (BitOperations.RotateLeft(hash2, 5) + hash2) ^ ptr[1];
ptr += 2;
}
if (length > 0)
hash2 = (BitOperations.RotateLeft(hash2, 5) + hash2) ^ ptr[0];
return (int)(hash1 + (hash2 * 1566083941));
}
static unsafe int GetNonRandomizedHashCodeIgnoreCase(char* src, int length)
{
uint hash1 = (5381 << 16) + 5381;
uint hash2 = hash1;
const uint NormalizeToLowercase = 0x0020_0020u;
uint* ptr = (uint*)src;
while (length > 2)
{
length -= 4;
hash1 = (BitOperations.RotateLeft(hash1, 5) + hash1) ^ (ptr[0] | NormalizeToLowercase);
hash2 = (BitOperations.RotateLeft(hash2, 5) + hash2) ^ (ptr[1] | NormalizeToLowercase);
ptr += 2;
}
if (length > 0)
hash2 = (BitOperations.RotateLeft(hash2, 5) + hash2) ^ (ptr[0] | NormalizeToLowercase);
return (int)(hash1 + (hash2 * 1566083941));
}
static void Main(string[] args)
{
var switcher = new BenchmarkSwitcher(new[] {
typeof(BingSNR)
});
switcher.Run(args);
}
}
|
So the break-even point is around 32 characters, and it is a regression for anything smaller than that? What can be done to address the regression for smaller strings? Also, do we know what is the string length distribution in the Bing scenarios that you are trying to improve? |
The regression mostly comes from the branch misprediction, in this method, the string length is random, makes it hard for branch predictor to learn any pattern, I was not able to come up with good ideas to fix this.
We did collected the distribution directly from the app, and the benchmark numbers, but I am not sure if it is proper to share them here, considering those are closed-source. |
Where are the branch mispredictions for the bare perf numbers that show the worst regressions? The benchmarks are setup to run on the same length string in a loop. There should not be any branch mispredictions. Also, I have notice that your bare perf numbers are for |
|
Thanks for the feedbacks!
using the code given in the issue, I got the same collision count with the number provided, the generated hash code will be different from the ones generated by the original algorithm, but the collision count remained the same with the given setup - This implementation may not resolve the known collision issue.
Locally, I can reproduce similar numbers by simply calling the System.IO.Hashing lib function as suggested, XxHash3 is around 40% faster than the proposed implementation and 64% faster than the original implementation. (benchmark code attached below, the inputs are slightly different due to the lib function interface.)
I am not sure what [Params(
512
)]
public int Size;
private string _string;
[GlobalSetup]
public void Setup()
{
_string = "";
for (int i = 0; i < Size; i++)
{
_string += i % 2 == 0 ? 'a' : 'A';
}
}
[Benchmark]
public unsafe int Hash_Vector()
{
fixed(char* pStr = _string)
{
int ret = GetNonRandomizedHashCode_Vector(pStr, Size);
return ret;
}
}
[Benchmark]
public unsafe int Hash()
{
fixed(char* pStr = _string)
{
int ret = GetNonRandomizedHashCode(pStr, Size);
return ret;
}
}
[Benchmark]
public unsafe ulong Hash_XxHash()
{
ulong ret = XxHash3.HashToUInt64(MemoryMarshal.AsBytes(_string.AsSpan()));
return ret;
} |
It would not lead to mismatch as long as the existing algorithm is unconditionally used for small strings and the vectorized algorithm is unconditionally used for long strings. Something like:
Note that XxHash3 uses similar pattern to avoid paying the fixed vectorization overhead for small sizes: runtime/src/libraries/System.IO.Hashing/src/System/IO/Hashing/XxHash3.cs Lines 129 to 144 in 995989e
|
Hi @jkotas , trying to kindly follow up on this PR, is there any further AR on my side to move forward on this, e.g integrating the suggested changes into the library? |
Do you plan to address the regressions for small strings and make the performance for large strings even better by switching to XxHash3?
I am worried about your current implementation making the collision issues worse. Analyzing quality of hash functions is a non-trivial field. |
Failures look irrelevant.
The suggested changes to mitigate the small strings have been pushed to the PR. On the large input side, we currently do not have the plan to further optimize it with XxHash3.
I do understand the hash code quality analysis is non-trivial, and due to my limited experience within this domain, it could be challenging for me to perform analysis and get the conclusion. If this is not the right time for this PR, I understand. |
Right, it is why it is best to use one of the existing algorithms that has been analyzed. |
Given the discussion and the analysis required and the lack of movement on it for several months, I'm going to close this for now. But thanks for trying to improve things here! |
Description
This PR vectorizes the hashing algorithm used in
GetNonRandomizedHashCode
andGetNonRandomizedHashCodeIgnoreCase
withVector128
APIs. This is motivated by the fact that this method was spotted as the hot path in some first party workload.In terms of the implementation, the incoming string input will be hashed on the vector path as long as the length is above the threshold. This leads to a 2-path hashing, meaning short strings (length under threshold) and long strings (length above threshold) will be hashed in different way, I'm not sure if this discrepancy matters, and the given structure might make it hard to scale up to V256 and V512.
we are open to discuss and improve the implementation.