Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote_cache/digest: add benchmark for sha256-simd #4547

Closed
wants to merge 2 commits into from

Conversation

sluongng
Copy link
Contributor

Provide a setup to compare minio/sha256-simd (Apache 2.0 license)
performance vs the Go standard library "crypto/sha256".

The sha256-simd library comes with 2 modes:

  • without server, automatically detect CPU features
  • with server, require Avx512 CPU features

The ARM64 support is not tested.

Running the benchmark against out remote executor yields

==================== Test output for //server/remote_cache/digest:simd_bench_test:
goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) CPU @ 3.10GHz

BenchmarkSIMDDigestCompute/without_SIMD/1-30                      255240              5042 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/1-30               234526              5190 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/1-30                388          36804140 ns/op

BenchmarkSIMDDigestCompute/without_SIMD/10-30                      10000            118668 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/10-30               26204             64872 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/10-30               100          62445228 ns/op

BenchmarkSIMDDigestCompute/without_SIMD/100-30                     10000            193471 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/100-30              20247            135334 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/100-30              100          64685802 ns/op

BenchmarkSIMDDigestCompute/without_SIMD/1000-30                    14314            188163 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/1000-30             10000            176901 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/1000-30             100         212289431 ns/op

BenchmarkSIMDDigestCompute/without_SIMD/10000-30                    9067            658089 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/10000-30            10000            721403 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/10000-30            100         234613900 ns/op

BenchmarkSIMDDigestCompute/without_SIMD/100000-30                   2685           1577976 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/100000-30            1924           1079974 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/100000-30           100         146595705 ns/op

BenchmarkSIMDDigestCompute/without_SIMD/1000000-30                   312           9117083 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/1000000-30            298          13086220 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/1000000-30           56         211401036 ns/op

PASS
================================================================================

Related issues: N/A

Provide a setup to compare minio/sha256-simd (Apache 2.0 license)
performance vs the Go standard library "crypto/sha256".

The `sha256-simd` library comes with 2 modes:
- without server, automatically detect CPU features
- with server, require Avx512 CPU features

The ARM64 support is not tested.

Running the benchmark against out remote executor yields

```
==================== Test output for //server/remote_cache/digest:simd_bench_test:
goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) CPU @ 3.10GHz

BenchmarkSIMDDigestCompute/without_SIMD/1-30                      255240              5042 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/1-30               234526              5190 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/1-30                388          36804140 ns/op

BenchmarkSIMDDigestCompute/without_SIMD/10-30                      10000            118668 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/10-30               26204             64872 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/10-30               100          62445228 ns/op

BenchmarkSIMDDigestCompute/without_SIMD/100-30                     10000            193471 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/100-30              20247            135334 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/100-30              100          64685802 ns/op

BenchmarkSIMDDigestCompute/without_SIMD/1000-30                    14314            188163 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/1000-30             10000            176901 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/1000-30             100         212289431 ns/op

BenchmarkSIMDDigestCompute/without_SIMD/10000-30                    9067            658089 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/10000-30            10000            721403 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/10000-30            100         234613900 ns/op

BenchmarkSIMDDigestCompute/without_SIMD/100000-30                   2685           1577976 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/100000-30            1924           1079974 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/100000-30           100         146595705 ns/op

BenchmarkSIMDDigestCompute/without_SIMD/1000000-30                   312           9117083 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/1000000-30            298          13086220 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/1000000-30           56         211401036 ns/op

PASS
================================================================================
```
Comment on lines 18 to 21
func hasherWithServer() hash.Hash {
server := sha256simd.NewAvx512Server()
return sha256simd.NewAvx512(server)
}
Copy link
Member

@bduffany bduffany Aug 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in real usage, would we reuse this server across requests? wonder if the server should be declared as a top-level var instead of creating a new server on every iteration

Copy link
Contributor Author

@sluongng sluongng Aug 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/minio/sha256-simd/blob/master/README.md#support-for-avx512

Due to this different way of scheduling, we decided to use an explicit method to instantiate the AVX512 version. > Essentially one or more AVX512 processing servers (Avx512Server) have to be created whereby each server can hash over 3 GB/s on a single core. An hash.Hash object (Avx512Digest) is then instantiated using one of these servers and used in the regular fashion:

I think the expectation here is to create 1 server for each core? there are not a lot of examples 🤔

@sluongng
Copy link
Contributor Author

==================== Test output for //server/remote_cache/digest:simd_bench_test:
goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) CPU @ 3.10GHz
BenchmarkSIMDDigestCompute/without_SIMD/1000000-30                          2180            556758 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/1000000-30                   2282            674404 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/1000000-30                  571           2252739 ns/op

BenchmarkSIMDDigestCompute/without_SIMD/10000000-30                          121           8813178 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/10000000-30                   129           8296944 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/10000000-30                  50          20486731 ns/op

BenchmarkSIMDDigestCompute/without_SIMD/100000000-30                          14          78125744 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/100000000-30                   13          80598781 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/100000000-30                  7         177869732 ns/op

BenchmarkSIMDDigestCompute/without_SIMD/1000000000-30                          1        4463899572 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/1000000000-30                   1        1478056796 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/1000000000-30                 1        1513101099 ns/op
================================================================================

Since the doc mentioned speed up for cases >1MB, I tried to run the test against some larger loads.

Overall, the constraint of (1) a bigger size message, (2) server - CPU core 1-1 mapping, and (3) message padding for alignment make it quite unattractive. to our use case.

Gona close this for now.

@sluongng sluongng closed this Aug 14, 2023
@sluongng
Copy link
Contributor Author

After digging into this a bit more, it seems like the CPU we have on GCP, at least for our executor, do not include Intel's SHA extension

Name: Intel(R) Xeon(R) CPU @ 3.10GHz
PhysicalCores: 15
ThreadsPerCore: 2
LogicalCores: 30
Family 6 Model: 85 Vendor ID: Intel
Features: ADX,AESNI,AVX,AVX2,AVX512BW,AVX512CD,AVX512DQ,AVX512F,AVX512VL,AVX512VNNI,BMI1,BMI2,CLMUL,CMOV,CMPXCHG8,CX16,ERMS,F16C,FMA3,FXSR,FXSROPT,HLE,HTT,HYPERVISOR,IA32_ARCH_CAP,IBPB,LAHF,LZCNT,MD_CLEAR,MMX,MOVBE,MPX,NX,OSXSAVE,POPCNT,RDRAND,RDSEED,RDTSCP,RTM,SPEC_CTRL_SSBD,SSE,SSE2,SSE3,SSE4,SSE42,SSSE3,STIBP,SYSCALL,SYSEE,VMX,X87,XGETBV1,XSAVE,XSAVEC,XSAVEOPT,XSAVES
Cacheline bytes: 64
L1 Data Cache: 32768 bytes
L1 Instruction Cache: 32768 bytes
L2 Cache: 1048576 bytes
L3 Cache: 25952256 bytes
Frequency 3100000000 hz

And the minio/sha256-simd code has this clause https://github.com/minio/sha256-simd/blob/6096f891a77bfe490cbea7a424c821b5fdb92849/cpuid_other.go#L27

So when we use sha256simd.New(), that is essentially a thin wrap around crypto/sha256, and thus, the result made no difference. If we ever made a switch to AMD Ryzen / Epyc, we could test this again.

The AVX512 implementation is mostly targeted toward hashing bigger files/messages and thus is not suitable for our use case for now.

The ARM64 implementation could be attractive for ARM64 executors (Linux / MacOS) down the line, but my benchmark on M1 laptop does not show a big speed-up.


Pushed my latest local setup to the branch so future me / other folks could replicate the experiment.

@sluongng sluongng mentioned this pull request Aug 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants