Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark results #89

Open
ggerganov opened this issue Oct 25, 2022 · 168 comments
Open

Benchmark results #89

ggerganov opened this issue Oct 25, 2022 · 168 comments
Labels
performance CPU and memory usage - results and comparisons

Comments

@ggerganov
Copy link
Owner

ggerganov commented Oct 25, 2022

Encoder

Collection of bench results for various platforms and devices.
If you want to submit info about your device, simply run the bench tool or the extra/bench-all.sh and report the results in the comments below.

Suggestions for better summary of the results are welcome

CPU OS Config Model Th Load Enc. Commit
MacBook M1 Pro MacOS 13.0.1 NEON BLAS tiny 8 71 102 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS base 8 96 220 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS small 8 233 685 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS medium 8 603 1928 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS large 8 1158 3350 206fc93
---
MacBook M1 Pro MacOS 13.0.1 NEON BLAS small 1 251 2605 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS small 4 255 884 206fc93
---
Mac Mini M1 MacOS NEON BLAS tiny 4 62 194 fcf515d
Mac Mini M1 MacOS NEON BLAS base 4 81 380 fcf515d
Mac Mini M1 MacOS NEON BLAS small 4 204 1249 fcf515d
Mac Mini M1 MacOS NEON BLAS medium 4 876 3980 fcf515d
Mac Mini M1 MacOS NEON BLAS large 4 1876 7979 fcf515d
---
Ryzen 9 3900X Ubuntu 20.04 AVX2 tiny 8 107 422 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 base 8 137 880 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 small 8 280 2874 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 medium 8 692 9610 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 large 8 1317 16917 fcf515d
---
Ryzen 9 3900X Ubuntu 20.04 AVX2 BLAS tiny 4 120 780 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 BLAS base 4 151 1173 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 BLAS small 4 289 3062 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 BLAS medium 4 711 9175 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 BLAS large 4 1282 16050 fcf515d
---
Ryzen 9 5950X Ubuntu 22.04 AVX2 tiny 8 135 197 fcf515d
Ryzen 9 5950X Ubuntu 22.04 AVX2 base 8 176 421 fcf515d
Ryzen 9 5950X Ubuntu 22.04 AVX2 small 8 357 1393 fcf515d
Ryzen 9 5950X Ubuntu 22.04 AVX2 medium 8 855 4404 fcf515d
Ryzen 9 5950X Ubuntu 22.04 AVX2 large 8 1576 8118 fcf515d
---
Raspberry Pi 4 NEON tiny 4 1436 13839 fcf515d
Raspberry Pi 4 NEON base 4 1894 30552 fcf515d
---
iPhone 13 Mini iOS 16.0 NEON BLAS base 4 97 1091 fcf515d
---
MacBook M1 Pro Vivaldi WASM tiny 8 133 3785 fcf515d
MacBook M1 Pro Vivaldi WASM base 8 172 8253 fcf515d
---
MacBook M1 Pro Chrome WASM tiny 8 134 3776 fcf515d
MacBook M1 Pro Chrome WASM base 8 168 8200 fcf515d
---
MacBook M1 Pro Firefox WASM tiny 8 137 2626 fcf515d
MacBook M1 Pro Firefox WASM base 8 183 6226 fcf515d

memcpy

MacBook M1 Pro

./bench -w 1 -t 1
memcpy: 37.59 GB/s

Ryzen 9 5950X

./bench -w 1 -t 1
memcpy: 16.74 GB/s

ggml_mul_mat

MacBook M1 Pro

./bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16    330.6 GFLOPS (128 runs) / F32    466.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16    737.5 GFLOPS (128 runs) / F32    838.9 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    938.6 GFLOPS (128 runs) / F32   1062.3 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16   1312.5 GFLOPS (128 runs) / F32   1835.5 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16   1765.1 GFLOPS (128 runs) / F32   2041.4 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16   1784.3 GFLOPS (104 runs) / F32   1859.2 GFLOPS (109 runs)
ggml_mul_mat:  4096 x  4096: F16   1855.1 GFLOPS ( 14 runs) / F32   1873.3 GFLOPS ( 14 runs)

Ryzen 9 5950X

WHISPER_OPENBLAS=1 make -j bench && ./bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16     56.3 GFLOPS (128 runs) / F32     70.2 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     47.8 GFLOPS (128 runs) / F32     67.0 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    185.1 GFLOPS (128 runs) / F32    332.7 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    386.4 GFLOPS (128 runs) / F32    658.6 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    636.2 GFLOPS (128 runs) / F32   1012.0 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16    950.9 GFLOPS ( 56 runs) / F32   1296.8 GFLOPS ( 76 runs)
ggml_mul_mat:  4096 x  4096: F16   1168.6 GFLOPS (  9 runs) / F32   1403.1 GFLOPS ( 11 runs)
@ggerganov ggerganov added the performance CPU and memory usage - results and comparisons label Oct 25, 2022
@cdosoftei
Copy link
Contributor

cdosoftei commented Oct 25, 2022

Results for Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz

CPU OS Config Model Threads Load [ms] Encode [ms]
i7-4790K Debian   tiny.en 4 165 808
i7-4790K Debian   tiny.en 8 165 783
i7-4790K Debian   base.en 4 212 1813
i7-4790K Debian   base.en 8 214 1746

@rjwilmsi
Copy link

Results for Ryzen 5 4500U 6C/6T laptop CPU (I've just included one result for 8 threads as Encode time is much higher when threads > CPU cores).

CPU OS Config Model Threads Load [ms] Encode [ms]
Ryzen 5 4500U (6C/6T) Opensuse Leap tiny.en 4 170.00 829.43
Ryzen 5 4500U (6C/6T) Opensuse Leap tiny.en 6 143.03 671.74
Ryzen 5 4500U (6C/6T) Opensuse Leap base.en 4 305.92 2,092.39
Ryzen 5 4500U (6C/6T) Opensuse Leap base.en 6 188.05 1,495.61
Ryzen 5 4500U (6C/6T) Opensuse Leap small.en 4 408.03 6,919.31
Ryzen 5 4500U (6C/6T) Opensuse Leap small.en 6 359.23 6,370.83
Ryzen 5 4500U (6C/6T) Opensuse Leap medium.en 4 2,238.11 25,863.28
Ryzen 5 4500U (6C/6T) Opensuse Leap medium.en 6 1,113.04 19,672.63
Ryzen 5 4500U (6C/6T) Opensuse Leap medium.en 8 973.65 39,619.20

@ArtyomZemlyak
Copy link

ArtyomZemlyak commented Oct 26, 2022

CPU OS Config Model Threads Load [ms] Encode [ms]
i7-11800H WSL2 Ubuntu AVX2 tiny 2 164.35 1087.61
i7-11800H WSL2 Ubuntu AVX2 tiny 4 128.94 733.24
i7-11800H WSL2 Ubuntu AVX2 tiny 8 137.57 619.88
i7-11800H WSL2 Ubuntu AVX2 AVX512 tiny 2 143.02 1087.15
i7-11800H WSL2 Ubuntu AVX2 AVX512 tiny 4 127.60 730.57
i7-11800H WSL2 Ubuntu AVX2 AVX512 tiny 8 125.62 616.27
i7-11800H WSL2 Ubuntu AVX2 AVX512 BLAS tiny 2 132.59 1511.38
i7-11800H WSL2 Ubuntu AVX2 AVX512 BLAS tiny 4 132.48 1407.49
i7-11800H WSL2 Ubuntu AVX2 AVX512 BLAS tiny 8 133.82 1458.27

@ArtyomZemlyak
Copy link

ArtyomZemlyak commented Oct 26, 2022

CPU OS Config Model Threads Load [ms] Encode [ms]
i7-11800H WSL2 Ubuntu AVX2 base 2 174.34 2533.79
i7-11800H WSL2 Ubuntu AVX2 base 4 166.68 1830.67
i7-11800H WSL2 Ubuntu AVX2 base 8 165.53 1478.73
i7-11800H WSL2 Ubuntu AVX2 small 2 340.12 8714.24
i7-11800H WSL2 Ubuntu AVX2 small 4 394.32 6021.41
i7-11800H WSL2 Ubuntu AVX2 small 8 305.98 4828.84
i7-11800H WSL2 Ubuntu AVX2 large 2 3205.36 57109.10
i7-11800H WSL2 Ubuntu AVX2 large 4 2720.25 38519.89
i7-11800H WSL2 Ubuntu AVX2 large 8 3716.34 27739.99

@ArtyomZemlyak
Copy link

ArtyomZemlyak commented Oct 26, 2022

CPU OS Config Model Threads Load [ms] Encode [ms]
i7-11800H WSL2 Ubuntu AVX2 AVX512 large 2 1954.21 54966.84
i7-11800H WSL2 Ubuntu AVX2 AVX512 large 4 1455.40 37320.62
i7-11800H WSL2 Ubuntu AVX2 AVX512 large 8 1372.58 27937.64

@ArtyomZemlyak
Copy link

This performance is impressing!

M1 Pro | MacOS |   | large | 8 | 1973 | 4208

@ggerganov ggerganov pinned this issue Oct 26, 2022
@ggerganov
Copy link
Owner Author

This performance is impressing!

Yes, there is a huge performance boost due to using the built-in BLAS implementation on these devices. I will soon add OpenBLAS support for x86 architectures and see how this compares.

By the way, AVX-512 is not supported on master. I have added initial support here, but I am not sure if it works: #95

@cristianglezm
Copy link

cristianglezm commented Oct 28, 2022

CPU OS Config Model Threads Load[ms] encode[ms]
Intel® Core™ i5-8250U Win11 Home AVX2 Large 8 2226.85 61547.61

compiled with MinGW64 gcc 11.3

@tazz4843
Copy link
Contributor

tazz4843 commented Oct 29, 2022

Valve Jupiter (AMD Custom APU 0405, Zen 2 microarch, 4c8t, 16GB DDR5 @ 5200 MT/s)

CPU OS Config Model Threads Load[ms] encode[ms]
AMD Custom APU 0405 SteamOS 3.2 AVX2 Base 8 326.32 2592.96

Compiled with cc (GCC) 11.3.0

The performance gains on jfk.wav since last test (two weeks or so ago) are extremely impressive, ~10-20x speedup from 40 to 2-4 seconds.

@yujinqiu
Copy link

CPU OS Config Model Threads Load [ms] Encode [ms]
MacBook M1 Max macOS Ventura BLAS small 1 299.09 4166.00
MacBook M1 Max macOS Ventura BLAS small 4 329.45 1304.32
MacBook M1 Max macOS Ventura BLAS base 1 139.10 1302.17
MacBook M1 Max macOS Ventura BLAS base 4 135.96 399.45

@trholding

This comment was marked as outdated.

@trholding

This comment was marked as outdated.

@trholding

This comment was marked as outdated.

@ggerganov
Copy link
Owner Author

@trholding
Thanks for the results.

You can generate a table with performance results by simply running the extra/bench_all.sh script.

Regarding the threads - yes, it seems that going beyond 8 threads does not help regardless of how many cores you have. My guess is that the computation is memory-bound so that's why using more threads does not improve the performance.

@trholding

This comment was marked as outdated.

@trholding
Copy link
Contributor

You can generate a table with performance results by simply running the extra/bench_all.sh script.

Hey Sorry. That didn't pan out well, I did the benchmark thrice, my account got deleted without notice. Could't get the logs as it was a web terminal. On the other hand I am happy that this happened, I was giving serious thought of purchasing a GPU+CPU plan there, so performance check of CPU was equally important. Probably or technically it was my fault - probably shouldn't have used a reverse shell and done benchmarks on a free trial, but how does one know if a service is real good or all just vapor...

@rgerganov
Copy link
Contributor

Dell Precision 5560 laptop results:

CPU OS Config Model Threads Load [ms] Encode [ms]
i7-11850H Ubuntu AVX2 tiny 4 115.87 538.43
i7-11850H Ubuntu AVX2 base 4 145.14 1241.84
i7-11850H Ubuntu AVX2 small 4 299.30 4343.57
i7-11850H Ubuntu AVX2 medium 4 760.98 15238.31
i7-11850H Ubuntu AVX2 large 4 1404.32 27476.86
i7-11850H Ubuntu AVX2 tiny 8 131.96 358.81
i7-11850H Ubuntu AVX2 base 8 166.61 839.31
i7-11850H Ubuntu AVX2 small 8 320.29 2854.86
i7-11850H Ubuntu AVX2 medium 8 756.20 9829.62
i7-11850H Ubuntu AVX2 large 8 1382.38 19872.81

@jaybinks
Copy link
Contributor

jaybinks commented Nov 5, 2022

CPU OS Config Model Threads Load [ms] Encode [ms]
i9-9900K WSL2 Ubuntu (GCC) AVX2  tiny.en 4 85.71 601.56
i9-9900K WSL2 Ubuntu (GCC) AVX2  small.en 4 212.59 5146.23
i9-9900K OSX 10.14.1 (hackintosh - GCC) AVX2  tiny.en 4 198.17 455.12
i9-9900K OSX 10.14.1 (hackintosh - GCC) AVX2  base.en 4 272.62 909.71
i9-9900K OSX 10.14.1 (hackintosh - GCC) AVX2 small.en 4 598.75 2968.75
Xeon(R) Silver 4210R CPU @ 2.40GHz Virtual Machine - Debian Stretch (GCC - master branch) AVX2 avx512f avx512dq avx512cd avx512bw avx512vl small.en 4 776.56 12340.41
Xeon(R) Silver 4210R CPU @ 2.40GHz Virtual Machine - Debian Stretch (GCC - master branch) AVX2 avx512f avx512dq avx512cd avx512bw avx512vl tiny.en 4 295.54 1710.46

@mark-beeby
Copy link

CPU OS Config Model Threads Load [ms] Encode [ms]
i9-11950H Pop!_OS 22.04 LTS AVX2 Tiny 4 124.28 656.41
i9-11950H Pop!_OS 22.04 LTS AVX2 Tiny 8 123.70 696.41
i9-11950H Pop!_OS 22.04 LTS AVX2 Base 4 159.91 1754.44
i9-11950H Pop!_OS 22.04 LTS AVX2 Base 8 164.47 1658.55
i9-11950H Pop!_OS 22.04 LTS AVX2 Small 4 330.91 6161.86
i9-11950H Pop!_OS 22.04 LTS AVX2 Small 8 346.22 5187.85

@niksedk
Copy link
Contributor

niksedk commented Nov 9, 2022

CPU OS Config Model Threads Load [ms] Encode [ms]
i7-1065G7 Windows 11 - small.en 4 1,314.25 294,168.09

Compiled with VS 2022

Something is off, right?

@ggerganov
Copy link
Owner Author

Yup - you are missing the AVX2 flag. See if some of the comments in #5 can help you resolve this.

@niksedk
Copy link
Contributor

niksedk commented Nov 9, 2022

OK, the AVX2 flag seems to help :)

CPU OS Config Model Threads Load [ms] Encode [ms]
i7-1065G7 Windows 11 AVX2 small.en 4 527.59 9,648.67

Compiled with VS 2022

@j1nx
Copy link

j1nx commented Nov 17, 2022

CPU OS Config Model Threads Load [ms] Encode [ms] Remarks
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 1 861.34 29428.21 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 1 843.80 16145.62 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 4 835.68 21509.08 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 4 824.24 13187.96 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 1 1146.02 87615.00 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 1 1103.39 52228.30 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 4 1183.47 55256.20 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 4 1161.32 29851.40 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 1 752.64 24018.10 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 1 751.96 13082.95 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 4 743.37 10122.80 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 4 742.90 9564.89 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 1 974.46 71587.61 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 1 979.65 43852.07 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 4 982.24 24814.62 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 4 982.80 19910.19 Without OVOS services running

@StuartIanNaylor
Copy link

StuartIanNaylor commented Nov 17, 2022

From the stream repo


CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny.en 4 243.54 ms 779.49 ms
RK3588 Ubuntu20.04 NEON base.en 4 316.52 ms 1821.06 ms
RK3588 Ubuntu20.04 NEON small.en 4 618.93 ms 7117.69 ms
RK3588 Ubuntu20.04 NEON medium.en 4 1514.88 ms 24139.92 ms
CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny 4 233.86 ms 791.01 ms
RK3588 Ubuntu20.04 NEON base 4 297.93 ms 1813.69 ms
RK3588 Ubuntu20.04 NEON small 4 592.18 ms 7102.28 ms
RK3588 Ubuntu20.04 NEON medium 4 1587.36 ms 24147.87 ms
CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny 8 226.48 ms 740.34 ms
RK3588 Ubuntu20.04 NEON base 8 300.48 ms 1723.42 ms
RK3588 Ubuntu20.04 NEON small 8 620.58 ms 6392.47 ms
RK3588 Ubuntu20.04 NEON medium 8 1533.75 ms 21899.08 ms

I still haven't worked out the little(0-3).Big(4-7) on this thing as if I pin to big cores taskset -c 4-7

CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny.en 4 234.14 ms 681.53 ms
RK3588 Ubuntu20.04 NEON base.en 4 297.08 ms 1679.75 ms
RK3588 Ubuntu20.04 NEON small.en 4 599.98 ms 6867.66 ms
RK3588 Ubuntu20.04 NEON medium.en 4 1492.73 ms 23600.45 ms

I tried to compile with openBlas but seemed to kill the make


From the master repo as didn't think about the repo after trying streaming input

CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny 8 226.48 ms 2681.05 ms
RK3588 Ubuntu20.04 NEON base 8 283.56 ms 6132.44 ms
RK3588 Ubuntu20.04 NEON small 8 583.39 ms 24397.78 ms
RK3588 Ubuntu20.04 NEON medium 8 1490.98 85099.45 ms

@dodysw
Copy link
Contributor

dodysw commented Nov 17, 2022

CPU OS Config Model Threads Load [ms] Encode [ms]
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 tiny.en 8 136.29 454.52
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 tiny 8 134.64 486.01
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 base 8 180.22 1184.80
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 base.en 8 192.86 1197.85
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 small 8 367.55 4179.00
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 small.en 8 378.27 4557.73
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 medium 8 923.48 15552.61
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 medium.en 8 952.48 15708.63
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 large 8 1650.28 28357.09

8 threads seemed to be the fastest. However I managed to squeeze a bit more performance by pinning CPU:

$ taskset -c 0-15 ./extra/bench-all.sh 16
CPU OS Config Model Threads Load [ms] Encode [ms]
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 tiny 16 143.17 437.73
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 base 16 184.10 1061.14
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 small 16 374.41 3645.64
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 medium 16 935.45 13029.54

@matth
Copy link

matth commented Nov 21, 2022

Results for AWS Graviton 3 Processor (c7g.4xlarge instance type).

Compiled with -march=native -ffast-math.

./extra/bench-all.sh 8

CPU OS Config Model Threads Load [ms] Encode [ms]
Graviton 3 Ubuntu 22.04 NEON tiny 8 125.92 230.33
Graviton 3 Ubuntu 22.04 NEON base 8 160.17 547.88
Graviton 3 Ubuntu 22.04 NEON small 8 299.59 2138.86
Graviton 3 Ubuntu 22.04 NEON medium 8 741.49 6999.33
Graviton 3 Ubuntu 22.04 NEON large 8 1313.95 14174.00

./extra/bench-all.sh 16

CPU OS Config Model Threads Load [ms] Encode [ms]
Graviton 3 Ubuntu 22.04 NEON tiny 16 121.92 158.61
Graviton 3 Ubuntu 22.04 NEON base 16 156.01 386.78
Graviton 3 Ubuntu 22.04 NEON small 16 299.85 1596.38
Graviton 3 Ubuntu 22.04 NEON medium 16 750.93 5351.24
Graviton 3 Ubuntu 22.04 NEON large 16 1313.82 11115.69

@ggerganov
Copy link
Owner Author

@matth Do you observe significant performance difference with / without -march=native -ffast-math?

@matth
Copy link

matth commented Nov 21, 2022

@ggerganov -ffast-math seems to make only a very small difference that could be noise between runs

-march=native does seem to make a big difference, without it FP16_VA is not reported as being enabled (I can get this with -march=armv8.4-a+bf16+fp16fml) - I think -march=native is enabling more intrinsics than this though.

Results without any -march or -ffast-math flags ...

./extra/bench-all.sh 16

CPU OS Config Model Threads Load [ms] Encode [ms]
Graviton 3 Ubuntu 22.04 NEON tiny 16 124.25 320.53
Graviton 3 Ubuntu 22.04 NEON base 16 156.91 734.22
Graviton 3 Ubuntu 22.04 NEON small 16 301.78 2812.75
Graviton 3 Ubuntu 22.04 NEON medium 16 714.23 9139.86
Graviton 3 Ubuntu 22.04 NEON large 16 1298.33 18147.47

I have tried to improve by using OpenBlas and armpl.h but with they both slow it down considerably - I'll keep trying with the latter.

Are there any possibilities for further optimisations in ggml.c that can take advantage of the situation where you have bf16 functions but not BLAS or Accelerate?

@nickovs
Copy link

nickovs commented Nov 3, 2023

Results for the new Raspberry Pi 5. Tests performed on a board with the active cooler. uname -a output is:

Linux newpi 6.1.0-rpi4-rpi-2712 #1 SMP PREEMPT Debian 1:6.1.54-1+rpt2 (2023-10-05) aarch64 GNU/Linux
CPU OS Config Model Threads Encode Decode Commit
BCM2712 Bookworm 12.2 NEON 4 tiny 1106.11 183.67 54c978c
BCM2712 Bookworm 12.2 NEON 4 tiny.en 1109.66 201.3 54c978c
BCM2712 Bookworm 12.2 NEON 4 base 2479.82 346.65 54c978c
BCM2712 Bookworm 12.2 NEON 4 base.en 2465.12 363.86 54c978c
BCM2712 Bookworm 12.2 NEON 4 small 8308.3 963.24 54c978c
BCM2712 Bookworm 12.2 NEON 4 small.en 8342.25 1119.25 54c978c
BCM2712 Bookworm 12.2 NEON 4 medium.en 26407.77 2893.55 54c978c
BCM2712 Bookworm 12.2 NEON 4 medium 26468.86 2919.43 54c978c

These results are 4.5 to 6.2 times faster than the Raspberry Pi 4.

NOTE: The packaged version of OpenBLAS has not been recompiled for the new CPU architecture, so it is about 50% slower than whisper.cpp's native NEON implementation. I will post benchmarks using OpenBLAS once I have built a version for the new CPU.

The memcpy and ggml_mul_mat benchmarks show:

memcpy: 4.64 GB/s (1 thread)
sum:    136902081526.000000

  64 x   64: Q4_0     5.5 GFLOPS (128 runs) | Q4_1     5.1 GFLOPS (128 runs)
  64 x   64: Q5_0     4.7 GFLOPS (128 runs) | Q5_1     4.9 GFLOPS (128 runs) | Q8_0     5.0 GFLOPS (128 runs)
  64 x   64: F16      5.0 GFLOPS (128 runs) | F32      4.9 GFLOPS (128 runs)
 128 x  128: Q4_0    22.9 GFLOPS (128 runs) | Q4_1    22.6 GFLOPS (128 runs)
 128 x  128: Q5_0    19.7 GFLOPS (128 runs) | Q5_1    20.3 GFLOPS (128 runs) | Q8_0    23.9 GFLOPS (128 runs)
 128 x  128: F16     26.3 GFLOPS (128 runs) | F32     13.3 GFLOPS (128 runs)
 256 x  256: Q4_0    39.0 GFLOPS (128 runs) | Q4_1    49.4 GFLOPS (128 runs)
 256 x  256: Q5_0    33.0 GFLOPS (128 runs) | Q5_1    37.5 GFLOPS (128 runs) | Q8_0    58.6 GFLOPS (128 runs)
 256 x  256: F16     64.1 GFLOPS (128 runs) | F32     48.4 GFLOPS (128 runs)
 512 x  512: Q4_0    62.6 GFLOPS (128 runs) | Q4_1    62.3 GFLOPS (128 runs)
 512 x  512: Q5_0    49.9 GFLOPS (128 runs) | Q5_1    46.1 GFLOPS (128 runs) | Q8_0    76.2 GFLOPS (128 runs)
 512 x  512: F16     80.1 GFLOPS (128 runs) | F32     51.1 GFLOPS (128 runs)
1024 x 1024: Q4_0    67.9 GFLOPS ( 32 runs) | Q4_1    67.6 GFLOPS ( 32 runs)
1024 x 1024: Q5_0    53.5 GFLOPS ( 25 runs) | Q5_1    50.4 GFLOPS ( 24 runs) | Q8_0    85.4 GFLOPS ( 40 runs)
1024 x 1024: F16     92.9 GFLOPS ( 44 runs) | F32     48.0 GFLOPS ( 23 runs)
2048 x 2048: Q4_0    71.0 GFLOPS (  5 runs) | Q4_1    72.2 GFLOPS (  5 runs)
2048 x 2048: Q5_0    55.7 GFLOPS (  4 runs) | Q5_1    52.3 GFLOPS (  4 runs) | Q8_0    87.6 GFLOPS (  6 runs)
2048 x 2048: F16     93.1 GFLOPS (  6 runs) | F32     43.9 GFLOPS (  3 runs)
4096 x 4096: Q4_0    72.2 GFLOPS (  3 runs) | Q4_1    73.9 GFLOPS (  3 runs)
4096 x 4096: Q5_0    55.9 GFLOPS (  3 runs) | Q5_1    52.7 GFLOPS (  3 runs) | Q8_0    86.9 GFLOPS (  3 runs)
4096 x 4096: F16     86.8 GFLOPS (  3 runs) | F32     38.4 GFLOPS (  3 runs)

@marjisound
Copy link

CPU details: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
GPU name: NVIDIA Tesla T4
OS: Linux 14 22.04.1-Ubuntu
Compiler: cc (Ubuntu 11.4.0-1ubuntu1 22.04) 11.4.0

WHISPER_CUBLAS=1 make -j bench && ./extra/bench-all.sh

I whisper.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -mavx2 -mfma -mf16c -mavx -msse3 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I LDFLAGS:  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC:       cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:      g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

make: 'bench' is up to date.
Usage: ./bench.sh [n_threads]

Running memcpy benchmark with 1 thread

memcpy: 5.05 GB/s
sum:    -536870997.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:   64 x   64: Q4_0     3.8 GFLOPS (128 runs) / Q4_1     3.8 GFLOPS (128 runs) / F16     3.8 GFLOPS (128 runs) / F32     3.9 GFLOPS (128 runs)
ggml_mul_mat:  128 x  128: Q4_0    23.6 GFLOPS (128 runs) / Q4_1    24.0 GFLOPS (128 runs) / F16    22.1 GFLOPS (128 runs) / F32    22.4 GFLOPS (128 runs)
ggml_mul_mat:  256 x  256: Q4_0    90.3 GFLOPS (128 runs) / Q4_1   100.0 GFLOPS (128 runs) / F16    92.0 GFLOPS (128 runs) / F32    92.3 GFLOPS (128 runs)
ggml_mul_mat:  512 x  512: Q4_0   278.8 GFLOPS (128 runs) / Q4_1   277.6 GFLOPS (128 runs) / F16   244.9 GFLOPS (128 runs) / F32   242.1 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: Q4_0   859.2 GFLOPS (128 runs) / Q4_1   853.6 GFLOPS (128 runs) / F16   648.3 GFLOPS (128 runs) / F32   685.4 GFLOPS (128 runs)
ggml_mul_mat: 2048 x 2048: Q4_0  1583.4 GFLOPS ( 93 runs) / Q4_1  1585.1 GFLOPS ( 93 runs) / F16  1383.9 GFLOPS ( 81 runs) / F32  1359.7 GFLOPS ( 80 runs)
ggml_mul_mat: 4096 x 4096: Q4_0  2525.9 GFLOPS ( 19 runs) / Q4_1  2658.6 GFLOPS ( 20 runs) / F16  2716.0 GFLOPS ( 20 runs) / F32  2302.7 GFLOPS ( 17 runs)

Running benchmark for all models
This can take a while!

CPU OS Config Model Th Load Enc. Commit
Xeon(R) Ubuntu AVX2 BLAS tiny 4 429 550 fa8dbdc
Xeon(R) Ubuntu AVX2 BLAS base 4 521 1133 fa8dbdc
Xeon(R) Ubuntu AVX2 BLAS small 4 798 3025 fa8dbdc
Xeon(R) Ubuntu AVX2 BLAS medium 4 1701 7639 fa8dbdc
Xeon(R) Ubuntu AVX2 BLAS large 4 2966 12927 fa8dbdc

@StuartIanNaylor
Copy link

StuartIanNaylor commented Nov 3, 2023

Whats happening with commit 8a2bee6?
I was just interested with the same Master Opi5 vs Rpi5, but seem to have an extra PP that I am sure I will find a use for
Rpi 5gb
Linux raspberrypi 6.1.0-rpi4-rpi-2712 #1 SMP PREEMPT Debian 1:6.1.54-1+rpt2 (2023-10-05) aarch64 GNU/Linux

memcpy: 5.32 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     6.0 GFLOPS (128 runs) | Q4_1     5.9 GFLOPS (128 runs)
  64 x   64: Q5_0     5.3 GFLOPS (128 runs) | Q5_1     4.9 GFLOPS (128 runs) | Q8_0     1.9 GFLOPS (128 runs)
  64 x   64: F16      6.0 GFLOPS (128 runs) | F32      5.8 GFLOPS (128 runs)
 128 x  128: Q4_0    23.9 GFLOPS (128 runs) | Q4_1    22.6 GFLOPS (128 runs)
 128 x  128: Q5_0    21.4 GFLOPS (128 runs) | Q5_1    20.4 GFLOPS (128 runs) | Q8_0    11.4 GFLOPS (128 runs)
 128 x  128: F16     28.6 GFLOPS (128 runs) | F32     26.2 GFLOPS (128 runs)
 256 x  256: Q4_0    49.8 GFLOPS (128 runs) | Q4_1    49.6 GFLOPS (128 runs)
 256 x  256: Q5_0    40.9 GFLOPS (128 runs) | Q5_1    24.8 GFLOPS (128 runs) | Q8_0    59.0 GFLOPS (128 runs)
 256 x  256: F16     63.0 GFLOPS (128 runs) | F32     29.6 GFLOPS (128 runs)
 512 x  512: Q4_0    56.6 GFLOPS (128 runs) | Q4_1    56.5 GFLOPS (128 runs)
 512 x  512: Q5_0    30.4 GFLOPS (114 runs) | Q5_1    36.5 GFLOPS (128 runs) | Q8_0    71.2 GFLOPS (128 runs)
 512 x  512: F16     64.6 GFLOPS (128 runs) | F32     35.2 GFLOPS (128 runs)
1024 x 1024: Q4_0    67.4 GFLOPS ( 32 runs) | Q4_1    68.7 GFLOPS ( 32 runs)
1024 x 1024: Q5_0    38.1 GFLOPS ( 18 runs) | Q5_1    32.3 GFLOPS ( 16 runs) | Q8_0    61.3 GFLOPS ( 29 runs)
1024 x 1024: F16     71.7 GFLOPS ( 34 runs) | F32     35.1 GFLOPS ( 17 runs)
2048 x 2048: Q4_0    71.4 GFLOPS (  5 runs) | Q4_1    71.5 GFLOPS (  5 runs)
2048 x 2048: Q5_0    38.1 GFLOPS (  3 runs) | Q5_1    36.9 GFLOPS (  3 runs) | Q8_0    63.5 GFLOPS (  4 runs)
2048 x 2048: F16     68.6 GFLOPS (  4 runs) | F32     32.0 GFLOPS (  3 runs)
4096 x 4096: Q4_0    66.8 GFLOPS (  3 runs) | Q4_1    62.4 GFLOPS (  3 runs)
4096 x 4096: Q5_0    39.5 GFLOPS (  3 runs) | Q5_1    37.0 GFLOPS (  3 runs) | Q8_0    62.7 GFLOPS (  3 runs)
4096 x 4096: F16     61.5 GFLOPS (  3 runs) | F32     29.1 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

|    CPU |     OS |           Config |       Model |  Th |    Enc. |    Dec. |      PP |  Commit |
|    --- |    --- |              --- |         --- | --- |     --- |     --- |     --- |     --- |
| Rpi5 BCM2712 | bookworm |             NEON |        tiny |   4 | 1206.23 |    6.67 |  198.84 | 8a2bee6 |
| Rpi5 BCM2712 | bookworm |             NEON |        base |   4 | 2862.56 |   11.74 |  466.51 | 8a2bee6 |
| Rpi5 BCM2712 | bookworm |             NEON |       small |   4 | 9630.88 |   32.81 | 1650.18 | 8a2bee6 |
| Rpi5 BCM2712 | bookworm |             NEON |      medium |   4 |      ms |   99.64 | 5601.57 | 8a2bee6 |

Opi5 4gb
Linux ubuntu 6.6.0 #1 SMP PREEMPT Mon Oct 30 22:54:25 GMT 2023 aarch64 aarch64 aarch64 GNU/Linux
Mainline Linux than the Rockchip BSP https://github.com/Joshua-Riek/ubuntu-rockchip/releases/tag/v1.29.1

memcpy: 10.93 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     6.8 GFLOPS (128 runs) | Q4_1     4.1 GFLOPS (128 runs)
  64 x   64: Q5_0     5.9 GFLOPS (128 runs) | Q5_1     6.0 GFLOPS (128 runs) | Q8_0     6.6 GFLOPS (128 runs)
  64 x   64: F16      4.1 GFLOPS (128 runs) | F32      6.8 GFLOPS (128 runs)
 128 x  128: Q4_0    14.0 GFLOPS (128 runs) | Q4_1    19.1 GFLOPS (128 runs)
 128 x  128: Q5_0    15.5 GFLOPS (128 runs) | Q5_1    12.7 GFLOPS (128 runs) | Q8_0    26.6 GFLOPS (128 runs)
 128 x  128: F16     22.1 GFLOPS (128 runs) | F32     21.2 GFLOPS (128 runs)
 256 x  256: Q4_0    45.0 GFLOPS (128 runs) | Q4_1    45.0 GFLOPS (128 runs)
 256 x  256: Q5_0    29.0 GFLOPS (128 runs) | Q5_1    29.6 GFLOPS (128 runs) | Q8_0    42.8 GFLOPS (128 runs)
 256 x  256: F16     42.5 GFLOPS (128 runs) | F32     42.6 GFLOPS (128 runs)
 512 x  512: Q4_0    55.8 GFLOPS (128 runs) | Q4_1    56.0 GFLOPS (128 runs)
 512 x  512: Q5_0    35.5 GFLOPS (128 runs) | Q5_1    36.7 GFLOPS (128 runs) | Q8_0    61.9 GFLOPS (128 runs)
 512 x  512: F16     80.7 GFLOPS (128 runs) | F32     49.6 GFLOPS (128 runs)
1024 x 1024: Q4_0    60.6 GFLOPS ( 29 runs) | Q4_1    61.4 GFLOPS ( 29 runs)
1024 x 1024: Q5_0    37.6 GFLOPS ( 18 runs) | Q5_1    39.3 GFLOPS ( 19 runs) | Q8_0    68.2 GFLOPS ( 32 runs)
1024 x 1024: F16     93.1 GFLOPS ( 44 runs) | F32     46.4 GFLOPS ( 22 runs)
2048 x 2048: Q4_0    63.1 GFLOPS (  4 runs) | Q4_1    64.1 GFLOPS (  4 runs)
2048 x 2048: Q5_0    39.2 GFLOPS (  3 runs) | Q5_1    41.0 GFLOPS (  3 runs) | Q8_0    70.9 GFLOPS (  5 runs)
2048 x 2048: F16     87.9 GFLOPS (  6 runs) | F32     41.4 GFLOPS (  3 runs)
4096 x 4096: Q4_0    64.2 GFLOPS (  3 runs) | Q4_1    65.3 GFLOPS (  3 runs)
4096 x 4096: Q5_0    39.7 GFLOPS (  3 runs) | Q5_1    41.7 GFLOPS (  3 runs) | Q8_0    70.7 GFLOPS (  3 runs)
4096 x 4096: F16     80.7 GFLOPS (  3 runs) | F32     38.7 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

|    CPU |     OS |           Config |       Model |  Th |    Enc. |    Dec. |      PP |  Commit |
|    --- |    --- |              --- |         --- | --- |     --- |     --- |     --- |     --- |
| Opi5 Rk3588s | 22.04.3 LTS (Jammy Jellyfish) |             NEON |        tiny |   4 |  782.52 |    3.10 |  135.25 | 8a2bee6 |
| Opi5 Rk3588s | 22.04.3 LTS (Jammy Jellyfish) |             NEON |        base |   4 | 1754.69 |   11.81 |  304.06 | 8a2bee6 |
| Opi5 Rk3588s | 22.04.3 LTS (Jammy Jellyfish) |             NEON |       small |   4 | 6226.10 |   15.26 | 1075.54 | 8a2bee6 |
| Opi5 Rk3588s | 22.04.3 LTS (Jammy Jellyfish) |             NEON |      medium |   4 |      ms |   44.75 | 3425.05 | 8a2bee6 |

@ggerganov
Copy link
Owner Author

ggerganov commented Nov 3, 2023

@nickovs These are some very interesting results. Looking forward to the OpenBLAS results as well.

@StuartIanNaylor The PP timing is the "prompt processing" time for a prompt of 256 tokens. As we transcribe with whisper, the context (i.e. the previously transcribed text) grows up to n_text_ctx. For each new audio segment that we process, we have to process the context. This processing is very similar to the token-by-token text generation during decoding, but it is much faster since we process 256 tokens at once.

@nickovs
Copy link

nickovs commented Nov 3, 2023

By way of comparison to the benchmarks I posted above, here is are the matrix multiplication numbers for the same Raspberry Pi 5 using OpenBLAS. It is notable that Whisper.cpp's native NEON code outperforms OpenBLAS on the Pi5 for everything except FP32, where OpenBLAS wins by some margin.

  64 x   64: Q4_0     4.4 GFLOPS (128 runs) | Q4_1     4.3 GFLOPS (128 runs)
  64 x   64: Q5_0     3.7 GFLOPS (128 runs) | Q5_1     4.2 GFLOPS (128 runs) | Q8_0     4.1 GFLOPS (128 runs)
  64 x   64: F16      4.0 GFLOPS (128 runs) | F32      4.1 GFLOPS (128 runs)
 128 x  128: Q4_0     0.9 GFLOPS (128 runs) | Q4_1     0.9 GFLOPS (128 runs)
 128 x  128: Q5_0     0.9 GFLOPS (128 runs) | Q5_1     0.9 GFLOPS (128 runs) | Q8_0     0.9 GFLOPS (128 runs)
 128 x  128: F16      0.9 GFLOPS (128 runs) | F32      0.9 GFLOPS (128 runs)
 256 x  256: Q4_0     6.3 GFLOPS (128 runs) | Q4_1     6.4 GFLOPS (128 runs)
 256 x  256: Q5_0     6.4 GFLOPS (128 runs) | Q5_1     6.3 GFLOPS (128 runs) | Q8_0     6.4 GFLOPS (128 runs)
 256 x  256: F16      6.4 GFLOPS (128 runs) | F32      6.5 GFLOPS (128 runs)
 512 x  512: Q4_0    19.7 GFLOPS ( 74 runs) | Q4_1    20.4 GFLOPS ( 76 runs)
 512 x  512: Q5_0    23.7 GFLOPS ( 89 runs) | Q5_1    23.5 GFLOPS ( 89 runs) | Q8_0    23.7 GFLOPS ( 89 runs)
 512 x  512: F16     24.0 GFLOPS ( 90 runs) | F32     25.3 GFLOPS ( 95 runs)
1024 x 1024: Q4_0    35.5 GFLOPS ( 17 runs) | Q4_1    36.5 GFLOPS ( 17 runs)
1024 x 1024: Q5_0    38.9 GFLOPS ( 19 runs) | Q5_1    39.1 GFLOPS ( 19 runs) | Q8_0    38.7 GFLOPS ( 19 runs)
1024 x 1024: F16     39.3 GFLOPS ( 19 runs) | F32     40.9 GFLOPS ( 20 runs)
2048 x 2048: Q4_0    52.8 GFLOPS (  4 runs) | Q4_1    55.4 GFLOPS (  4 runs)
2048 x 2048: Q5_0    56.8 GFLOPS (  4 runs) | Q5_1    55.6 GFLOPS (  4 runs) | Q8_0    56.5 GFLOPS (  4 runs)
2048 x 2048: F16     56.1 GFLOPS (  4 runs) | F32     56.4 GFLOPS (  4 runs)
4096 x 4096: Q4_0    55.3 GFLOPS (  3 runs) | Q4_1    56.9 GFLOPS (  3 runs)
4096 x 4096: Q5_0    58.9 GFLOPS (  3 runs) | Q5_1    60.0 GFLOPS (  3 runs) | Q8_0    61.4 GFLOPS (  3 runs)
4096 x 4096: F16     59.3 GFLOPS (  3 runs) | F32     60.4 GFLOPS (  3 runs)

I have not tried all the tuning options in OpenBLAS, but the options I did try didn't really change the performance compared to the pre-packaged version.

@StuartIanNaylor
Copy link

StuartIanNaylor commented Nov 4, 2023

I have not tried all the tuning options in OpenBLAS, but the options I did try didn't really change the performance compared to the pre-packaged version.

I think this is where we benefit from ArmV8.2 and being a subgroup of Apple Silicon first-class citizen - optimized via ARM NEON.
If you do a lscpu
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
So I guess we benefit that GGML is optimised aroung V8.2+ architecture
What should be interesting with https://github.com/ggerganov/whisper.cpp#opencl-gpu-support-via-clblast is that the GPU on the Pi5 & Rk3588(s) should be able to use OpenCL but in testing I am finding that the same and wondering if that is also similar.
I never worked out if its due to the serial nature of Whisper that you will only get a speedup if the GPU is faster than the CPU but on testing I get a huge slow down whilst in other ML tests the supposed FP32 610.6 GFLOPS of the mali G610 works mightily at approx 75% of the CPU with ArmNN tests using the GPU Tflite OpenCL delegate.
I am presuming CLBlast is somewhat similar and may not be well optimised for some data types?

These results are 4.5 to 6.2 times faster than the Raspberry Pi 4.
Not too sure about that as likely the same commit would have to be tested as seem to remember thinking RK3588s was < 5x Pi4 and likely due to memory bandwidth, quite a bit faster than a Pi5.

Linux ubuntu 6.6.0 #1 SMP PREEMPT Opi5 4GB performance governor

memcpy: 10.50 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     3.5 GFLOPS (128 runs) | Q4_1     3.2 GFLOPS (128 runs)
  64 x   64: Q5_0     2.8 GFLOPS (128 runs) | Q5_1     2.7 GFLOPS (128 runs) | Q8_0     3.5 GFLOPS (128 runs)
  64 x   64: F16      3.4 GFLOPS (128 runs) | F32      3.4 GFLOPS (128 runs)
 128 x  128: Q4_0     7.9 GFLOPS (128 runs) | Q4_1     8.1 GFLOPS (128 runs)
 128 x  128: Q5_0     6.2 GFLOPS (128 runs) | Q5_1     6.5 GFLOPS (128 runs) | Q8_0     7.9 GFLOPS (128 runs)
 128 x  128: F16      9.4 GFLOPS (128 runs) | F32      7.5 GFLOPS (128 runs)
 256 x  256: Q4_0    10.5 GFLOPS (128 runs) | Q4_1    11.1 GFLOPS (128 runs)
 256 x  256: Q5_0     7.9 GFLOPS (128 runs) | Q5_1     8.5 GFLOPS (128 runs) | Q8_0    10.3 GFLOPS (128 runs)
 256 x  256: F16     14.5 GFLOPS (128 runs) | F32      9.3 GFLOPS (128 runs)
 512 x  512: Q4_0    11.7 GFLOPS ( 44 runs) | Q4_1    12.4 GFLOPS ( 47 runs)
 512 x  512: Q5_0     8.8 GFLOPS ( 33 runs) | Q5_1     9.7 GFLOPS ( 37 runs) | Q8_0    11.4 GFLOPS ( 43 runs)
 512 x  512: F16     17.8 GFLOPS ( 67 runs) | F32      9.2 GFLOPS ( 35 runs)
1024 x 1024: Q4_0    32.2 GFLOPS ( 15 runs) | Q4_1    33.2 GFLOPS ( 16 runs)
1024 x 1024: Q5_0    24.9 GFLOPS ( 12 runs) | Q5_1    25.7 GFLOPS ( 12 runs) | Q8_0    35.2 GFLOPS ( 17 runs)
1024 x 1024: F16     38.0 GFLOPS ( 18 runs) | F32     27.5 GFLOPS ( 13 runs)
2048 x 2048: Q4_0    57.7 GFLOPS (  4 runs) | Q4_1    59.5 GFLOPS (  4 runs)
2048 x 2048: Q5_0    38.0 GFLOPS (  3 runs) | Q5_1    39.3 GFLOPS (  3 runs) | Q8_0    64.3 GFLOPS (  4 runs)
2048 x 2048: F16     77.9 GFLOPS (  5 runs) | F32     38.4 GFLOPS (  3 runs)
4096 x 4096: Q4_0    63.4 GFLOPS (  3 runs) | Q4_1    64.7 GFLOPS (  3 runs)
4096 x 4096: Q5_0    39.9 GFLOPS (  3 runs) | Q5_1    41.7 GFLOPS (  3 runs) | Q8_0    70.3 GFLOPS (  3 runs)
4096 x 4096: F16     78.6 GFLOPS (  3 runs) | F32     37.2 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

|    CPU |     OS |           Config |       Model |  Th |    Enc. |    Dec. |      PP |  Commit |
|    --- |    --- |              --- |         --- | --- |     --- |     --- |     --- |     --- |
| <todo> | <todo> |             NEON |        tiny |   4 |  853.56 |    7.37 |  161.81 | f96e1c5 |
| <todo> | <todo> |             NEON |        base |   4 | 1847.86 |   13.00 |  338.18 | f96e1c5 |
| <todo> | <todo> |             NEON |       small |   4 | 6289.17 |   39.19 | 1109.25 | f96e1c5 |
| <todo> | <todo> |             NEON |      medium |   4 |      ms |   67.99 | 3454.96 | f96e1c5 |
| <todo> | <todo> |             NEON |       large |   4 |      ms |  107.50 | 6541.15 | f96e1c5 |

Linux raspberrypi 6.1.0-rpi4-rpi-2712 Rpi5 4GB performance governor

memcpy: 6.03 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     5.7 GFLOPS (128 runs) | Q4_1     5.5 GFLOPS (128 runs)
  64 x   64: Q5_0     5.3 GFLOPS (128 runs) | Q5_1     5.1 GFLOPS (128 runs) | Q8_0     5.6 GFLOPS (128 runs)
  64 x   64: F16      5.6 GFLOPS (128 runs) | F32      5.7 GFLOPS (128 runs)
 128 x  128: Q4_0    22.8 GFLOPS (128 runs) | Q4_1    24.1 GFLOPS (128 runs)
 128 x  128: Q5_0    12.3 GFLOPS (128 runs) | Q5_1    11.8 GFLOPS (128 runs) | Q8_0    11.3 GFLOPS (128 runs)
 128 x  128: F16     15.4 GFLOPS (128 runs) | F32     26.5 GFLOPS (128 runs)
 256 x  256: Q4_0    49.7 GFLOPS (128 runs) | Q4_1    50.3 GFLOPS (128 runs)
 256 x  256: Q5_0    41.8 GFLOPS (128 runs) | Q5_1    39.0 GFLOPS (128 runs) | Q8_0    59.7 GFLOPS (128 runs)
 256 x  256: F16     65.2 GFLOPS (128 runs) | F32     48.7 GFLOPS (128 runs)
 512 x  512: Q4_0    63.0 GFLOPS (128 runs) | Q4_1    63.6 GFLOPS (128 runs)
 512 x  512: Q5_0    50.5 GFLOPS (128 runs) | Q5_1    47.3 GFLOPS (128 runs) | Q8_0    77.7 GFLOPS (128 runs)
 512 x  512: F16     85.6 GFLOPS (128 runs) | F32     53.3 GFLOPS (128 runs)
1024 x 1024: Q4_0    68.1 GFLOPS ( 32 runs) | Q4_1    69.8 GFLOPS ( 33 runs)
1024 x 1024: Q5_0    54.1 GFLOPS ( 26 runs) | Q5_1    51.2 GFLOPS ( 24 runs) | Q8_0    86.0 GFLOPS ( 41 runs)
1024 x 1024: F16     93.6 GFLOPS ( 44 runs) | F32     49.0 GFLOPS ( 23 runs)
2048 x 2048: Q4_0    70.8 GFLOPS (  5 runs) | Q4_1    72.8 GFLOPS (  5 runs)
2048 x 2048: Q5_0    56.1 GFLOPS (  4 runs) | Q5_1    53.0 GFLOPS (  4 runs) | Q8_0    88.1 GFLOPS (  6 runs)
2048 x 2048: F16     93.7 GFLOPS (  6 runs) | F32     44.4 GFLOPS (  3 runs)
4096 x 4096: Q4_0    72.6 GFLOPS (  3 runs) | Q4_1    74.8 GFLOPS (  3 runs)
4096 x 4096: Q5_0    56.2 GFLOPS (  3 runs) | Q5_1    53.3 GFLOPS (  3 runs) | Q8_0    88.4 GFLOPS (  3 runs)
4096 x 4096: F16     86.7 GFLOPS (  3 runs) | F32     39.7 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

|    CPU |     OS |           Config |       Model |  Th |    Enc. |    Dec. |      PP |  Commit |
|    --- |    --- |              --- |         --- | --- |     --- |     --- |     --- |     --- |
| <todo> | <todo> |             NEON |        tiny |   4 | 1049.00 |    6.74 |  149.32 | f96e1c5 |
| <todo> | <todo> |             NEON |        base |   4 | 2362.92 |   12.60 |  361.37 | f96e1c5 |
| <todo> | <todo> |             NEON |       small |   4 | 8081.87 |   35.65 | 1283.34 | f96e1c5 |
| <todo> | <todo> |             NEON |      medium |   4 |      ms |  105.77 | 4360.80 | f96e1c5 |
| <todo> | <todo> |             NEON |       large |   4 |      ms |  189.93 | 8158.78 | f96e1c5 |

I dunno to be honest why Gflops is higher but whilst the Enc the biggest chunk of process faster, maybe mem bandwidth?
Its like4like with the perf governor, due to pref of running Whisper that way of race-till-idle.

@nickovs
Copy link

nickovs commented Nov 4, 2023

@StuartIanNaylor Here is a straight up comparison of the same 54c978c commit between the Pi4 and the Pi5, both running the code compiled on the Pi4 on the Pi5 and then also recompiling the same commit on the Pi5.

Model Pi4   Pi4 code on Pi5   Speedup on same compilation   Recompiled on Pi5   Speedup on recompiled code
  Encode Decode Encode Decode Encode Decode Encode Decode Encode
tiny 5246.14 510.57 2694.38 188.38 1.95 2.71 1106.11 183.67 4.74
tiny.en 5264.76 551.17 2744.80 203.94 1.92 2.70 1109.66 201.3 4.74
base.en 12473.07 1004.23 6345.28 363.15 1.97 2.77 2479.82 346.65 5.03
base 12453.04 972.29 6399.54 348.33 1.95 2.79 2465.12 363.86 5.05
small.en 48849.9 3316.15 24127.58 961.75 2.02 3.45 8308.3 963.24 5.88
small 49671.25 2953 24134.46 1109.70 2.06 2.66 8342.25 1119.25 5.95
medium.en 169889.39 8451.51 79045.66 2815.81 2.15 3.00 26407.77 2893.55 6.43
medium 173236.92 8531.94 79075.19 2836.38 2.19 3.01 26468.86 2919.43 6.54

This suggests that there is a little better than a 2-fold performance improvement on encode, and more like a 2.8 fold improvement on decode, just moving the code from the Pi4 to the Pi5. Recompiling on the Pi5 raises the encode performance to between 4.74 and 6.54 times faster that on the Pi4, but the decode performance remains only about 2.8 times faster than the Pi4 and doesn't benefit a great deal from the recompilation.

(Note that this table hits GitHub's 10 column limit, so the decode speedup may not be displayed, but the numbers are in the comment source.)

The key thing here as far as I'm concerned is that on the Pi5 the small model runs in better than real time, whereas on the Pi4 you were stuck using the tiny model for real-time work.

@jwinarske
Copy link

It would be great to have a test results db for this. I'm thinking similar to what DRM info does

@StuartIanNaylor
Copy link

StuartIanNaylor commented Nov 5, 2023

@jwinarske that would be great as maybe a seperate repo of fixed commits as we are not benching the software but the hardware.
The Llama bench would be a good inclusion as the openLlama3b-q4 manages 20 Tokens/s on a Rk3588s-4gb.
I also like https://github.com/Tencent/ncnn/tree/master/benchmark as a pretty easy install and has a ready made list of smaller yolo type models.

Linux ubuntu 6.6.0 #1 SMP PREEMPT Opi5 4GB performance governor 54c978c

memcpy: 11.18 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     3.5 GFLOPS (128 runs) | Q4_1     3.2 GFLOPS (128 runs)
  64 x   64: Q5_0     2.7 GFLOPS (128 runs) | Q5_1     2.8 GFLOPS (128 runs) | Q8_0     3.1 GFLOPS (128 runs)
  64 x   64: F16      3.3 GFLOPS (128 runs) | F32      3.2 GFLOPS (128 runs)
 128 x  128: Q4_0     7.8 GFLOPS (128 runs) | Q4_1     8.0 GFLOPS (128 runs)
 128 x  128: Q5_0     6.2 GFLOPS (128 runs) | Q5_1     6.5 GFLOPS (128 runs) | Q8_0     7.8 GFLOPS (128 runs)
 128 x  128: F16      9.5 GFLOPS (128 runs) | F32      7.5 GFLOPS (128 runs)
 256 x  256: Q4_0    10.6 GFLOPS (128 runs) | Q4_1    11.0 GFLOPS (128 runs)
 256 x  256: Q5_0     7.9 GFLOPS (128 runs) | Q5_1     8.4 GFLOPS (128 runs) | Q8_0    10.3 GFLOPS (128 runs)
 256 x  256: F16     14.8 GFLOPS (128 runs) | F32      9.3 GFLOPS (128 runs)
 512 x  512: Q4_0    11.8 GFLOPS ( 44 runs) | Q4_1    12.4 GFLOPS ( 47 runs)
 512 x  512: Q5_0     8.9 GFLOPS ( 34 runs) | Q5_1     9.7 GFLOPS ( 37 runs) | Q8_0    11.5 GFLOPS ( 43 runs)
 512 x  512: F16     17.8 GFLOPS ( 67 runs) | F32      9.6 GFLOPS ( 36 runs)
1024 x 1024: Q4_0    32.7 GFLOPS ( 16 runs) | Q4_1    33.3 GFLOPS ( 16 runs)
1024 x 1024: Q5_0    25.2 GFLOPS ( 12 runs) | Q5_1    27.0 GFLOPS ( 13 runs) | Q8_0    36.0 GFLOPS ( 17 runs)
1024 x 1024: F16     39.4 GFLOPS ( 19 runs) | F32     28.1 GFLOPS ( 14 runs)
2048 x 2048: Q4_0    58.2 GFLOPS (  4 runs) | Q4_1    60.0 GFLOPS (  4 runs)
2048 x 2048: Q5_0    37.2 GFLOPS (  3 runs) | Q5_1    38.8 GFLOPS (  3 runs) | Q8_0    63.3 GFLOPS (  4 runs)
2048 x 2048: F16     78.3 GFLOPS (  5 runs) | F32     38.0 GFLOPS (  3 runs)
4096 x 4096: Q4_0    63.9 GFLOPS (  3 runs) | Q4_1    64.9 GFLOPS (  3 runs)
4096 x 4096: Q5_0    39.6 GFLOPS (  3 runs) | Q5_1    41.5 GFLOPS (  3 runs) | Q8_0    70.3 GFLOPS (  3 runs)
4096 x 4096: F16     78.6 GFLOPS (  3 runs) | F32     35.1 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

|    CPU |     OS |           Config |       Model |  Th |    Enc. |    Dec. |      PP |  Commit |
|    --- |    --- |              --- |         --- | --- |     --- |     --- |     --- |     --- |
| <todo> | <todo> |             NEON |        tiny |   4 |  885.27 |    7.35 |  166.54 | 54c978c |
| <todo> | <todo> |             NEON |        base |   4 | 1888.93 |   12.61 |  347.61 | 54c978c |
| <todo> | <todo> |             NEON |       small |   4 | 6397.88 |   38.49 | 1111.82 | 54c978c |
| <todo> | <todo> |             NEON |      medium |   4 |      ms |   68.98 | 3511.72 | 54c978c |

@nickovs Dunno as before as A76 gets vector mat/mul and the code is optimised for ArmV8,2+ that the poor Pi4 with openBlas was approx < 5 times slower than a RK3588s.
The above is just same commit on a Opi5-4gb so Zram and swap comes into play with bigger models but from audio in to txt out last time I pegged the Pi4 as approx just less than x5 and ignored models it didn't manage in realtime.
I guess further optimisations have happened, the decode is less important to overall time or the Enc as that is the biggest process.

(venv) pi@raspberrypi:~/llama.cpp $ ./llama-bench -m  models/3b/open-llama-3b-q4_0.gguf -t 4
| model                          |       size |     params | backend    |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
| llama 3B mostly Q4_0           |   1.84 GiB |     3.43 B | CPU        |          4 | pp 512     |      9.77 ± 0.01 |
| llama 3B mostly Q4_0           |   1.84 GiB |     3.43 B | CPU        |          4 | tg 128     |      5.42 ± 0.00 |

build: c41ea36 (1487)

ubuntu@ubuntu:~/llama.cpp$ ./llama-bench -m models/3b/open-llama-3b-q4_0.gguf -t 4
| model                          |       size |     params | backend    |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
| llama 3B mostly Q4_0           |   1.84 GiB |     3.43 B | CPU        |          4 | pp 512     |      9.14 ± 0.01 |
| llama 3B mostly Q4_0           |   1.84 GiB |     3.43 B | CPU        |          4 | tg 128     |      7.06 ± 0.05 |

ryanrapp pushed a commit to ryanrapp/attentional-ios that referenced this issue Jan 9, 2024
"lib" is needed for windows.

With this change, you can build whisper.cpp with OpenBLAS's prebuilt DLL.
1. extract a zip from https://github.com/xianyi/OpenBLAS/releases
2. copy the headers in (openblas)/include to the root directory of whisper.cpp
3. invoke cmake with -DCMAKE_LIBRARY_PATH=(openblas)\lib -DWHISPER_SUPPORT_OPENBLAS=ON
4. copy (openblas)/bin/libopenblas.dll to the same directory of whisper.dll after msbuild

ggerganov/whisper.cpp#89 (comment)
kultivator-consulting pushed a commit to KultivatorConsulting/whisper.cpp that referenced this issue Feb 12, 2024
@petterreinholdtsen
Copy link
Contributor

Here is the result for NVIDIA GeForce GT 755M on Debian GNU/Linux 12 Bookworm using GCC 12.2.0 build with -DWHISPER_CLBLAST=ON:

whisper_init_from_file_with_params_no_state: loading model from '../nb-large-ggml-model.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
ggml_opencl: selecting platform: 'NVIDIA CUDA'
ggml_opencl: selecting device: 'NVIDIA GeForce GT 755M'
ggml_opencl: device FP16 support: false
whisper_model_load:      CPU buffer size =  3094.86 MB
whisper_model_load: model size    = 3094.36 MB
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   32.42 MB
whisper_init_state: compute buffer (encode) =  212.42 MB
whisper_init_state: compute buffer (cross)  =    9.38 MB
whisper_init_state: compute buffer (decode) =   99.24 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | 

whisper_print_timings:     load time =   712.98 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time = 29405.07 ms /     1 runs (29405.07 ms per run)
whisper_print_timings:   decode time = 25138.65 ms /   256 runs (   98.20 ms per run)
whisper_print_timings:   batchd time = 15522.25 ms /   320 runs (   48.51 ms per run)
whisper_print_timings:   prompt time = 120379.20 ms /  4096 runs (   29.39 ms per run)
whisper_print_timings:    total time = 190447.95 ms

@zhouwg
Copy link
Contributor

zhouwg commented Mar 6, 2024

benchmark result with 11th Gen Intel Core(TM) i7-11700F @ 2.50GHz + Ubuntu 20.04 + gcc version 9.4.0


CPU OS Config Mode Threads Load [ms] Encode [ms]
i7-11700F Ubuntu 20.04 tiny.en 4 46.72 4654.39
i7-11700F Ubuntu 20.04 tiny.en 8 49.85 2981.43
i7-11700F Ubuntu 20.04 small.en 4 175.02 51381.51
i7-11700F Ubuntu 20.04 small.en 8 161.98 29662.80

./bench  -m ./models/ggml-small.en.bin -t 8 -w 2
  64 x   64: Q4_0     4.3 GFLOPS (128 runs) | Q4_1     4.4 GFLOPS (128 runs)
  64 x   64: Q5_0     4.0 GFLOPS (128 runs) | Q5_1     3.5 GFLOPS (128 runs) | Q8_0     4.7 GFLOPS (128 runs)
  64 x   64: F16      4.2 GFLOPS (128 runs) | F32      2.1 GFLOPS (128 runs)
 128 x  128: Q4_0    15.0 GFLOPS (128 runs) | Q4_1    15.3 GFLOPS (128 runs)
 128 x  128: Q5_0    11.9 GFLOPS (128 runs) | Q5_1    12.3 GFLOPS (128 runs) | Q8_0    21.0 GFLOPS (128 runs)
 128 x  128: F16     11.1 GFLOPS (128 runs) | F32      8.7 GFLOPS (128 runs)
 256 x  256: Q4_0    25.4 GFLOPS (128 runs) | Q4_1    29.1 GFLOPS (128 runs)
 256 x  256: Q5_0    17.4 GFLOPS (128 runs) | Q5_1    18.7 GFLOPS (128 runs) | Q8_0    49.1 GFLOPS (128 runs)
 256 x  256: F16     13.8 GFLOPS (128 runs) | F32     10.4 GFLOPS (128 runs)
 512 x  512: Q4_0    31.1 GFLOPS (116 runs) | Q4_1    33.0 GFLOPS (124 runs)
 512 x  512: Q5_0    17.1 GFLOPS ( 64 runs) | Q5_1    20.5 GFLOPS ( 77 runs) | Q8_0    66.3 GFLOPS (128 runs)
 512 x  512: F16     14.0 GFLOPS ( 53 runs) | F32      9.3 GFLOPS ( 35 runs)
1024 x 1024: Q4_0    31.9 GFLOPS ( 16 runs) | Q4_1    31.0 GFLOPS ( 15 runs)
1024 x 1024: Q5_0    20.0 GFLOPS ( 10 runs) | Q5_1    22.9 GFLOPS ( 11 runs) | Q8_0    80.1 GFLOPS ( 38 runs)
1024 x 1024: F16     14.6 GFLOPS (  7 runs) | F32      8.9 GFLOPS (  5 runs)
2048 x 2048: Q4_0    35.9 GFLOPS (  3 runs) | Q4_1    40.1 GFLOPS (  3 runs)
2048 x 2048: Q5_0    21.2 GFLOPS (  3 runs) | Q5_1    23.6 GFLOPS (  3 runs) | Q8_0    88.0 GFLOPS (  6 runs)
2048 x 2048: F16     14.4 GFLOPS (  3 runs) | F32      8.6 GFLOPS (  3 runs)
4096 x 4096: Q4_0    35.4 GFLOPS (  3 runs) | Q4_1    39.2 GFLOPS (  3 runs)
4096 x 4096: Q5_0    20.0 GFLOPS (  3 runs) | Q5_1    21.2 GFLOPS (  3 runs) | Q8_0    85.0 GFLOPS (  3 runs)
4096 x 4096: F16     13.5 GFLOPS (  3 runs) | F32      8.2 GFLOPS (  3 runs)

./bench  -m ./models/ggml-small.en.bin -t 8 -w 1
memcpy:    9.43 GB/s (heat-up)
memcpy:    9.31 GB/s ( 1 thread)
memcpy:    9.15 GB/s ( 1 thread)
memcpy:    8.74 GB/s ( 2 thread)
memcpy:    8.67 GB/s ( 3 thread)
memcpy:    8.43 GB/s ( 4 thread)
memcpy:    8.42 GB/s ( 5 thread)
memcpy:    8.70 GB/s ( 6 thread)
memcpy:    8.63 GB/s ( 7 thread)
memcpy:    8.32 GB/s ( 8 thread)
sum:    -5119997019.000000
 ./bench-all.sh 
Usage: ./bench.sh [n_threads] [encoder-only]

Running memcpy benchmark

memcpy:    9.38 GB/s (heat-up)
memcpy:    9.39 GB/s ( 1 thread)
memcpy:    9.39 GB/s ( 1 thread)
memcpy:    9.12 GB/s ( 2 thread)
memcpy:    9.05 GB/s ( 3 thread)
memcpy:    8.68 GB/s ( 4 thread)
sum:    -3071998678.000000

memcpy:    9.38 GB/s (heat-up)
memcpy:    9.39 GB/s ( 1 thread)
memcpy:    9.39 GB/s ( 1 thread)
memcpy:    9.12 GB/s ( 2 thread)
memcpy:    9.05 GB/s ( 3 thread)
memcpy:    8.68 GB/s ( 4 thread)
sum:    -3071998678.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     7.3 GFLOPS (128 runs) | Q4_1     7.8 GFLOPS (128 runs)
  64 x   64: Q5_0     6.3 GFLOPS (128 runs) | Q5_1     6.5 GFLOPS (128 runs) | Q8_0     9.4 GFLOPS (128 runs)
  64 x   64: F16      6.2 GFLOPS (128 runs) | F32      2.4 GFLOPS (128 runs)
 128 x  128: Q4_0    15.4 GFLOPS (128 runs) | Q4_1    16.6 GFLOPS (128 runs)
 128 x  128: Q5_0    10.6 GFLOPS (128 runs) | Q5_1    11.5 GFLOPS (128 runs) | Q8_0    25.9 GFLOPS (128 runs)
 128 x  128: F16      9.0 GFLOPS (128 runs) | F32      5.8 GFLOPS (128 runs)
 256 x  256: Q4_0    19.9 GFLOPS (128 runs) | Q4_1    22.8 GFLOPS (128 runs)
 256 x  256: Q5_0    12.8 GFLOPS (128 runs) | Q5_1    13.9 GFLOPS (128 runs) | Q8_0    44.2 GFLOPS (128 runs)
 256 x  256: F16      9.4 GFLOPS (128 runs) | F32      7.6 GFLOPS (128 runs)
 512 x  512: Q4_0    21.7 GFLOPS ( 81 runs) | Q4_1    23.0 GFLOPS ( 86 runs)
 512 x  512: Q5_0    12.9 GFLOPS ( 48 runs) | Q5_1    13.9 GFLOPS ( 52 runs) | Q8_0    48.6 GFLOPS (128 runs)
 512 x  512: F16      8.9 GFLOPS ( 34 runs) | F32      6.8 GFLOPS ( 26 runs)
1024 x 1024: Q4_0    22.1 GFLOPS ( 11 runs) | Q4_1    24.9 GFLOPS ( 12 runs)
1024 x 1024: Q5_0    13.1 GFLOPS (  7 runs) | Q5_1    14.0 GFLOPS (  7 runs) | Q8_0    53.4 GFLOPS ( 25 runs)
1024 x 1024: F16      8.8 GFLOPS (  5 runs) | F32      6.5 GFLOPS (  4 runs)
2048 x 2048: Q4_0    22.6 GFLOPS (  3 runs) | Q4_1    25.7 GFLOPS (  3 runs)
2048 x 2048: Q5_0    13.1 GFLOPS (  3 runs) | Q5_1    14.7 GFLOPS (  3 runs) | Q8_0    57.1 GFLOPS (  4 runs)
2048 x 2048: F16      8.7 GFLOPS (  3 runs) | F32      6.3 GFLOPS (  3 runs)
4096 x 4096: Q4_0    21.5 GFLOPS (  3 runs) | Q4_1    23.4 GFLOPS (  3 runs)
4096 x 4096: Q5_0    12.2 GFLOPS (  3 runs) | Q5_1    13.5 GFLOPS (  3 runs) | Q8_0    53.9 GFLOPS (  3 runs)
4096 x 4096: F16      8.0 GFLOPS (  3 runs) | F32      5.7 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!
CPU OS Config Model Th Enc. Dec. Bch5 PP Commit
i7-11700F Ubuntu 20.04 base 4 ms 15.82 15.05 15.71 31989a5a

there is an impressive benchmark result(compare to above bench result in PC which was purchased by RMB12000(about USD 1700) a few years ago) with Xiaomi 14's powerful mobile SoC------Qualcomm SM8650-AB Snapdragon 8 Gen 3 (4 nm) + Xiaomi's HyperOS(derived from Android 14) + Android NDK r21e:

2106054156

updated on 03-20-2024, Xiaomi 14 + Android NDK r26c( NDK r26c is required for special build optimization:https://github.com/cdeos/kantv/blob/master/external/whispercpp/CMakeLists.txt#L60)

514487122

1755516838

30679635

@obeone
Copy link

obeone commented Apr 24, 2024

CPU OS Config Model Th Enc. Dec. Bch5 PP Commit
Macbook M3 Pro Sonoma 14.5 NEON BLAS METAL tiny 4 34.15 1.45 0.47 0.03 858452d
Macbook M3 Pro Sonoma 14.5 NEON BLAS METAL base 4 59.32 2.27 0.79 0.05 858452d
Macbook M3 Pro Sonoma 14.5 NEON BLAS METAL small 4 200.45 5.50 1.75 0.15 858452d
Macbook M3 Pro Sonoma 14.5 NEON BLAS METAL medium 4 534.54 12.88 3.90 0.37 858452d
Macbook M3 Pro Sonoma 14.5 NEON BLAS METAL large-v1 4 989.45 22.29 6.58 0.64 858452d
Macbook M3 Pro Sonoma 14.5 NEON BLAS METAL large-v2 4 962.34 22.38 6.61 0.64 858452d
Macbook M3 Pro Sonoma 14.5 NEON BLAS METAL large-v3 4 969.27 22.23 6.59 0.64 858452d

@nanocosmos-ol
Copy link

Different results for different code commits - older version is much faster!

CPU: AMD Ryzen 9 7950X3D 16-Core

  • commit 858452d Date: Wed Apr 24 14:56:30 2024 +0300

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0

whisper_print_timings: load time = 64.61 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 878.59 ms / 1 runs ( 878.59 ms per run)
whisper_print_timings: decode time = 935.20 ms / 256 runs ( 3.65 ms per run)
whisper_print_timings: batchd time = 544.69 ms / 320 runs ( 1.70 ms per run)
whisper_print_timings: prompt time = 3865.51 ms / 4096 runs ( 0.94 ms per run)
whisper_print_timings: total time = 6225.76 ms

  • commit d03c60d Date: Wed Nov 8 04:53:31 2023 +0700

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | COREML = 0 | OPENVINO = 0 |

whisper_print_timings: load time = 83.24 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 693.48 ms / 1 runs ( 693.48 ms per run)
whisper_print_timings: decode time = 874.80 ms / 256 runs ( 3.42 ms per run)
whisper_print_timings: prompt time = 2249.08 ms / 16 runs ( 140.57 ms per run)
whisper_print_timings: total time = 3817.54 ms

@dwindibank
Copy link

A quick question: When would you want us to run this / report results?

For context, we're looking on using space on one of our old nodes to run a large number of files through Whisper (cpp). It's a server with multiple RTX2080TIs clustered together. I just don't know if knowing that Whisper.cpp runs fast on this out of date (but high spec'd for it's time) setup is useful.

Thanks!

@BBC-Esq
Copy link

BBC-Esq commented Jun 6, 2024

Hello all, I'm trying to benchmark all whisper backends but am having trouble benchmarking whisper.cpp. Since I'm unfamiliar with "compiling" I'm forced to use python bindings. I'm only aware of the following bindings but they all either haven't been updated in a long time or don't implement gpu acceleration:

Also, does whisper.cpp have "batching" by chance? Here's a sample graph I've created. Any feedback would be welcome regarding either how I'm graphing as well as how to test fairly with identical parameters and what not. Thanks!

image

P.S. faster-whisper doesn't have batching yet so, obviously, that's why there's only one graph for it...

@GrantLau1226
Copy link

benchmark result with 11th Gen Intel Core(TM) i7-11700F @ 2.50GHz + Ubuntu 20.04 + gcc version 9.4.0

CPU OS Config Mode Threads Load [ms] Encode [ms]
i7-11700F Ubuntu 20.04 tiny.en 4 46.72 4654.39
i7-11700F Ubuntu 20.04 tiny.en 8 49.85 2981.43
i7-11700F Ubuntu 20.04 small.en 4 175.02 51381.51
i7-11700F Ubuntu 20.04 small.en 8 161.98 29662.80

./bench  -m ./models/ggml-small.en.bin -t 8 -w 2
  64 x   64: Q4_0     4.3 GFLOPS (128 runs) | Q4_1     4.4 GFLOPS (128 runs)
  64 x   64: Q5_0     4.0 GFLOPS (128 runs) | Q5_1     3.5 GFLOPS (128 runs) | Q8_0     4.7 GFLOPS (128 runs)
  64 x   64: F16      4.2 GFLOPS (128 runs) | F32      2.1 GFLOPS (128 runs)
 128 x  128: Q4_0    15.0 GFLOPS (128 runs) | Q4_1    15.3 GFLOPS (128 runs)
 128 x  128: Q5_0    11.9 GFLOPS (128 runs) | Q5_1    12.3 GFLOPS (128 runs) | Q8_0    21.0 GFLOPS (128 runs)
 128 x  128: F16     11.1 GFLOPS (128 runs) | F32      8.7 GFLOPS (128 runs)
 256 x  256: Q4_0    25.4 GFLOPS (128 runs) | Q4_1    29.1 GFLOPS (128 runs)
 256 x  256: Q5_0    17.4 GFLOPS (128 runs) | Q5_1    18.7 GFLOPS (128 runs) | Q8_0    49.1 GFLOPS (128 runs)
 256 x  256: F16     13.8 GFLOPS (128 runs) | F32     10.4 GFLOPS (128 runs)
 512 x  512: Q4_0    31.1 GFLOPS (116 runs) | Q4_1    33.0 GFLOPS (124 runs)
 512 x  512: Q5_0    17.1 GFLOPS ( 64 runs) | Q5_1    20.5 GFLOPS ( 77 runs) | Q8_0    66.3 GFLOPS (128 runs)
 512 x  512: F16     14.0 GFLOPS ( 53 runs) | F32      9.3 GFLOPS ( 35 runs)
1024 x 1024: Q4_0    31.9 GFLOPS ( 16 runs) | Q4_1    31.0 GFLOPS ( 15 runs)
1024 x 1024: Q5_0    20.0 GFLOPS ( 10 runs) | Q5_1    22.9 GFLOPS ( 11 runs) | Q8_0    80.1 GFLOPS ( 38 runs)
1024 x 1024: F16     14.6 GFLOPS (  7 runs) | F32      8.9 GFLOPS (  5 runs)
2048 x 2048: Q4_0    35.9 GFLOPS (  3 runs) | Q4_1    40.1 GFLOPS (  3 runs)
2048 x 2048: Q5_0    21.2 GFLOPS (  3 runs) | Q5_1    23.6 GFLOPS (  3 runs) | Q8_0    88.0 GFLOPS (  6 runs)
2048 x 2048: F16     14.4 GFLOPS (  3 runs) | F32      8.6 GFLOPS (  3 runs)
4096 x 4096: Q4_0    35.4 GFLOPS (  3 runs) | Q4_1    39.2 GFLOPS (  3 runs)
4096 x 4096: Q5_0    20.0 GFLOPS (  3 runs) | Q5_1    21.2 GFLOPS (  3 runs) | Q8_0    85.0 GFLOPS (  3 runs)
4096 x 4096: F16     13.5 GFLOPS (  3 runs) | F32      8.2 GFLOPS (  3 runs)
./bench  -m ./models/ggml-small.en.bin -t 8 -w 1
memcpy:    9.43 GB/s (heat-up)
memcpy:    9.31 GB/s ( 1 thread)
memcpy:    9.15 GB/s ( 1 thread)
memcpy:    8.74 GB/s ( 2 thread)
memcpy:    8.67 GB/s ( 3 thread)
memcpy:    8.43 GB/s ( 4 thread)
memcpy:    8.42 GB/s ( 5 thread)
memcpy:    8.70 GB/s ( 6 thread)
memcpy:    8.63 GB/s ( 7 thread)
memcpy:    8.32 GB/s ( 8 thread)
sum:    -5119997019.000000
 ./bench-all.sh 
Usage: ./bench.sh [n_threads] [encoder-only]

Running memcpy benchmark

memcpy:    9.38 GB/s (heat-up)
memcpy:    9.39 GB/s ( 1 thread)
memcpy:    9.39 GB/s ( 1 thread)
memcpy:    9.12 GB/s ( 2 thread)
memcpy:    9.05 GB/s ( 3 thread)
memcpy:    8.68 GB/s ( 4 thread)
sum:    -3071998678.000000

memcpy:    9.38 GB/s (heat-up)
memcpy:    9.39 GB/s ( 1 thread)
memcpy:    9.39 GB/s ( 1 thread)
memcpy:    9.12 GB/s ( 2 thread)
memcpy:    9.05 GB/s ( 3 thread)
memcpy:    8.68 GB/s ( 4 thread)
sum:    -3071998678.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     7.3 GFLOPS (128 runs) | Q4_1     7.8 GFLOPS (128 runs)
  64 x   64: Q5_0     6.3 GFLOPS (128 runs) | Q5_1     6.5 GFLOPS (128 runs) | Q8_0     9.4 GFLOPS (128 runs)
  64 x   64: F16      6.2 GFLOPS (128 runs) | F32      2.4 GFLOPS (128 runs)
 128 x  128: Q4_0    15.4 GFLOPS (128 runs) | Q4_1    16.6 GFLOPS (128 runs)
 128 x  128: Q5_0    10.6 GFLOPS (128 runs) | Q5_1    11.5 GFLOPS (128 runs) | Q8_0    25.9 GFLOPS (128 runs)
 128 x  128: F16      9.0 GFLOPS (128 runs) | F32      5.8 GFLOPS (128 runs)
 256 x  256: Q4_0    19.9 GFLOPS (128 runs) | Q4_1    22.8 GFLOPS (128 runs)
 256 x  256: Q5_0    12.8 GFLOPS (128 runs) | Q5_1    13.9 GFLOPS (128 runs) | Q8_0    44.2 GFLOPS (128 runs)
 256 x  256: F16      9.4 GFLOPS (128 runs) | F32      7.6 GFLOPS (128 runs)
 512 x  512: Q4_0    21.7 GFLOPS ( 81 runs) | Q4_1    23.0 GFLOPS ( 86 runs)
 512 x  512: Q5_0    12.9 GFLOPS ( 48 runs) | Q5_1    13.9 GFLOPS ( 52 runs) | Q8_0    48.6 GFLOPS (128 runs)
 512 x  512: F16      8.9 GFLOPS ( 34 runs) | F32      6.8 GFLOPS ( 26 runs)
1024 x 1024: Q4_0    22.1 GFLOPS ( 11 runs) | Q4_1    24.9 GFLOPS ( 12 runs)
1024 x 1024: Q5_0    13.1 GFLOPS (  7 runs) | Q5_1    14.0 GFLOPS (  7 runs) | Q8_0    53.4 GFLOPS ( 25 runs)
1024 x 1024: F16      8.8 GFLOPS (  5 runs) | F32      6.5 GFLOPS (  4 runs)
2048 x 2048: Q4_0    22.6 GFLOPS (  3 runs) | Q4_1    25.7 GFLOPS (  3 runs)
2048 x 2048: Q5_0    13.1 GFLOPS (  3 runs) | Q5_1    14.7 GFLOPS (  3 runs) | Q8_0    57.1 GFLOPS (  4 runs)
2048 x 2048: F16      8.7 GFLOPS (  3 runs) | F32      6.3 GFLOPS (  3 runs)
4096 x 4096: Q4_0    21.5 GFLOPS (  3 runs) | Q4_1    23.4 GFLOPS (  3 runs)
4096 x 4096: Q5_0    12.2 GFLOPS (  3 runs) | Q5_1    13.5 GFLOPS (  3 runs) | Q8_0    53.9 GFLOPS (  3 runs)
4096 x 4096: F16      8.0 GFLOPS (  3 runs) | F32      5.7 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

CPU OS Config Model Th Enc. Dec. Bch5 PP Commit
i7-11700F Ubuntu 20.04 base 4 ms 15.82 15.05 15.71 31989a5a
there is an impressive benchmark result(compare to above bench result in PC which was purchased by RMB12000(about USD 1700) a few years ago) with Xiaomi 14's powerful mobile SoC------Qualcomm SM8650-AB Snapdragon 8 Gen 3 (4 nm) + Xiaomi's HyperOS(derived from Android 14) + Android NDK r21e:

2106054156

updated on 03-20-2024, Xiaomi 14 + Android NDK r26c( NDK r26c is required for special build optimization:https://github.com/cdeos/kantv/blob/master/external/whispercpp/CMakeLists.txt#L60)

514487122

1755516838

30679635

Excuse me, may I ask which way you generated the benchmark app?I am now worried because I am not able to benchmark on my phone. Thanks for your answer.

@zhouwg
Copy link
Contributor

zhouwg commented Jun 8, 2024

running the original "bench(which generated by the original build system in project whisper.cpp)" in X86-Linux(Ubuntu 20.04).

benchmark on Android phone is another topic and scenario. the official project whisper.cpp doesn't care this:they focus on core implementation/improvement and focus on MacOS(iOS)/Windows/Linux(I personally think the Android OS is another special Linux distribution).

I maintained a dedicated ggml learning&study project focus on Android and some benchmark items are also provided in this ggml learning&study project accordingly.
1997071773
652081312

BTW, the codes of above two benchmark items are exactly same to the original codes of above benchmark in the project whisper.cpp essentially/technically.

@StuartIanNaylor
Copy link

StuartIanNaylor commented Jun 13, 2024

Hello all, I'm trying to benchmark all whisper backends but am having trouble benchmarking whisper.cpp. Since I'm unfamiliar with "compiling" I'm forced to use python bindings. I'm only aware of the following bindings but they all either haven't been updated in a long time or don't implement gpu acceleration:

Also, does whisper.cpp have "batching" by chance? Here's a sample graph I've created. Any feedback would be welcome regarding either how I'm graphing as well as how to test fairly with identical parameters and what not. Thanks!

P.S. faster-whisper doesn't have batching yet so, obviously, that's why there's only one graph for it...

https://github.com/ggerganov/whisper.cpp?tab=readme-ov-file#quick-start

@aleksas
Copy link

aleksas commented Sep 28, 2024

System Info

  • The C compiler: GNU 12.2.0
  • The CXX compiler: GNU 12.2.0
  • Docker container image: debian:bookworm
  • Docker host: Ubuntu 22.04.4 LTS
  • CPU: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
n_threads AVX AVX2 AVX512 FMA NEON ARM_FMA METAL F16C FP16_VA WASM_SIMD BLAS SSE3 SSSE3 VSX CUDA COREML OPENVINO
4 / 8 1 1 0 1 0 0 0 1 0 0 0 1 1 0 0 0 0

memcpy

./bench -w 1 -t 1
memcpy:    4.48 GB/s (heat-up)
memcpy:    5.13 GB/s ( 1 thread)
memcpy:    5.48 GB/s ( 1 thread)
sum:    -1535998239.000000

ggml_mul_mat

./bench -w 2 -t 1
  64 x   64: Q4_0     2.6 GFLOPS (128 runs) | Q4_1     2.6 GFLOPS (128 runs)
  64 x   64: Q5_0     2.4 GFLOPS (128 runs) | Q5_1     2.3 GFLOPS (128 runs) | Q8_0     2.8 GFLOPS (128 runs)
  64 x   64: F16      3.2 GFLOPS (128 runs) | F32      0.7 GFLOPS (128 runs)
 128 x  128: Q4_0     4.3 GFLOPS (128 runs) | Q4_1     4.5 GFLOPS (128 runs)
 128 x  128: Q5_0     4.2 GFLOPS (128 runs) | Q5_1     4.0 GFLOPS (128 runs) | Q8_0     5.4 GFLOPS (128 runs)
 128 x  128: F16      5.7 GFLOPS (128 runs) | F32      2.9 GFLOPS (128 runs)
 256 x  256: Q4_0     6.9 GFLOPS (128 runs) | Q4_1     6.0 GFLOPS (128 runs)
 256 x  256: Q5_0     6.0 GFLOPS (128 runs) | Q5_1     5.4 GFLOPS (128 runs) | Q8_0     9.5 GFLOPS (128 runs)
 256 x  256: F16      8.3 GFLOPS (128 runs) | F32      5.4 GFLOPS (128 runs)
 512 x  512: Q4_0     9.2 GFLOPS ( 35 runs) | Q4_1     8.0 GFLOPS ( 30 runs)
 512 x  512: Q5_0     7.1 GFLOPS ( 27 runs) | Q5_1     7.1 GFLOPS ( 27 runs) | Q8_0    11.2 GFLOPS ( 42 runs)
 512 x  512: F16      9.0 GFLOPS ( 34 runs) | F32      5.0 GFLOPS ( 19 runs)
1024 x 1024: Q4_0    10.2 GFLOPS (  5 runs) | Q4_1     9.1 GFLOPS (  5 runs)
1024 x 1024: Q5_0     8.4 GFLOPS (  4 runs) | Q5_1     8.1 GFLOPS (  4 runs) | Q8_0    13.4 GFLOPS (  7 runs)
1024 x 1024: F16      8.8 GFLOPS (  5 runs) | F32      4.0 GFLOPS (  3 runs)
2048 x 2048: Q4_0    11.4 GFLOPS (  3 runs) | Q4_1    10.2 GFLOPS (  3 runs)
2048 x 2048: Q5_0     7.9 GFLOPS (  3 runs) | Q5_1     7.5 GFLOPS (  3 runs) | Q8_0    11.3 GFLOPS (  3 runs)
2048 x 2048: F16      7.8 GFLOPS (  3 runs) | F32      4.4 GFLOPS (  3 runs)
4096 x 4096: Q4_0     9.7 GFLOPS (  3 runs) | Q4_1     9.7 GFLOPS (  3 runs)
4096 x 4096: Q5_0     7.9 GFLOPS (  3 runs) | Q5_1     7.4 GFLOPS (  3 runs) | Q8_0    11.5 GFLOPS (  3 runs)
CPU OS Config Model Th Load Enc. Commit
i7-8650U Ubuntu 22.04.4 LTS AVX2 tiny 4 251.99 2350.88 c7b6988
i7-8650U Ubuntu 22.04.4 LTS AVX2 base 4 387.53 5259.46 c7b6988
i7-8650U Ubuntu 22.04.4 LTS AVX2 small 4 939.49 22799.70 c7b6988
i7-8650U Ubuntu 22.04.4 LTS AVX2 medium 4 2679.57 62713.56 c7b6988

@aleksas
Copy link

aleksas commented Oct 1, 2024

System Info

  • The C compiler: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
  • The CXX compiler: g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
  • Docker container image: nvidia/cuda:12.2.2-cudnn8-devel-ubuntu22.04
  • Docker host: Ubuntu 24.04 LTS
  • CPU: Intel(R) Xeon(R) CPU E5-1660 v3 @ 3.00GHz
  • GPU: NVIDIA GeForce RTX 4090
n_threads AVX AVX2 AVX512 FMA NEON ARM_FMA METAL F16C FP16_VA WASM_SIMD BLAS SSE3 SSSE3 VSX CUDA COREML OPENVINO
8 / 16 1 1 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0

memcpy

./bench -w 1 -t 1
memcpy:   13.44 GB/s (heat-up)
memcpy:   13.53 GB/s ( 1 thread)
memcpy:   13.49 GB/s ( 1 thread)
sum:    -1535998239.000000

ggml_mul_mat

./bench -w 2 -t 1
  64 x   64: Q4_0    10.3 GFLOPS (128 runs) | Q4_1     9.8 GFLOPS (128 runs)
  64 x   64: Q5_0     9.3 GFLOPS (128 runs) | Q5_1     8.7 GFLOPS (128 runs) | Q8_0    11.0 GFLOPS (128 runs)
  64 x   64: F16     11.0 GFLOPS (128 runs) | F32      3.0 GFLOPS (128 runs)
 128 x  128: Q4_0    15.5 GFLOPS (128 runs) | Q4_1    15.1 GFLOPS (128 runs)
 128 x  128: Q5_0    13.7 GFLOPS (128 runs) | Q5_1    13.2 GFLOPS (128 runs) | Q8_0    17.6 GFLOPS (128 runs)
 128 x  128: F16     15.6 GFLOPS (128 runs) | F32      9.7 GFLOPS (128 runs)
 256 x  256: Q4_0    20.0 GFLOPS (128 runs) | Q4_1    19.1 GFLOPS (128 runs)
 256 x  256: Q5_0    16.5 GFLOPS (128 runs) | Q5_1    16.0 GFLOPS (128 runs) | Q8_0    23.3 GFLOPS (128 runs)
 256 x  256: F16     19.4 GFLOPS (128 runs) | F32     14.5 GFLOPS (128 runs)
 512 x  512: Q4_0    24.0 GFLOPS ( 90 runs) | Q4_1    23.8 GFLOPS ( 89 runs)
 512 x  512: Q5_0    20.1 GFLOPS ( 76 runs) | Q5_1    19.7 GFLOPS ( 74 runs) | Q8_0    27.8 GFLOPS (104 runs)
 512 x  512: F16     22.9 GFLOPS ( 86 runs) | F32     13.6 GFLOPS ( 51 runs)
1024 x 1024: Q4_0    26.6 GFLOPS ( 13 runs) | Q4_1    27.1 GFLOPS ( 13 runs)
1024 x 1024: Q5_0    21.7 GFLOPS ( 11 runs) | Q5_1    21.5 GFLOPS ( 11 runs) | Q8_0    32.3 GFLOPS ( 16 runs)
1024 x 1024: F16     23.9 GFLOPS ( 12 runs) | F32     13.2 GFLOPS (  7 runs)
2048 x 2048: Q4_0    28.0 GFLOPS (  3 runs) | Q4_1    29.1 GFLOPS (  3 runs)
2048 x 2048: Q5_0    22.4 GFLOPS (  3 runs) | Q5_1    23.3 GFLOPS (  3 runs) | Q8_0    34.4 GFLOPS (  3 runs)
2048 x 2048: F16     24.6 GFLOPS (  3 runs) | F32     12.7 GFLOPS (  3 runs)
4096 x 4096: Q4_0    29.3 GFLOPS (  3 runs) | Q4_1    30.3 GFLOPS (  3 runs)
4096 x 4096: Q5_0    22.9 GFLOPS (  3 runs) | Q5_1    24.0 GFLOPS (  3 runs) | Q8_0    35.3 GFLOPS (  3 runs)
4096 x 4096: F16     24.4 GFLOPS (  3 runs) | F32     11.1 GFLOPS (  3 runs)
Model Th Load Enc. Commit
CUDA tiny 8 196.73 2.67
CUDA base 8 213.35 5.36
CUDA small 8 313.44 16.15
CUDA medium 8 570.86 41.80

@aleksas
Copy link

aleksas commented Oct 1, 2024

System Info

  • The C compiler: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
  • The CXX compiler: g++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
  • Docker container image: nvidia/cuda:11.4.3-cudnn8-devel-ubuntu20.04
  • Docker host: Ubuntu 24.04 LTS
  • CPU: Intel(R) Core(TM) i5-6600K CPU @ 3.50GHz
  • GPU: NVIDIA GeForce RTX 1080 TI
n_threads AVX AVX2 AVX512 FMA NEON ARM_FMA METAL F16C FP16_VA WASM_SIMD BLAS SSE3 SSSE3 VSX CUDA COREML OPENVINO
4 / 4 1 1 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0

memcpy

./bench -w 1 -t 1
memcpy:   13.62 GB/s (heat-up)
memcpy:   13.54 GB/s ( 1 thread)
memcpy:   13.62 GB/s ( 1 thread)
sum:    -1535998239.000000

ggml_mul_mat

./bench -w 2 -t 1
  64 x   64: Q4_0    12.2 GFLOPS (128 runs) | Q4_1    11.3 GFLOPS (128 runs)
  64 x   64: Q5_0    11.3 GFLOPS (128 runs) | Q5_1    10.1 GFLOPS (128 runs) | Q8_0    13.2 GFLOPS (128 runs)
  64 x   64: F16     15.3 GFLOPS (128 runs) | F32      3.6 GFLOPS (128 runs)
 128 x  128: Q4_0    19.4 GFLOPS (128 runs) | Q4_1    16.9 GFLOPS (128 runs)
 128 x  128: Q5_0    17.0 GFLOPS (128 runs) | Q5_1    15.5 GFLOPS (128 runs) | Q8_0    21.5 GFLOPS (128 runs)
 128 x  128: F16     22.3 GFLOPS (128 runs) | F32     10.7 GFLOPS (128 runs)
 256 x  256: Q4_0    24.7 GFLOPS (128 runs) | Q4_1    20.5 GFLOPS (128 runs)
 256 x  256: Q5_0    20.4 GFLOPS (128 runs) | Q5_1    18.8 GFLOPS (128 runs) | Q8_0    28.2 GFLOPS (128 runs)
 256 x  256: F16     29.2 GFLOPS (128 runs) | F32     15.4 GFLOPS (128 runs)
 512 x  512: Q4_0    28.9 GFLOPS (108 runs) | Q4_1    25.7 GFLOPS ( 96 runs)
 512 x  512: Q5_0    24.9 GFLOPS ( 93 runs) | Q5_1    23.4 GFLOPS ( 87 runs) | Q8_0    34.3 GFLOPS (128 runs)
 512 x  512: F16     35.0 GFLOPS (128 runs) | F32     13.8 GFLOPS ( 52 runs)
1024 x 1024: Q4_0    33.6 GFLOPS ( 16 runs) | Q4_1    30.2 GFLOPS ( 15 runs)
1024 x 1024: Q5_0    28.3 GFLOPS ( 14 runs) | Q5_1    26.9 GFLOPS ( 13 runs) | Q8_0    40.4 GFLOPS ( 19 runs)
1024 x 1024: F16     33.3 GFLOPS ( 16 runs) | F32     12.9 GFLOPS (  7 runs)
2048 x 2048: Q4_0    36.1 GFLOPS (  3 runs) | Q4_1    32.8 GFLOPS (  3 runs)
2048 x 2048: Q5_0    29.5 GFLOPS (  3 runs) | Q5_1    28.5 GFLOPS (  3 runs) | Q8_0    42.6 GFLOPS (  3 runs)
2048 x 2048: F16     31.0 GFLOPS (  3 runs) | F32     12.2 GFLOPS (  3 runs)
4096 x 4096: Q4_0    36.6 GFLOPS (  3 runs) | Q4_1    33.6 GFLOPS (  3 runs)
4096 x 4096: Q5_0    30.7 GFLOPS (  3 runs) | Q5_1    29.5 GFLOPS (  3 runs) | Q8_0    42.9 GFLOPS (  3 runs)
4096 x 4096: F16     30.1 GFLOPS (  3 runs) | F32     11.7 GFLOPS (  3 runs)
Model Th Load Enc. Commit
CUDA tiny 4 196.22 13.52
CUDA base 4 211.40 27.67
CUDA small 4 310.89 91.85
CUDA medium 4 861.47 233.79

@slavanorm
Copy link

what is faster on mac M1, turbo compiled with coreml or turbo_q5 without it?

@magnacartatron
Copy link

magnacartatron commented Nov 11, 2024

M4 Mac Mini (Base Model) CoreML flags

./bench -w 1 -t 1 
memcpy:   34.01 GB/s (heat-up)
memcpy:   41.20 GB/s ( 1 thread)
memcpy:   41.44 GB/s ( 1 thread)
sum:    -1536000387.000000
./bench -w 2 -t 1
  64 x   64: Q4_0    40.8 GFLOPS (128 runs) | Q4_1    38.2 GFLOPS (128 runs)
  64 x   64: Q5_0    29.6 GFLOPS (128 runs) | Q5_1    28.8 GFLOPS (128 runs) | Q8_0    44.9 GFLOPS (128 runs)
  64 x   64: F16     43.4 GFLOPS (128 runs) | F32     30.3 GFLOPS (128 runs)
 128 x  128: Q4_0    71.0 GFLOPS (128 runs) | Q4_1    62.6 GFLOPS (128 runs)
 128 x  128: Q5_0    45.7 GFLOPS (128 runs) | Q5_1    42.2 GFLOPS (128 runs) | Q8_0    73.7 GFLOPS (128 runs)
 128 x  128: F16     65.4 GFLOPS (128 runs) | F32     35.2 GFLOPS (128 runs)
 256 x  256: Q4_0    81.4 GFLOPS (128 runs) | Q4_1    72.9 GFLOPS (128 runs)
 256 x  256: Q5_0    57.0 GFLOPS (128 runs) | Q5_1    50.9 GFLOPS (128 runs) | Q8_0    92.7 GFLOPS (128 runs)
 256 x  256: F16     69.4 GFLOPS (128 runs) | F32     40.9 GFLOPS (128 runs)
 512 x  512: Q4_0    85.5 GFLOPS (128 runs) | Q4_1    76.9 GFLOPS (128 runs)
 512 x  512: Q5_0    62.1 GFLOPS (128 runs) | Q5_1    54.1 GFLOPS (128 runs) | Q8_0   100.8 GFLOPS (128 runs)
 512 x  512: F16     81.0 GFLOPS (128 runs) | F32     44.7 GFLOPS (128 runs)
1024 x 1024: Q4_0    89.6 GFLOPS ( 42 runs) | Q4_1    80.0 GFLOPS ( 38 runs)
1024 x 1024: Q5_0    65.8 GFLOPS ( 31 runs) | Q5_1    56.9 GFLOPS ( 27 runs) | Q8_0   110.5 GFLOPS ( 52 runs)
1024 x 1024: F16     88.0 GFLOPS ( 41 runs) | F32     43.4 GFLOPS ( 21 runs)
2048 x 2048: Q4_0    92.2 GFLOPS (  6 runs) | Q4_1    81.4 GFLOPS (  5 runs)
2048 x 2048: Q5_0    67.2 GFLOPS (  4 runs) | Q5_1    57.9 GFLOPS (  4 runs) | Q8_0   116.6 GFLOPS (  7 runs)
2048 x 2048: F16     83.7 GFLOPS (  5 runs) | F32     37.3 GFLOPS (  3 runs)
4096 x 4096: Q4_0    92.7 GFLOPS (  3 runs) | Q4_1    81.5 GFLOPS (  3 runs)
4096 x 4096: Q5_0    67.9 GFLOPS (  3 runs) | Q5_1    58.2 GFLOPS (  3 runs) | Q8_0   119.7 GFLOPS (  3 runs)
4096 x 4096: F16     73.4 GFLOPS (  3 runs) | F32     35.8 GFLOPS (  3 runs)

M1 Ultra 48 Core GPU 64 GB - Standard Metal

/bench -w 1 -t 1
memcpy:   36.55 GB/s (heat-up)
memcpy:   41.00 GB/s ( 1 thread)
memcpy:   41.86 GB/s ( 1 thread)
sum:    -1536000387.000000
./bench -w 2 -t 1
  64 x   64: Q4_0    20.0 GFLOPS (128 runs) | Q4_1    18.4 GFLOPS (128 runs)
  64 x   64: Q5_0    15.3 GFLOPS (128 runs) | Q5_1    14.8 GFLOPS (128 runs) | Q8_0    21.1 GFLOPS (128 runs)
  64 x   64: F16     21.3 GFLOPS (128 runs) | F32     16.4 GFLOPS (128 runs)
 128 x  128: Q4_0    40.9 GFLOPS (128 runs) | Q4_1    37.3 GFLOPS (128 runs)
 128 x  128: Q5_0    27.5 GFLOPS (128 runs) | Q5_1    26.0 GFLOPS (128 runs) | Q8_0    44.1 GFLOPS (128 runs)
 128 x  128: F16     43.8 GFLOPS (128 runs) | F32     27.4 GFLOPS (128 runs)
 256 x  256: Q4_0    51.0 GFLOPS (128 runs) | Q4_1    45.2 GFLOPS (128 runs)
 256 x  256: Q5_0    33.4 GFLOPS (128 runs) | Q5_1    31.3 GFLOPS (128 runs) | Q8_0    58.6 GFLOPS (128 runs)
 256 x  256: F16     53.3 GFLOPS (128 runs) | F32     30.5 GFLOPS (128 runs)
 512 x  512: Q4_0    59.9 GFLOPS (128 runs) | Q4_1    53.1 GFLOPS (128 runs)
 512 x  512: Q5_0    39.3 GFLOPS (128 runs) | Q5_1    35.4 GFLOPS (128 runs) | Q8_0    71.5 GFLOPS (128 runs)
 512 x  512: F16     62.8 GFLOPS (128 runs) | F32     33.4 GFLOPS (125 runs)
1024 x 1024: Q4_0    65.0 GFLOPS ( 31 runs) | Q4_1    58.1 GFLOPS ( 28 runs)
1024 x 1024: Q5_0    42.6 GFLOPS ( 20 runs) | Q5_1    38.3 GFLOPS ( 18 runs) | Q8_0    80.2 GFLOPS ( 38 runs)
1024 x 1024: F16     64.9 GFLOPS ( 31 runs) | F32     30.5 GFLOPS ( 15 runs)
2048 x 2048: Q4_0    68.1 GFLOPS (  4 runs) | Q4_1    60.3 GFLOPS (  4 runs)
2048 x 2048: Q5_0    44.1 GFLOPS (  3 runs) | Q5_1    39.5 GFLOPS (  3 runs) | Q8_0    85.8 GFLOPS (  5 runs)
2048 x 2048: F16     60.0 GFLOPS (  4 runs) | F32     25.4 GFLOPS (  3 runs)
4096 x 4096: Q4_0    69.4 GFLOPS (  3 runs) | Q4_1    61.0 GFLOPS (  3 runs)
4096 x 4096: Q5_0    44.9 GFLOPS (  3 runs) | Q5_1    39.8 GFLOPS (  3 runs) | Q8_0    85.8 GFLOPS (  3 runs)
4096 x 4096: F16     50.8 GFLOPS (  3 runs) | F32     25.3 GFLOPS (  3 runs)

i5-14600k 4070 Ti Super 16GB (555 drivers), 32GB, Ubuntu 24.04 - CUDA Version

./bench -w 1 -t 1
memcpy:   23.88 GB/s (heat-up)
memcpy:   24.09 GB/s ( 1 thread)
memcpy:   24.25 GB/s ( 1 thread)
sum:    -1535998239.000000
./bench -w 2 -t 1
  64 x   64: Q4_0    26.7 GFLOPS (128 runs) | Q4_1    26.1 GFLOPS (128 runs)
  64 x   64: Q5_0    24.3 GFLOPS (128 runs) | Q5_1    22.6 GFLOPS (128 runs) | Q8_0    29.9 GFLOPS (128 runs)
  64 x   64: F16     34.6 GFLOPS (128 runs) | F32      6.5 GFLOPS (128 runs)
 128 x  128: Q4_0    43.0 GFLOPS (128 runs) | Q4_1    41.5 GFLOPS (128 runs)
 128 x  128: Q5_0    37.8 GFLOPS (128 runs) | Q5_1    34.9 GFLOPS (128 runs) | Q8_0    51.7 GFLOPS (128 runs)
 128 x  128: F16     45.8 GFLOPS (128 runs) | F32     14.7 GFLOPS (128 runs)
 256 x  256: Q4_0    57.1 GFLOPS (128 runs) | Q4_1    53.5 GFLOPS (128 runs)
 256 x  256: Q5_0    47.7 GFLOPS (128 runs) | Q5_1    44.9 GFLOPS (128 runs) | Q8_0    69.9 GFLOPS (128 runs)
 256 x  256: F16     50.6 GFLOPS (128 runs) | F32     23.1 GFLOPS (128 runs)
 512 x  512: Q4_0    66.7 GFLOPS (128 runs) | Q4_1    61.3 GFLOPS (128 runs)
 512 x  512: Q5_0    53.2 GFLOPS (128 runs) | Q5_1    50.6 GFLOPS (128 runs) | Q8_0    80.3 GFLOPS (128 runs)
 512 x  512: F16     53.2 GFLOPS (128 runs) | F32     26.2 GFLOPS ( 98 runs)
1024 x 1024: Q4_0    74.2 GFLOPS ( 35 runs) | Q4_1    67.3 GFLOPS ( 32 runs)
1024 x 1024: Q5_0    58.4 GFLOPS ( 28 runs) | Q5_1    54.0 GFLOPS ( 26 runs) | Q8_0    82.9 GFLOPS ( 39 runs)
1024 x 1024: F16     54.7 GFLOPS ( 26 runs) | F32     25.9 GFLOPS ( 13 runs)
2048 x 2048: Q4_0    78.4 GFLOPS (  5 runs) | Q4_1    70.3 GFLOPS (  5 runs)
2048 x 2048: Q5_0    61.0 GFLOPS (  4 runs) | Q5_1    55.5 GFLOPS (  4 runs) | Q8_0    85.6 GFLOPS (  5 runs)
2048 x 2048: F16     55.1 GFLOPS (  4 runs) | F32     25.9 GFLOPS (  3 runs)
4096 x 4096: Q4_0    79.4 GFLOPS (  3 runs) | Q4_1    71.7 GFLOPS (  3 runs)
4096 x 4096: Q5_0    62.5 GFLOPS (  3 runs) | Q5_1    55.1 GFLOPS (  3 runs) | Q8_0    83.0 GFLOPS (  3 runs)
4096 x 4096: F16     51.4 GFLOPS (  3 runs) | F32     24.4 GFLOPS (  3 runs)

What is strange is in the standard ./bench -m models/ggml-large-v3-turbo.bin the m1 Ultra is about twice as fast. But in practice the M4 with MacMini (CoreML converted models) is only about 10% slower in hour long transcriptions compared to the M1 Ultra 48core (using standard Metal make -j).

So the M4 is quite a beefy CPU, the ANE is nice though limiting in what it can do, GPU when running MLX models is about 2x M1 performance. E.g. getting 24 tokens per second on M1 vs 45 on M4, vs 120 on M1 Ultra using Llama 3.2 3B 4bit MLX. I'm surprised that the 4_k quant running on a 4070 Ti Super also gets about 120 tokens/s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance CPU and memory usage - results and comparisons
Projects
None yet
Development

No branches or pull requests