WIP Learning high performance Go by doing the One Billion Row Challenge
Find more details about:
- the One Billion Row Challenge at https://1brc.dev/
- the leaderboard and how to generate or run the solutions https://github.com/gunnarmorling/1brc
My system information (Dell Inspiron 5567 with Ubuntu 22.04.1 LTS):
$ lscpu | sed -n '1p;5p;8p;11p;20,23p'
Architecture: x86_64
CPU(s): 4
Model name: Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz
Thread(s) per core: 2
L1d cache: 64 KiB (2 instances)
L1i cache: 64 KiB (2 instances)
L2 cache: 512 KiB (2 instances)
L3 cache: 4 MiB (1 instance)
$ free -g -h -t | grep Total | cut -b 17-20
17Gi
Apple M1 Pro:
$ system_profiler SPHardwareDataType | sed -n '8,9p'
Chip: Apple M1 Pro
Total Number of Cores: 10 (8 performance and 2 efficiency)
$ uname -m
arm64
$ sysctl -a | grep cache | sed -n '32,35p'
hw.cachelinesize: 128
hw.l1icachesize: 131072
hw.l1dcachesize: 65536
hw.l2cachesize: 4194304
$ sysctl -n hw.memsize | numfmt --to=si
35G
Reference implementations on my machine:
Execution time for 1B row
$ time ./calculate_average_baseline.sh > /dev/null
378,93s user 12,82s system 99% cpu 6:32,96 total
$ time ./calculate_average_thomaswue.sh > /dev/null
0,23s user 0,17s system 0% cpu 1:02,58 total
$ time ./calculate_average_AlexanderYastrebov.sh > /dev/null
167,48s user 15,02s system 307% cpu 59,309 total
Execution time for 1M row
$ time ./calculate_average_baseline.sh
1,39s user 0,30s system 149% cpu 1,132 total
$ time ./calculate_average_thomaswue.sh > /dev/null
1,16s user 0,22s system 200% cpu 0,689 total
$ time ./calculate_average_AlexanderYastrebov.sh
0,24s user 0,10s system 171% cpu 0,194 total
Reference implementations on M1 Pro:
Execution time for 1B row
$ time ./calculate_average_baseline.sh > /dev/null
206.19s user 6.63s system 100% cpu 3:32.62 total
$ time ./calculate_average_thomaswue.sh > /dev/null
24.19s user 2.20s system 674% cpu 3.914 total
$ time ./calculate_average_AlexanderYastrebov.sh > /dev/null
45.35s user 1.90s system 858% cpu 5.505 total
Execution time for 1M row
$ time ./calculate_average_baseline.sh
0.57s user 0.08s system 99% cpu 0.653 total
$ time ./calculate_average_thomaswue.sh > /dev/null
0.69s user 0.06s system 137% cpu 0.552 total
$ time ./calculate_average_AlexanderYastrebov.sh
0.13s user 0.03s system 137% cpu 0.115 total
For reference:
- 1BRC Baseline: 392.96s (M1: 212.62s)
- Thomas Wuerthinger: 62.58s (M1: 3.91s)
- Alexander Yastrebov (Go): 59.30s (M1: 5.50s)
My solution steps:
Step | Description | Exec. time | Improvement | Baseline imp. | Commit |
---|---|---|---|---|---|
1 | Naive approach | 286s (M1: 167s) |
- | - | 6bc5f94 |
2 | Parallel measurement processors | 243s (M1: 164s) |
1.177x (M1: 1.018x) |
1.177x (M1: 1018.x) |
b652f32 |
3 | Batch read file lines | 167s (M1: 62s) |
1.455x (M1: 2.645x) |
1.712x (M1: 2.693x) |
66f92ce |
4 | Batch process lines | 168s (M1: 68s) |
0.994x (M1: 0.911x) |
1.702x (M1: 2.455x) |
a9a44aa |
5 | Use L3 size chunks | 93s (M1: 16s) |
1.806x (M1: 4.25x) |
3.075x (M1: 10.437x) |
2668364 |
6 | Refactor to parallel read and process | 100s (M1: 17s) |
0.930x (M1: 0.941x) |
2.860x (M1: 9.823x) |
b0fac51 |
7 | Optimize map allocation for processing | 97s (M1: 16s) |
1.031x (M1: 1.062x) |
2.948x (M1: 10.437x) |
6ba10e5 |
8 | PGO and GC tuning | 86s (M1: 15s) |
1.128x (M1: 1.066x) |
3.325x (M1: 11.133x) |
14b9f35 |
Comments for the steps:
- Naive approach: Sequential file read and processing using 1 CPU core.
- Parallel measurement processors: Sequential file read with multiple parallel measurement processors. The processors are sharded and one stations measurement will always be processed by the same processor. The results are merged and printed at the end. No concurrency optimizations were made at this point.
- Batch read file lines: After doing a trace analysis (using file with 1M lines) we could see that
processMeasurements
function takes ~48% of the total time and more than half of it's time it is waiting forchanrecv
(also sharding is suboptimal, less processing is done as expected). On the other handreadFileLines
takes ~22% of the total time but it does a lot of IO reads with waits between them. The next step was optimizing only(!) the file read to read lines in batches and ignore splitting and converting them to strings. After thisreadFileLines
took 1% of the total time (~20% waiting and half of it in syscall) andprocessMeasurements
took 77% (~20% waiting).processMeasurements
changed mostly the hashing, and now a station could land on multiple worker so at the end we need to merge them. With these changes the CPU cores are more busy andreadFileLines
does more syscalls in bigger chunks. Benchmark before and after forreadFileLines
:
❯ benchstat stats.old.txt stats.txt | tail -n 11
│ stats.old.txt │ stats.txt │
│ sec/op │ sec/op vs base │
ReadFileLines-4 3960.8µ ± 9% 130.9µ ± 2% -96.70% (p=0.002 n=6)
│ stats.old.txt │ stats.txt │
│ B/op │ B/op vs base │
ReadFileLines-4 15636.2Ki ± 0% 1008.1Ki ± 0% -93.55% (p=0.002 n=6)
│ stats.old.txt │ stats.txt │
│ allocs/op │ allocs/op vs base │
ReadFileLines-4 5.000 ± 0% 5.000 ± 0% ~ (p=1.000 n=6) ¹
¹ all samples are equal
- Batch process lines: Read chunks in
getMeasurements
and distribute them to parallel workers to process the measurements (processMeasurements
). TheprocessMeasurements
function split the lines and aggregates the chunks. The result is sent back togetMeasurements
where the aggregated subresults are merged. BothprocessMeasurements
(pm_stats
) andgetMeasurements
(gm_stats
) are improved, but the channel synchronizations are degrading the performance (what should be fixed next time)
❯ benchstat pm_stats.orig.txt pm_stats.txt | tail -n 11
│ pm_stats.orig.txt │ pm_stats.txt │
│ sec/op │ sec/op vs base │
ProcessMeasurements-4 3.905µ ± 1% 1.942µ ± 1% -50.26% (p=0.002 n=6)
│ pm_stats.orig.txt │ pm_stats.txt │
│ B/op │ B/op vs base │
ProcessMeasurements-4 680.0 ± 0% 7064.0 ± 0% +938.82% (p=0.002 n=6)
│ pm_stats.orig.txt │ pm_stats.txt │
│ allocs/op │ allocs/op vs base │
ProcessMeasurements-4 8.000 ± 0% 4.000 ± 0% -50.00% (p=0.002 n=6)
❯ benchstat gm_stats.orig.txt gm_stats.txt | tail -n 11
│ gm_stats.orig.txt │ gm_stats.txt │
│ sec/op │ sec/op vs base │
GetMeasurements-4 126.02µ ± 23% 11.13µ ± 2% -91.16% (p=0.002 n=6)
│ gm_stats.orig.txt │ gm_stats.txt │
│ B/op │ B/op vs base │
GetMeasurements-4 633.69Ki ± 0% 25.71Ki ± 0% -95.94% (p=0.002 n=6)
│ gm_stats.orig.txt │ gm_stats.txt │
│ allocs/op │ allocs/op vs base │
GetMeasurements-4 17.00 ± 0% 21.00 ± 0% +23.53% (p=0.002 n=6)
-
Use L3 size chunks
-
Refactor to parallel read and process: Refactor from concurrent heavy approach to parallel heavy approach. Read and process the file in equally distributed parts across all the CPU cores and merge the results at the end. Turned out this is a bit slower, however the goroutines has less synchronization, they spend ~15% more time in execution and uses far less system memory (visible in Top). This code looks easier to improve and could gain more performance with bigger number of cores.
-
Optimize map allocation for processing: After doing more profiling we can see that
processMeasurements
function does a lot of allocations (for themeasurement
object) and map lookups. By using FNV-1a hash as a key (int32
) we could improve the execution time by ~10%, but it's not possible to use that for finding out the station names without adding extra complexity and we lose our gain. So the only improvement what I could see here is to properly pre-allocate the map and not have re-allocations during the execution. On a small dataset this only change gives as a 55% gain on the function level.
❯ benchstat pm_stats.orig.txt pm_stats.txt | tail -n 11
│ pm_stats.orig.txt │ pm_stats.txt │
│ sec/op │ sec/op vs base │
ProcessMeasurements-4 3.565µ ± 3% 1.574µ ± 4% -55.85% (p=0.002 n=6)
│ pm_stats.orig.txt │ pm_stats.txt │
│ B/op │ B/op vs base │
ProcessMeasurements-4 7.008Ki ± 0% 1.109Ki ± 0% -84.17% (p=0.002 n=6)
│ pm_stats.orig.txt │ pm_stats.txt │
│ allocs/op │ allocs/op vs base │
ProcessMeasurements-4 4.000 ± 0% 3.000 ± 0% -25.00% (p=0.002 n=6)
I also optimized the getResults
function, but it's not a big overall improvement because it runs only on the end of the process for a short time (I did it just for the fun :) ).
❯ benchstat gr_stats.orig.txt gr_stats.txt | tail -n 11
│ gr_stats.orig.txt │ gr_stats.txt │
│ sec/op │ sec/op vs base │
GetResult-4 3629.5n ± 2% 668.3n ± 1% -81.59% (p=0.002 n=6)
│ gr_stats.orig.txt │ gr_stats.txt │
│ B/op │ B/op vs base │
GetResult-4 560.0 ± 0% 576.0 ± 0% +2.86% (p=0.002 n=6)
│ gr_stats.orig.txt │ gr_stats.txt │
│ allocs/op │ allocs/op vs base │
GetResult-4 25.000 ± 0% 2.000 ± 0% -92.00% (p=0.002 n=6)
Looks this part did a lot of allocations.
❯ benchstat main_stats.orig.txt main_stats.txt | tail -n 3
│ main_stats.orig.txt │ main_stats.txt │
│ allocs/op │ allocs/op vs base │
Main-4 53023.0 ± 0% 125.0 ± 2% -99.76% (p=0.002 n=6)
- PGO and GC tuning: This is not part of the official challenge because it contains non-code optimization. Profile Guided Optimization resulted in a 4% gain. GC was configured to run 100x times less (at around 6-8GB on my machine), what resulted ~50% system memory usage on my machine instead of ~1%. This resulted in another 8% gain. The gain on M1 was minimal, less then 0.5s with PGO and nearly 1s with GC tuning.
Out of curiosity I solved the same problem with GitHub Copilot. I tried to give the description from the 1brc challenge GitHub page with as small change as possible. The generated codes were working and they needed just tiny modifications regarding to the output formatting and mean calculation (I had this issue as well so I basically copied my version). They were ready to run within a couple of minutes.
- The first generated code (as the most suggestions) was single threaded. It ran for 424 seconds (148s on M1).
- For the second version I tried to pick a suggestion what tried to use some concurrency. It used around 2.5-3 CPU cores but it used up all the system memory and stopped responding after nearly 15 minutes (finnished running on M1 in 443s).
I also wrote the measurement file generator but my logic behind generating measurements was not that sophisticated like the original one. On my machine I could reduce the generation time to 1:46 min compared to 4:27 min used by create_measurements3.sh
. It was interesting that the same code was running totally different on M1. On my machine the threads were busy and didn't had to wait much for scheduling goroutines (used ~360% cpu), but on M1 this was much worse (used ~510% cpu). Because of this on M1 my multithreaded version was slower (2:40) compared to the single threaded original version (2:09). (I leave here a TODO to investigate and improve this).