-
-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
walk: Send WorkerResults in batches #1422
Conversation
tavianator@tachyon $ hyperfine -w2 fd{,-{master,batch}}" -u '^$' /tmp/empty"
Benchmark 1: fd -u '^$' /tmp/empty
Time (mean ± σ): 143.1 ms ± 5.3 ms [User: 9.0 ms, System: 132.6 ms]
Range (min … max): 134.2 ms … 152.7 ms 21 runs
Benchmark 2: fd-master -u '^$' /tmp/empty
Time (mean ± σ): 63.6 ms ± 8.6 ms [User: 6.0 ms, System: 58.2 ms]
Range (min … max): 51.7 ms … 80.0 ms 43 runs
Benchmark 3: fd-batch -u '^$' /tmp/empty
Time (mean ± σ): 6.8 ms ± 2.0 ms [User: 2.6 ms, System: 7.6 ms]
Range (min … max): 4.3 ms … 12.4 ms 283 runs
Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
Summary
fd-batch -u '^$' /tmp/empty ran
9.33 ± 3.08 times faster than fd-master -u '^$' /tmp/empty
20.99 ± 6.36 times faster than fd -u '^$' /tmp/empty |
let items = batch.as_mut().unwrap(); | ||
items.push(item); | ||
|
||
if items.len() == 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be better to send over batches after reaching a certain size. That would mean we could know how large to set the initial capacity of the Vec, and possibly avoid contention on the mutex. However, it means receiver threads could end up waiting longer to get results, especially if it takes a while to find more results in the sender threads, so what you have might be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just measured the average batch size: 209. I don't think we have to try too hard to send larger batches :)
The risk of a minimum batch size is it could stall a very long time if most results are being filtered out. We could do something like always send the batch after N entries are encountered, regardless of whether they're added to the batch. But I doubt it's worth it.
There isn't very much mutex contention with this design anyway. It's only between the receiver and at most one sender, and the receiver critical section is extremely short (just lock().take().unwrap().into_iter()
). I actually just checked with perf trace
and the receiver only blocked 136 times over the whole Chromium benchmark (2.1M files). Each sender blocked between 3-12 times.
So I am comparing master @ d62bbbb with this branch @ 815b3b1. After struggling a bit with thermal throttling affecting the benchmark results, I now have clean results — and they look great!
|
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./fd-master --hidden --no-ignore '' '/some/folder' |
798.4 ± 33.6 | 746.0 | 842.6 | 1.18 ± 0.05 |
./fd-1422 --hidden --no-ignore '' '/some/folder' |
677.7 ± 8.3 | 667.9 | 698.9 | 1.00 |
Simple pattern
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
./fd-master '.*[0-9]\.jpg$' '/some/folder' |
1.492 ± 0.006 | 1.483 | 1.504 | 1.01 ± 0.01 |
./fd-1422 '.*[0-9]\.jpg$' '/some/folder' |
1.479 ± 0.008 | 1.466 | 1.494 | 1.00 |
Simple pattern (-HI)
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./fd-master -HI '.*[0-9]\.jpg$' '/some/folder' |
590.3 ± 2.8 | 585.7 | 594.3 | 1.05 ± 0.01 |
./fd-1422 -HI '.*[0-9]\.jpg$' '/some/folder' |
562.8 ± 2.4 | 557.5 | 566.0 | 1.00 |
File extension
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./fd-master -HI --extension jpg '' '/some/folder' |
644.9 ± 2.7 | 640.1 | 647.6 | 1.04 ± 0.01 |
./fd-1422 -HI --extension jpg '' '/some/folder' |
620.1 ± 2.4 | 616.3 | 622.9 | 1.00 |
File type
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./fd-master -HI --type l '' '/some/folder' |
605.1 ± 8.6 | 599.0 | 628.6 | 1.05 ± 0.02 |
./fd-1422 -HI --type l '' '/some/folder' |
575.2 ± 4.0 | 570.2 | 580.7 | 1.00 |
Command execution
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
./fd-master 'ab' '/some/folder' --exec echo |
4.711 ± 0.031 | 4.688 | 4.792 | 1.00 |
./fd-1422 'ab' '/some/folder' --exec echo |
5.109 ± 0.036 | 5.057 | 5.176 | 1.08 ± 0.01 |
Command execution (large output)
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
./fd-master -tf 'ab' '/some/folder' --exec cat |
4.719 ± 0.028 | 4.682 | 4.779 | 1.00 |
./fd-1422 -tf 'ab' '/some/folder' --exec cat |
5.015 ± 0.053 | 4.915 | 5.087 | 1.06 ± 0.01 |
Empty folder benchmark
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./fd-master -u . /tmp/empty |
28.2 ± 0.6 | 27.3 | 31.2 | 7.65 ± 0.44 |
./fd-1422 -u . /tmp/empty |
3.7 ± 0.2 | 3.5 | 5.3 | 1.00 |
Yeah I see the |
Actually the problem is not contention, its that results are not evenly distributed to the receivers because the batch sizes vary wildly. So one
Maybe lowering the max batch size for |
Yep, that works! |
I was actually thinking that for the |
Unfortunately, it looks like things got worse for me with 1469bf3
I was also wondering what happens in case of (few search results but) longer running programs? Would batching lead to situations where one receiver would run significantly longer than the rest? |
Interesting! Doesn't reproduce for me, even with a similar test ( tavianator@tachyon $ hyperfine -w2 fd-{batch,master}" --search-path ~/code/linux -d4 -x echo"
Benchmark 1: fd-batch --search-path ~/code/linux -d4 -x echo
Time (mean ± σ): 3.622 s ± 0.012 s [User: 36.389 s, System: 32.154 s]
Range (min … max): 3.602 s … 3.640 s 10 runs
Benchmark 2: fd-master --search-path ~/code/linux -d4 -x echo
Time (mean ± σ): 3.693 s ± 0.011 s [User: 36.529 s, System: 32.544 s]
Range (min … max): 3.674 s … 3.707 s 10 runs
Summary
fd-batch --search-path ~/code/linux -d4 -x echo ran
1.02 ± 0.00 times faster than fd-master --search-path ~/code/linux -d4 -x echo
In the degenerate case, any batch size N > 1 can lead to an Nx slowdown: if there are exactly N matches, all ending up in the same batch (unlikely but possible), they will be executed sequentially rather than in parallel. Would you mind trying some variations?
Neither of those makes a big difference for me, but they may help on your machine. |
That's worth trying! |
It's not that unlikely. I often use
and then:
I think that would have the same problem. We really want to decouple the search from the command execution in order to balance the load in both of these cases: (1) long search and fast task execution (2) fast search and long task execution. |
@sharkdp Actually it seems like both the changes I suggested in #1422 (comment) are beneficial in general, so I included them in this PR. Can you give it another try? |
let mut results: Vec<ExitCode> = Vec::new(); | ||
loop { | ||
let mut ret = ExitCode::Success; | ||
for result in results { | ||
// Obtain the next result from the receiver, else if the channel | ||
// has closed, exit from the loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// has closed, exit from the loop | |
// has closed, exit from the loop. |
@@ -36,13 +36,91 @@ enum ReceiverMode { | |||
|
|||
/// The Worker threads can result in a valid entry having PathBuf or an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// The Worker threads can result in a valid entry having PathBuf or an error. | |
/// The Worker threads can result in a valid entry having `PathBuf` or an error. |
pub enum WorkerResult { | ||
// Errors should be rare, so it's probably better to allow large_enum_variant than | ||
// to box the Entry variant | ||
Entry(DirEntry), | ||
Error(ignore::Error), | ||
} | ||
|
||
/// A batch of WorkerResults to send over a channel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// A batch of WorkerResults to send over a channel. | |
/// A batch of `WorkerResult`s to send over a channel. |
Ok(()) | ||
} | ||
} | ||
|
||
/// Maximum size of the output buffer before flushing results to the console |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Maximum size of the output buffer before flushing results to the console | |
/// Maximum size of the output buffer before flushing results to the console. |
@@ -319,13 +403,13 @@ impl WorkerState { | |||
|
|||
/// Run the receiver work, either on this thread or a pool of background | |||
/// threads (for --exec). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// threads (for --exec). | |
/// threads (for `--exec`). |
I re-ran the benchmarks, comparing master @ 5903dec with this branch @ b8a5f95.
But it might be worth noting that the speedups for longer-running searches that I saw in previous iterations are now apparently gone. This still seems worth merging though to increase the startup speed! |
Should we look into this comment, now that this is merged? #1412 (comment) |
Yeah probably. I'd be curious to see how it scales on different CPUs. Here's my results for $ hyperfine -w1 -P j 4 48 -D 2 "fd -j{j} -u --search-path ~/code/bfs/bench/corpus/chromium" In both cases the sweet spot is near the physical core count, so maybe we should try
I assume the high variance at |
Mainly I'm curious about lower core count CPUs. E.g. on a 1-core, 2-thread CPU, I assume -j2 is better than -j1. On a 2-core, 4-thread CPU, -j4 is probably better than -j2. On a 4-core, 8-thread CPU? I don't really know. Where's the crossover point where we want to stop using hyperthreads? |
Well this is interesting. I tried to simulate this using Even when pinned to a single core/thread, |
Intel Core i7-10850H, 6 cores, 12 threads Does not seem to be so clear in this case. Setting it to |
And this is from a server with |
Fixes #1408, fixes #1362.
Benchmark results
Complete traversal
linux v6.5 (86,380 files)
bfs bench/corpus/linux -false
find bench/corpus/linux -false
fd -u '^$' bench/corpus/linux
fd-master -u '^$' bench/corpus/linux
fd-batch -u '^$' bench/corpus/linux
rust 1.72.1 (192,714 files)
bfs bench/corpus/rust -false
find bench/corpus/rust -false
fd -u '^$' bench/corpus/rust
fd-master -u '^$' bench/corpus/rust
fd-batch -u '^$' bench/corpus/rust
chromium 119.0.6036.2 (2,119,292 files)
bfs bench/corpus/chromium -false
find bench/corpus/chromium -false
fd -u '^$' bench/corpus/chromium
fd-master -u '^$' bench/corpus/chromium
fd-batch -u '^$' bench/corpus/chromium
Printing paths
Without colors
linux v6.5
bfs bench/corpus/linux
find bench/corpus/linux
fd -u --search-path bench/corpus/linux
fd-master -u --search-path bench/corpus/linux
fd-batch -u --search-path bench/corpus/linux
chromium 119.0.6036.2
bfs bench/corpus/chromium
find bench/corpus/chromium
fd -u --search-path bench/corpus/chromium
fd-master -u --search-path bench/corpus/chromium
fd-batch -u --search-path bench/corpus/chromium
With colors
linux v6.5
bfs bench/corpus/linux -color
fd -u --search-path bench/corpus/linux --color=always
fd-master -u --search-path bench/corpus/linux --color=always
fd-batch -u --search-path bench/corpus/linux --color=always
chromium 119.0.6036.2
bfs bench/corpus/chromium -color
fd -u --search-path bench/corpus/chromium --color=always
fd-master -u --search-path bench/corpus/chromium --color=always
fd-batch -u --search-path bench/corpus/chromium --color=always
Parallelism
rust 1.72.1
-j1
bfs -j1 bench/corpus/rust -false
fd -j1 -u '^$' bench/corpus/rust
fd-master -j1 -u '^$' bench/corpus/rust
fd-batch -j1 -u '^$' bench/corpus/rust
-j2
bfs -j2 bench/corpus/rust -false
fd -j2 -u '^$' bench/corpus/rust
fd-master -j2 -u '^$' bench/corpus/rust
fd-batch -j2 -u '^$' bench/corpus/rust
-j3
bfs -j3 bench/corpus/rust -false
fd -j3 -u '^$' bench/corpus/rust
fd-master -j3 -u '^$' bench/corpus/rust
fd-batch -j3 -u '^$' bench/corpus/rust
-j4
bfs -j4 bench/corpus/rust -false
fd -j4 -u '^$' bench/corpus/rust
fd-master -j4 -u '^$' bench/corpus/rust
fd-batch -j4 -u '^$' bench/corpus/rust
-j6
bfs -j6 bench/corpus/rust -false
fd -j6 -u '^$' bench/corpus/rust
fd-master -j6 -u '^$' bench/corpus/rust
fd-batch -j6 -u '^$' bench/corpus/rust
-j8
bfs -j8 bench/corpus/rust -false
fd -j8 -u '^$' bench/corpus/rust
fd-master -j8 -u '^$' bench/corpus/rust
fd-batch -j8 -u '^$' bench/corpus/rust
-j12
bfs -j12 bench/corpus/rust -false
fd -j12 -u '^$' bench/corpus/rust
fd-master -j12 -u '^$' bench/corpus/rust
fd-batch -j12 -u '^$' bench/corpus/rust
-j16
bfs -j16 bench/corpus/rust -false
fd -j16 -u '^$' bench/corpus/rust
fd-master -j16 -u '^$' bench/corpus/rust
fd-batch -j16 -u '^$' bench/corpus/rust
Process spawning
linux v6.5
One file per process
bfs bench/corpus/linux -maxdepth 2 -exec true -- {} \;
find bench/corpus/linux -maxdepth 2 -exec true -- {} \;
fd -u --search-path bench/corpus/linux --max-depth=2 -x true --
fd -j1 -u --search-path bench/corpus/linux --max-depth=2 -x true --
fd-master -u --search-path bench/corpus/linux --max-depth=2 -x true --
fd-master -j1 -u --search-path bench/corpus/linux --max-depth=2 -x true --
fd-batch -u --search-path bench/corpus/linux --max-depth=2 -x true --
fd-batch -j1 -u --search-path bench/corpus/linux --max-depth=2 -x true --
Many files per process
bfs bench/corpus/linux -exec true -- {} +
find bench/corpus/linux -exec true -- {} +
fd -u --search-path bench/corpus/linux -X true --
fd-master -u --search-path bench/corpus/linux -X true --
fd-batch -u --search-path bench/corpus/linux -X true --
Spawn in parent directory
bfs bench/corpus/linux -maxdepth 3 -execdir true -- {} +
find bench/corpus/linux -maxdepth 3 -execdir true -- {} +
Details
Versions