walk: Send WorkerResults in batches #1422

tavianator · 2023-11-05T19:57:45Z

Fixes #1408, fixes #1362.

Benchmark results

Complete traversal

linux v6.5 (86,380 files)

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs bench/corpus/linux -false`	19.3 ± 0.4	18.6	20.3	1.02 ± 0.08
`find bench/corpus/linux -false`	96.7 ± 0.4	96.0	97.2	5.14 ± 0.39
`fd -u '^$' bench/corpus/linux`	229.0 ± 29.4	135.6	239.2	12.16 ± 1.82
`fd-master -u '^$' bench/corpus/linux`	61.4 ± 18.2	29.7	74.4	3.26 ± 1.00
`fd-batch -u '^$' bench/corpus/linux`	18.8 ± 1.4	16.0	20.8	1.00

rust 1.72.1 (192,714 files)

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs bench/corpus/rust -false`	53.7 ± 1.9	51.2	58.4	1.57 ± 0.11
`find bench/corpus/rust -false`	304.5 ± 0.9	302.7	305.8	8.91 ± 0.51
`fd -u '^$' bench/corpus/rust`	360.0 ± 0.9	358.7	361.4	10.53 ± 0.61
`fd-master -u '^$' bench/corpus/rust`	70.9 ± 20.6	44.5	91.1	2.07 ± 0.62
`fd-batch -u '^$' bench/corpus/rust`	34.2 ± 2.0	30.9	37.3	1.00

chromium 119.0.6036.2 (2,119,292 files)

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs bench/corpus/chromium -false`	516.9 ± 9.1	504.1	532.8	2.10 ± 0.04
`find bench/corpus/chromium -false`	3218.6 ± 9.7	3205.6	3242.2	13.07 ± 0.12
`fd -u '^$' bench/corpus/chromium`	2522.9 ± 50.4	2484.3	2602.1	10.25 ± 0.22
`fd-master -u '^$' bench/corpus/chromium`	281.3 ± 21.3	259.5	306.2	1.14 ± 0.09
`fd-batch -u '^$' bench/corpus/chromium`	246.2 ± 2.1	243.5	250.1	1.00

Printing paths

Without colors

linux v6.5

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs bench/corpus/linux`	32.0 ± 1.5	29.4	34.8	1.47 ± 0.11
`find bench/corpus/linux`	102.4 ± 1.0	101.1	104.5	4.70 ± 0.27
`fd -u --search-path bench/corpus/linux`	152.3 ± 41.8	132.7	248.7	7.00 ± 1.96
`fd-master -u --search-path bench/corpus/linux`	72.4 ± 22.0	46.3	98.7	3.33 ± 1.03
`fd-batch -u --search-path bench/corpus/linux`	21.8 ± 1.2	19.4	23.7	1.00

chromium 119.0.6036.2

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs bench/corpus/chromium`	707.0 ± 28.2	668.0	768.4	2.32 ± 0.10
`find bench/corpus/chromium`	3378.1 ± 11.0	3368.4	3399.6	11.09 ± 0.12
`fd -u --search-path bench/corpus/chromium`	2495.7 ± 64.0	2440.6	2577.8	8.20 ± 0.23
`fd-master -u --search-path bench/corpus/chromium`	776.0 ± 19.8	742.4	820.3	2.55 ± 0.07
`fd-batch -u --search-path bench/corpus/chromium`	304.5 ± 3.1	297.2	307.3	1.00

With colors

linux v6.5

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs bench/corpus/linux -color`	221.5 ± 2.8	215.2	226.3	4.43 ± 0.26
`fd -u --search-path bench/corpus/linux --color=always`	172.4 ± 51.1	133.9	251.0	3.45 ± 1.04
`fd-master -u --search-path bench/corpus/linux --color=always`	81.1 ± 18.3	69.0	120.2	1.62 ± 0.38
`fd-batch -u --search-path bench/corpus/linux --color=always`	50.0 ± 2.9	47.4	56.9	1.00

chromium 119.0.6036.2

Command	Mean [s]	Min [s]	Max [s]	Relative
`bfs bench/corpus/chromium -color`	5.644 ± 0.022	5.612	5.685	4.64 ± 0.07
`fd -u --search-path bench/corpus/chromium --color=always`	2.502 ± 0.072	2.448	2.614	2.06 ± 0.07
`fd-master -u --search-path bench/corpus/chromium --color=always`	4.738 ± 0.156	4.496	5.037	3.89 ± 0.14
`fd-batch -u --search-path bench/corpus/chromium --color=always`	1.218 ± 0.018	1.199	1.250	1.00

Parallelism

rust 1.72.1

`-j1`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs -j1 bench/corpus/rust -false`	219.0 ± 1.7	216.7	221.5	1.00
`fd -j1 -u '^$' bench/corpus/rust`	271.1 ± 2.7	269.2	278.8	1.24 ± 0.02
`fd-master -j1 -u '^$' bench/corpus/rust`	275.5 ± 1.7	273.6	279.0	1.26 ± 0.01
`fd-batch -j1 -u '^$' bench/corpus/rust`	275.2 ± 2.3	271.8	278.8	1.26 ± 0.01

`-j2`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs -j2 bench/corpus/rust -false`	199.7 ± 3.3	196.1	207.9	1.30 ± 0.03
`fd -j2 -u '^$' bench/corpus/rust`	219.3 ± 6.8	210.0	229.6	1.42 ± 0.05
`fd-master -j2 -u '^$' bench/corpus/rust`	158.4 ± 1.7	155.0	160.6	1.03 ± 0.02
`fd-batch -j2 -u '^$' bench/corpus/rust`	154.2 ± 2.0	152.0	159.8	1.00

`-j3`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs -j3 bench/corpus/rust -false`	118.0 ± 7.0	110.3	129.7	1.08 ± 0.07
`fd -j3 -u '^$' bench/corpus/rust`	214.3 ± 7.4	198.6	224.2	1.95 ± 0.07
`fd-master -j3 -u '^$' bench/corpus/rust`	115.7 ± 3.5	111.0	120.3	1.06 ± 0.04
`fd-batch -j3 -u '^$' bench/corpus/rust`	109.6 ± 1.7	105.9	112.5	1.00

`-j4`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs -j4 bench/corpus/rust -false`	85.7 ± 5.2	79.0	95.4	1.00
`fd -j4 -u '^$' bench/corpus/rust`	221.8 ± 6.8	204.5	234.2	2.59 ± 0.18
`fd-master -j4 -u '^$' bench/corpus/rust`	94.1 ± 4.3	88.4	100.0	1.10 ± 0.08
`fd-batch -j4 -u '^$' bench/corpus/rust`	86.5 ± 1.3	82.7	88.8	1.01 ± 0.06

`-j6`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs -j6 bench/corpus/rust -false`	62.4 ± 1.8	59.8	65.5	1.00
`fd -j6 -u '^$' bench/corpus/rust`	231.9 ± 13.6	201.8	246.2	3.71 ± 0.24
`fd-master -j6 -u '^$' bench/corpus/rust`	76.2 ± 5.7	66.1	81.8	1.22 ± 0.10
`fd-batch -j6 -u '^$' bench/corpus/rust`	62.5 ± 1.3	60.6	66.7	1.00 ± 0.04

`-j8`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs -j8 bench/corpus/rust -false`	53.0 ± 1.3	51.0	55.9	1.03 ± 0.04
`fd -j8 -u '^$' bench/corpus/rust`	230.0 ± 4.4	223.8	237.5	4.49 ± 0.14
`fd-master -j8 -u '^$' bench/corpus/rust`	59.4 ± 6.6	53.9	76.3	1.16 ± 0.13
`fd-batch -j8 -u '^$' bench/corpus/rust`	51.2 ± 1.2	49.3	53.4	1.00

`-j12`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs -j12 bench/corpus/rust -false`	53.6 ± 2.8	48.2	62.7	1.32 ± 0.08
`fd -j12 -u '^$' bench/corpus/rust`	245.5 ± 14.9	224.8	268.6	6.03 ± 0.40
`fd-master -j12 -u '^$' bench/corpus/rust`	56.5 ± 11.8	48.2	75.8	1.39 ± 0.29
`fd-batch -j12 -u '^$' bench/corpus/rust`	40.7 ± 1.1	39.0	43.8	1.00

`-j16`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs -j16 bench/corpus/rust -false`	68.8 ± 7.0	54.7	79.2	1.88 ± 0.21
`fd -j16 -u '^$' bench/corpus/rust`	246.9 ± 10.5	238.5	276.0	6.76 ± 0.42
`fd-master -j16 -u '^$' bench/corpus/rust`	54.6 ± 14.9	45.2	81.1	1.50 ± 0.41
`fd-batch -j16 -u '^$' bench/corpus/rust`	36.5 ± 1.7	33.7	39.7	1.00

Process spawning

linux v6.5

One file per process

Command	Mean [s]	Min [s]	Max [s]	Relative
`bfs bench/corpus/linux -maxdepth 2 -exec true -- {} \;`	1.391 ± 0.066	1.309	1.469	9.65 ± 0.75
`find bench/corpus/linux -maxdepth 2 -exec true -- {} \;`	1.351 ± 0.028	1.312	1.396	9.37 ± 0.61
`fd -u --search-path bench/corpus/linux --max-depth=2 -x true --`	0.274 ± 0.059	0.182	0.349	1.90 ± 0.42
`fd -j1 -u --search-path bench/corpus/linux --max-depth=2 -x true --`	1.015 ± 0.028	0.970	1.050	7.04 ± 0.48
`fd-master -u --search-path bench/corpus/linux --max-depth=2 -x true --`	0.157 ± 0.021	0.136	0.191	1.09 ± 0.16
`fd-master -j1 -u --search-path bench/corpus/linux --max-depth=2 -x true --`	1.013 ± 0.015	0.998	1.047	7.03 ± 0.44
`fd-batch -u --search-path bench/corpus/linux --max-depth=2 -x true --`	0.144 ± 0.009	0.127	0.158	1.00
`fd-batch -j1 -u --search-path bench/corpus/linux --max-depth=2 -x true --`	1.013 ± 0.028	0.973	1.064	7.03 ± 0.47

Many files per process

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs bench/corpus/linux -exec true -- {} +`	73.0 ± 1.8	69.8	76.3	1.00
`find bench/corpus/linux -exec true -- {} +`	256.1 ± 0.7	255.2	257.3	3.51 ± 0.08
`fd -u --search-path bench/corpus/linux -X true --`	311.0 ± 33.1	217.1	328.5	4.26 ± 0.47
`fd-master -u --search-path bench/corpus/linux -X true --`	198.9 ± 20.4	170.4	215.6	2.72 ± 0.29
`fd-batch -u --search-path bench/corpus/linux -X true --`	146.5 ± 1.8	144.3	152.0	2.01 ± 0.05

Spawn in parent directory

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs bench/corpus/linux -maxdepth 3 -execdir true -- {} +`	972.9 ± 15.2	943.3	995.2	1.00
`find bench/corpus/linux -maxdepth 3 -execdir true -- {} +`	1030.5 ± 25.0	991.8	1062.4	1.06 ± 0.03

Details

Versions

$ bfs --version | head -n1
bfs 3.0.4
$ find --version | head -n1
find (GNU findutils) 4.9.0
$ fd --version
fd 8.7.1
$ fd-master --version
fd 8.7.1
$ fd-batch --version
fd 8.7.1

tavianator · 2023-11-05T20:12:47Z

tavianator@tachyon $ hyperfine -w2 fd{,-{master,batch}}" -u '^$' /tmp/empty"
Benchmark 1: fd -u '^$' /tmp/empty
  Time (mean ± σ):     143.1 ms ±   5.3 ms    [User: 9.0 ms, System: 132.6 ms]
  Range (min … max):   134.2 ms … 152.7 ms    21 runs
 
Benchmark 2: fd-master -u '^$' /tmp/empty
  Time (mean ± σ):      63.6 ms ±   8.6 ms    [User: 6.0 ms, System: 58.2 ms]
  Range (min … max):    51.7 ms …  80.0 ms    43 runs
 
Benchmark 3: fd-batch -u '^$' /tmp/empty
  Time (mean ± σ):       6.8 ms ±   2.0 ms    [User: 2.6 ms, System: 7.6 ms]
  Range (min … max):     4.3 ms …  12.4 ms    283 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
 
Summary
  fd-batch -u '^$' /tmp/empty ran
    9.33 ± 3.08 times faster than fd-master -u '^$' /tmp/empty
   20.99 ± 6.36 times faster than fd -u '^$' /tmp/empty

src/walk.rs

tmccombs · 2023-11-07T07:05:01Z

src/walk.rs

+        let items = batch.as_mut().unwrap();
+        items.push(item);
+
+        if items.len() == 1 {


I wonder if it would be better to send over batches after reaching a certain size. That would mean we could know how large to set the initial capacity of the Vec, and possibly avoid contention on the mutex. However, it means receiver threads could end up waiting longer to get results, especially if it takes a while to find more results in the sender threads, so what you have might be better.

I just measured the average batch size: 209. I don't think we have to try too hard to send larger batches :)

The risk of a minimum batch size is it could stall a very long time if most results are being filtered out. We could do something like always send the batch after N entries are encountered, regardless of whether they're added to the batch. But I doubt it's worth it.

There isn't very much mutex contention with this design anyway. It's only between the receiver and at most one sender, and the receiver critical section is extremely short (just lock().take().unwrap().into_iter()). I actually just checked with perf trace and the receiver only blocked 136 times over the whole Chromium benchmark (2.1M files). Each sender blocked between 3-12 times.

tavianator · 2023-11-07T18:29:03Z

Thanks @tmccombs! I'll let @sharkdp review it too since he has his own benchmarks and he found a fatal flaw in the last attempt :)

sharkdp · 2023-11-08T07:58:05Z

So I am comparing master @ d62bbbb with this branch @ 815b3b1. After struggling a bit with thermal throttling affecting the benchmark results, I now have clean results — and they look great!

I see the expected massive improvement in startup speed (last benchmark below), which is great
I see a large (20%) performance gain on searches with lots of results, which is great
I see a small (< 5%) performance gain on searches with few results, which is also great
I see a medium (6-8%) performance loss on command execution benchmarks, which might be acceptable — but maybe something we could still look into?

`fd` regression benchmark

No pattern

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`./fd-master --hidden --no-ignore '' '/some/folder'`	798.4 ± 33.6	746.0	842.6	1.18 ± 0.05
`./fd-1422 --hidden --no-ignore '' '/some/folder'`	677.7 ± 8.3	667.9	698.9	1.00

Simple pattern

Command	Mean [s]	Min [s]	Max [s]	Relative
`./fd-master '.*[0-9]\.jpg$' '/some/folder'`	1.492 ± 0.006	1.483	1.504	1.01 ± 0.01
`./fd-1422 '.*[0-9]\.jpg$' '/some/folder'`	1.479 ± 0.008	1.466	1.494	1.00

Simple pattern (-HI)

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`./fd-master -HI '.*[0-9]\.jpg$' '/some/folder'`	590.3 ± 2.8	585.7	594.3	1.05 ± 0.01
`./fd-1422 -HI '.*[0-9]\.jpg$' '/some/folder'`	562.8 ± 2.4	557.5	566.0	1.00

File extension

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`./fd-master -HI --extension jpg '' '/some/folder'`	644.9 ± 2.7	640.1	647.6	1.04 ± 0.01
`./fd-1422 -HI --extension jpg '' '/some/folder'`	620.1 ± 2.4	616.3	622.9	1.00

File type

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`./fd-master -HI --type l '' '/some/folder'`	605.1 ± 8.6	599.0	628.6	1.05 ± 0.02
`./fd-1422 -HI --type l '' '/some/folder'`	575.2 ± 4.0	570.2	580.7	1.00

Command execution

Command	Mean [s]	Min [s]	Max [s]	Relative
`./fd-master 'ab' '/some/folder' --exec echo`	4.711 ± 0.031	4.688	4.792	1.00
`./fd-1422 'ab' '/some/folder' --exec echo`	5.109 ± 0.036	5.057	5.176	1.08 ± 0.01

Command execution (large output)

Command	Mean [s]	Min [s]	Max [s]	Relative
`./fd-master -tf 'ab' '/some/folder' --exec cat`	4.719 ± 0.028	4.682	4.779	1.00
`./fd-1422 -tf 'ab' '/some/folder' --exec cat`	5.015 ± 0.053	4.915	5.087	1.06 ± 0.01

Empty folder benchmark

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`./fd-master -u . /tmp/empty`	28.2 ± 0.6	27.3	31.2	7.65 ± 0.44
`./fd-1422 -u . /tmp/empty`	3.7 ± 0.2	3.5	5.3	1.00

tavianator · 2023-11-08T14:42:21Z

Yeah I see the --exec regression in my benchmarks too. I'm guessing it's because we have N receiver threads in that case, not just 1, so there's more contention. I'm testing a fix.

tavianator · 2023-11-08T15:13:02Z

Actually the problem is not contention, its that results are not evenly distributed to the receivers because the batch sizes vary wildly. So one exec::job() can get way more results to process than the others. E.g. I just saw this distribution:

-x: 0
-x: 0
-x: 0
-x: 3
-x: 2
-x: 33
-x: 45
-x: 55
-x: 63
-x: 66
-x: 47
-x: 73
-x: 116
-x: 133
-x: 132
-x: 115
-x: 125
-x: 151
-x: 186
-x: 421

Maybe lowering the max batch size for -x will help.

tavianator · 2023-11-08T15:41:25Z

Maybe lowering the max batch size for -x will help.

Yep, that works! --exec perf is now better than master, and nothing else seems to have regressed. New benchmark results are in the top comment.

tmccombs · 2023-11-08T16:53:23Z

I was actually thinking that for the --exec case, it might be more performant to execute on the sender thread instead of having N receiver threads. I'm not sure if that would be worth the complexity though.

sharkdp · 2023-11-08T19:07:30Z

Unfortunately, it looks like things got worse for me with 1469bf3

Command	Mean [s]	Min [s]	Max [s]	Relative
`./fd-master 'ab' '/some/folder' --exec echo`	4.742 ± 0.099	4.691	5.022	1.00
`./fd-1422 'ab' '/some/folder' --exec echo`	5.791 ± 0.031	5.759	5.846	1.22 ± 0.03

I was also wondering what happens in case of (few search results but) longer running programs? Would batching lead to situations where one receiver would run significantly longer than the rest?

tavianator · 2023-11-08T20:11:51Z

Unfortunately, it looks like things got worse for me with 1469bf3

Interesting! Doesn't reproduce for me, even with a similar test (echo instead of true, no -u, similar execution time):

tavianator@tachyon $ hyperfine -w2 fd-{batch,master}" --search-path ~/code/linux -d4 -x echo"
Benchmark 1: fd-batch --search-path ~/code/linux -d4 -x echo
  Time (mean ± σ):      3.622 s ±  0.012 s    [User: 36.389 s, System: 32.154 s]
  Range (min … max):    3.602 s …  3.640 s    10 runs
 
Benchmark 2: fd-master --search-path ~/code/linux -d4 -x echo
  Time (mean ± σ):      3.693 s ±  0.011 s    [User: 36.529 s, System: 32.544 s]
  Range (min … max):    3.674 s …  3.707 s    10 runs
 
Summary
  fd-batch --search-path ~/code/linux -d4 -x echo ran
    1.02 ± 0.00 times faster than fd-master --search-path ~/code/linux -d4 -x echo

I was also wondering what happens in case of (few search results but) longer running programs? Would batching lead to situations where one receiver would run significantly longer than the rest?

In the degenerate case, any batch size N > 1 can lead to an Nx slowdown: if there are exactly N matches, all ending up in the same batch (unlikely but possible), they will be executed sequentially rather than in parallel.

Would you mind trying some variations?

Set the batch size to 1 in --exec mode (line 453)
Set the channel capacity to 2 * config.threads (line 641)
Both of the above

Neither of those makes a big difference for me, but they may help on your machine.

tavianator · 2023-11-08T20:13:06Z

I was actually thinking that for the --exec case, it might be more performant to execute on the sender thread instead of having N receiver threads. I'm not sure if that would be worth the complexity though.

That's worth trying!

sharkdp · 2023-11-09T07:24:58Z

if there are exactly N matches, all ending up in the same batch (unlikely but possible), they will be executed sequentially rather than in parallel

It's not that unlikely. I often use fd -x to simply perform a task in parallel across a list of files in the current directory (or "closeby", i.e. the search is extremely fast compared to the executed tasks). For example:

> mkdir /tmp/test
> cd /tmp/test
> touch $(seq 12)

and then:

> fd-master -x bash -c "sleep 1 && echo {}"
[takes 1 second]

> fd-1422 -x bash -c "sleep 1 && echo {}"
[takes 11 seconds]

I was actually thinking that for the --exec case, it might be more performant to execute on the sender thread instead of having N receiver threads.

I think that would have the same problem. We really want to decouple the search from the command execution in order to balance the load in both of these cases: (1) long search and fast task execution (2) fast search and long task execution.

tavianator · 2023-11-13T14:57:56Z

@sharkdp Actually it seems like both the changes I suggested in #1422 (comment) are beneficial in general, so I included them in this PR. Can you give it another try?

ghost · 2023-11-29T14:29:35Z

src/exec/job.rs

-    let mut results: Vec<ExitCode> = Vec::new();
-    loop {
+    let mut ret = ExitCode::Success;
+    for result in results {
        // Obtain the next result from the receiver, else if the channel
        // has closed, exit from the loop


Suggested change

// has closed, exit from the loop

// has closed, exit from the loop.

ghost · 2023-11-29T14:31:03Z

src/walk.rs

@@ -36,13 +36,91 @@ enum ReceiverMode {

 /// The Worker threads can result in a valid entry having PathBuf or an error.


Suggested change

/// The Worker threads can result in a valid entry having PathBuf or an error.

/// The Worker threads can result in a valid entry having `PathBuf` or an error.

ghost · 2023-11-29T14:31:28Z

src/walk.rs

 pub enum WorkerResult {
    // Errors should be rare, so it's probably better to allow large_enum_variant than
    // to box the Entry variant
    Entry(DirEntry),
    Error(ignore::Error),
 }

+/// A batch of WorkerResults to send over a channel.


Suggested change

/// A batch of WorkerResults to send over a channel.

/// A batch of `WorkerResult`s to send over a channel.

ghost · 2023-11-29T14:33:53Z

src/walk.rs

+        Ok(())
+    }
+}
+
 /// Maximum size of the output buffer before flushing results to the console


Suggested change

/// Maximum size of the output buffer before flushing results to the console

/// Maximum size of the output buffer before flushing results to the console.

ghost · 2023-11-29T14:35:01Z

src/walk.rs

@@ -319,13 +403,13 @@ impl WorkerState {

    /// Run the receiver work, either on this thread or a pool of background
    /// threads (for --exec).


Suggested change

/// threads (for --exec).

/// threads (for `--exec`).

sharkdp · 2023-11-29T15:44:05Z

I re-ran the benchmarks, comparing master @ 5903dec with this branch @ b8a5f95.

I do see comparable performance between this branch and master for larger searches
I do see comparable performance for --exec scenarios
I still see the massive speedup for small/empty folders.

But it might be worth noting that the speedups for longer-running searches that I saw in previous iterations are now apparently gone. This still seems worth merging though to increase the startup speed!

sharkdp · 2023-11-29T15:57:36Z

Should we look into this comment, now that this is merged? #1412 (comment)

tavianator · 2023-11-29T16:39:38Z

Should we look into this comment, now that this is merged? #1412 (comment)

Yeah probably. I'd be curious to see how it scales on different CPUs. Here's my results for

$ hyperfine -w1 -P j 4 48 -D 2 "fd -j{j} -u --search-path ~/code/bfs/bench/corpus/chromium"

In both cases the sweet spot is near the physical core count, so maybe we should try num_cpus::get_physical() as the default (with a higher cap like 64).

Threadripper 3960X (24 cores, 48 threads)

Intel Core i7-1280P (14 cores (6 P-cores, 8 E-cores), 20 threads)

I assume the high variance at -j14 is due to sometimes getting scheduled on a hyperthread.

tavianator · 2023-11-29T17:00:23Z

Mainly I'm curious about lower core count CPUs. E.g. on a 1-core, 2-thread CPU, I assume -j2 is better than -j1. On a 2-core, 4-thread CPU, -j4 is probably better than -j2. On a 4-core, 8-thread CPU? I don't really know. Where's the crossover point where we want to stop using hyperthreads?

tavianator · 2023-11-29T19:10:01Z

Mainly I'm curious about lower core count CPUs. E.g. on a 1-core, 2-thread CPU, I assume -j2 is better than -j1. On a 2-core, 4-thread CPU, -j4 is probably better than -j2. On a 4-core, 8-thread CPU? I don't really know. Where's the crossover point where we want to stop using hyperthreads?

Well this is interesting. I tried to simulate this using taskset to pin fd to various CPUs. Not sure how closely this matches real CPUs with fewer cores, but here's what I found:

Even when pinned to a single core/thread, -j8 is much better than -j1. I don't really understand why yet. I mean, -j8 lets it do more I/O in parallel, but everything is cached in this benchmark so that shouldn't matter.

sharkdp · 2023-11-29T19:14:48Z

Intel Core i7-10850H, 6 cores, 12 threads

Does not seem to be so clear in this case. Setting it to num_cpus::get_physical() == 6 in this case would not lead to the best performance.

sharkdp · 2023-11-29T19:54:38Z

And this is from a server with nproc == 6. Apparently it's a Intel Xeon E5-2680 (which should have 8/16 cores/threads, but I guess there is some virtualization going on(?).

tavianator force-pushed the batch branch from c384492 to 1d341b8 Compare November 5, 2023 19:59

tmccombs reviewed Nov 7, 2023

View reviewed changes

src/walk.rs Outdated Show resolved Hide resolved

tmccombs reviewed Nov 7, 2023

View reviewed changes

src/walk.rs Show resolved Hide resolved

tmccombs reviewed Nov 7, 2023

View reviewed changes

tavianator force-pushed the batch branch from 1d341b8 to 815b3b1 Compare November 7, 2023 13:50

tmccombs approved these changes Nov 7, 2023

View reviewed changes

tavianator force-pushed the batch branch from ace2a71 to 1469bf3 Compare November 8, 2023 15:41

tavianator force-pushed the batch branch from 1469bf3 to e7e16a4 Compare November 13, 2023 14:55

tavianator added 2 commits November 28, 2023 15:51

walk: Send WorkerResults in batches

73260c0

walk: Limit batch sizes in --exec mode

b8a5f95

tavianator force-pushed the batch branch from e7e16a4 to b8a5f95 Compare November 28, 2023 21:19

ghost reviewed Nov 29, 2023

View reviewed changes

sharkdp approved these changes Nov 29, 2023

View reviewed changes

tavianator merged commit 84f032e into sharkdp:master Nov 29, 2023
15 checks passed

tavianator deleted the batch branch November 29, 2023 15:57

tavianator mentioned this pull request Nov 29, 2023

cli: Tweak default thread count logic #1431

Merged

tavianator mentioned this pull request Sep 16, 2024

[BUG] 🐌 fd can be much slower than GNU find in some cases #1614

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

walk: Send WorkerResults in batches #1422

walk: Send WorkerResults in batches #1422

tavianator commented Nov 5, 2023 •

edited

Loading

tavianator commented Nov 5, 2023

tmccombs Nov 7, 2023

tavianator Nov 7, 2023

tavianator commented Nov 7, 2023

sharkdp commented Nov 8, 2023

tavianator commented Nov 8, 2023

tavianator commented Nov 8, 2023

tavianator commented Nov 8, 2023

tmccombs commented Nov 8, 2023

sharkdp commented Nov 8, 2023 •

edited

Loading

tavianator commented Nov 8, 2023

tavianator commented Nov 8, 2023

sharkdp commented Nov 9, 2023 •

edited

Loading

tavianator commented Nov 13, 2023

ghost Nov 29, 2023

ghost Nov 29, 2023

ghost Nov 29, 2023

ghost Nov 29, 2023

ghost Nov 29, 2023

sharkdp commented Nov 29, 2023

sharkdp commented Nov 29, 2023

tavianator commented Nov 29, 2023

tavianator commented Nov 29, 2023

tavianator commented Nov 29, 2023

sharkdp commented Nov 29, 2023 •

edited

Loading

sharkdp commented Nov 29, 2023

	// has closed, exit from the loop
	// has closed, exit from the loop.

		@@ -36,13 +36,91 @@ enum ReceiverMode {

		/// The Worker threads can result in a valid entry having PathBuf or an error.

	/// A batch of WorkerResults to send over a channel.
	/// A batch of `WorkerResult`s to send over a channel.

	/// Maximum size of the output buffer before flushing results to the console
	/// Maximum size of the output buffer before flushing results to the console.

		@@ -319,13 +403,13 @@ impl WorkerState {

		/// Run the receiver work, either on this thread or a pool of background
		/// threads (for --exec).

walk: Send WorkerResults in batches #1422

walk: Send WorkerResults in batches #1422

Conversation

tavianator commented Nov 5, 2023 • edited Loading

Complete traversal

linux v6.5 (86,380 files)

rust 1.72.1 (192,714 files)

chromium 119.0.6036.2 (2,119,292 files)

Printing paths

Without colors

linux v6.5

chromium 119.0.6036.2

With colors

linux v6.5

chromium 119.0.6036.2

Parallelism

rust 1.72.1

-j1

-j2

-j3

-j4

-j6

-j8

-j12

-j16

Process spawning

linux v6.5

One file per process

Many files per process

Spawn in parent directory

Details

Versions

tavianator commented Nov 5, 2023

tmccombs Nov 7, 2023

Choose a reason for hiding this comment

tavianator Nov 7, 2023

Choose a reason for hiding this comment

tavianator commented Nov 7, 2023

sharkdp commented Nov 8, 2023

fd regression benchmark

No pattern

Simple pattern

Simple pattern (-HI)

File extension

File type

Command execution

Command execution (large output)

Empty folder benchmark

tavianator commented Nov 8, 2023

tavianator commented Nov 8, 2023

tavianator commented Nov 8, 2023

tmccombs commented Nov 8, 2023

sharkdp commented Nov 8, 2023 • edited Loading

tavianator commented Nov 8, 2023

tavianator commented Nov 8, 2023

sharkdp commented Nov 9, 2023 • edited Loading

tavianator commented Nov 13, 2023

ghost Nov 29, 2023

Choose a reason for hiding this comment

ghost Nov 29, 2023

Choose a reason for hiding this comment

ghost Nov 29, 2023

Choose a reason for hiding this comment

ghost Nov 29, 2023

Choose a reason for hiding this comment

ghost Nov 29, 2023

Choose a reason for hiding this comment

sharkdp commented Nov 29, 2023

sharkdp commented Nov 29, 2023

tavianator commented Nov 29, 2023

tavianator commented Nov 29, 2023

tavianator commented Nov 29, 2023

sharkdp commented Nov 29, 2023 • edited Loading

sharkdp commented Nov 29, 2023

tavianator commented Nov 5, 2023 •

edited

Loading

`-j1`

`-j2`

`-j3`

`-j4`

`-j6`

`-j8`

`-j12`

`-j16`

`fd` regression benchmark

sharkdp commented Nov 8, 2023 •

edited

Loading

sharkdp commented Nov 9, 2023 •

edited

Loading

sharkdp commented Nov 29, 2023 •

edited

Loading