`spawnWindows`: Improve worst-case performance considerably + tests #13993

squeek502 · 2022-12-18T11:17:42Z

Follow up to #13983.

First, the eye-popping benchmark:

Benchmark 1: winpathbench-perf.exe
  Time (mean ± σ):     854.3 ms ±   6.0 ms    [User: 116.9 ms, System: 731.9 ms]
  Range (min … max):   845.3 ms … 867.6 ms    10 runs

Benchmark 2: winpathbench-master.exe
  Time (mean ± σ):      8.188 s ±  0.038 s    [User: 0.684 s, System: 7.479 s]
  Range (min … max):    8.137 s …  8.254 s    10 runs

Benchmark 3: winpathbench.bat
  Time (mean ± σ):     851.4 ms ±   4.9 ms    [User: 90.3 ms, System: 741.3 ms]
  Range (min … max):   843.9 ms … 861.1 ms    10 runs

Summary
  'winpathbench.bat' ran
    1.00 ± 0.01 times faster than 'winpathbench-perf.exe'
    9.62 ± 0.07 times faster than 'winpathbench-master.exe'

Benchmark code

Compiled with -lc -OReleaseFast:

const std = @import("std");

pub fn main() !void {
    const allocator = std.heap.c_allocator;
    var i: usize = 0;
    const iterations = 100;
    while (i < iterations) : (i += 1) {
        var result = std.ChildProcess.exec(.{
            .allocator = allocator,
            .argv = &[_][]const u8{"hello"},
        }) catch continue;
        defer allocator.free(result.stdout);
        defer allocator.free(result.stderr);

        if (result.term != .Exited or result.term.Exited != 0) {
            return error.CommandFailed;
        }
    }
}

The .bat file for comparison:

FOR /L %%i IN (1,1,100) DO hello

However, this is not representative, and in fact I think it's basically impossible to get a 'representative' benchmark, as the benchmark mostly depends on the number of search paths in the PATH env var that must be searched before finding the command, which varies hugely (order of PATH, number of values in PATH, the particular command, etc).

The benchmark above is using a PATH with 211 entries (artificially inflated), and it finds the command after searching through 197 of them. Here's a benchmark with only the directory that the command is found in on the PATH (so it only checks the cwd and then the PATH entry where it finds it):

Benchmark 1: winpathbench-perf.exe
  Time (mean ± σ):     237.9 ms ±   4.1 ms    [User: 48.4 ms, System: 179.9 ms]
  Range (min … max):   231.0 ms … 244.0 ms    11 runs

Benchmark 2: winpathbench-master.exe
  Time (mean ± σ):     271.3 ms ±   1.4 ms    [User: 47.5 ms, System: 195.0 ms]
  Range (min … max):   269.4 ms … 274.1 ms    10 runs

Benchmark 3: winpathbench.bat
  Time (mean ± σ):     278.9 ms ±   1.7 ms    [User: 69.4 ms, System: 179.4 ms]
  Range (min … max):   276.0 ms … 281.1 ms    10 runs

Summary
  'winpathbench-perf.exe' ran
    1.14 ± 0.02 times faster than 'winpathbench-master.exe'
    1.17 ± 0.02 times faster than 'winpathbench.bat'

Still slightly faster, but it should demonstrate that this PR greatly improves the worst-case performance but not necessarily the performance of the happy path.

That out of the way, here's some details:

The name of the game here is to avoid CreateProcessW calls at all costs, and only ever try calling it when we have a real candidate for execution. Secondarily, we want to minimize the number of syscalls used when checking for each PATHEXT-appended version of the app name.

An overview of the technique used (it was inspired by running NtTrace on the winpathbench.bat file and finding that this is what it was doing):

Open the search directory for iteration (either cwd or a path from PATH)
Use NtQueryDirectoryFile with a wildcard filename of <app name>* to check if anything that could possibly match either the unappended version of the app name or any of the versions with a PATHEXT value appended exists.
If the wildcard NtQueryDirectoryFile call found nothing, we can exit early without needing to use PATHEXT at all.

This allows us to use a <open dir, NtQueryDirectoryFile, close dir> sequence for any directory that doesn't contain any possible matches, instead of having to use a separate look up for each individual filename combination (unappended + each PATHEXT appended). For directories where the wildcard does match something, we only need to do a maximum of <number of supported PATHEXT extensions> more NtQueryDirectoryFile calls.

In addition, we now only evaluate the extensions in PATHEXT that we know we can handle (.COM, .EXE, .BAT, .CMD) and ignore the rest (see #13983 (comment) for more details about this).

This also adds a standalone test and makes two edge cases match Windows behavior:

If an invalid executable has the extension .exe and it is attempted to be executed, that is now treated as unrecoverable and InvalidExe is immediately returned no matter where the .exe is (cwd or in the PATH). This matches the behavior of the Windows cmd.exe.
If the app name contains more than just a filename (e.g. it has path separators), then it is excluded from PATH searching and only does a cwd search. This matches the behavior of Windows cmd.exe.

Seems to happen if the command trying to be executed has the extension .exe and it's an invalid executable.

The name of the game here is to avoid CreateProcessW calls at all costs, and only ever try calling it when we have a real candidate for execution. Secondarily, we want to minimize the number of syscalls used when checking for each PATHEXT-appended version of the app name. An overview of the technique used: - Open the search directory for iteration (either cwd or a path from PATH) - Use NtQueryDirectoryFile with a wildcard filename of `<app name>*` to check if anything that could possibly match either the unappended version of the app name or any of the versions with a PATHEXT value appended exists. - If the wildcard NtQueryDirectoryFile call found nothing, we can exit early without needing to use PATHEXT at all. This allows us to use a <open dir, NtQueryDirectoryFile, close dir> sequence for any directory that doesn't contain any possible matches, instead of having to use a separate look up for each individual filename combination (unappended + each PATHEXT appended). For directories where the wildcard *does* match something, we only need to do a maximum of <number of supported PATHEXT extensions> more NtQueryDirectoryFile calls. --- In addition, we now only evaluate the extensions in PATHEXT that we know we can handle (.COM, .EXE, .BAT, .CMD) and ignore the rest. --- This commit also makes two edge cases match Windows behavior: - If an app name has the extension .exe and it is attempted to be executed, that is now treated as unrecoverable and InvalidExe is immediately returned no matter where the .exe is (cwd or in the PATH). This matches the behavior of the Windows cmd.exe. - If the app name contains more than just a filename (e.g. it has path separators), then it is excluded from PATH searching and only does a cwd search. This matches the behavior of Windows cmd.exe.

squeek502 · 2022-12-18T11:30:31Z

An addendum about some tradeoffs:

Out of curiosity, I tried comparing against a Rust version of the benchmark, and I noticed that the Rust version performs considerably better (this is the benchmark with 211 search paths and the command being found after 197 searched paths):

Benchmark 1: winpathbench-perf.exe
  Time (mean ± σ):     848.9 ms ±   3.6 ms    [User: 136.6 ms, System: 702.8 ms]
  Range (min … max):   841.5 ms … 853.3 ms    10 runs

Benchmark 2: winpathbench-rust.exe
  Time (mean ± σ):     545.4 ms ±   3.8 ms    [User: 81.9 ms, System: 440.3 ms]
  Range (min … max):   539.6 ms … 551.5 ms    10 runs

Summary
  'winpathbench-rust.exe' ran
    1.56 ± 0.01 times faster than 'winpathbench-perf.exe'

In looking into why this is, I noticed that Rust makes some tradeoffs in terms of matching Windows behavior and what it thinks is the most likely scenario:

When resolving a command without an extension, it will append .exe and then only search for that (no other extensions from PATHEXT). This means that it will only check exactly one file per directory (and therefore just does a single NtQueryAttributesFile call instead of the open dir, NtQueryDirectoryFile, close dir that we do), which I believe is what makes the difference in the benchmark. The tradeoff here is that it will fail to find commands that Windows would find, e.g. if there is a something.bat in the PATH and you call something Rust will not find it.
Rust also gets the Windows\system32 directory and checks it early regardless of where it is in PATH or if it's there at all. This hugely speeds up finding things when they are in system32, but makes it so that Rust would find things in system32 even if PATH is set to an empty string.

So, my conclusion is that performance could be increased if these sorts of tradeoffs are acceptable, but I think I personally prefer matching the Windows behavior even if it comes at a cost.

For completeness, here's the benchmark with only the directory that the command is found in on the PATH:

Benchmark 1: winpathbench-perf.exe
  Time (mean ± σ):     238.4 ms ±   3.7 ms    [User: 45.5 ms, System: 173.9 ms]
  Range (min … max):   231.8 ms … 243.0 ms    11 runs

Benchmark 2: winpathbench-rust.exe
  Time (mean ± σ):     230.5 ms ±   4.5 ms    [User: 38.9 ms, System: 178.2 ms]
  Range (min … max):   223.9 ms … 238.9 ms    11 runs

Summary
  'winpathbench-rust.exe' ran
    1.03 ± 0.03 times faster than 'winpathbench-perf.exe'

Tests a decent amount of edge cases dealing with how PATH and PATHEXT searching is handled.

matu3ba

some nits.

matu3ba · 2022-12-18T14:56:57Z

lib/std/child_process.zig

@@ -1126,6 +1351,64 @@ fn windowsCreateProcess(app_name: [*:0]u16, cmd_line: [*:0]u16, envp_ptr: ?[*]u1
    );
 }

+/// Case-insenstive UTF-16 lookup


nit: The origin/how these known extensions are given of this table should be included here.

matu3ba · 2022-12-18T14:58:37Z

lib/std/child_process.zig

@@ -1094,6 +1090,235 @@ pub const ChildProcess = struct {
    }
 };

+/// Expects `app_buf` to contain exactly the app name, and `dir_buf` to contain exactly the dir path.
+/// After return, `app_buf` will always contain exactly the app name and `dir_buf` will always contain exactly the dir path.
+/// Note: `app_buf` should not contain any leading path separators.


nit: https://ziglang.org/documentation/master/#toc-Doc-Comment-Guidance

matu3ba · 2022-12-18T15:10:33Z

lib/std/child_process.zig

+    try app_buf.append(allocator, 0);
+    const app_name_wildcard = app_buf.items[0 .. app_buf.items.len - 1 :0];
+
+    // Enough for the FILE_DIRECTORY_INFORMATION + (NAME_MAX UTF-16 code units [2 bytes each]).


This doesnt explain why 2*NAME_MAX. context https://github.com/ziglang/zig/blob/master/lib/std/os/windows.zig#L3142

matu3ba · 2022-12-18T15:40:16Z

So, my conclusion is that performance could be increased if these sorts of tradeoffs are acceptable, but I think I personally prefer matching the Windows behavior even if it comes at a cost.

I do assume that absolute paths have the same performance?
Yes, Rust doing the wrong thing for kernel32 routines is not nice.
For relative paths: For a large amount of spawns performance this can be important, for example if one does a pile of panic tests. It sounds okayish, if libstd offers less performance, but not so okay, if creating the code means lots of boilerplate. Or what do you think?

Aside, has very thorough test coverage.

squeek502 · 2022-12-18T16:02:43Z

I do assume that absolute paths have the same performance?

If the app name is absolute (e.g. argv = &.{"C:\some\path\something.exe"}, then it will always do at most 1 + <num of supported PATHEXT extensions> NtQueryDirectoryFile calls, and will not search the PATH at all. So, absolute paths should always be extremely quick. If the cwd is absolute, though, this does not hold (or should not once #13994 is merged).

For relative paths: For a large amount of spawns performance this can be important, for example if one does a pile of panic tests. It sounds okayish, if libstd offers less performance, but not so okay, if creating the code means lots of boilerplate. Or what do you think?

Not sure exactly what you're asking, but I'll try to clarify how this PR will impact normal use cases:

This will significantly speed up the case of a command not being found in the PATH. However, this likely won't be too noticeable in real use cases since continuously execing missing commands is not something people try to do very often.
This will definitely speed up the case of a command being found somewhere near the end of the PATH (e.g. if C:\Windows\system32 is the last entry in the PATH and the PATH has a lot of entries). This is a much more common/likely scenario and will be the main benefit of this change for normal use.

If you're asking about if there would be a benefit to taking some of the sorts of tradeoffs that Rust makes, they would have some effect, but again (in terms of 'normal' use) probably only in the situations in the two points above. The most noticeable differences are in the outlier cases where there are a ton of entries in the PATH that need to be searched, so I'm not convinced that sacrificing correctness (in terms of conforming to how Windows behaves) is worth the speedups in mostly outlier scenarios.

The usage of the wildcard with NtQueryDirectoryFile gets us close enough in terms of performance I think while still being conformant with Windows behavior (without this trick, I might have a different opinion; I first tried just doing NtQueryAttributesFile calls for each possible command location (1 verbatim and 4 appended with each supported PATHEXT extension) and the performance was about 2x worse than it is currently)

…nsion Previously, the implementation would essentially check `startsWith` instead of `eql` (e.g. it would return true for `.exec` because it erroneously 'matched' `.exe`). Follow up to ziglang#13993

…nsion Previously, the implementation would essentially check `startsWith` instead of `eql` (e.g. it would return true for `.exec` because it erroneously 'matched' `.exe`). Follow up to #13993

This fixes a regression caused by ziglang#13993 As an optimization, the first call to `NtQueryDirectoryFile` would only ask for a single result and assume that if the result returned did not match the app_name exactly, then the unappended app_name did not exist. However, this relied on the assumption that the unappended app_name would always be returned first, but that only seems to be the same on NTFS. On FAT filesystems, the order of returned files can be different, which meant that it could assume the unappended file doesn't exist when it actually does. This commit fixes that by fully iterating the wildcard matches via `NtQueryDirectoryFile` and taking note of any unappended/PATHEXT-appended filenames it finds. In practice, this strategy does not introduce a speed regression compared to the previous (buggy) implementation. Benchmark 1 (10 runs): winpathbench-master.exe measurement mean ± σ min … max outliers delta wall_time 508ms ± 4.08ms 502ms … 517ms 1 (10%) 0% peak_rss 3.62MB ± 2.76KB 3.62MB … 3.63MB 0 ( 0%) 0% Benchmark 2 (10 runs): winpathbench-fat32-fix.exe measurement mean ± σ min … max outliers delta wall_time 500ms ± 21.4ms 480ms … 535ms 0 ( 0%) - 1.5% ± 2.8% peak_rss 3.62MB ± 2.76KB 3.62MB … 3.63MB 0 ( 0%) - 0.0% ± 0.1% --- Partially addresses ziglang#16374 (it fixes `zig build` on FAT32 when no `zig-cache` is present)

This fixes a regression caused by ziglang#13993 As an optimization, the first call to `NtQueryDirectoryFile` would only ask for a single result and assume that if the result returned did not match the app_name exactly, then the unappended app_name did not exist. However, this relied on the assumption that the unappended app_name would always be returned first, but that only seems to be the case on NTFS. On FAT filesystems, the order of returned files can be different, which meant that it could assume the unappended file doesn't exist when it actually does. This commit fixes that by fully iterating the wildcard matches via `NtQueryDirectoryFile` and taking note of any unappended/PATHEXT-appended filenames it finds. In practice, this strategy does not introduce a speed regression compared to the previous (buggy) implementation. Benchmark 1 (10 runs): winpathbench-master.exe measurement mean ± σ min … max outliers delta wall_time 508ms ± 4.08ms 502ms … 517ms 1 (10%) 0% peak_rss 3.62MB ± 2.76KB 3.62MB … 3.63MB 0 ( 0%) 0% Benchmark 2 (10 runs): winpathbench-fat32-fix.exe measurement mean ± σ min … max outliers delta wall_time 500ms ± 21.4ms 480ms … 535ms 0 ( 0%) - 1.5% ± 2.8% peak_rss 3.62MB ± 2.76KB 3.62MB … 3.63MB 0 ( 0%) - 0.0% ± 0.1% --- Partially addresses ziglang#16374 (it fixes `zig build` on FAT32 when no `zig-cache` is present)

This fixes a regression caused by #13993 As an optimization, the first call to `NtQueryDirectoryFile` would only ask for a single result and assume that if the result returned did not match the app_name exactly, then the unappended app_name did not exist. However, this relied on the assumption that the unappended app_name would always be returned first, but that only seems to be the case on NTFS. On FAT filesystems, the order of returned files can be different, which meant that it could assume the unappended file doesn't exist when it actually does. This commit fixes that by fully iterating the wildcard matches via `NtQueryDirectoryFile` and taking note of any unappended/PATHEXT-appended filenames it finds. In practice, this strategy does not introduce a speed regression compared to the previous (buggy) implementation. Benchmark 1 (10 runs): winpathbench-master.exe measurement mean ± σ min … max outliers delta wall_time 508ms ± 4.08ms 502ms … 517ms 1 (10%) 0% peak_rss 3.62MB ± 2.76KB 3.62MB … 3.63MB 0 ( 0%) 0% Benchmark 2 (10 runs): winpathbench-fat32-fix.exe measurement mean ± σ min … max outliers delta wall_time 500ms ± 21.4ms 480ms … 535ms 0 ( 0%) - 1.5% ± 2.8% peak_rss 3.62MB ± 2.76KB 3.62MB … 3.63MB 0 ( 0%) - 0.0% ± 0.1% --- Partially addresses #16374 (it fixes `zig build` on FAT32 when no `zig-cache` is present)

squeek502 added 3 commits December 17, 2022 22:06

windows: Extract RtlEqualUnicodeString usage into to a helper function

6e22b63

windows: Map EXE_MACHINE_TYPE_MISMATCH to InvalidExe

3ee8c49

Seems to happen if the command trying to be executed has the extension .exe and it's an invalid executable.

standalone tests: Add windows spawn test

0cbc59f

Tests a decent amount of edge cases dealing with how PATH and PATHEXT searching is handled.

squeek502 force-pushed the windows-childprocess-perf branch from 2c46323 to 0cbc59f Compare December 18, 2022 11:55

squeek502 mentioned this pull request Dec 18, 2022

spawnWindows: Fix PATH searching when cwd is absolute #13994

Merged

matu3ba reviewed Dec 18, 2022

View reviewed changes

andrewrk merged commit d93edad into ziglang:master Dec 18, 2022

squeek502 mentioned this pull request Dec 31, 2022

ChildProcess: fix false positives in windowsCreateProcessSupportsExtension #14138

Merged

squeek502 mentioned this pull request Jul 23, 2023

child_process: Fix regression on Windows for FAT filesystems #16492

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`spawnWindows`: Improve worst-case performance considerably + tests #13993

`spawnWindows`: Improve worst-case performance considerably + tests #13993

squeek502 commented Dec 18, 2022 •

edited

Loading

squeek502 commented Dec 18, 2022 •

edited

Loading

matu3ba left a comment

matu3ba Dec 18, 2022

matu3ba Dec 18, 2022

matu3ba Dec 18, 2022

matu3ba commented Dec 18, 2022

squeek502 commented Dec 18, 2022 •

edited

Loading

spawnWindows: Improve worst-case performance considerably + tests #13993

spawnWindows: Improve worst-case performance considerably + tests #13993

Conversation

squeek502 commented Dec 18, 2022 • edited Loading

squeek502 commented Dec 18, 2022 • edited Loading

matu3ba left a comment

Choose a reason for hiding this comment

matu3ba Dec 18, 2022

Choose a reason for hiding this comment

matu3ba Dec 18, 2022

Choose a reason for hiding this comment

matu3ba Dec 18, 2022

Choose a reason for hiding this comment

matu3ba commented Dec 18, 2022

squeek502 commented Dec 18, 2022 • edited Loading

`spawnWindows`: Improve worst-case performance considerably + tests #13993

`spawnWindows`: Improve worst-case performance considerably + tests #13993

squeek502 commented Dec 18, 2022 •

edited

Loading

squeek502 commented Dec 18, 2022 •

edited

Loading

squeek502 commented Dec 18, 2022 •

edited

Loading