Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spawnWindows: Improve worst-case performance considerably + tests #13993

Merged
merged 4 commits into from
Dec 18, 2022

Conversation

squeek502
Copy link
Collaborator

@squeek502 squeek502 commented Dec 18, 2022

Follow up to #13983.

First, the eye-popping benchmark:

Benchmark 1: winpathbench-perf.exe
  Time (mean ± σ):     854.3 ms ±   6.0 ms    [User: 116.9 ms, System: 731.9 ms]
  Range (min … max):   845.3 ms … 867.6 ms    10 runs

Benchmark 2: winpathbench-master.exe
  Time (mean ± σ):      8.188 s ±  0.038 s    [User: 0.684 s, System: 7.479 s]
  Range (min … max):    8.137 s …  8.254 s    10 runs

Benchmark 3: winpathbench.bat
  Time (mean ± σ):     851.4 ms ±   4.9 ms    [User: 90.3 ms, System: 741.3 ms]
  Range (min … max):   843.9 ms … 861.1 ms    10 runs

Summary
  'winpathbench.bat' ran
    1.00 ± 0.01 times faster than 'winpathbench-perf.exe'
    9.62 ± 0.07 times faster than 'winpathbench-master.exe'
Benchmark code

Compiled with -lc -OReleaseFast:

const std = @import("std");

pub fn main() !void {
    const allocator = std.heap.c_allocator;
    var i: usize = 0;
    const iterations = 100;
    while (i < iterations) : (i += 1) {
        var result = std.ChildProcess.exec(.{
            .allocator = allocator,
            .argv = &[_][]const u8{"hello"},
        }) catch continue;
        defer allocator.free(result.stdout);
        defer allocator.free(result.stderr);

        if (result.term != .Exited or result.term.Exited != 0) {
            return error.CommandFailed;
        }
    }
}

The .bat file for comparison:

FOR /L %%i IN (1,1,100) DO hello

However, this is not representative, and in fact I think it's basically impossible to get a 'representative' benchmark, as the benchmark mostly depends on the number of search paths in the PATH env var that must be searched before finding the command, which varies hugely (order of PATH, number of values in PATH, the particular command, etc).

The benchmark above is using a PATH with 211 entries (artificially inflated), and it finds the command after searching through 197 of them. Here's a benchmark with only the directory that the command is found in on the PATH (so it only checks the cwd and then the PATH entry where it finds it):

Benchmark 1: winpathbench-perf.exe
  Time (mean ± σ):     237.9 ms ±   4.1 ms    [User: 48.4 ms, System: 179.9 ms]
  Range (min … max):   231.0 ms … 244.0 ms    11 runs

Benchmark 2: winpathbench-master.exe
  Time (mean ± σ):     271.3 ms ±   1.4 ms    [User: 47.5 ms, System: 195.0 ms]
  Range (min … max):   269.4 ms … 274.1 ms    10 runs

Benchmark 3: winpathbench.bat
  Time (mean ± σ):     278.9 ms ±   1.7 ms    [User: 69.4 ms, System: 179.4 ms]
  Range (min … max):   276.0 ms … 281.1 ms    10 runs

Summary
  'winpathbench-perf.exe' ran
    1.14 ± 0.02 times faster than 'winpathbench-master.exe'
    1.17 ± 0.02 times faster than 'winpathbench.bat'

Still slightly faster, but it should demonstrate that this PR greatly improves the worst-case performance but not necessarily the performance of the happy path.


That out of the way, here's some details:

The name of the game here is to avoid CreateProcessW calls at all costs, and only ever try calling it when we have a real candidate for execution. Secondarily, we want to minimize the number of syscalls used when checking for each PATHEXT-appended version of the app name.

An overview of the technique used (it was inspired by running NtTrace on the winpathbench.bat file and finding that this is what it was doing):

  • Open the search directory for iteration (either cwd or a path from PATH)
  • Use NtQueryDirectoryFile with a wildcard filename of <app name>* to check if anything that could possibly match either the unappended version of the app name or any of the versions with a PATHEXT value appended exists.
  • If the wildcard NtQueryDirectoryFile call found nothing, we can exit early without needing to use PATHEXT at all.

This allows us to use a <open dir, NtQueryDirectoryFile, close dir> sequence for any directory that doesn't contain any possible matches, instead of having to use a separate look up for each individual filename combination (unappended + each PATHEXT appended). For directories where the wildcard does match something, we only need to do a maximum of <number of supported PATHEXT extensions> more NtQueryDirectoryFile calls.

In addition, we now only evaluate the extensions in PATHEXT that we know we can handle (.COM, .EXE, .BAT, .CMD) and ignore the rest (see #13983 (comment) for more details about this).


This also adds a standalone test and makes two edge cases match Windows behavior:

  • If an invalid executable has the extension .exe and it is attempted to be executed, that is now treated as unrecoverable and InvalidExe is immediately returned no matter where the .exe is (cwd or in the PATH). This matches the behavior of the Windows cmd.exe.
  • If the app name contains more than just a filename (e.g. it has path separators), then it is excluded from PATH searching and only does a cwd search. This matches the behavior of Windows cmd.exe.

Seems to happen if the command trying to be executed has the extension .exe and it's an invalid executable.
The name of the game here is to avoid CreateProcessW calls at all costs,
and only ever try calling it when we have a real candidate for execution.
Secondarily, we want to minimize the number of syscalls used when checking
for each PATHEXT-appended version of the app name.

An overview of the technique used:
- Open the search directory for iteration (either cwd or a path from PATH)
- Use NtQueryDirectoryFile with a wildcard filename of `<app name>*` to
  check if anything that could possibly match either the unappended version
  of the app name or any of the versions with a PATHEXT value appended exists.
- If the wildcard NtQueryDirectoryFile call found nothing, we can exit early
  without needing to use PATHEXT at all.

This allows us to use a <open dir, NtQueryDirectoryFile, close dir> sequence
for any directory that doesn't contain any possible matches, instead of having
to use a separate look up for each individual filename combination (unappended +
each PATHEXT appended). For directories where the wildcard *does* match something,
we only need to do a maximum of <number of supported PATHEXT extensions> more
NtQueryDirectoryFile calls.

---

In addition, we now only evaluate the extensions in PATHEXT that we know we can handle (.COM, .EXE, .BAT, .CMD) and ignore the rest.

---

This commit also makes two edge cases match Windows behavior:

- If an app name has the extension .exe and it is attempted to be executed, that is now treated as unrecoverable and InvalidExe is immediately returned no matter where the .exe is (cwd or in the PATH). This matches the behavior of the Windows cmd.exe.
- If the app name contains more than just a filename (e.g. it has path separators), then it is excluded from PATH searching and only does a cwd search. This matches the behavior of Windows cmd.exe.
@squeek502
Copy link
Collaborator Author

squeek502 commented Dec 18, 2022

An addendum about some tradeoffs:

Out of curiosity, I tried comparing against a Rust version of the benchmark, and I noticed that the Rust version performs considerably better (this is the benchmark with 211 search paths and the command being found after 197 searched paths):

Benchmark 1: winpathbench-perf.exe
  Time (mean ± σ):     848.9 ms ±   3.6 ms    [User: 136.6 ms, System: 702.8 ms]
  Range (min … max):   841.5 ms … 853.3 ms    10 runs

Benchmark 2: winpathbench-rust.exe
  Time (mean ± σ):     545.4 ms ±   3.8 ms    [User: 81.9 ms, System: 440.3 ms]
  Range (min … max):   539.6 ms … 551.5 ms    10 runs

Summary
  'winpathbench-rust.exe' ran
    1.56 ± 0.01 times faster than 'winpathbench-perf.exe'

In looking into why this is, I noticed that Rust makes some tradeoffs in terms of matching Windows behavior and what it thinks is the most likely scenario:

So, my conclusion is that performance could be increased if these sorts of tradeoffs are acceptable, but I think I personally prefer matching the Windows behavior even if it comes at a cost.


For completeness, here's the benchmark with only the directory that the command is found in on the PATH:

Benchmark 1: winpathbench-perf.exe
  Time (mean ± σ):     238.4 ms ±   3.7 ms    [User: 45.5 ms, System: 173.9 ms]
  Range (min … max):   231.8 ms … 243.0 ms    11 runs

Benchmark 2: winpathbench-rust.exe
  Time (mean ± σ):     230.5 ms ±   4.5 ms    [User: 38.9 ms, System: 178.2 ms]
  Range (min … max):   223.9 ms … 238.9 ms    11 runs

Summary
  'winpathbench-rust.exe' ran
    1.03 ± 0.03 times faster than 'winpathbench-perf.exe'

Tests a decent amount of edge cases dealing with how PATH and PATHEXT searching is handled.
Copy link
Contributor

@matu3ba matu3ba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some nits.

@@ -1126,6 +1351,64 @@ fn windowsCreateProcess(app_name: [*:0]u16, cmd_line: [*:0]u16, envp_ptr: ?[*]u1
);
}

/// Case-insenstive UTF-16 lookup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The origin/how these known extensions are given of this table should be included here.

@@ -1094,6 +1090,235 @@ pub const ChildProcess = struct {
}
};

/// Expects `app_buf` to contain exactly the app name, and `dir_buf` to contain exactly the dir path.
/// After return, `app_buf` will always contain exactly the app name and `dir_buf` will always contain exactly the dir path.
/// Note: `app_buf` should not contain any leading path separators.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try app_buf.append(allocator, 0);
const app_name_wildcard = app_buf.items[0 .. app_buf.items.len - 1 :0];

// Enough for the FILE_DIRECTORY_INFORMATION + (NAME_MAX UTF-16 code units [2 bytes each]).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matu3ba
Copy link
Contributor

matu3ba commented Dec 18, 2022

So, my conclusion is that performance could be increased if these sorts of tradeoffs are acceptable, but I think I personally prefer matching the Windows behavior even if it comes at a cost.

  1. I do assume that absolute paths have the same performance?
  2. Yes, Rust doing the wrong thing for kernel32 routines is not nice.
  3. For relative paths: For a large amount of spawns performance this can be important, for example if one does a pile of panic tests. It sounds okayish, if libstd offers less performance, but not so okay, if creating the code means lots of boilerplate. Or what do you think?

Aside, has very thorough test coverage.

@squeek502
Copy link
Collaborator Author

squeek502 commented Dec 18, 2022

I do assume that absolute paths have the same performance?

If the app name is absolute (e.g. argv = &.{"C:\some\path\something.exe"}, then it will always do at most 1 + <num of supported PATHEXT extensions> NtQueryDirectoryFile calls, and will not search the PATH at all. So, absolute paths should always be extremely quick. If the cwd is absolute, though, this does not hold (or should not once #13994 is merged).

For relative paths: For a large amount of spawns performance this can be important, for example if one does a pile of panic tests. It sounds okayish, if libstd offers less performance, but not so okay, if creating the code means lots of boilerplate. Or what do you think?

Not sure exactly what you're asking, but I'll try to clarify how this PR will impact normal use cases:

  • This will significantly speed up the case of a command not being found in the PATH. However, this likely won't be too noticeable in real use cases since continuously execing missing commands is not something people try to do very often.
  • This will definitely speed up the case of a command being found somewhere near the end of the PATH (e.g. if C:\Windows\system32 is the last entry in the PATH and the PATH has a lot of entries). This is a much more common/likely scenario and will be the main benefit of this change for normal use.

If you're asking about if there would be a benefit to taking some of the sorts of tradeoffs that Rust makes, they would have some effect, but again (in terms of 'normal' use) probably only in the situations in the two points above. The most noticeable differences are in the outlier cases where there are a ton of entries in the PATH that need to be searched, so I'm not convinced that sacrificing correctness (in terms of conforming to how Windows behaves) is worth the speedups in mostly outlier scenarios.

The usage of the wildcard with NtQueryDirectoryFile gets us close enough in terms of performance I think while still being conformant with Windows behavior (without this trick, I might have a different opinion; I first tried just doing NtQueryAttributesFile calls for each possible command location (1 verbatim and 4 appended with each supported PATHEXT extension) and the performance was about 2x worse than it is currently)

@andrewrk andrewrk merged commit d93edad into ziglang:master Dec 18, 2022
squeek502 added a commit to squeek502/zig that referenced this pull request Dec 31, 2022
…nsion

Previously, the implementation would essentially check `startsWith` instead of `eql` (e.g. it would return true for `.exec` because it erroneously 'matched' `.exe`).

Follow up to ziglang#13993
andrewrk pushed a commit that referenced this pull request Jan 1, 2023
…nsion

Previously, the implementation would essentially check `startsWith` instead of `eql` (e.g. it would return true for `.exec` because it erroneously 'matched' `.exe`).

Follow up to #13993
squeek502 added a commit to squeek502/zig that referenced this pull request Jul 23, 2023
This fixes a regression caused by ziglang#13993

As an optimization, the first call to `NtQueryDirectoryFile` would only ask for a single result and assume that if the result returned did not match the app_name exactly, then the unappended app_name did not exist. However, this relied on the assumption that the unappended app_name would always be returned first, but that only seems to be the same on NTFS. On FAT filesystems, the order of returned files can be different, which meant that it could assume the unappended file doesn't exist when it actually does.

This commit fixes that by fully iterating the wildcard matches via `NtQueryDirectoryFile` and taking note of any unappended/PATHEXT-appended filenames it finds. In practice, this strategy does not introduce a speed regression compared to the previous (buggy) implementation.

Benchmark 1 (10 runs): winpathbench-master.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           508ms ± 4.08ms     502ms …  517ms          1 (10%)        0%
  peak_rss           3.62MB ± 2.76KB    3.62MB … 3.63MB          0 ( 0%)        0%
Benchmark 2 (10 runs): winpathbench-fat32-fix.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           500ms ± 21.4ms     480ms …  535ms          0 ( 0%)          -  1.5% ±  2.8%
  peak_rss           3.62MB ± 2.76KB    3.62MB … 3.63MB          0 ( 0%)          -  0.0% ±  0.1%

---

Partially addresses ziglang#16374 (it fixes `zig build` on FAT32 when no `zig-cache` is present)
squeek502 added a commit to squeek502/zig that referenced this pull request Jul 23, 2023
This fixes a regression caused by ziglang#13993

As an optimization, the first call to `NtQueryDirectoryFile` would only ask for a single result and assume that if the result returned did not match the app_name exactly, then the unappended app_name did not exist. However, this relied on the assumption that the unappended app_name would always be returned first, but that only seems to be the case on NTFS. On FAT filesystems, the order of returned files can be different, which meant that it could assume the unappended file doesn't exist when it actually does.

This commit fixes that by fully iterating the wildcard matches via `NtQueryDirectoryFile` and taking note of any unappended/PATHEXT-appended filenames it finds. In practice, this strategy does not introduce a speed regression compared to the previous (buggy) implementation.

Benchmark 1 (10 runs): winpathbench-master.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           508ms ± 4.08ms     502ms …  517ms          1 (10%)        0%
  peak_rss           3.62MB ± 2.76KB    3.62MB … 3.63MB          0 ( 0%)        0%
Benchmark 2 (10 runs): winpathbench-fat32-fix.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           500ms ± 21.4ms     480ms …  535ms          0 ( 0%)          -  1.5% ±  2.8%
  peak_rss           3.62MB ± 2.76KB    3.62MB … 3.63MB          0 ( 0%)          -  0.0% ±  0.1%

---

Partially addresses ziglang#16374 (it fixes `zig build` on FAT32 when no `zig-cache` is present)
andrewrk pushed a commit that referenced this pull request Jul 24, 2023
This fixes a regression caused by #13993

As an optimization, the first call to `NtQueryDirectoryFile` would only ask for a single result and assume that if the result returned did not match the app_name exactly, then the unappended app_name did not exist. However, this relied on the assumption that the unappended app_name would always be returned first, but that only seems to be the case on NTFS. On FAT filesystems, the order of returned files can be different, which meant that it could assume the unappended file doesn't exist when it actually does.

This commit fixes that by fully iterating the wildcard matches via `NtQueryDirectoryFile` and taking note of any unappended/PATHEXT-appended filenames it finds. In practice, this strategy does not introduce a speed regression compared to the previous (buggy) implementation.

Benchmark 1 (10 runs): winpathbench-master.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           508ms ± 4.08ms     502ms …  517ms          1 (10%)        0%
  peak_rss           3.62MB ± 2.76KB    3.62MB … 3.63MB          0 ( 0%)        0%
Benchmark 2 (10 runs): winpathbench-fat32-fix.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           500ms ± 21.4ms     480ms …  535ms          0 ( 0%)          -  1.5% ±  2.8%
  peak_rss           3.62MB ± 2.76KB    3.62MB … 3.63MB          0 ( 0%)          -  0.0% ±  0.1%

---

Partially addresses #16374 (it fixes `zig build` on FAT32 when no `zig-cache` is present)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants