Skip to content

Commit

Permalink
perf(cargo-package): match certain path prefix with pathspec (#14997)
Browse files Browse the repository at this point in the history
### What does this PR try to resolve?

This revives #14962. See benchmark chart in
<#14962 (comment)>.
#14962 was closed because we found more bugs in `cargo package`, and
#14962 could potentially make them even harder to fix. Two of them have
been fixed so this is good to ship IMO with its own good.

---

An improvement #14955.

`check_repo_state` checks the entire git repo status. This is usually
fine if you have only a few packages in a workspace.

For huge monorepos, it may hit performance issues.

For example,
on awslabs/aws-sdk-rust@2cbd34d
the workspace has roughly 434 members to publish.
`git ls-files` reported us 204379 files in this Git repository. That
means git may need to check status of all files 434 times. That would be
`204379 * 434 = 88,700,486` checks!

Moreover, the current algorithm is finding the intersection of
`PathSource::list_files` and `git status`.
It is an `O(n^2)` check.
Let's assume files are evenly distributed into each package, so roughly
470 files per package.
If we're unlucky to have some dirty files, say 100 files. We will have
to do `470 * 100 = 47,000` times of path comparisons.

Even worse, because we `git status` everything in the repo, we'll have
to it for all members,
even when those dirty files are not part of the current package in
question. So it becomes `470 * 100 * 434 = 20,398,000`!

#### Solution

Instead of comparing with the status of the entire repository, this
patch use the magic pathspec[1] to tell git only reports paths that
match a certain path prefix.

This wouldn't help the `O(n^2)` algorithm,
but at least it won't check dirty files outside the current package.
Also, we don't `git status` against entire git worktree/index anymore.

[1]:
https://git-scm.com/docs/gitglossary#Documentation/gitglossary.txt-aiddefpathspecapathspec

### How should we test and review this PR?

Run this command against
awslabs/aws-sdk-rust@2cbd34d,
and see if it is getting better.

```
CARGO_LOG_PROFILE=1 cargor package --no-verify --offline --allow-dirty -p aws-sdk-accessanalyzer -p aws-sdk-apigateway
```

I've verified checksums of `.crate` files generated from master
(d85d761) and this commit
(3dabdcd). They are the same.

### Additional information

There are some other alternatives, like making `PathSource::list_files`
additionally reports dirty files. While we already have rooms to do it,
this approach should be the most straightforward one at this moment.

Some other approaches like

* Switch to gitoxide (I tried and it didn't as good as expected. Maybe I
did something wrong).
* A flag `--no-vcs` to skip vcs at all
* Improve the `O(n^2)` algorithm
  • Loading branch information
epage authored Jan 13, 2025
2 parents f15df8f + 3dabdcd commit 54df3c7
Showing 1 changed file with 21 additions and 4 deletions.
25 changes: 21 additions & 4 deletions src/cargo/ops/cargo_package/vcs.rs
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
//! Helpers to gather the VCS information for `cargo package`.
use std::collections::HashSet;
use std::path::Path;
use std::path::PathBuf;

use anyhow::Context as _;
Expand Down Expand Up @@ -173,7 +174,9 @@ fn git(
// - ignored (in case the user has an `include` directive that
// conflicts with .gitignore).
let mut dirty_files = Vec::new();
collect_statuses(repo, &mut dirty_files)?;
let pathspec = relative_pathspec(repo, pkg.root());
collect_statuses(repo, &[pathspec.as_str()], &mut dirty_files)?;

// Include each submodule so that the error message can provide
// specifically *which* files in a submodule are modified.
status_submodules(repo, &mut dirty_files)?;
Expand Down Expand Up @@ -263,12 +266,18 @@ fn dirty_files_outside_pkg_root(
}

/// Helper to collect dirty statuses for a single repo.
fn collect_statuses(repo: &git2::Repository, dirty_files: &mut Vec<PathBuf>) -> CargoResult<()> {
fn collect_statuses(
repo: &git2::Repository,
pathspecs: &[&str],
dirty_files: &mut Vec<PathBuf>,
) -> CargoResult<()> {
let mut status_opts = git2::StatusOptions::new();
// Exclude submodules, as they are being handled manually by recursing
// into each one so that details about specific files can be
// retrieved.
status_opts
pathspecs
.iter()
.fold(&mut status_opts, git2::StatusOptions::pathspec)
.exclude_submodules(true)
.include_ignored(true)
.include_untracked(true);
Expand Down Expand Up @@ -300,8 +309,16 @@ fn status_submodules(repo: &git2::Repository, dirty_files: &mut Vec<PathBuf>) ->
// If its files are required, then the verification step should fail.
if let Ok(sub_repo) = submodule.open() {
status_submodules(&sub_repo, dirty_files)?;
collect_statuses(&sub_repo, dirty_files)?;
collect_statuses(&sub_repo, &[], dirty_files)?;
}
}
Ok(())
}

/// Use pathspec so git only matches a certain path prefix
fn relative_pathspec(repo: &git2::Repository, pkg_root: &Path) -> String {
let workdir = repo.workdir().unwrap();
let relpath = pkg_root.strip_prefix(workdir).unwrap_or(Path::new(""));
// to unix separators
relpath.to_str().unwrap().replace('\\', "/")
}

0 comments on commit 54df3c7

Please sign in to comment.