Unconditionally query crates.io HTTP index for publish status #693

Jake-Shadle · 2023-08-22T10:19:41Z

This replaces the crates-index with tame-index, which adds built-in support for sparse registries, resolving a couple of issues, as well as using gix instead of git2 which has improved functionality for authentication, resolving another issue (or at least, I believe it should).

~~This unfortunately drastically increases the number of dependencies for this crate, due to gix and reqwest being pulled in, but there are a couple of options to mitigate this.~~

1. Replace the usage of git2 with gix in a follow up PR, which would remove some crates completely and not have 2 implementations of git. There is at least one missing feature in gix that I am aware of that this crate uses (dirty checking of the repository), but since this crate already uses the git cli for some operations like fetch anyways, that would be easy enough to replicate with the cli until gix gets that functionality in the future.
2. Just drop support for git registry indices altogether. This may not be possible if using a custom registry that only supports the git sparse protocol, but some alternative registry sources such as cloudsmith already support the sparse protocol, so maybe not that big of an issue. Only using the crates.io sparse registry, even if the user uses an old cargo that doesn't support it, is irrelevant for this crate's usage as it is only checking for the existence of the crate, or the published version.

Resolves: #218
Resolves: #679
Resolves: #692

epage

Thanks for exploring thos!

epage · 2023-08-22T13:24:30Z

src/steps/index.rs

+    let index = tame_index::index::ComboIndexCache::new(tame_index::IndexLocation::new(
+        tame_index::IndexUrl::crates_io(None, None, None)?,
+    ))?;
+
+    let index = match index {
+        tame_index::index::ComboIndexCache::Git(gi) => {
+            tame_index::index::RemoteGitIndex::new(gi)?.into()
+        }
+        tame_index::index::ComboIndexCache::Sparse(si) => {
+            tame_index::index::RemoteSparseIndex::new(
+                si,
+                tame_index::external::reqwest::blocking::Client::builder()
+                    .http2_prior_knowledge()
+                    .build()
+                    .map_err(tame_index::Error::from)?,
+            )
+            .into()
+        }
+    };


Not too thrilled with the amount of ceremony from this API...

Understandable, but this API was a response to crates-index having a much too rigid API that made testing and integration into other crates much more difficult, and some scenarios impossible without replicating pieces of logic that were not public in crates-index.

That being said it's trivial to to add helper functions to tame-index to open the crates.io index with zero knobs.

A lot of libraries have a builder for complex operations and then have a convenience new function on the built type. I would assume something like that would exist here.

epage · 2023-08-22T13:51:10Z

src/ops/cargo.rs

-        .flat_map(|c| c.versions().iter())
-        .any(|v| v.version() == version)
+pub fn is_published(index: &tame_index::index::ComboIndex, name: &str, version: &str) -> bool {
+    match index.krate(name.try_into().expect("crate name is invalid"), true) {


Does this call always do a network operation for sparse index?

Inside of cargo, we unconditionally cache lookups in-memory (and on-disk) and it requires an explicit call to force a new lookup. If the caller has to explicitly decide to do a krate call and then cached_krate, that seems onerous.

If this doesn't always do a network operation, then this is broken for sparse index as we need to ensure we get the latest published status.
(granted, I should soon just say we don' support those older cargos)

For git it does no network I/O only reading from the cache or git blob if the cache is outdated, for sparse it always does a network call, using the local cache entry to set the e-tag so that the response will only be 200 if the crate metadata has been updated.

epage · 2023-08-22T13:54:42Z

src/ops/cargo.rs

-            if let Err(e) = index.update() {
-                log::debug!("crate index update failed with {}", e);
+            if let tame_index::index::ComboIndex::Git(gi) = index {
+                if let Err(e) = gi.fetch() {
+                    log::debug!("crate index update failed with {}", e);
+                }


What is the locking scheme?

Granted, I'm not correctly handling it for crates_index today but its what has stopped me from migrating from v1 to v2 in cargo-edit is the lack of clarify on this.

In cargo-edit, we've had to implement a retry loop for locking. I see that tame-index advertises (b)locking support but its unclear to me which locks are involved seeing as tame-index is directly reading and manipulating internal cargo caches (not really thrilled with that btw). Is it using the same locking policy as cargo?

So the opening of a git index is locked via gix's mechanism, but tame-index doesn't currently support the global package lock mechanism that cargo uses. I'll open an issue to add that eventually, but I think the easier thing to do for this crate is to simply do one of the options I mentioned in the PR, and just only use the sparse index, and do zero disk operations at all, and remove the gix dependency.

I'm hesitant about the idea of exclusively doing network operations to the point that I'd need a strong justification to move this PR forward.

Why is that a problem? The current code is doing far more network I/O due to always performing a git fetch, changing it to do a single HTTP request for the crate being released with its etag means drastically reduced and faster checking for the publish status of the crate.

Jake-Shadle · 2023-08-23T07:51:59Z

So I refactored this to still use tame-index, but without the sparse or git features enabled, in favor of doing "manual" HTTP calls to the crates.io index and zero disk I/O, eliminating the concern around cargo's global package lock, and completely sidestepping all of the issues around the git index. This massively reduces the dependencies added by this PR by no longer using gix (though I would be willing to do a follow up PR to replace this crate's usage of git2 with gix since rust > not-rust IMO).

AFAICT there are no tests in this crate around index communication, which makes sense, so I ran a quick test by just releasing one of my own crates with the changes in this PR.

   Compiling cargo-fetcher v0.14.6 (/home/jake/code/cargo-fetcher/target/package/cargo-fetcher-0.14.6)
    Finished dev [unoptimized + debuginfo] target(s) in 54.48s
    Packaged 52 files, 415.4KiB (107.8KiB compressed)
   Uploading cargo-fetcher v0.14.6 (/home/jake/code/cargo-fetcher)
    Uploaded cargo-fetcher v0.14.6 to registry `crates-io`
note: Waiting for `cargo-fetcher v0.14.6` to be available at registry `crates-io`.
You may press ctrl-c to skip waiting; the crate should be available shortly.
   Published cargo-fetcher v0.14.6 at registry `crates-io`
     Pushing Pushing 0.14.6, main to origin

Obviously you can't tell from the output but the waiting was < 2s, and most importantly, I had blown away ~/.cargo/registry/index/github.com-1ecc6299db9ec823 to ensure that a missing/outdated crates.io git index didn't affect cargo-release, which was my initial reason for opening this PR, seeing that not having a crates.io git index (which will be more and more common due it no longer being the default) caused a massive stall in cargo-release.

Jake-Shadle · 2023-08-23T08:08:54Z

So I had to bump the MSRV, as tame-index requires 1.67.0 due to gix also having a 1.67.0 MSRV, if that's an issue I can close this PR.

Jake-Shadle · 2023-08-23T09:28:48Z

Actually I lied, bumped it to 1.67.1 to get rust-lang/rust-clippy#10265 and avoid a bunch of churn that only applies to 1.67.0.

As pointed out in #30, gix's locking mechanism is insufficient in the face of SIGKILL or other process interruption that can't be caught by gix's signal handler, in addition to the giant footgun that tame-index doesn't actually hook that signal handler for you, meaning applications using it that forget to hook it also run into issues. In addition, I raised #17 while doing crate-ci/cargo-release#693 (https://github.com/crate-ci/cargo-release/pull/693/files/012d3e9a7be23db14096e6a0d41cea7528f9348c#r1301688472) as right now tame-index can perform mutation concurrently with cargo itself, or potentially read partial data while it is being written. This PR resolves both of these issues by forcing all R/W operations on indexes to take a `&FileLock` argument, as well as providing a `LockOptions` "builder" to create file locks. These locks are created using flock on unix and LockFileEx on windows, meaning they can properly be cleaned up by the OS in all situations, including SIGKILL and power loss etc, unlike gix's locks, and is the same mechanism that cargo uses for its global package lock, meaning downstream users can ensure they play nicely with cargo. The lock facilities are part of the public API of tame-index as I opted to roll my own implementation instead of using fslock, as it is very outdated, and doesn't support timeouts. This does mean a lot of unsafe has been added, but it is tested and not _too_ bad. This can potentially be moved out to a separate crate in the future, but is fine for now. This means it could be used to resolve rustsec/rustsec#1011, and is something I will use in cargo-deny for the same thing, protecting access to the advisory database during mutation. It should also be noted that one can also just construct a `FileLock::unlocked()` to satisfy the API, without actually performing any locking, for cases where it's not needed/testing/etc. Resolves: #17 Resolves: #30

epage · 2023-11-01T21:54:33Z

Finally giving in to going with this approach. I made some tweaks in #719.

epage · 2023-11-02T00:49:26Z

So I refactored this to still use tame-index, but without the sparse or git features enabled, in favor of doing "manual" HTTP calls to the crates.io index and zero disk I/O, eliminating the concern around cargo's global package lock, and completely sidestepping all of the issues around the git index.

In doing the upgrades (#722), it drew attention that this wasn't correct and disk IO is done, even in 0.5.1 which this PR is based on.

Jake-Shadle · 2023-11-02T01:08:46Z

In doing the upgrades (#722), it drew attention that this wasn't correct and disk IO is done, even in 0.5.1 which this PR is based on.

No, it is only done if you don't set the etag, which this PR always did.

Replace crates-index with tame-index

012d3e9

Jake-Shadle mentioned this pull request Aug 22, 2023

Embark's pull requests EmbarkStudios/rust-ecosystem#20

Open

26 tasks

epage mentioned this pull request Aug 22, 2023

local feature is not additive EmbarkStudios/tame-index#16

Closed

epage reviewed Aug 22, 2023

View reviewed changes

Change to purely use remote sparse index

145bb50

Jake-Shadle changed the title ~~Replace crates-index with tame-index~~ Unconditionally query crates.io HTTP index for publish status Aug 23, 2023

Jake-Shadle added 2 commits August 23, 2023 09:58

Return MSRV to 1.66.0

44b023b

Update MSRV

d3253ce

Bump to 1.67.1

db48047

Jake-Shadle closed this Sep 5, 2023

Jake-Shadle mentioned this pull request Sep 29, 2023

Add support for OS locking EmbarkStudios/tame-index#33

Merged

epage mentioned this pull request Nov 1, 2023

fix: Always use sparse index #719

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unconditionally query crates.io HTTP index for publish status #693

Unconditionally query crates.io HTTP index for publish status #693

Jake-Shadle commented Aug 22, 2023 •

edited

Loading

epage left a comment

epage Aug 22, 2023

Jake-Shadle Aug 22, 2023

epage Aug 22, 2023

epage Aug 22, 2023

Jake-Shadle Aug 22, 2023

epage Aug 22, 2023

Jake-Shadle Aug 22, 2023

epage Aug 22, 2023

Jake-Shadle Aug 22, 2023

Jake-Shadle commented Aug 23, 2023

Jake-Shadle commented Aug 23, 2023

Jake-Shadle commented Aug 23, 2023

epage commented Nov 1, 2023

epage commented Nov 2, 2023

Jake-Shadle commented Nov 2, 2023

Unconditionally query crates.io HTTP index for publish status #693

Unconditionally query crates.io HTTP index for publish status #693

Conversation

Jake-Shadle commented Aug 22, 2023 • edited Loading

epage left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jake-Shadle commented Aug 23, 2023

Jake-Shadle commented Aug 23, 2023

Jake-Shadle commented Aug 23, 2023

epage commented Nov 1, 2023

epage commented Nov 2, 2023

Jake-Shadle commented Nov 2, 2023

Jake-Shadle commented Aug 22, 2023 •

edited

Loading