Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SourceId: Test and fix ambiguous serialization of Git references #11086

Closed
wants to merge 2 commits into from

Conversation

g2p
Copy link
Contributor

@g2p g2p commented Sep 14, 2022

For some branch names (or tags or refs), generate_lockfile will serialize poorly. Parsing the Cargo.lock will refer to a branch that wasn't the intended one in Cargo.toml.

Because a branch name can contain the '#' character, the precise reference could be corrupted through a branch/tag/ref as well.

First commit adds a test showing the problem, second commit fixes it by having the serializer match the deserializer by encoding a x-www-form-urlencoded query string.

Fixes #11085.

@rust-highfive
Copy link

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @ehuss (or someone else) soon.

Please see the contribution instructions for more information.

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 14, 2022
@g2p
Copy link
Contributor Author

g2p commented Sep 14, 2022

Re compatibility, this changes the way some git urls are written to Cargo.lock if they contain certain characters (anything outside of [a-zA-Z0-9*._-]).

Running cargo update before/after switching cargo versions will write different git urls when they contain characters that are now escaped. The Cargo.lock stays compatible, because parsing is unchanged: URLs that used to round trip still do, URLs that were generated incorrectly but without breaking precise references (git sha1s) seem to behave the same (cargo seems to use the precise reference, at least for building), minus spurious messages on cargo update (from compare_dependency_graphs). I haven't checked how URLs that broke precise references behaved, but they will be unbroken.

Copy link
Member

@weihanglo weihanglo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the pull request! It is really unfortunate that Cargo parses it incorrectly 😞.

The first thing is Cargo should display a friendly readable URL. Looks like domain and path parts are already displayed in encoded format. I would argue that maybe EncodablePackageId should use a dedicated function other than relying on Display trait to perform encoding, so that we can decode URI to get a pretty display.

The other thing I feel uncertain is the change of lockfile serialization. As you mentioned, lockfile will be updated after this change. cargo build --locked will start failing after this PR, since Cargo detects changes in lockfile. Outside Cargo world, there may exist other dependency analysis tools would get confused as well.

@@ -273,7 +273,7 @@ fn cargo_compile_git_dep_pull_request() {
.cargo("build")
.with_stderr(&format!(
"[UPDATING] git repository `{}`\n\
[COMPILING] dep1 v0.5.0 ({}?rev=refs/pull/330/head#[..])\n\
[COMPILING] dep1 v0.5.0 ({}?rev=refs%2Fpull%2F330%2Fhead#[..])\n\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unlucky that we display an encoded format 😢
If a branch name contains non-ASCII character it will be hard to read.

Cargo.toml Outdated Show resolved Hide resolved
g2p added 2 commits September 15, 2022 19:50
This adds a failing test for rust-lang#11085.

For some branch names (or tags or refs), generate_lockfile will
serialize poorly.  Parsing the Cargo.lock will refer to a branch
that wasn't the intended one in Cargo.toml.

Because a branch name can contain the '#' character, the precise
reference could be corrupted through a branch/tag/ref as well.

This is because serialisation is done with a custom Display
implementation that concatenates strings as-is and is unaware
that some characters may need escaping; on the other hand,
deserialization uses url::Url::query_pairs which assumes
<https://url.spec.whatwg.org/#application/x-www-form-urlencoded>.
Use x-www-form-urlencoded which is the standard for query strings,
matching what the deserializer expects.
https://url.spec.whatwg.org/#application/x-www-form-urlencoded

Fixes rust-lang#11085.
@weihanglo weihanglo added the A-lockfile Area: Cargo.lock issues label Sep 15, 2022
Copy link
Member

@weihanglo weihanglo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, sorry for taking such a long time to revisit this.

The Cargo team discussed this a while back. This is definitely an overlook from our side. However, we tend not to change encoding for this that work today. Besides breaking the current lockfile foramt, I forgot the other specific reasons we concluded during the meeting 😅.

Looking into this today again. In my opinion, I feel like this issue

  • affects a relatively small portion(?) using special characters in their branch/tag/rev.
  • has a simple workaround by changing the name of the Git reference.

Your solution is not wrong. My own take is we wait for more lockfile format changes emerging, and then we do a version bump of lockfile format for a certain amount of changes all at once.

What if we want a fix now?

To fix it now before the next lockfile version bump, I'd recommend implementing an ad-hoc URL parser. It shouldn't be too hard to have a naive one here. It may look like:

  1. rsplit_once('#') to get the precise #<commit-sha>.
  2. rsplit_once(&["branch=", "rev=", "tag="]) to get the Git reference.
  3. If none of the above works, fallback to the old logic.

We may need to check in what scenario #<commit-sha> is missing in source field in lockfile, as well as any assumption we will make in this simple parser.

@g2p
Copy link
Contributor Author

g2p commented Nov 15, 2022

I appreciate the review feedback, and I would be happy to see where the cargo team discussed this, if there's a transcript or a good discussion channel.

I want to stress once more that this is not a parsing problem and that the parsing path should not be changed.

Serializing two different inputs to the same serial form is always a problem on the serialisation side. The information is lost after serialisation, and deserialisation will never be able to divine the intended branch name no matter how ad-hoc the parser. Let's keep the perfectly good deserialisation path, it follows the WHATWG URL standard, it should stay there.

So, here is what I'm suggesting:
when writing a version 2 lockfile, this PR can be amended to change the serialised form for fewer special characters, escaping only those that produce known breakage (either breaking round-tripping or spilling into other fields).
Doing simpler, fully conformant serialisation can be conditional on a Cargo.lock version bump.

Deserialisation will stay the same pre and post version bump, no ambiguities there.

This way will minimise churn to VCS-tracked lockfiles.

@weihanglo
Copy link
Member

True that we should keep deserialization correct. I'll label this as https://github.com/rust-lang/cargo/labels/S-blocked at this moment. We'll revisit once we decide to have a lockfile version bump. Thank you for your efforts and suggestions so far!

@weihanglo weihanglo added S-blocked and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 15, 2022
@bors
Copy link
Contributor

bors commented Dec 19, 2022

☔ The latest upstream changes (presumably #11501) made this pull request unmergeable. Please resolve the merge conflicts.

@weihanglo weihanglo added S-needs-design Status: Needs someone to work further on the design for the feature or fix. NOT YET accepted. and removed S-blocked labels Apr 18, 2023
@weihanglo weihanglo marked this pull request as draft April 18, 2023 21:56
@weihanglo
Copy link
Member

#12120 is an issue tracking for all bugfixes needing a lockfile version bump. Given this PR isn't likely to merge in its current status, I'll go ahead closing it. We can track its progress in #11085 and #12120. If I got time I'll have this fixed within a lockfile version bump (or someone gets it faster than me). Thank you for calling out the ambiguity and the fix!

@weihanglo weihanglo closed this May 11, 2023
bors added a commit that referenced this pull request Jul 23, 2023
fix: encode URL params correctly for SourceId in Cargo.lock

We use [`form_urlencoded::byte_serialize`](https://docs.rs/form_urlencoded/1.2.0/form_urlencoded/fn.byte_serialize.html), which is re-exported by `url` crate. Tests are copied from <#11086>. Kudos to the original author!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-lockfile Area: Cargo.lock issues S-needs-design Status: Needs someone to work further on the design for the feature or fix. NOT YET accepted.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SourceId serialization is ambiguous due to lack of escaping
5 participants