Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-32 missing encoding for ansible.git @ salsa #1798

Open
gl-yziquel opened this issue Jan 23, 2025 · 8 comments
Open

UTF-32 missing encoding for ansible.git @ salsa #1798

gl-yziquel opened this issue Jan 23, 2025 · 8 comments
Labels
acknowledged an issue is accepted as shortcoming to be fixed

Comments

@gl-yziquel
Copy link

Current behavior 😯

Hi.

gix clone on https://salsa.debian.org/python-team/packages/ansible.git fails with the following error:

Error: The encoding named 'UTF-32' isn't available

Expected behavior 🤔

It should succeed in cloning.

Git behavior

git clone succeeds.

Steps to reproduce 🕹

gix clone https://salsa.debian.org/python-team/packages/ansible.git

Failure with error message as described and error level 1 in the shell.

@EliahKagan
Copy link
Member

I can reproduce this on Arch Linux with gitoxide 0.41.0. git clone works on that URL, as does gix clone --bare, but a non-bare gix clone fails on that URL. As expected, a non-bare gix clone also fails on a file:// URL to a local bare clone.

The error occurs in gix-filter:

#[error("The encoding named '{name}' isn't available")]
UnknownEncoding { name: BString },

The problem is that extract_encoding uses the encoding_rs crate:

StateRef::Value(name) => encoding_rs::Encoding::for_label(name.as_bstr())
.ok_or(configuration::Error::UnknownEncoding {
name: name.as_bstr().to_owned(),
})
.map(|encoding| {

By design, encoding_rs does not support UTF-32:

//! The UTF-32 family of Unicode encoding schemes is not supported
//! by this crate. The Encoding Standard doesn't define any UTF-32
//! family encodings, since they aren't necessary for consuming Web
//! content.

I don't know what the best fix would be.

@Byron
Copy link
Member

Byron commented Jan 23, 2025

Thanks for reporting, and thanks for doing a first analysis, @EliahKagan!

My intuition here would be to see what Git does if it encounters an unknown encoding. My guess would be that instead of failing everything, it will continue to do as much as possible, and report failed files in the end similar to how case-folded files are reported.

Doing so would make gix clone usable at least.

Edit: Maybe on top of that, it would and should be possible to specifically handle UTF32, there is a widestring crate that could do the trick.

@Byron Byron added the acknowledged an issue is accepted as shortcoming to be fixed label Jan 23, 2025
@EliahKagan
Copy link
Member

EliahKagan commented Jan 23, 2025

My intuition here would be to see what Git does if it encounters an unknown encoding. My guess would be that instead of failing everything, it will continue to do as much as possible, and report failed files in the end similar to how case-folded files are reported.

In my experiment (#1798 (comment)), git clone reported success and did not show any error messages. This was with Git 2.48.1 on Arch Linux.

Maybe on top of that, it would and should be possible to specifically handle UTF32, there is a widestring crate that could do the trick.

That sounds like it could be a good fix. Are there any disadvantages to having it as a dependency?

@Byron
Copy link
Member

Byron commented Jan 23, 2025

In my experiment (#1798 (comment)), git clone reported success and did not show any error messages. This was with Git 2.48.1 on Arch Linux.

Right! My thought was to find out what Git does if it encounters an unknown encoding. Does it just fail, or continue with as much as it can handle? gitoxide should probably do no worse, and I personally would like it to keep going while reporting files with issues later.

That sounds like it could be a good fix. Are there any disadvantages to having it as a dependency?

Probably not, even though I would probably gate additional encodings and their dependencies behind feature flags. Otherwise the amount of additional dependencies needed to fully support everything that Git can handle is an unknown and possibly large number.

@EliahKagan
Copy link
Member

EliahKagan commented Jan 23, 2025

My thought was to find out what Git does if it encounters an unknown encoding. Does it just fail, or continue with as much as it can handle?

This is not decisive--maybe git doesn't check if the encoding exists when the file is empty, or maybe it doesn't use it at all if it doesn't have to normalize line endings--but it looks like git will silently clone a repository with anything specified as an encoding, even if it is not a real encoding, and that it does not print any messages when doing so.

In contrast, gix clone will report the same errors shown in the description here, fail, and delete its partial clone (i.e. not leave a directory) if any encoding it does not recognized is specified as the encoding of at least one file in .gitattributes, and that file exists. I have verified that, if the file does not exist, the error does not occur. All my testing so far has been with very simple test repositories as described above, and on the same Arch Linux system. Also, I have not tried a sparse checkout; my guess, from my understanding of what the code is trying to do (and since bare clones are fine), is that if a file need not be checked out, then it does not produce the error.

This is to say that a repository that triggers this bug can be produced as follows:

git init has-utf32-encoding
cd has-utf32-encoding
echo 'a text working-tree-encoding=UTF-32' >.gitattributes
touch a
git add .
git commit -m 'Initial commit'
cd ..
git clone has-utf32-encoding hue  # Verify that git silently works.
rm -rf hue
gix clone has-utf32-encoding hue

The last command produces:

 05:00:09 indexing done 7.0 objects in 0.00s (49.5K objects/s)
 05:00:09 decompressing done 767B in 0.00s (5.0MB/s)
 05:00:09     Resolving done 7.0 objects in 0.05s (138.0 objects/s)
 05:00:09      Decoding done 868B in 0.05s (17.2KB/s)
 05:00:09 writing index file done 1.3KB in 0.00s (15.8MB/s)
 05:00:09  create index file done 7.0 objects in 0.05s (137.0 objects/s)
 05:00:09          read pack done 658B in 0.05s (12.0KB/s)
Error: The encoding named 'UTF-32' isn't available

Likewise, with an encoding pretty much guaranteed not to be recognized:

git init has-unrecognized-encoding
cd has-unrecognized-encoding
echo 'a text working-tree-encoding=wait-a-minute-this-is-not-a-real-encoding' >.gitattributes
touch a
git add .
git commit -m 'Initial commit'
cd ..
git clone has-unrecognized-encoding hue  # Verify that git silently works, even with this.
rm -rf hue
gix clone has-unrecognized-encoding hue

The last command produces:

 05:03:14 indexing done 7.0 objects in 0.00s (55.6K objects/s)
 05:03:14 decompressing done 881B in 0.00s (6.5MB/s)
 05:03:14     Resolving done 7.0 objects in 0.05s (138.0 objects/s)
 05:03:14      Decoding done 982B in 0.05s (19.4KB/s)
 05:03:14 writing index file done 1.3KB in 0.00s (30.7MB/s)
 05:03:14  create index file done 7.0 objects in 0.05s (137.0 objects/s)
 05:03:14          read pack done 706B in 0.05s (13.0KB/s)
Error: The encoding named 'wait-a-minute-this-is-not-a-real-encoding' isn't available

The effects in both git and gix are the same with file:// URLs and with https:// URLs to two GitHub repositories I made that correspond to those except that they each have a second commit with a readme in it:

So if one does not want to make a repository, one can use the repository presented in the description. But to avoid waiting for it to clone (since the fetch part has no errors, just the checkout), or to check with an encoding that truly does not exist, those test repositories can be used.

(I may at some point move my test repositories on GitHub into an organization that exists to hold them and distinguish them from others. But links with them under EliahKagan should continue to work fine even after that is done, for everything except testing things that relate to moving repositories into organizations on GitHub. Most testing for which small test repositories are useful will of course have nothing to do with that.)


Edit: In the https://salsa.debian.org/python-team/packages/ansible.git repository, it looks like the .gitattributes file that triggers this error in gix clone is ansible_collections/community/windows/tests/integration/targets/win_lineinfile/files/expectations/.gitattributes. That file's contents are:

*.text text eol=LF
*.txt text eol=CRLF
*.txt16 text working-tree-encoding=UTF-16 eol=CRLF
*.txt32	text working-tree-encoding=UTF-32 eol=CRLF

(It also seems to have a blank line at the end, but I do not think that is relevant to this issue.)

There are two files in that directory that match *.txt32: 27_utf32.txt32 and 28_utf32_line_added.txt32.

@Byron
Copy link
Member

Byron commented Jan 25, 2025

Thanks a lot for researching this!

This is not decisive--maybe git doesn't check if the encoding exists when the file is empty, or maybe it doesn't use it at all if it doesn't have to normalize line endings--but it looks like git will silently clone a repository with anything specified as an encoding, even if it is not a real encoding, and that it does not print any messages when doing so.

To me this looks like Git is silently ignoring unknown encodings, and I think that gitoxide can be more verbose here once this is fixed. Also, as it stands, this definitely is a bug as Git will never fail on unknown encodings even. Here is the relevant code in Git.

@EliahKagan
Copy link
Member

EliahKagan commented Jan 26, 2025

Would it be a bug not to fail to check out files for which an unrecognized encoding is specified, even if no line-ending normalization or other transformation would require knowledge of the encoding? Or should that be allowed and other commands (git diff, maybe?) produce errors or warnings if their operation is degraded by the uncertain encoding?

Another thing I am unclear on is how stringently encodings need to be respected as applied to .gitattributes files. Does a .gitattributes file that specifies an encoding for itself need to be respected in that regard, by attempting to interpret it under various encodings to figure what if anything it says that implies its encoding? If a higher-up .gitattributes file specifies an encoding for a subordinate one, must that be respected and, if so, what happens if the lower .gitattributes file specifies a different encoding for itself?

I am not familiar enough with the semantics of encodings in Git and gitoxide to know if these questions point to any actual difficulties (other than difficulties from my own limited knowledge of this topic).

@Byron
Copy link
Member

Byron commented Jan 26, 2025

Would it be a bug not to fail to check out files for which an unrecognized encoding is specified, even if no line-ending normalization or other transformation would require knowledge of the encoding? Or should that be allowed and other commands (git diff, maybe?) produce errors or warnings if their operation is degraded by the uncertain encoding?

I would also think that the filter should be failing as usual, but that certain failure types should be ignored where needed. So the checkout would probably ignore the failure, check out what's there, and report it separately, and probably like Git check-out what's stored in Git instead.
Diffing would do nothing of that kind and just fail by default, like it would now.

Another thing I am unclear on is how stringently encodings need to be respected as applied to .gitattributes files. Does a .gitattributes file that specifies an encoding for itself need to be respected in that regard, by attempting to interpret it under various encodings to figure what if anything it says that implies its encoding? If a higher-up .gitattributes file specifies an encoding for a subordinate one, must that be respected and, if so, what happens if the lower .gitattributes file specifies a different encoding for itself?

That's an interesting thought, I never thought about it! Right now the top-level file would control the subordinate file encoding. A .gitattribute file can't change its own encoding right now, and I don't think Git does so either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledged an issue is accepted as shortcoming to be fixed
Projects
None yet
Development

No branches or pull requests

3 participants