-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tar APIs pending feedback to address #68230
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
Tagging subscribers to this area: @dotnet/area-system-io Issue DetailsI received feedback in my first Tar PR #67883 that I would like to address in a follow-up PR.
|
On .NET 7 Preview 4,
Hope that helps. |
Thank you for your report, @Bio2hazard ! Any chance you can share a sample tar for each one of your experiments? |
@carlossanlop yes I was able to reproduce it without sensitive information, thankfully. Here it is: It has 3 files in it:
I was trying to benchmark it, so I'll just drop the benchmarking code in here. Goal is just to get the content using System.Formats.Tar;
using System.IO.Compression;
using BenchmarkDotNet.Attributes;
namespace Benchmarks7;
[GcServer(true)]
[GcForce(true)]
[MemoryDiagnoser]
public class ReadFromTarGz
{
private static byte[] fileData;
public ReadFromTarGz()
{
fileData = File.ReadAllBytes(@"testarchive.tar.gz");
}
[Benchmark]
public List<byte[]> WithSharpCompress()
{
var ret = new List<byte[]>();
using (var fileStream = new MemoryStream(fileData))
using (var stream = new System.IO.Compression.GZipStream(fileStream, CompressionMode.Decompress))
using (var tarReader = SharpCompress.Readers.Tar.TarReader.Open(stream))
{
while (tarReader.MoveToNextEntry())
{
var entry = tarReader.Entry;
if (entry.IsDirectory)
{
continue;
}
byte[] entryData = new byte[entry.Size];
using (var fileMemoryStream = new MemoryStream(entryData))
using (var entryStream = tarReader.OpenEntryStream())
{
entryStream.CopyTo(fileMemoryStream);
}
ret.Add(entryData);
}
}
return ret;
}
[Benchmark]
public List<byte[]> WithSystemsFormatTar_PreDecompressed()
{
var ret = new List<byte[]>();
using var decompressedStream = new MemoryStream();
using (var fileStream = new MemoryStream(fileData))
using (var stream = new System.IO.Compression.GZipStream(fileStream, CompressionMode.Decompress))
{
stream.CopyTo(decompressedStream);
}
decompressedStream.Position = 0;
using (var tarReader = new System.Formats.Tar.TarReader(decompressedStream))
{
System.Formats.Tar.TarEntry entry;
while ((entry = tarReader.GetNextEntry()) != null)
{
if (entry.EntryType == TarEntryType.Directory)
{
continue;
}
byte[] entryData = new byte[entry.Length];
using (var fileMemoryStream = new MemoryStream(entryData))
{
entry.DataStream.CopyTo(fileMemoryStream);
}
ret.Add(entryData);
}
}
return ret;
}
[Benchmark]
public List<byte[]> WithSystemsFormatTar_StreamedDecompression()
{
var ret = new List<byte[]>();
using (var fileStream = new MemoryStream(fileData))
using (var stream = new System.IO.Compression.GZipStream(fileStream, CompressionMode.Decompress))
using (var tarReader = new System.Formats.Tar.TarReader(stream))
{
System.Formats.Tar.TarEntry entry;
while ((entry = tarReader.GetNextEntry()) != null)
{
if (entry.EntryType == TarEntryType.Directory)
{
continue;
}
byte[] entryData = new byte[entry.Length];
using (var fileMemoryStream = new MemoryStream(entryData))
{
entry.DataStream.CopyTo(fileMemoryStream);
}
ret.Add(entryData);
}
}
return ret;
}
[Benchmark]
public List<byte[]> WithSystemsFormatTar_StreamedDecompression_WithCopyData()
{
var ret = new List<byte[]>();
using (var fileStream = new MemoryStream(fileData))
using (var stream = new System.IO.Compression.GZipStream(fileStream, CompressionMode.Decompress))
using (var tarReader = new System.Formats.Tar.TarReader(stream))
{
System.Formats.Tar.TarEntry entry;
while ((entry = tarReader.GetNextEntry(true)) != null)
{
if (entry.EntryType == TarEntryType.Directory)
{
continue;
}
byte[] entryData = new byte[entry.Length];
using (var fileMemoryStream = new MemoryStream(entryData))
{
entry.DataStream.CopyTo(fileMemoryStream);
}
ret.Add(entryData);
}
}
return ret;
}
}
Exception:
I hope that helps! |
Hi again @Bio2hazard , What tool did you use to generate the tar file inside your gz file? I inspected the internal tar file with the Hex Editor HxD. The Here is what the spec has to say about the For POSIX archives (ustar and pax)
And then a few lines below it says:
For GNU
In other words:
I'm surprised that SharpCompress and 7-zip are able to open this malformed archive. This tells me that they are flexible when it comes to mixed format entries. The spec does not explicitly indicate the entries should all be in the same format, but I interpreted it this way considering there are metadata entries like Based on the above, I see one public API structure change we should make: If archives are allowed to have mixed entries as we are seeing in your example (or in other words, should not expect all entries to be of the same format), then we would have to remove the public Note: The |
@Bio2hazard I'm really curious how you generated your archive, because even if I address the above by allowing intermixing formats, the |
Yikes, I was unaware that the archive is in a malformed format. So the original file that I first encountered this problem with was created by TeamCity Enterprise. A build agent designated a folder as the output artifact, and TeamCity automatically turned the folder into a These are the relevant build logs:
Of course, I can't exactly share our internal build artifacts, so what I did was:
My hope here was that modifying the file in this manner would cause 7z to use the same tar format/style as the original archive produced by TeamCity. While I understand that that is a fairly ... unorthodox way of creating an archive, I am actually fairly confident it is a faithful reproduction. Debugging the actual TeamCity archive and the supplied test archive shows that they both fail in the same way with the same stack traces. Furthermore, I had actually as a proof of concept created a very rudimentary manual tar extractor based on the GNU spec and it also got tripped up and failed at the long link part 😄 TeamCity is a big player in the CI/CD space, and the artifact publisher that produced this artifact is built-in and native. Given that I could not find any notable user discussions around problems with the artifact archives produced by TeamCity, I have to assume that it is very widely supported despite being malformed. Otherwise I am sure there would be plenty of discussions about it. Please let me know if you need anything else. If you are concerned that the reproduction archive isn't accurate enough I can provide the header bytes of the original archive. Edit: To further make sure the reproduction is accurate:
|
Thanks for the information, @Bio2hazard . A few more questions:
Do you happen to know what was the original format of the archive? I couldn't find it in the documentation you shared. The reason I ask is because you then mentioned:
Which I interpret as: you deleted all the original files that the TeamCity tool generated, then added new files using 7-zip. So it's 7-zip who's adding the (seemingly) malformed files, not TeamCity.
Do you mean you tried reading the entries using the new
I don't think the archive created with the TeamCity tool is generating malformed entries. I think it was 7-zip, since you did not preserve any of the original artifacts.
These 3 steps are confusing: Are you saying that some of the files in the .tar.gz file you shared are the original ones that were added by the TeamCity tool? If yes, which one was not removed? If all the files were replaced using 7-zip, then 7-zip is making a bad job at detecting the format of existing entries, and adding entries in whatever format it wants. Still, if you're saying 7-zip and SharpCompress both support intermixed format entries in a single archive, then we should support them as well. So I'll start working on that. |
Sorry, you may be right. I'll try to clarify. I created a LINQPad Script (gist) to read the headers of the files in the tar archive and compare the headers between 2 archives. I was trying to determine if 7z changes the tar headers when files are added through the 7z GUI, as you are suspecting. I extracted and untar'd the original archive as produced by teamcity, then made a copy of it and through the 7z GUI deleted, and then re-added a single file (from the extracted original archive). In that test, the header did not show up as changed by 7z, but I'm not convinced it was working right. I performed the experiment again today, this time removing and re-adding every file, and now the headers do show up differently, so 7z is modifying the tar headers. At a glance, I can tell that some differences are that user / group are removed and that the file modes are different. So for comparison, here is the Base64 of a tar header as created directly by TeamCity: And the Base64 of the tar header for the same file as created by 7z: Here is the Base64 of a long link tar header as created directly by TeamCity: And the corresponding Base64 of a long link tar header as created by 7z: Finally, here is the Base64 of a long file name entry as created directly by TeamCity: And here is the corresponding Base64 of a long file name entry as created by 7z: I think that should cover all cases as far as header differences goes. It bears repeating that the original, completely unmodified, downloaded straight from teamcity Hope that helps and clears things up. I'm an SDE, not DevOps so I didn't have the option of producing a sharable/sanitized archive directly in teamcity. |
Thanks for spending time helping with this, @Bio2hazard. I appreciate it a lot.
Ah good to know, my suspicion was correct.
This might be because you were doing this on a Windows machine. The concept of mode, uname and gname are from Unix. This is ok.
This is interesting, and I think it's the only thing left for us to clarify, if that's ok. Do you mind sharing the exception message/callstack when using TarReader with the original TeamCity archive? I think the exception messages you shared in your previous comments came from the modified file, which contained entries that were added by 7-zip, not TeamCity. Feel free to obfuscate any sensitive information if needed. Actually, if you have time and the chance to help a bit more: if you can download HxD Hex Editor and open the Can you share the magic and version fields of the LongLink entry in the TeamCity archive? This would help us determine if it's malformed or not. Here's what I get from the first entry: Here are the And again, thank you so much for testing this and for your patience. |
@carlossanlop I'll get those stack traces for you
I wanted to quickly add that you should be able to just base64 decode the headers in my last message. They should have exactly what you are looking for. It's the first 512 bytes of the entries, from both teamcity and 7z. |
Oh cool, never done that. Allow me to figure it out then, and will come back to you if I don't find the info I wanted. EDIT: this is AWESOME! https://www.base64decode.org/ |
@carlossanlop nice looks like it worked out! Base64 is just a quick and convenient way to share byte[]s through a text format. You can use I got the stack traces for you: WithSystemsFormatTar_PreDecompressed()In this variant, the TeamCity Archive:
Test Archive:
Stack Traces are identical WithSystemsFormatTar_StreamedDecompression()In this variant, the TeamCity Archive:
Test Archive:
The stack traces are again identical. This is the relevant failure point in // Empty checksum means this is an invalid (all blank) entry, finish early
Span<byte> spanChecksum = buffer.Slice(FieldLocations.Checksum, FieldLengths.Checksum);
if (TarHelpers.IsAllNullBytes(spanChecksum))
{
return false;
} The debugger showed all bytes of the checksum being null, which bubbles up and causes WithSystemsFormatTar_StreamedDecompression_WithCopyData()In this variant, the TeamCity Archive:
Test Archive:
This is the only test case where the result was different between the two. I hope that helps! |
* Address some System.Formats.Tar TODOs (infra and syscalls) (#69107) Re-submitting the changes approved in PR: #69107 Addresses: #68230 Includes an extra change to prevent the android build failure with the addition of the new native call to getgrgid_r. Co-authored-by: carlossanlop <[email protected]> Co-authored-by: Adam Sitnik <[email protected]>
Initial PR changes were introduced here: #67883
The following is pending feedback I need to address (not urgent, for 8.0, mostly just additional testing coverage):
Perf improvements:
Additional test coverage:
Done:
.\
. Tar entries should be able to handle paths that begin with "./", including the root path itself #70516Format
detection in the reader when an archive has entries of multiple formats. SharpCompress and 7-zip handle this, we should too.The text was updated successfully, but these errors were encountered: