Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: TarReader should not assume same format for all entries in a tar file #69544

Closed
carlossanlop opened this issue May 19, 2022 · 6 comments
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Formats.Tar
Milestone

Comments

@carlossanlop
Copy link
Member

carlossanlop commented May 19, 2022

Background and motivation

The user @Bio2hazard was kind enough to report this bug when testing the new System.Formats.Tar APIs published in .NET 7 Preview4 with a tar file he generated with their own work tools. He shared the file here.

The tar file contains 4 entries:

  • 2 consecutive RegularFile ustar entries.
  • Then a gnu metadata entry of type LongLink.
  • Then the RegularFile entry with the actual data that the previous LongLink metadata entry describes.

I confirmed this by inspecting the archive with a Hex Editor.

The archive also shows that all 4 entries have magic and a version metadata fields following the ustar rules. Even the 3rd entry, which is clearly not supported by ustar because the entry type is L. In the gnu format, the magic and version fields are slightly different to those from POSIX formats (ustar and pax). I describe those differences here.

The current behavior of our TarReader is to throw FormatException when initially reading an entry of a particular format, and then encountering an subsequent entry in a different format, or the entry type is unsupported by the initially-assumed format of the whole archive.

Surprisingly, if this same file is opened and traversed with SharpCompress or 7-zip, they can traverse all the entries without problems. This means that:

  • They do not assume the whole archive is in the same format, and allow intermixing entries of different formats.
  • They do not mind if the entry has a magic and a version that belongs to a particular format, but the entry type is for another format.

I am opening this proposal to discuss the possibility of becoming as flexible as SharpCompress and 7-zip, at least when it comes to the TarReader.

The TarWriter should keep its current behavior: The user should specify the initial format in the constructor, and if an unsupported entry is inserted, it should be converted to that format if possible, or throw an exception if the file is unsupported. This behavior aligns with the Unix tar tool, which fails if, for example, the user tries to add a block device entry, or a long path entry, when creating a v7 or ustar archive (these two formats do not support those types of files).

@bartonjs @stephentoub @eerhardt @adamsitnik @jozkee @jeffhandley @baronfel

API Proposal

Remove the Format property from TarReader to stop assuming all entries are in the same format:

Remove:

namespace System.Formats.Tar;

public class TarReader : IDisposable
{
-      public TarFormat Format { get; }
}

Add:

public class TarEntry
{
+    public TarFormat Format { get; }
}

API Usage

using FileStream fs = File.OpenRead("archive.tar"); // Archive with intermixed entries
using TarReader reader = new(fs);
TarEntry entry = reader.GetNextEntry();

// The entry format can be detected by using the Format property
switch (entry.Format)
{
    case TarFormat.V7:
        //...
        break;
    case TarFormat.Ustar:
        //...
        break;
    case TarFormat.Pax:
        //...
        break;
    case TarFormat.Gnu:
        //...
        break;
    case TarFormat.Unknown:
        //...
        break;
}

Alternative Designs

One alternative is to just not remove anything, and if we encounter an entry of a different format than the first one, we just switch the Format property to Unknown, but keep returning entries to the user when they call GetNextEntry. We would have to document this very clearly.

Risks

Low.

  • We are in still in preview, so we have time for adjustments.
  • We want to be as helpful as SharpCompress and 7-zip, especially if they are able to handle these cases and we don't.
  • The spec does not explicitly indicate that all entries are expected to be in the same format, but it does specify clear rules about the metadata differences between formats, which should help ensure we properly detect the format of each entry individually and independently from the others.
@carlossanlop carlossanlop added api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO labels May 19, 2022
@carlossanlop carlossanlop added this to the 7.0.0 milestone May 19, 2022
@carlossanlop carlossanlop self-assigned this May 19, 2022
@ghost
Copy link

ghost commented May 19, 2022

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

The user @Bio2hazard was kind enough to report this bug when testing the new System.Formats.Tar APIs published in .NET 7 Preview4 with a tar file he generated with their own work tools. He shared the file here.

The tar file contains 4 entries:

  • 2 consecutive RegularFile ustar entries.
  • Then a gnu metadata entry of type LongLink.
  • Then the RegularFile entry with the actual data that the previous LongLink metadata entry describes.

I confirmed this by inspecting the archive with a Hex Editor.

The archive also shows that all 4 entries have magic and a version metadata fields following the ustar rules. Even the 3rd entry, which is clearly not supported by ustar because the entry type is L. In the gnu format, the magic and version fields are slightly different to those from POSIX formats (ustar and pax). I describe those differences here.

The current behavior of our TarReader is to throw FormatException when initially reading an entry of a particular format, and then encountering an subsequent entry in a different format, or the entry type is unsupported by the initially-assumed format of the whole archive.

Surprisingly, if this same file is opened and traversed with SharpCompress or 7-zip, they can traverse all the entries without problems. This means that:

  • They do not assume the whole archive is in the same format, and allow intermixing entries of different formats.
  • They do not mind if the entry has a magic and a version that belongs to a particular format, but the entry type is for another format.

I am opening this proposal to discuss the possibility of becoming as flexible as SharpCompress and 7-zip, at least when it comes to the TarReader.

The TarWriter should keep its current behavior: The user should specify the initial format in the constructor, and if an unsupported entry is inserted, it should be converted to that format if possible, or throw an exception if the file is unsupported. This behavior aligns with the Unix tar tool, which fails if, for example, the user tries to add a block device entry, or a long path entry, when creating a v7 or ustar archive (these two formats do not support those types of files).

@bartonjs @stephentoub @eerhardt @adamsitnik @jozkee @jeffhandley @baronfel

API Proposal

Remove the Format property from TarReader to stop assuming all entries are in the same format:

namespace System.Formats.Tar;

public class TarReader : IDisposable
{
-      public TarFormat Format { get; private set; }
}

API Usage

using FileStream fs = File.OpenRead("archive.tar"); // Archive with intermixed entries
using TarReader reader = new(fs);
TarEntry entry = reader.GetNextEntry();

if (entry is V7TarEntry v7Entry)
{
  // do something if v7
}
else if (entry is UstarTarEntry ustarEntry)
{
  // do something if ustar
}
else if (entry is PaxTarEntry paxEntry)
{
  // do something if pax
}
else if (entry is GnuTarEntry gnuEntry)
{
  // do something if gnu
}

Alternative Designs

Optionally, but not really needed, we could move the Format property to the TarEntry base class so the user can easily determine to which subclass they should cast the entries they get with TarReader.GetNextEntry():

namespace System.Formats.Tar;

public abstract class TarEntry
{
+      public TarFormat Format { get; private set; }
}

Which would allow using the property instead of casting:

using FileStream fs = File.OpenRead("archive.tar"); // Archive with intermixed entries
using TarReader reader = new(fs);
TarEntry entry = reader.GetNextEntry();

switch (entry.Format)
{
  case TarFormat.V7:
    // do something if v7
    break;
  case TarFormat.Ustar:
    // do something if ustar
    break;
  case TarFormat.Pax:
    // do something if pax
    break;
  case TarFormat.Gnu:
    // do something if gnu
    break;
}

Risks

Low.

  • We are in still in preview, so we have time for adjustments.
  • We want to be as helpful as SharpCompress and 7-zip, especially if they are able to handle these cases and we don't.
  • The spec does not explicitly indicate that all entries are expected to be in the same format, but it does specify clear rules about the metadata differences between formats, which should help ensure we properly detect the format of each entry individually and independently from the others.
Author: carlossanlop
Assignees: carlossanlop
Labels:

api-suggestion, area-System.IO

Milestone: 7.0.0

@bartonjs
Copy link
Member

I think adding the TarFormat kind on the entry is fine... it has goodness (jump table instead of cast-a-fall) and precedent.

I assume GNU-format global attributes still have to be literally the first entry to work as expected, and not simply the first GNU-format entry in the archive. If not, is there a way of making the behavior apparent to a user who might want to read from them?

What does this do to the writer? Now that there's evidence that the various existing TAR programs don't care about hybrid format files, it seems like the writer shouldn't convert TarEntry objects by itself; just using an initial mode for what kind of format to use for the easy methods. Once it's NOT converting, then maybe there's missing API to perform the conversion, like each TarEntry type gaining a .ctor(TarEntry other), or some such.

@bartonjs
Copy link
Member

Re-reading the top I see you already talked about the writer continuing to be converting. To me that made sense when an archive had to use only one family of entries, but makes less sense if the majority of tools support hybrid.

The ability to convert probably has some tidiness/goodness; but it feels like it should maybe just be extra/special API (like converting copy-style ctors).

@carlossanlop
Copy link
Member Author

carlossanlop commented May 28, 2022

From our offline conversation, Jeremy, removing the format property from TarReader and moving it to TarEntry make the most sense, so I'll edit the proposal and mark it as ready for review.

@carlossanlop carlossanlop added blocking Marks issues that we want to fast track in order to unblock other important work api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels May 28, 2022
@bartonjs
Copy link
Member

bartonjs commented May 31, 2022

Video

Looks good as proposed.

namespace System.Formats.Tar
{
    public partial class TarReader
    {
-       public TarFormat Format { get; }
    }

    public partial class TarEntry
    {
+       public TarFormat Format { get; }
    }
}

@bartonjs bartonjs added api-approved API was approved in API review, it can be implemented and removed blocking Marks issues that we want to fast track in order to unblock other important work api-ready-for-review API is ready for review, it is NOT ready for implementation labels May 31, 2022
@ghost ghost added in-pr There is an active PR which will close this issue when it is merged and removed in-pr There is an active PR which will close this issue when it is merged labels Jun 1, 2022
@carlossanlop
Copy link
Member Author

Fixed with #70313 and #70325.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Formats.Tar
Projects
None yet
3 participants