Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: Allow multiple Global Extended Attributes entries in Tar archives with PAX format #69935

Closed
carlossanlop opened this issue May 28, 2022 · 6 comments
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Formats.Tar
Milestone

Comments

@carlossanlop
Copy link
Member

Background and motivation

In the initial Tar proposal, the assumption was that the PAX format allowed one special entry known as Global Extended Attributes (GEA from now on), at the beginning of the archive, to allow overriding the extended attributes all the subsequent entries in the archive. This was incorrect.

The FreeBSD tar spec does not explain much about the GEA entries, except that they exist.

Both the OpenGroup pax manual and the GNU tar manual explain the format of the name in the GEA entry, which is $TMPDIR/GlobalHead.%p.%n, and describes the suffix number as:

  %n                 An integer that represents the
                        sequence number of the global extended
                        header record in the archive, starting
                        at 1.

But there is no mention of when to expect more than 1 entry and what they mean.

Then I recently found this spec: IBM z/OS 2.5.0 pax manual, which has a clear and detailed description of how the GEA entry should work:

g
Represents global extended header records for the following files in the archive.
[...]
Each value shall affect all subsequent files that do not override that value in their own extended header record and until another global extended header record is reached that provides another value for the same field.

API Proposal

New:

namespace System.Formats.Tar
{
    public sealed partial class PaxGlobalExtendedAttributesTarEntry : System.Formats.Tar.PosixTarEntry
    {
        public PaxGlobalExtendedAttributesTarEntry(System.Collections.Generic.IEnumerable<System.Collections.Generic.KeyValuePair<string, string>> globalExtendedAttributes) { }
        public System.Collections.Generic.IReadOnlyDictionary<string, string> GlobalExtendedAttributes { get { throw null; } }
    }

Remove:

    public sealed partial class TarReader : System.IDisposable
    {
-        public System.Collections.Generic.IReadOnlyDictionary<string, string>? GlobalExtendedAttributes { get { throw null; } }
    }

Modify:

    public sealed partial class TarWriter : System.IDisposable
    {
-        public TarWriter(System.IO.Stream archiveStream, System.Collections.Generic.IEnumerable<System.Collections.Generic.KeyValuePair<string, string>>? globalExtendedAttributes = null, bool leaveOpen = false) { }
+        public TarWriter(System.IO.Stream archiveStream, bool leaveOpen = false) { }
    }

API Usage

Before

We could only add one GEA entry:

// Write
Dictionary<string, string> attributes = new();
attributes["SomeAttributeName"] = "I'm an extended attribute!";

using MemoryStream ms = new();
using TarWriter writer = new(ms, attributes, leaveOpen: true)
{
   // Add some more entries if desired
}

ms.Position = 0;
using TarReader reader = new(ms, leaveOpen: false);
reader.GetNextEntry(); // Advance the reader to detect format and GEA entry
// Access the values of the single GEA
Console.WriteLine(reader.GlobalExtendedAttributes["SomeAttributeName"]); // "I'm an extended attribute!"

After

With this proposed change, we could now add multiple GEA entries:

// Write
Dictionary<string, string> attributes1 = new();
attributes["attr1"] = "I'm extended attribute 1!";
PaxGlobalExtendedAttributesTarEntry gea1 = new(attributes1);

Dictionary<string, string> attributes2 = new();
attributes["attr2"] = "I'm extended attribute 2!";
PaxGlobalExtendedAttributesTarEntry gea2 = new(attributes2);

using MemoryStream ms = new();
using TarWriter writer = new(ms, leaveOpen: true) // Default format is PAX for this constructor
{
  writer.WriteEntry(gea1);
  // Add some more entries if desired, they'll be affected by gea1
  writer.WriteEntry(gea2);
  // Add some more entries if desired, they'll be affected by gea2
}

ms.Position = 0;
using TarReader reader = new(ms, leaveOpen: false);
PaxGlobalExtendedAttributesTarEntry readGea1 = reader.GetNextEntry() as PaxGlobalExtendedAttributesTarEntry;
Console.WriteLine(readGea1.GlobalExtendedAttributes["attr1"]); // "I'm extended attribute 1!"

// Multiple calls of GetNextEntry for the other entries, until reaching the next GEA entry

PaxGlobalExtendedAttributesTarEntry readGea2 = reader.GetNextEntry() as PaxGlobalExtendedAttributesTarEntry;
Console.WriteLine(readGea2.GlobalExtendedAttributes["attr2"]); // "I'm extended attribute 2!"

Alternative Designs

Reuse PaxTarEntry

We could avoid adding a new class to represent a GEA entry, and instead reuse the existing PaxTarEntry class. But there's a problem: it would be confusing to create a TarEntryType.GlobalExtendedAttributes entry, because the constructor expects an entryName argument, and in a GEA entry, the name is created internally by TarWriter: the name depends on the platforms $TmpDir, on the process ID, and the current GEA entry number, which is stored internally by TarWriter.

Having the entryName argument isn't really necessary, since we expose it in its own property with a getter and a setter. So there's a clean way of reusing PaxTarEntry if we do the following modifications:

  • None of the constructors should take a entryName as an argument, and the user should set it manually later. If the user attempts to pass an entry without a name to TarWriter.WriteEntry, an exception is thrown, except if the entry is of TarEntryType.GlobalExtendedAttributes, because TarWriter is in charge of writing the name.
  • The PaxTarEntry.Name field would throw if the user attempts to set it on a GEA entry.
   public sealed partial class GnuTarEntry : System.Formats.Tar.PosixTarEntry
    {
-        public GnuTarEntry(System.Formats.Tar.TarEntryType entryType, string entryName) { }
+        public GnuTarEntry(System.Formats.Tar.TarEntryType entryType) { }
    }
    public sealed partial class PaxTarEntry : System.Formats.Tar.PosixTarEntry
    {
-        public PaxTarEntry(System.Formats.Tar.TarEntryType entryType, string entryName) { }
+        public PaxTarEntry(System.Formats.Tar.TarEntryType entryType) { }
-        public PaxTarEntry(System.Formats.Tar.TarEntryType entryType, string entryName, System.Collections.Generic.IEnumerable<System.Collections.Generic.KeyValuePair<string, string>> extendedAttributes) { }
+        public PaxTarEntry(System.Formats.Tar.TarEntryType entryType, System.Collections.Generic.IEnumerable<System.Collections.Generic.KeyValuePair<string, string>> extendedAttributes) { }
    }
    public sealed partial class UstarTarEntry : System.Formats.Tar.PosixTarEntry
    {
-       public UstarTarEntry(System.Formats.Tar.TarEntryType entryType, string entryName) { }
+       public UstarTarEntry(System.Formats.Tar.TarEntryType entryType) { }
    }
    public sealed partial class V7TarEntry : System.Formats.Tar.TarEntry
    {
-       public V7TarEntry(System.Formats.Tar.TarEntryType entryType, string entryName) { }
+       public V7TarEntry(System.Formats.Tar.TarEntryType entryType) { }
    }

Discarded design

I also considered avoiding adding a new entry type, and instead write the dictionary directly:

Dictionary<string, string> attributes = new();
attributes["hello"] = "world";
writer.WriteGlobalExtendedAttributes(attributes);

But then how should we give the user the GEA entries in a reader? GetNextEntry returns a TarEntry. Having a dictionary of dictionaries that could hold the GEA dictionaries would be too messy and confusing.

Risks

Low. The APIs are new in 7.0, we are on time to improve them.

@bartonjs @jeffhandley @adamsitnik @jozkee @tmds

@carlossanlop carlossanlop added api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO labels May 28, 2022
@carlossanlop carlossanlop added this to the 7.0.0 milestone May 28, 2022
@carlossanlop carlossanlop self-assigned this May 28, 2022
@ghost
Copy link

ghost commented May 28, 2022

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

In the initial Tar proposal, the assumption was that the PAX format allowed one special entry known as Global Extended Attributes (GEA from now on), at the beginning of the archive, to allow overriding the extended attributes all the subsequent entries in the archive. This was incorrect.

The FreeBSD tar spec does not explain much about the GEA entries, except that they exist.

Both the OpenGroup pax manual and the GNU tar manual explain the format of the name in the GEA entry, which is $TMPDIR/GlobalHead.%p.%n, and describes the suffix number as:

  %n                 An integer that represents the
                        sequence number of the global extended
                        header record in the archive, starting
                        at 1.

But there is no mention of when to expect more than 1 entry and what they mean.

Then I recently found this spec: IBM z/OS 2.5.0 pax manual, which has a clear and detailed description of how the GEA entry should work:

g
Represents global extended header records for the following files in the archive.
[...]
Each value shall affect all subsequent files that do not override that value in their own extended header record and until another global extended header record is reached that provides another value for the same field.

API Proposal

New:

namespace System.Formats.Tar
{
    public sealed partial class PaxGlobalExtendedAttributesTarEntry : System.Formats.Tar.PosixTarEntry
    {
        public PaxGlobalExtendedAttributesTarEntry(System.Collections.Generic.IEnumerable<System.Collections.Generic.KeyValuePair<string, string>> globalExtendedAttributes) { }
        public System.Collections.Generic.IReadOnlyDictionary<string, string> GlobalExtendedAttributes { get { throw null; } }
    }

Remove:

    public sealed partial class TarReader : System.IDisposable
    {
-        public System.Collections.Generic.IReadOnlyDictionary<string, string>? GlobalExtendedAttributes { get { throw null; } }
    }

Modify:

    public sealed partial class TarWriter : System.IDisposable
    {
-        public TarWriter(System.IO.Stream archiveStream, System.Collections.Generic.IEnumerable<System.Collections.Generic.KeyValuePair<string, string>>? globalExtendedAttributes = null, bool leaveOpen = false) { }
+        public TarWriter(System.IO.Stream archiveStream, bool leaveOpen = false) { }
    }

API Usage

Before

We could only add one GEA entry:

// Write
Dictionary<string, string> attributes = new();
attributes["SomeAttributeName"] = "I'm an extended attribute!";

using MemoryStream ms = new();
using TarWriter writer = new(ms, attributes, leaveOpen: true)
{
   // Add some more entries if desired
}

ms.Position = 0;
using TarReader reader = new(ms, leaveOpen: false);
reader.GetNextEntry(); // Advance the reader to detect format and GEA entry
// Access the values of the single GEA
Console.WriteLine(reader.GlobalExtendedAttributes["SomeAttributeName"]); // "I'm an extended attribute!"

After

With this proposed change, we could now add multiple GEA entries:

// Write
Dictionary<string, string> attributes1 = new();
attributes["attr1"] = "I'm extended attribute 1!";
PaxGlobalExtendedAttributesTarEntry gea1 = new(attributes1);

Dictionary<string, string> attributes2 = new();
attributes["attr2"] = "I'm extended attribute 2!";
PaxGlobalExtendedAttributesTarEntry gea2 = new(attributes2);

using MemoryStream ms = new();
using TarWriter writer = new(ms, leaveOpen: true) // Default format is PAX for this constructor
{
  writer.WriteEntry(gea1);
  // Add some more entries if desired, they'll be affected by gea1
  writer.WriteEntry(gea2);
  // Add some more entries if desired, they'll be affected by gea2
}

ms.Position = 0;
using TarReader reader = new(ms, leaveOpen: false);
PaxGlobalExtendedAttributesTarEntry readGea1 = reader.GetNextEntry() as PaxGlobalExtendedAttributesTarEntry;
Console.WriteLine(readGea1.GlobalExtendedAttributes["attr1"]); // "I'm extended attribute 1!"

// Multiple calls of GetNextEntry for the other entries, until reaching the next GEA entry

PaxGlobalExtendedAttributesTarEntry readGea2 = reader.GetNextEntry() as PaxGlobalExtendedAttributesTarEntry;
Console.WriteLine(readGea2.GlobalExtendedAttributes["attr2"]); // "I'm extended attribute 2!"

Alternative Designs

Reuse PaxTarEntry

We could avoid adding a new class to represent a GEA entry, and instead reuse the existing PaxTarEntry class. But there's a problem: it would be confusing to create a TarEntryType.GlobalExtendedAttributes entry, because the constructor expects an entryName argument, and in a GEA entry, the name is created internally by TarWriter: the name depends on the platforms $TmpDir, on the process ID, and the current GEA entry number, which is stored internally by TarWriter.

Having the entryName argument isn't really necessary, since we expose it in its own property with a getter and a setter. So there's a clean way of reusing PaxTarEntry if we do the following modifications:

  • None of the constructors should take a entryName as an argument, and the user should set it manually later. If the user attempts to pass an entry without a name to TarWriter.WriteEntry, an exception is thrown, except if the entry is of TarEntryType.GlobalExtendedAttributes, because TarWriter is in charge of writing the name.
  • The PaxTarEntry.Name field would throw if the user attempts to set it on a GEA entry.
   public sealed partial class GnuTarEntry : System.Formats.Tar.PosixTarEntry
    {
-        public GnuTarEntry(System.Formats.Tar.TarEntryType entryType, string entryName) { }
+        public GnuTarEntry(System.Formats.Tar.TarEntryType entryType) { }
    }
    public sealed partial class PaxTarEntry : System.Formats.Tar.PosixTarEntry
    {
-        public PaxTarEntry(System.Formats.Tar.TarEntryType entryType, string entryName) { }
+        public PaxTarEntry(System.Formats.Tar.TarEntryType entryType) { }
-        public PaxTarEntry(System.Formats.Tar.TarEntryType entryType, string entryName, System.Collections.Generic.IEnumerable<System.Collections.Generic.KeyValuePair<string, string>> extendedAttributes) { }
+        public PaxTarEntry(System.Formats.Tar.TarEntryType entryType, System.Collections.Generic.IEnumerable<System.Collections.Generic.KeyValuePair<string, string>> extendedAttributes) { }
    }
    public sealed partial class UstarTarEntry : System.Formats.Tar.PosixTarEntry
    {
-       public UstarTarEntry(System.Formats.Tar.TarEntryType entryType, string entryName) { }
+       public UstarTarEntry(System.Formats.Tar.TarEntryType entryType) { }
    }
    public sealed partial class V7TarEntry : System.Formats.Tar.TarEntry
    {
-       public V7TarEntry(System.Formats.Tar.TarEntryType entryType, string entryName) { }
+       public V7TarEntry(System.Formats.Tar.TarEntryType entryType) { }
    }

Discarded design

I also considered avoiding adding a new entry type, and instead write the dictionary directly:

Dictionary<string, string> attributes = new();
attributes["hello"] = "world";
writer.WriteGlobalExtendedAttributes(attributes);

But then how should we give the user the GEA entries in a reader? GetNextEntry returns a TarEntry. Having a dictionary of dictionaries that could hold the GEA dictionaries would be too messy and confusing.

Risks

Low. The APIs are new in 7.0, we are on time to improve them.

@bartonjs @jeffhandley @adamsitnik @jozkee @tmds

Author: carlossanlop
Assignees: carlossanlop
Labels:

api-suggestion, area-System.IO

Milestone: 7.0.0

@carlossanlop
Copy link
Member Author

Had a short conversation with @bartonjs offline about this, the main proposal makes more sense than the alternative. Marking this as ready.

@carlossanlop carlossanlop added blocking Marks issues that we want to fast track in order to unblock other important work api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels May 31, 2022
@bartonjs
Copy link
Member

bartonjs commented May 31, 2022

Video

  • The new PaxGlobalExtendedAttributesTarEntry looks good as proposed
  • Combined with the previous update that moved the TarFormat property from TarReader to TarEntry we made some systematic updates while we were here
    • Instead of the TarWriter doing automatic conversion of any TarEntry into the ctor-provided TarFormat, conversion must be done explicitly by the caller (new conversion constructors), and the TarWriter will just write whatever it's given.
    • TarFormat is now TarEntryFormat since it only applies to entries
    • We kept TarWriter.Format and the constructor parameter, because they control the format for the TarWriter.Write(string, string) method.
      • But we cleaned up the constructors+defaults to take this global extended attributes change into account.
namespace System.Formats.Tar
{
    public sealed partial class PaxGlobalExtendedAttributesTarEntry : System.Formats.Tar.PosixTarEntry
    {
        public PaxGlobalExtendedAttributesTarEntry(System.Collections.Generic.IEnumerable<System.Collections.Generic.KeyValuePair<string, string>> globalExtendedAttributes) { }
        public System.Collections.Generic.IReadOnlyDictionary<string, string> GlobalExtendedAttributes { get { throw null; } }
    }

    // Conversion constructors.  They move the DataStream value from other to [new].
    public partial class GnuTarEntry
    {
        public GnuTarEntry(TarEntry other);
    }
    public partial class PaxTarEntry
    {
        public PaxTarEntry(TarEntry other);
    }
    public partial class UstarTarEntry
    {
        public UstarTarEntry(TarEntry other);
    }
    public partial class V7TarEntry
    {
        public V7TarEntry(TarEntry other);
    }
}
namespace System.Formats.Tar
{
    public sealed partial class TarReader : System.IDisposable
    {
-        public System.Collections.Generic.IReadOnlyDictionary<string, string>? GlobalExtendedAttributes { get { throw null; } }
    }

    public sealed partial class TarWriter : System.IDisposable
    {
-        public TarWriter(System.IO.Stream archiveStream, System.Collections.Generic.IEnumerable<System.Collections.Generic.KeyValuePair<string, string>>? globalExtendedAttributes = null, bool leaveOpen = false) { }
+        public TarWriter(System.IO.Stream archiveStream) {}

-        public TarWriter(Stream archiveStream, TarEntryFormat archiveFormat, bool leaveOpen = false);
+        public TarWriter(Stream archiveStream, TarEntryFormat archiveFormat = TarEntryFormat.Pax, bool leaveOpen = false);
    }

-   public partial enum TarFormat { ... }
+   public partial enum TarEntryFormat { ... }
}

@bartonjs bartonjs added api-approved API was approved in API review, it can be implemented and removed blocking Marks issues that we want to fast track in order to unblock other important work api-ready-for-review API is ready for review, it is NOT ready for implementation labels May 31, 2022
@dersia
Copy link

dersia commented May 31, 2022

Shouldn't the default to TarEntryFormat.Pax better be on TarWriter.WriteEntry(string fileName, string entryName) since it is only used there now?
This way the constructor on TarWriter will also be much easier.
And also since the TarFormatEntry is now setable on each TarEntry this would allow setting the Format for string based entries too and not have one for all on the writer.

So it should look like this:

namespace System.Formats.Tar
{
    public sealed partial class TarReader : System.IDisposable
    {
-        public System.Collections.Generic.IReadOnlyDictionary<string, string>? GlobalExtendedAttributes { get { throw null; } }
    }

    public sealed partial class TarWriter : System.IDisposable
    {
-        public TarWriter(System.IO.Stream archiveStream, System.Collections.Generic.IEnumerable<System.Collections.Generic.KeyValuePair<string, string>>? globalExtendedAttributes = null, bool leaveOpen = false) { }
-        public TarWriter(Stream archiveStream, TarEntryFormat archiveFormat, bool leaveOpen = false);
+        public TarWriter(Stream archiveStream, bool leaveOpen = false);

-        public void WriteEntry(string fileName, string entryName);
+        public void WriteEntry(string fileName, string entryName, TarEntryFormat archiveFormat = TarEntryFormat.Pax); 
    }

-   public partial enum TarFormat { ... }
+   public partial enum TarEntryFormat { ... }
} 

@carlossanlop
Copy link
Member Author

@dersia:

Shouldn't the default to TarEntryFormat.Pax better be on TarWriter.WriteEntry(string fileName, string entryName) since it is only used there now?

No, we are going to default to writing fileName-based entries to what was defined in the TarWriter constructor:

using TarWriter writer = new(archiveStream, TarEntryFormat.Pax, leaveOpen: false);

// A) Saves in the default format specified in the writer constructor
writer.WriteEntry("path/to/file.txt", "entryName.txt");

// B) If the user wants to write entries in different formats (extremely rare, but valid scenario), they'll need to do this:
GnuTarEntry gnuEntry = new(TarEntryType.RegularFile, "entryName2.txt");

using FileStream fs = File.OpenRead("path/to/file2.txt");
gnuEntry.DataStream = fs;

writer.WriteEntry(gnuEntry); // Written in gnu format

And also since the TarFormatEntry is now setable on each TarEntry

TarEntryFormat is not setable. It matches the class that was used to construct the entry: #69544 (comment)

GnuTarEntry entry = new(TarEntryType.RegularFile, "entryName.txt");
Console.WriteLine(entry.Format); // Getter only, prints 'Gnu'

this would allow setting the Format for string based entries too and not have one for all on the writer.

See comment B in my first code example.

@ghost ghost added in-pr There is an active PR which will close this issue when it is merged and removed in-pr There is an active PR which will close this issue when it is merged labels Jun 1, 2022
@carlossanlop
Copy link
Member Author

TarEntryFormat, Conversion and GEA changes have been fixed.

@ghost ghost locked as resolved and limited conversation to collaborators Jul 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Formats.Tar
Projects
None yet
4 participants