compression on blob export should be an array #2347

tonistiigi · 2021-09-06T06:14:31Z

There are currently 2 fields to control compression of exported blobs: compression and force-compression=bool.

The first type should be an array instead. compression field checks that the existing compression exists in the array and if it does not then creates a new blob with the compression of the first element in the array. If force-compression is false then final compression array value is equal to compression,<any>.

The new default parameters for images would be: compression=gzip;uncompressed,force-compression=true. Notice there is no zstd as atm we don't want to push zstd unless the user knows what they are doing due to limited support. If the user sets compression field then the default force-compression is false like atm.

This makes it possible that a remote cache backend can start to use compression=zstd,gzip;uncompressed,force-compression=true as default and get faster compression for the cache. Should the same blob be exported later to the image then it gets converted to gzip. We might also want to change the remote cache to allow exporting blobs with multiple compressions at the same time.

Are there are better ways to achieve some use cases to start using zstd while not breaking everyone else? Notice -o compression=gzip,force-compression=false would still get zstd blobs if they already exist, but there is little reason to use it instead of the default if you don't want this behavior.

@ktock

The text was updated successfully, but these errors were encountered:

tonistiigi · 2021-09-09T00:22:58Z

While working on #2350 some more thoughts on this and #2346

The current ref/remote design was very much based on idea that there is only 1 blob per ref that obviously is not true anymore, with different compressions and different blobs extracting to same diffid. This leads to bugs, eg. remote cache issue I discovered in #2350 and hacky code with lots of exceptions.

As a first step, to unblock the cache issue I think GetRemote() signature needs to be changed to:

GetRemotes(ctx context.Context, createIfNeeded bool, compression CompressionOpt, all bool, s session.Group) ([]*solver.Remote, error

type CompressionOpt struct {
	Type  []compression.Type
	Force bool
	Level int
}

So changes from current #2350 are that compressionopt.Type is array, all bool is added, and array of remotes is returned if all is set to true.

This enables getting all the blobs (chains) that match any of the compressions. Cache can then export all available blobs or find the correct one that matches inline cache. If all=false then first element of the array would be used. Most callers only need one result so I think parameter would be needed to do early return.

Phase 2

It would be better to get rid of the current label-based linking between blobs. Explained some of the current problems in #2346 . One way would be that every ref creates an additional "blob lease" where the blob is put. There is no "main blob", or ref.getBlob(). When a new compression appears it is just put to the same blob-lease. When there is a new pull or diff that results in the already existing diffID the ref just gets the same "blob lease id". When we look up blob for a compression we just run lease.ListResources. Until ref is active all the blobs would be kept as well. Deleting blobs decreases the ref count for the blob-lease.

I think we can also get rid of the blobchainid etc. Don't think we need migration for this. We can just check both in GetByBlob and only set id without the blobid, deprecating the old format as new and old format never collide anyway.

wdyt? @ktock @sipsma

ktock · 2021-09-09T01:26:04Z

overall SGTM. Some minor questions:

What is the difference (all, force) = (true, false) vs (true, true)? (true, false) allowes non-target compression type of blobs are mixed in the returned *solver.Remote?
How will it work with DescHandlers?

tonistiigi · 2021-09-09T02:02:24Z

What is the difference (all, force) = (true, false) vs (true, true)? (true, false) allowes non-target compression type of blobs are mixed in the returned *solver.Remote?

Both can return mixed. Mixed can be avoided if compression is single length array and force==true. For all it can return multiple instances of Remote. Otherwise only one. If force is true then all the returned remotes need to have compression in the allowed compressions array.

Question: how do multiple instances of remotes work with parent chains. It tries to find a parent with the same compression and if it can't then takes next based on priority order?

How will it work with DescHandlers?

Not sure. How are multiple DescHandlers handled today? I assume the case you have in mind is where one provides blobs for one digest, and another for different compression.

I think we can also get rid of the blobchainid etc.

Actually, not that sure about it anymore. Eg. if an image is pulled and layers put on top of it, even if another blob for same diffid existed we should make sure that blob that was pulled gets uploaded, not another blob that extracts to the same diffid. At the same time, diff should always reuse and reference count existing blobs if diffid matches.

ktock · 2021-09-09T03:56:34Z

It tries to find a parent with the same compression and if it can't then takes next based on priority order?

SGTM.

How will it work with DescHandlers?

Not sure. How are multiple DescHandlers handled today? I assume the case you have in mind is where one provides blobs for one digest, and another for different compression.

DescHandler is keyed by the digest of the blob to be unlazied (https://github.com/moby/buildkit/blob/master/cache/blobs.go#L339) so if we want to get rid of "main blob" I think we need some metadata somewhere that indicates which blob need to be pulled from the registry.

@sipsma WDYT?

sipsma · 2021-09-09T18:01:45Z

The idea of using leases to track blobs instead of labels SGTM.

In terms of DescHandler, my thoughts are:

Currently, DescHandlers only get created in the pull code here and worker FromRemote code here. It makes sense that they are only setup for a single compression variant in those places because only a single compression variant is known at those times.
Additionally, I can't find any place in the cache package where it's strictly required that we get the DescHandler for a specific variant. We only ever need to get a single variant so that we can pull down the data. From there, any conversions can be performed locally. For example, in that code you linked to @ktock, even though there wouldn't be a main blob anymore, DescHandlers would still let us unlazy one of the variants of the blob, which is all we need.
- This works because DescHandlers are tracked per Ref, not cacheRecord, so even though cacheRecords are de-duped by blobchainid, refs aren't and thus can track how to pull down the specific variant that they were created for in the pull code or in worker.FromRemote.

So, given the above, I can't see that it's strictly necessary to make any changes to DescHandlers; we can leave to be keyed by the digest of a single compression variant blob. Let me know if I'm missing something.

The only reason I could see for making a change is to add a new optimization where we track the providers for each compression variant as part of the cacheRecord, which would then allow us to sometimes pull down a specific variant directly rather than pull down whatever variant the ref has a provider for and do a local conversion. I don't know if this optimization is worth it given it's a fairly obscure situation and it would probably complicate the code a significant amount even further.

sipsma · 2021-09-09T18:15:02Z

Actually, I just realized that if we are making these updates to immutableRef.ociDesc(...), we will probably need to add a parameter to specify which compression variant we want, in which case we will have to grab that from the DescHandler. So, there may need to be a small change to add MediaType to DescHandler, which should be an easy change. Other than that I don't think any modifications are strictly required.

tonistiigi mentioned this issue Oct 8, 2021

Propagate compression options to the inline cache export #2405

Merged

ktock mentioned this issue Feb 7, 2022

cache: Allow sharing compression variants among refs #2603

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compression on blob export should be an array #2347

compression on blob export should be an array #2347

tonistiigi commented Sep 6, 2021

tonistiigi commented Sep 9, 2021

ktock commented Sep 9, 2021

tonistiigi commented Sep 9, 2021

ktock commented Sep 9, 2021

sipsma commented Sep 9, 2021

sipsma commented Sep 9, 2021

compression on blob export should be an array #2347

compression on blob export should be an array #2347

Comments

tonistiigi commented Sep 6, 2021

tonistiigi commented Sep 9, 2021

Phase 2

ktock commented Sep 9, 2021

tonistiigi commented Sep 9, 2021

ktock commented Sep 9, 2021

sipsma commented Sep 9, 2021

sipsma commented Sep 9, 2021