Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

IPLD and compression #76

Closed
vmx opened this issue Nov 5, 2018 · 22 comments · Fixed by #189
Closed

IPLD and compression #76

vmx opened this issue Nov 5, 2018 · 22 comments · Fixed by #189

Comments

@vmx
Copy link
Member

vmx commented Nov 5, 2018

Compression came already up several times. I think there should be some kind of compression for IPLD.

I can think of two different ways which serve different purposes.

On the application level

Add a new container format which is about compression which embeds the original one. There's several ways doing that, one could be by just adding a new format for any compression method, the format implementation would then extract the data and you could access the underlying one. Hashes would then be the one of the compressed data.

Use cases for doing it this way are when you are rather collecting data (e.g. with sensors) and don't look at them too often.

Deep inside IPLD/IPFS

Compression could also happen on a layer below IPLD. Then the hashes would be based on the actual (uncompressed) content. The idea had is that you would implement it on two layers, the storage and the transport.

The storage layer (the repo) would be enabled with compression and just compress all the data that comes in. When you retrieve the data, it would be uncompressed by default. So things would work as they do today. Though there could be also an option to return the compressed data, which could then be passed on the transportation layer

The transportation layer takes an uncompressed block by default, compresses it for the transport and uncompressed it again on the other side. Though there could be an option that it takes the compressed data directly of the storage layer and transmit that data. Then it would only need to be de-compressed on the receiving side to verify the hashes match.

@mikeal
Copy link
Contributor

mikeal commented Nov 5, 2018

I've had several conversations about this as well. My current view is that we need to put some work into transport level compression and storage level compression soon. The longer we wait the more incentive there will be for people to solve this at the codec layer which we actively don't want because it will make transport and storage level compression redundant.

@rklaehn
Copy link

rklaehn commented Nov 5, 2018

A few remarks from somebody who really needs this:

  • doing the compression transparently at the storage and transport layer has the disadvantage that you will often try to compress data where compression is futile. E.g. in our applications we sometimes have gigabytes of videos (which do not compress at all since they are already compressed), in addition to lots of IPLD data that compresses very well. If the store or transport layer is unaware of the nature of the data, it will spend a lot of time trying in vain to compress video data.

  • storage compression and transport compression should be the same. Uncompressing data from storage and then compressing it again would be horribly inefficient

  • having the hash depend on the compression is not great, but also not the end of the world. If you encode in dag-json or dag-cbor you also get different hashes for the same data.

So a quick and dirty solution would be to just have another multiformats format like cbor-deflate. But then of course you have an explosion of formats for different encodings x different compression algorithms.

A better solution might be: when you store something using the block api, you can hint whether compression is likely to be useful. This is so you can disable compression when storing something that is already compressed, like images and videos. You get the same hash regardless of which, if any, compression is used.

For storage and transfer, you offer and accept the data for a hash with different representations. Then both participants of an exchange figure out the best exchange format and use that. E.g. a offers Qm1234 in formats uncompressed, deflate and zstd, b accepts uncompressed and deflate, so they settle on deflate and transfer the raw deflated bytes exactly as they are in storage. This would work a bit like the content negotiation in HTTP. Old nodes only understand uncompressed, so they don't benefit from compression but still work.

However, doing this correctly, with no unnecessary transcoding between storage and transmission, will probably require quite a bit of change in many layers, whereas just adding cbor-deflate to https://github.com/multiformats/multicodec/blob/master/table.csv and implementing it is a very small and isolated code change. So it might be better to do this first before opening up the huge can of worms that is doing this correctly...

@mikeal
Copy link
Contributor

mikeal commented Nov 5, 2018

doing the compression transparently at the storage and transport layer has the disadvantage that you will often try to compress data where compression is futile.

Compression at the storage layer should be configurable. Futile transport layer compression is pretty common, the vast majority of HTTP traffic is gzip compressed even when it's video and image content :/ Also, even if the binary data in blocks is already compressed there's plenty of other stuff going on in the transport that will benefit from compression.

@rklaehn
Copy link

rklaehn commented Nov 5, 2018

doing the compression transparently at the storage and transport layer has the disadvantage that you will often try to compress data where compression is futile.

Compression at the storage layer should be configurable. Futile transport layer compression is pretty common, the vast majority of HTTP traffic is gzip compressed even when it's video and image content :/

I am aware of that, but gzip is not going to be able to extract any redundancy out of a chunk of an mp4 file, so why do it?

Also, even if the binary data in blocks is already compressed there's plenty of other stuff going on in the transport that will benefit from compression.

Sure. But I think the benefit of sending bytes over the wire exactly as they are in storage probably outweighs the advantage of saving a few bytes by compressing the protocol overhead. Depends on the use exact use case, of course.

@mikeal
Copy link
Contributor

mikeal commented Nov 5, 2018

I am aware of that, but gzip is not going to be able to extract any redundancy out of a chunk of an mp4 file, so why do it?
Sure. But I think the benefit of sending bytes over the wire exactly as they are in storage probably outweighs the advantage of saving a few bytes by compressing the protocol overhead. Depends on the use exact use case, of course.

The answer to both of these is the same: because it's very easy to just compress the entire transport stream. It would be much much harder to compress only parts of the transport stream and to negotiate each part by pulling extra metadata out of the storage format.

@Stebalien
Copy link
Contributor

So, the key issue here is that compressing before hashing causes some significant issues with deduplication. Otherwise, our options are effectively equivalent in terms of performance at the end of the day, they just move complexity around.

Even if we compress at the transport/storage layer, we can still have smart transports and smart storage systems that:

  • Compress based on types/application hints (we can feed this information through from the application).
  • Compress based on built-in heuristics (even ML). It shouldn't be that expensive to check for compressibility before compressing with an expensive compression algorithm.
  • Pass-through compressed objects directly from the transport to the storage layer. This does require a smarter transport but that's something that can be negotiated.

@rklaehn
Copy link

rklaehn commented Nov 5, 2018

I am aware of that, but gzip is not going to be able to extract any redundancy out of a chunk of an mp4 file, so why do it?
Sure. But I think the benefit of sending bytes over the wire exactly as they are in storage probably outweighs the advantage of saving a few bytes by compressing the protocol overhead. Depends on the use exact use case, of course.

The answer to both of these is the same: because it's very easy to just compress the entire transport stream. It would be much much harder to compress only parts of the transport stream and to negotiate each part by pulling extra metadata out of the storage format.

I agree about this being easier. You can just pipe the entire transport stream through inflate/deflate. But I would think that doing so for data that can not be reasonably compressed, such as video data, would slow things down and/or increase energy consumption for no good reason. This will matter especially when running on low powered devices.

You will have a tradeoff between using a fast but inefficient compression algo like LZ4 or a slow but space efficient algo like bzip2. You will also have an overhead for the compression / decompression state for each connection. And all that might be for nothing if the majority of the content that is being exchanged is already compressed (videos, images etc.), which might frequently be the case.

If you limit your efforts to data where there you expect a large gain due to compression, such as IPLD data encoded as CBOR, and make sure not to transcode between storage and transport, things should be more efficient. I think this can still be done in a way that keeps the storage and transport layers generic.

@rklaehn
Copy link

rklaehn commented Nov 5, 2018

So, the key issue here is that compressing before hashing causes some significant issues with deduplication.

Agree with most of what you say. But personally I don't care that much about deduplication not working due to different multiformats encoding.

Let's say I have an application that adds some data in a hypothetical multiformats format cbor-deflate. My application is always going to use cbor-deflate, and the likelihood of somebody else adding the exact same data is essentially zero. So the only downside would be that I would have to re-encode my IPLD when switching to, say cbor-zstd. That's a price I would be willing to pay for getting compression soon.

What I do care about most is that the data is still IPLD data that can be traversed and inspected, instead of essentially a binary blob (which is what I have implemented now because I really need compression for my use case).

@vmx
Copy link
Member Author

vmx commented Nov 6, 2018

So a quick and dirty solution would be to just have another multiformats format like cbor-deflate. But then of course you have an explosion of formats for different encodings x different compression algorithms.

That's the reason why I didn't bring that up as a possible solution. I'd rather go with the format wrapping, so you only have one new format per compression algorithm. You'd also get inspection/traversing capabilities from IPLD.

@vmx
Copy link
Member Author

vmx commented Nov 6, 2018

  • Compress based on types/application hints (we can feed this information through from the application).

  • Compress based on built-in heuristics (even ML). It shouldn't be that expensive to check for compressibility before compressing with an expensive compression algorithm.

This sounds nice, but if I think about the implementation, I fear it'll make things quite complicated/error prone as you somehow need to keep track which items are compressed or not (and keep it in sync). Though we can always start simple and go from there.

@Stebalien
Copy link
Contributor

Though we can always start simple and go from there.

Definitely. Really, I'd hide this behind types. That is, a "smart" transport would produce normal Block (or Node) objects however, one would be able to cast to a CompressedNode interface to extract the compressed data without re-compressing.

@mikeal
Copy link
Contributor

mikeal commented Dec 3, 2018

I just spent about half a day running a simulation for npm-on-ipfs. ipfs-shipyard/ipfs-npm-registry-mirror#6

TLDR; A use-case practically designed to show the benefits of de-duplication doesn't even touch the performance gains of just compressing an entire archive on every release and not de-duplicating anything.

I'm still processing the results of this investigation but it's starting to make me think that we might be focusing on the wrong performance cases in IPLD/IPFS across the board. Compression is incredibly powerful and I'm starting to get skeptical of any numbers we're putting together without transport or storage compression in place. Reducing round trips is going to help out no matter what, but compression is probably the biggest thing we could do to impact performance of most use cases.

@Stebalien
Copy link
Contributor

focusing on the wrong performance cases in IPLD/IPFS across the board

If we were bandwidth constrained, that would be the issue. However, I've never seen a bandwidth constrained test except on localhost.

I do agree that we need to deal with this ASAP

@mikeal
Copy link
Contributor

mikeal commented Dec 4, 2018

If we were bandwidth constrained, that would be the issue. However, I've never seen a bandwidth constrained test except on localhost.

Not necessarily.

  • The faster we get early file resources the quicker we can make subsequent requests in parallel (each layer of the graph we traverse we gain parallelism). Even if we haven't saturated the connection, compressing the resources will speed them up which will then reduce the time it takes for us to actually saturate the connection.
  • Depending on the congestion control algorithm, early requests are much slower than subsequent requests. TCP (and TFRC when using UDP) use a loss based congestion control algorithm that ramps up, increasing the send rate until it sees loss. Because we connect to multiple peers we hit this in every connection and hit it more in the initial stages of a connection, unless we're using an alternative congestion control algorithm I don't know about.
  • Mobile packet loss plays havoc with these algorithms and mobile infrastructure tries to compensate by keeping a buffer in the network layer. In the long run this helps but it tends to make the initial connection speed fluctuate with spikes up and down and sending larger amounts of data before this normalizes tends to make it worse.

@Stebalien
Copy link
Contributor

WRT packet loss, a large issue there is that go-ipfs currently sends out way too many packets (we need to buffer better).

WRT compression, I'd be surprised if intermediate nodes were all that compressible. They tend to mostly be composed of hashes.

vmx added a commit that referenced this issue Sep 5, 2019
This report was originally issue #76.

Closes #76.
@vmx vmx closed this as completed in #189 Sep 9, 2019
vmx added a commit that referenced this issue Sep 9, 2019
This report was originally issue #76.

Closes #76.
rvagg pushed a commit that referenced this issue Sep 10, 2019
This report was originally issue #76.

Closes #76.
rvagg pushed a commit that referenced this issue Sep 10, 2019
This report was originally issue #76.

Closes #76.
rvagg pushed a commit that referenced this issue Sep 10, 2019
This report was originally issue #76.

Closes #76.
@RubenKelevra
Copy link

RubenKelevra commented Mar 31, 2020

I'm sorry I obviously overlooked this ticket when I created ipld/legacy-unixfs-v2#34.

My idea was mentioned (partly) before:

  • hash the content
  • compress it on the datastore level
  • if a node can accept transport-level compression, send the compressed data
  • if the node cannot accept transport-level compression, decompress first

Security issues with transport compression

The compression used inside an encrypted connection is prone to be a security issue, see BREACH and CRIME exploits. If you like to use a transport compression it needs to be carefully designed to not leak private content.

This isn't an issue when we use the already compressed data from the storage and send those through a connection, or decompress the data (we have stored compressed) and send it through a connection.

My thoughts on a possible implementation

The only addition would be the implementation details:

There are three types of compression:

  • With a predefined dictionary
  • With a dictionary build at compression time
  • With a dictionary build before compression time

The last two apply to zstd (zstd can use/create an external dictionary), the first applies to brotli.

Brotli and zstd

Brotli is interesting to be supported because it's supported by browsers, can compress pretty well and doesn't need a dictionary inside the archive.

Zstd is extremely interesting because it can analyze a dataset, store a dictionary and compress the dataset with it. This allows to analyze say a large amount of small HTML files, create a dictionary, store it in ipfs and put a reference to the dictionary on the file/folder meta information.

All further writes containing HTML files could use the dictionary and therefore increase the compression/decompression speed massively:
dict-cs dict-ds

Drawbacks on using a dictionary

The information which dictionary was used can also be stored inside the zstd format. This allows that we can easily lookup the correct dictionary in a list of all dictionary used on the local node.

If a remote node is requesting blocks, it needs to have a way to request the matching CID for the dictionary ID of the blocks received.

By default, the dictionary id used on generation is a 4-byte random number, but can also be set to a specific value. Keep in mind that this value needs to be stored in all compressed files. So it shouldn't be an IPNS hash for example.

A dictionary ID is a locally unique ID that a decoder can use to verify it is using the right dictionary. By default, zstd will create a 4-bytes random number ID. It's possible to give a precise number instead. Short numbers have an advantage: an ID < 256 will only need 1 byte in the compressed frame header, and an ID < 65536 will only need 2 bytes. This compares favorably to 4 bytes default. However, it's up to the dictionary manager to not assign twice the same ID to 2 different dictionaries.

Drawbacks on using no dictionary

We have a lot of small chunks in IPFS since the default maximum block size is 256 KByte. Without using a dictionary a compression algorithm needs to create a dictionary and store it inside the block archive, for each block. This decreases the efficiency of the compression.

With an external dictionary, we can have both: Good efficiency and the ability to decode each block on its own (together with the dictionary).

Compression test

Since we care about a specific scenario I tried zstd on a datastore with various settings.

The corpus was the Linux git repository (without hidden folders/files) added with standard settings on ipfs 0.4.23 to an empty flatfs datastore. The commit was 906c40438bb669b253d0daeaf5f37a9f78a81b41.

Uncompressed 866M (with du --apparent-size -h)

Highest compression ratio with zstd (1MB dictionary): 5.931x
Highest compression ratio with zstd (110KB dictionary): 5.628x
Highest compression ratio with zstd (no dictionary): 4.731x

I checked apart from the quoted level 22 also compression level 3, 9 and 15. The same picture, except that on lower compression levels a 1 MB dictionary makes no sense anymore.

Offer pretrained dictionaries

The best thing of using zstd is, we can provide already trained dictionaries for common applications, like compressing C code, compressing JS code, compressing Go code, compressing HTML/CSS files etc. since the file metadata would just contain a reference to the dictionary. IPFS can just use the dictionaries that we in the IPFS network while we don't have to add them to the binaries.

Brotli

Brotli can achieve on the same dataset on the highest setting 5.249x with obviously much slower compression and decompression speed compared to zstd with a dictionary.

Other compression algorithms for comparison:

Highest compression ratio with bzip2: 4.595
Highest compression ratio with gzip: 4.409

@vmx
Copy link
Member Author

vmx commented Apr 1, 2020

Thanks for the detailed analysis.

Custom dictionaries would be really cool, though making it actually work sounds pretty difficult as you need to transfer the information and the dict somehow.

If someone would want to prototype this, I'd probably start with no-dict or pre-defined dict compression.

@RubenKelevra
Copy link

I wouldn't use zstd without a dictionary at all. It would slow down operations significantly.

The easiest way to start would be to create a set of dictionaries and add custom numbers to them. If we use 0-256 for this it will just increase the size of the files by 1 byte vs no dictionary id.

We could simply throw a lot of files from the web for each mime code into the analyzer and generate up to 256 dictionaries, one for each mime type.

We would store just a list of the id and the corresponding CID in the IPFS code and they are fetched from the network when needed.

This way we can make sure all nodes use the same dictionary, while we don't blow up the binary of IPFS.

Custom dictionaries by the user on the other hand would need a lot of additionally work on the protocol and in the definition of objects. To avoid having to specify the whole CID on each block or file it would make sense to alter the 'folder' definition in UnixFSv2 to an optional field for defining a dictionary ID and a corresponding CID.

But this will get pretty difficult when a file gets pulled by a client which doesn't want to hold the full folder, since we need to transmit this information efficiently, if the client doesn't have the dictionary already.

TL;DR stick to 256 mime dictionaries and sort the mess later


checksums inside the zstd header

zstd can also save a checksum of the data inside the archive - I would deactivate this feature, since it's using space unnecessarily.

@RubenKelevra
Copy link

Should I create a new ticket for this under go-ipfs?

@vmx
Copy link
Member Author

vmx commented Apr 2, 2020

@RubenKelevra I don't think a new ticket is needed. You could just link from a PR to this issue.

@vmx
Copy link
Member Author

vmx commented Apr 2, 2020

@RubenKelevra Though if you want to discuss Go specific things or want to make sure such a PR would be accepted before working on it, it makes sense to open an issue there.

@RubenKelevra
Copy link

In this case, this ticket should be reopened :)

I'll create a ticket just the creation of the dictionaries, and the further stuff can be either discussed here or in a new ticket, if necessary.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants