Add exploration report about compression

This report was originally issue #76. Closes #76.
ipld · Sep 5, 2019 · 0e5acb1 · 0e5acb1
1 parent 1203b44
commit 0e5acb1
Showing 1 changed file with 157 additions and 0 deletions.
diff --git a/design/history/exploration-reports/2018.11-compression.md b/design/history/exploration-reports/2018.11-compression.md
@@ -0,0 +1,157 @@
+This document was archived from https://github.com/ipld/specs/issues/76.
+
+#76: IPLD and compression
+-------------------------
+Opened 2018-11-05T16:24:20Z by vmx
+
+Compression came already up [several](https://github.com/ipfs/ipfs/issues/47) [times](/discuss.ipfs.io/t/any-work-on-compression-of-ipld-data/2505). I think there should be some kind of compression for IPLD.
+
+I can think of two different ways which serve different purposes.
+
+### On the application level
+
+Add a new container format which is about compression which embeds the original one. There's several ways doing that, one could be by just adding a new format for any compression method, the format implementation would then extract the data and you could access the underlying one. Hashes would then be the one of the compressed data.
+
+Use cases for doing it this way are when you are rather collecting data (e.g. with sensors) and don't look at them too often.
+
+### Deep inside IPLD/IPFS
+
+Compression could also happen on a layer below IPLD. Then the hashes would be based on the actual (uncompressed) content. The idea had is that you would implement it on two layers, the storage and the transport.
+
+The storage layer (the repo) would be enabled with compression and just compress all the data that comes in. When you retrieve the data, it would be uncompressed by default. So things would work as they do today. Though there could be also an option to return the compressed data, which could then be passed on the transportation layer
+
+The transportation layer takes an uncompressed block by default, compresses it for the transport and uncompressed it again on the other side. Though there could be an option that it takes the compressed data directly of the storage layer and transmit that data. Then it would only need to be de-compressed on the receiving side to verify the hashes match.
+
+---
+#### (2018-11-05T17:27:34Z) mikeal:
+I've had several conversations about this as well. My current view is that we need to put some work into transport level compression and storage level compression soon. The longer we wait the more incentive there will be for people to solve this at the codec layer which we actively don't want because it will make transport and storage level compression redundant.
+
+---
+#### (2018-11-05T18:59:58Z) rklaehn:
+A few remarks from somebody who *really* needs this:
+
+- doing the compression transparently at the storage and transport layer has the disadvantage that you will often try to compress data where compression is futile. E.g. in our applications we sometimes have gigabytes of videos (which do not compress at all since they are already compressed), in addition to lots of IPLD data that compresses very well. If the store or transport layer is unaware of the nature of the data, it will spend a lot of time trying in vain to *compress video data*.
+
+- storage compression and transport compression should be the same. Uncompressing data from storage and then compressing it again would be horribly inefficient
+
+- having the hash depend on the compression is not great, but also not the end of the world. If you encode in dag-json or dag-cbor you also get different hashes for the same data.
+
+So a quick and dirty solution would be to just have another multiformats format like cbor-deflate. But then of course you have an explosion of formats for different encodings x different compression algorithms.
+
+A better solution might be: when you store something using the block api, you can hint whether compression is likely to be useful. This is so you can disable compression when storing something that is already compressed, like images and videos. You get the same hash regardless of which, if any, compression is used.
+
+For storage and transfer, you offer and accept the data for a hash with different representations. Then both participants of an exchange figure out the best exchange format and use that. E.g. `a` offers `Qm1234` in formats `uncompressed`, `deflate` and `zstd`, `b` accepts `uncompressed` and `deflate`, so they settle on `deflate` and transfer the raw deflated bytes exactly as they are in storage. This would work a bit like the content negotiation in HTTP. Old nodes only understand `uncompressed`, so they don't benefit from compression but still work.
+
+However, doing this correctly, with no unnecessary transcoding between storage and transmission, will probably require quite a bit of change in many layers, whereas just adding cbor-deflate to https://github.com/multiformats/multicodec/blob/master/table.csv and implementing it is a very small and isolated code change. So it might be better to do this first before opening up the huge can of worms that is doing this correctly...
+
+---
+#### (2018-11-05T19:07:47Z) mikeal:
+> doing the compression transparently at the storage and transport layer has the disadvantage that you will often try to compress data where compression is futile.
+
+Compression at the storage layer should be configurable. Futile transport layer compression is pretty common, the vast majority of HTTP traffic is gzip compressed even when it's video and image content :/ Also, even if the binary data in blocks is already compressed there's plenty of other stuff going on in the transport that will benefit from compression.
+
+---
+#### (2018-11-05T19:20:47Z) rklaehn:
+> > doing the compression transparently at the storage and transport layer has the disadvantage that you will often try to compress data where compression is futile.
+>
+> Compression at the storage layer should be configurable. Futile transport layer compression is pretty common, the vast majority of HTTP traffic is gzip compressed even when it's video and image content :/
+
+I am aware of that, but gzip is not going to be able to extract any redundancy out of a chunk of an mp4 file, so why do it?
+
+> Also, even if the binary data in blocks is already compressed there's plenty of other stuff going on in the transport that will benefit from compression.
+
+Sure. But I think the benefit of sending bytes over the wire exactly as they are in storage probably outweighs the advantage of saving a few bytes by compressing the protocol overhead. Depends on the use exact use case, of course.
+
+---
+#### (2018-11-05T20:18:11Z) mikeal:
+> I am aware of that, but gzip is not going to be able to extract any redundancy out of a chunk of an mp4 file, so why do it?
+> Sure. But I think the benefit of sending bytes over the wire exactly as they are in storage probably outweighs the advantage of saving a few bytes by compressing the protocol overhead. Depends on the use exact use case, of course.
+
+The answer to both of these is the same: because it's very easy to just compress the entire transport stream. It would be much much harder to compress only *parts* of the transport stream and to negotiate each part by pulling extra metadata out of the storage format.
+
+
+---
+#### (2018-11-05T20:49:51Z) Stebalien:
+So, the key issue here is that compressing before hashing causes some significant issues with deduplication. Otherwise, our options are effectively equivalent in terms of performance at the end of the day, they just move complexity around.
+
+Even if we compress at the transport/storage layer, we can still have smart transports and smart storage systems that:
+
+* Compress based on types/application hints (we can feed this information through from the application).
+* Compress based on built-in heuristics (even ML). It shouldn't be that expensive to check for compressibility before compressing with an expensive compression algorithm.
+* Pass-through compressed objects directly from the transport to the storage layer. This *does* require a smarter transport but that's something that can be negotiated.
+
+---
+#### (2018-11-05T20:52:43Z) rklaehn:
+> > I am aware of that, but gzip is not going to be able to extract any redundancy out of a chunk of an mp4 file, so why do it?
+> > Sure. But I think the benefit of sending bytes over the wire exactly as they are in storage probably outweighs the advantage of saving a few bytes by compressing the protocol overhead. Depends on the use exact use case, of course.
+>
+> The answer to both of these is the same: because it's very easy to just compress the entire transport stream. It would be much much harder to compress only _parts_ of the transport stream and to negotiate each part by pulling extra metadata out of the storage format.
+
+I agree about this being *easier*. You can just pipe the entire transport stream through inflate/deflate. But I would think that doing so for data that can not be reasonably compressed, such as video data, would slow things down and/or increase energy consumption for no good reason. This will matter especially when running on low powered devices.
+
+You will have a tradeoff between using a fast but inefficient compression algo like LZ4 or a slow but space efficient algo like bzip2. You will also have an overhead for the compression / decompression state for each connection. And all that might be for nothing if the majority of the content that is being exchanged is already compressed (videos, images etc.), which might frequently be the case.
+
+If you limit your efforts to data where there you expect a large gain due to compression, such as IPLD data encoded as CBOR, and make sure not to transcode between storage and transport, things should be more efficient. I think this can still be done in a way that keeps the storage and transport layers generic.
+
+---
+#### (2018-11-05T21:06:19Z) rklaehn:
+> So, the key issue here is that compressing before hashing causes some significant issues with deduplication.
+
+Agree with most of what you say. But personally I don't care *that much* about deduplication not working due to different multiformats encoding.
+
+Let's say I have an application that adds some data in a hypothetical multiformats format `cbor-deflate`. My application is always going to use `cbor-deflate`, and the likelihood of somebody else adding the exact same data is essentially zero. So the only downside would be that I would have to re-encode my IPLD when switching to, say `cbor-zstd`. That's a price I would be willing to pay for getting compression soon.
+
+What I *do* care about most is that the data is still IPLD data that can be traversed and inspected, instead of essentially a binary blob (which is what I have implemented now because I *really* need compression for my use case).
+
+---
+#### (2018-11-06T09:08:34Z) vmx:
+> So a quick and dirty solution would be to just have another multiformats format like cbor-deflate. But then of course you have an explosion of formats for different encodings x different compression algorithms.
+
+That's the reason why I didn't bring that up as a possible solution. I'd rather go with the format wrapping, so you only have one new format per compression algorithm. You'd also get inspection/traversing capabilities from IPLD.
+
+---
+#### (2018-11-06T09:13:42Z) vmx:
+> * Compress based on types/application hints (we can feed this information through from the application).
+>
+> * Compress based on built-in heuristics (even ML). It shouldn't be that expensive to check for compressibility before compressing with an expensive compression algorithm.
+
+This sounds nice, but if I think about the implementation, I fear it'll make things quite complicated/error prone as you somehow need to keep track which items are compressed or not (and keep it in sync). Though we can always start simple and go from there.
+
+---
+#### (2018-11-06T19:54:17Z) Stebalien:
+> Though we can always start simple and go from there.
+
+Definitely. Really, I'd hide this behind types. That is, a "smart" transport would produce normal `Block` (or `Node`) objects however, one would be able to cast to a `CompressedNode` interface to extract the compressed data without re-compressing.
+
+---
+#### (2018-12-03T23:55:06Z) mikeal:
+I just spent about half a day running a simulation for `npm-on-ipfs`. https://github.com/ipfs-shipyard/ipfs-npm-registry-mirror/issues/6
+
+TLDR; A use-case practically designed to show the benefits of de-duplication doesn't even touch the performance gains of just compressing an entire archive on every release and not de-duplicating **anything**.
+
+I'm still processing the results of this investigation but it's starting to make me think that we might be focusing on the wrong performance cases in IPLD/IPFS across the board. Compression is incredibly powerful and I'm starting to get skeptical of any numbers we're putting together without transport or storage compression in place. Reducing round trips is going to help out no matter what, but compression is probably the biggest thing we could do to impact performance of most use cases.
+
+---
+#### (2018-12-04T00:44:38Z) Stebalien:
+> focusing on the wrong performance cases in IPLD/IPFS across the board
+
+If we were *bandwidth* constrained, that would be the issue. However, I've never seen a bandwidth constrained test *except* on localhost.
+
+I do agree that we need to deal with this ASAP
+
+---
+#### (2018-12-04T16:42:27Z) mikeal:
+> If we were bandwidth constrained, that would be the issue. However, I've never seen a bandwidth constrained test except on localhost.
+
+Not necessarily.
+
+* The faster we get early file resources the quicker we can make subsequent requests in parallel (each layer of the graph we traverse we gain parallelism). Even if we haven't saturated the connection, compressing the resources will speed them up which will then reduce the time it takes for us to actually saturate the connection.
+* Depending on the congestion control algorithm, early requests are much slower than subsequent requests. TCP (and TFRC when using UDP) use a loss based congestion control algorithm that ramps up, increasing the send rate until it sees loss. Because we connect to multiple peers we hit this in every connection and hit it more in the initial stages of a connection, unless we're using an alternative congestion control algorithm I don't know about.
+* Mobile packet loss plays havoc with these algorithms and mobile infrastructure tries to compensate by keeping a buffer in the network layer. In the long run this helps but it tends to make the initial connection speed fluctuate with spikes up and down and sending larger amounts of data before this normalizes tends to make it worse.
+
+
+---
+#### (2018-12-04T18:28:10Z) Stebalien:
+WRT packet loss, a large issue there is that go-ipfs currently sends out *way* too many packets (we need to buffer better).
+
+WRT compression, I'd be surprised if intermediate nodes were all that compressible. They tend to *mostly* be composed of hashes.