-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[2.x] Replace ParallelGzipOutputStream
with Ichoran's implementation
#1456
Conversation
TODO: add a bunch of unit tests for Non concurrent tests:
Concurrent tests:
|
2290dee
to
90bbcba
Compare
val compressedSize = d.length() | ||
val compressionRatio = compressedSize.toDouble / uncompressedSize.toDouble | ||
assert(compressionRatio < 0.85) | ||
// compression rate for each data: 0.8185090254676337, 0.7247774786370688, 0.8346021341469837 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compression rate looks somewhat underwhelming... But we only have 3 data points. Maybe Stefan's internal testing at databricks showed better compression rates that motivated the introduction of GZIP compression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially related comment on this #1326 (comment)
The gzip compression is very slow relative to everything else, despite gzip with native zlib being fast and using the best performance/size tradeoff in he compression settings as determined by benchmarks. Or maybe I should say: The new serializer and deserializer are very fast compared to gzip.
I reran a quick benchmark. Writing without compression takes 153ms, writing with Java's standard GZIPOutputStream is 402ms, writing with ParallelGzipOutputStream is 157ms. The latter is on a 12-core M2 Max (but with only 230% CPU usage; that's enough to offload all the compression to background threads, making it almost free in terms of wall clock time).
The compression level currently set is
private val compression = Deflater.DEFAULT_COMPRESSION
So my read is that by using parallel gzip, it's not like we get an amazing compression but we get it almost without additional time penalty.
Transient CI failure observed
|
Well that error isn't very informative :/ Is there any way to get your hands on the file so we can inspect what went wrong? "Not in GZip format" could be anything. Is that a SBT message, or is it wrapping the Java error without the full stack trace? |
Here's the full stack trace for documentation. It does not tell much other than the fact that the GZIP is not properly compressed.
I made a test branch to gather more information. I managed to reproduce another failure involving small data (which may be helpful for debugging)
(size refers to the size of uncompressed data, aka the failure happened when an empty byte array was compressed) Uncompressed data & improperly compressed gzip for the above failure in https://github.com/Friendseeker/zinc/actions/runs/11334286709/artifacts/2054937941 The improperly compressed GZIP has 10 bytes
|
I think I might know what is happening with Hopefully that is the only failure. |
Oh, it's possible to have this try to write a zero-byte file? I didn't even consider that as a use case. "Cache nothing by writing it into a file" didn't seem like a desirable way to run things. I can look over my code again to make sure it handles that anyway, if you'd like. |
Yeah I don't think it happens. I was just doing smoke testing with some boundary inputs. It is fine leaving this alone then. I can add a comment saying it does not support 0 byte input. |
9c7766a
to
86e8122
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks both!
aa66f40
to
d35e028
Compare
Were you able to identify the situation where my original implementation causes a deadlock? We're building around 10 million targets per day with it and haven't encountered any problems. |
I think the Zinc benchmark got stuck? Also I wonder if it might be more to do with EC implementation. Like thread pool vs fork/join. |
I read through the code carefully and I didn't see any conceptual problems with it. Since it was JVM 8 specific and behavior varied with execution context, and you used Friendseeker tested with a small fixed-pool execution context and it blocked every time. There's really no reason that should happen at all: the futures themselves do nothing that ought to block, and the calling thread blocks until the lead future is done, which should at the very worst case happen when all work is done (e.g. if there is one thread to work on and everything happens in the worst order). So I could only assume that behind the scenes something about the ThreadLocal access was somehow using blocking operations that required more threads to unwind the problem (ForkJoinPool had intermittent failures). |
@Ichoran thank you for your contribution!
Overview
This PR features Ichoran's implementation of
ParallelGzipOutputStream
that provides binary-identical output and similar performance compared to Stefan'sParallelGzipOutputStream
but uses raw threads directly to avoid concurrency issue with Stefan's implementation.Validation
New unit tests are added to stress test
ParallelGzipOutputStream
against a variety of data size & parallelism settings. Tests involving multiple threads each creating & usingParallelGzipOutputStream
is added to simulate Zinc's multithread usage.Flow Diagram of Ichoran's
ParallelGzipOutputStream
algorithm