-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JS] Implement IPC RecordBatch body buffer compression from ARROW-300 #24833
Comments
eric mauviere: |
Dominik Moritz / @domoritz: |
Kyle Barron: I think there are various reasons why we might not want to pull in LZ4 and ZSTD implementations by default:
|
Dominik Moritz / @domoritz: Could you look at the available js libraries and see what their sizes are? Also, is lz4 or zstd much more common than the other? We also should look into how much benefit we actually get from compression since most servers already support transparent gzip compression and so compressing an already compressed file will just incur overhead. If the libraries are too heavy, we can think about a plugin system. We could make our registry be synchronous. I definitely don't want to pull in wasm into the library as it will break people's workflows. |
Kyle Barron:
Example of writing table to buffer uncompressed, then using
Example of writing table to buffer with lz4 compression:
Example of writing table to buffer with zstd compression:
|
Maybe. Let's make sure to compare native gzip compression that a web server uses with js lz4/zstd compression.
It would unfortunately also preclude people from putting decompression into a worker. Maybe we can make the relevant IPC methods return return promises when the compression/decompression method is async (returns a promise).
We could consider inlining the wasm code with base64 if it's tiny but I suspect it will not. Worth considering, though. Anyway, I think it makes sense to work on this and send a pull request. We should definitely have a way to pass in/register compression algorithms. Then let's look into whether we want to bundle any algorithms. Let's start with lz4 and try a few libraries (e.g. https://github.com/gorhill/lz4-wasm, https://github.com/Benzinga/lz4js, https://github.com/pierrec/node-lz4). If they are small enough, I would consider including a default lz4 implementation. Sounds good? |
Dominik Moritz / @domoritz: |
I'm most familiar with fastapi, which is probably the third most-popular Python web server framework after Django and Flask. Its suggested gzip middleware uses the standard library's gzip implementation so I don't think my example above was completely out of place. The lzbench native benchmarks still have lz4 and zstd as 4-6x faster than zlib. But I think these performance discussions are more of a side discussion; given that the Arrow IPC format allows for compression, I'd love to find a way for Arrow JS to support these files.
That's a very good point. If we implement a registry of some sort, we could consider allowing both sync and async of compression. Then the
Sounds good! I'll try to find time soon to put up a draft. |
Hey folks! Great to see the discussion on this issue. We've been looking at using arrow to send some data over the wire from C# to JS. Right now when we try to read record batches from the stream, we see an error It seems like the js library is the only one missing buffer compression support compared to other libraries. This can essentially block interop when passing arrow data from another language to js, with compression enabled. What's the extent of compression support in the js library right now? I did see a file called |
Any update on the timelines for JS implementation to support buffer compression |
This may not be a hard requirement for JS because this would require pulling in implementations of LZ4 and ZSTD which not all users may want
Reporter: Wes McKinney / @wesm
PRs and other links:
Note: This issue was originally created as ARROW-8674. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: