Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Snowflake Arrow prototype #1

Closed
wants to merge 6 commits into from
Closed

Initial Snowflake Arrow prototype #1

wants to merge 6 commits into from

Conversation

joshuataylor
Copy link
Owner

@joshuataylor joshuataylor commented Jun 4, 2022

Hi!

Thanks so much for checking out this repository, and this PR.

I don't have much experience with Rust (only a few small apps here and there to learn the basics), so this has been a fantastic learning experience.

Please see this discussion for a rundown about my initial thoughts, and a fantastic response from @jorgecarleitao.

This Pull Request adds support for converting Arrow IPC Stream files that Snowflake sends in a response to their Rest API using Rustler within Elixir. This allows req_snowflake to use this as an optional dependency to decode Arrow alongside the JSON files Snowflake sends.

Approach

The approach I went with initially was around serialising the Arrow types to Rust types, which could then be serialised to Elixir types easily. I'm not sure how memcpy would work in this case, or a good way to do this in a zero-copy way. Right now I feel like Arrow2 is parsing it, then we're reiterating over it to encode it. And I have a feeling Rustler might be also encoding the data?

Feedback

I would love feedback around the following areas:

1/ What is the best way to return Rust types back to Elixir types (and cast them as an option, so we could return a Snowflake date as an Elixir Date, without having to do it via Elixir).
2/ Is the way I'm doing the encoding good? I think I could just move the encoder into the main function then return a Vec of Terms, right? It feels a bit like pass the parcel, where it sends data to be encoded (which then encodes it in another function).
3/ Would it be worth while using something like multhreading (rayon or whatever), I'm guessing not unless it's a huge amount of data, which Snowflake will only send us <20mb from what I can see.

I've added tests to check that data is converting correctly and the NIF works, so there will be a lot of extra files, sorry.

Main files:
lib.rs:

https://github.com/joshuataylor/snowflake_arrow/pull/1/files#diff-a329bc6502ab443b8ba9f5ce11b9d2cef36f441771d5d11437574d6e9b1c58a4

Here are some initial benchmarking results, these aren't casting into Elixir structs etc yet:

Laptop, Apple Macbook m1

CPU Information: Apple M1
Number of Available Cores: 8
Available memory: 16 GB
Elixir 1.13.4
Erlang 25.0

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 24 s

Benchmarking arrow_large (9.4mb) ...
Benchmarking arrow_small (368kb) ...

Name                          ips        average  deviation         median         99th %
arrow_small (368kb)        1.15 K        0.87 ms    ±21.48%        0.79 ms        1.20 ms
arrow_large (9.4mb)       0.148 K        6.76 ms    ±25.01%        5.91 ms        9.32 ms

Comparison:
arrow_small (368kb)        1.15 K
arrow_large (9.4mb)       0.148 K - 7.77x slower +5.89 ms

Desktop (slowish single core performance)

Operating System: Linux
CPU Information: AMD Ryzen Threadripper 2990WX 32-Core Processor
Number of Available Cores: 64
Available memory: 125.82 GB
Elixir 1.13.4
Erlang 25.0

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 24 s

Benchmarking arrow_large (9.4mb) ...
Benchmarking arrow_small (368kb) ...

Name                          ips        average  deviation         median         99th %
arrow_small (368kb)        488.32        2.05 ms    ±44.27%        1.45 ms        3.80 ms
arrow_large (9.4mb)         56.52       17.69 ms     ±9.90%       17.45 ms       24.26 ms

Comparison: 
arrow_small (368kb)        488.32
arrow_large (9.4mb)         56.52 - 8.64x slower +15.65 ms

Desktop, AMD Ryzen 5 5600X

Operating System: Linux
CPU Information: AMD Ryzen 5 5600X 6-Core Processor
Number of Available Cores: 12
Available memory: 31.33 GB
Elixir 1.13.4
Erlang 25.0

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 24 s

Benchmarking arrow_large (9.4mb) ...
Benchmarking arrow_small (368kb) ...

Name                          ips        average  deviation         median         99th %
arrow_small (368kb)        1.08 K        0.92 ms    ±22.65%        0.83 ms        1.28 ms
arrow_large (9.4mb)       0.121 K        8.25 ms    ±28.22%        7.47 ms       11.54 ms

Comparison: 
arrow_small (368kb)        1.08 K
arrow_large (9.4mb)       0.121 K - 8.94x slower +7.33 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant