JIT-compille DataFusion expression with column name #2124

Dandandan · 2022-03-29T19:26:25Z

Which issue does this PR close?

Closes #2123

Rationale for this change

Allowing to compile more realistic expressions, like (a+1) as a step towards generating code to run on Arrays.

What changes are included in this PR?

Are there any user-facing changes?

yjshen · 2022-03-30T02:16:31Z

Thanks again @Dandandan for working on JIT! I hope I can have time to work with you on jit soon.

FWIW, one of our blockers is Cranelift has no ability to inline assembly, so it causes performance regression for columnar to row conversion (as recognized in #1975). A possible solution is to use LLVM instead of Cranelift because PostgreSQL and Impala have proved that inline is feasible in LLVM. I would pick it up if it's not taken already at the time.

Dandandan · 2022-03-30T06:25:40Z

Thanks again @Dandandan for working on JIT! I hope I can have time to work with you on jit soon.

FWIW, one of our blockers is Cranelift has no ability to inline assembly, so it causes performance regression for columnar to row conversion (as recognized in #1975). A possible solution is to use LLVM instead of Cranelift because PostgreSQL and Impala have proved that inline is feasible in LLVM. I would pick it up if it's not taken already at the time.

Cool, thanks for the context.

I was thinking for generating the code for compiling expressions to cranelift jit, this would not be bad, as we could generate the whole loop instead to generate the new array contents which doesn't need to call any function. I am not sure if we run into the same problem as that thread, as it is about accessors for datastructures? It seems if we can operate on arrays with primitive datatypes this wouldn't be problematic?

I was thinking of the following strategy:

Get the start address, array length and the bytes for the primitive type.
Pre-allocate an output array, pass the address
Use this info to generate the full loop where the loop uses the length, the assignment uses the addresses and increases the pointer with the number of bytes.

Is this somewhere going to get stuck?

yjshen · 2022-03-30T17:45:01Z

I'm a bit confused here. Do you mean jorgecarleitao/arrow2#627 chaining expressions and avoid repeated array allocation through Jit？

alamb · 2022-03-30T18:33:31Z

chaining expressions and avoid repeated array allocation through Jit？

FWIW I think if we "JIT'd" the entire expression, an approach like jorgecarleitao/arrow2#627 will likely not help all that much (as the temporaries will be in registers rather than being materialized in intermediate arrow arrays)

Dandandan · 2022-03-30T19:19:32Z

I'm a bit confused here. Do you mean jorgecarleitao/arrow2#627 chaining expressions and avoid repeated array allocation through Jit？

The goal in that issue is similar yes, but here I suggest using generated code instead of reusing arrays.

The idea is that we can compile the entire loop.
The compiled expression a + b would roughly compile to something like the following pseudo code:

i = 0
while i < length {
    *item = *a + *b;

    a += size_a;
    b += size_b;
    item += size_item;
}

Here item is the pointer to items in the target array and a / b are pointing to items in arrays a and b.
In this case we only need to allocate one target array, instead of intermediate arrays.

houqp · 2022-03-31T05:17:26Z

I think we are good as long as we don't invoke rust functions from the JIT'd code. I am curious if cranelift will be smart enough to auto vectorize the loop though :)

If we keep the code gen API similar, then when we switch to LLVM, the migration shouldn't be too hard.

Dandandan · 2022-03-31T10:09:03Z

I think we are good as long as we don't invoke rust functions from the JIT'd code. I am curious if cranelift will be smart enough to auto vectorize the loop though :)

If we keep the code gen API similar, then when we switch to LLVM, the migration shouldn't be too hard.

💯 Maybe it could event coexist together (both a Cranelift / LLVM backend).

Also not sure on the autovectorization, though I am hopeful we might be able to instrument cranelift enough to have it compile to SIMD instructions, given that we now more about the data types and code than a generic for-loop.
(there are for example SIMD types https://docs.rs/cranelift/latest/cranelift/prelude/types/index.html).

alamb · 2022-04-02T10:30:19Z

I am sorry I am not following along this work this too carefully, but I really like where this is headed. JIT compiling predicate evaluation is definitely state of the art and I love to see it arriving in DataFusion. I think we can potentially get pretty far just JIT compiling the most common predicates even if for more complicated one we (initially) fall back to vectorized evaluation

So thank you all for pushing this forward

Add code to create expression based on schema information

9d358b6

Dandandan force-pushed the create_jit_from_column_reference branch from 5aea815 to 9d358b6 Compare March 29, 2022 19:41

github-actions bot added the datafusion Changes in the datafusion crate label Mar 29, 2022

Dandandan added 2 commits March 29, 2022 21:42

Add spaces

8d4f212

Move to top

f55d4ed

Dandandan changed the title ~~JIT-compille DataFusion expression with alias / column name #2123~~ JIT-compille DataFusion expression with column name #2123 Mar 29, 2022

Clippy

2af1436

Dandandan changed the title ~~JIT-compille DataFusion expression with column name #2123~~ JIT-compille DataFusion expression with column name Mar 29, 2022

yjshen approved these changes Mar 30, 2022

View reviewed changes

Dandandan merged commit 4c2320e into apache:master Mar 31, 2022

waynexia mentioned this pull request May 22, 2022

Evaluate JIT'd expression over arrays #2587

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT-compille DataFusion expression with column name #2124

JIT-compille DataFusion expression with column name #2124

Dandandan commented Mar 29, 2022

yjshen commented Mar 30, 2022

Dandandan commented Mar 30, 2022

yjshen commented Mar 30, 2022

alamb commented Mar 30, 2022

Dandandan commented Mar 30, 2022 •

edited

Loading

houqp commented Mar 31, 2022

Dandandan commented Mar 31, 2022 •

edited

Loading

alamb commented Apr 2, 2022

JIT-compille DataFusion expression with column name #2124

JIT-compille DataFusion expression with column name #2124

Conversation

Dandandan commented Mar 29, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

yjshen commented Mar 30, 2022

Dandandan commented Mar 30, 2022

yjshen commented Mar 30, 2022

alamb commented Mar 30, 2022

Dandandan commented Mar 30, 2022 • edited Loading

houqp commented Mar 31, 2022

Dandandan commented Mar 31, 2022 • edited Loading

alamb commented Apr 2, 2022

Dandandan commented Mar 30, 2022 •

edited

Loading

Dandandan commented Mar 31, 2022 •

edited

Loading