Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT-compille DataFusion expression with column name #2124

Merged
merged 4 commits into from
Mar 31, 2022

Conversation

Dandandan
Copy link
Contributor

Which issue does this PR close?

Closes #2123

Rationale for this change

Allowing to compile more realistic expressions, like (a+1) as a step towards generating code to run on Arrays.

What changes are included in this PR?

Are there any user-facing changes?

@Dandandan Dandandan force-pushed the create_jit_from_column_reference branch from 5aea815 to 9d358b6 Compare March 29, 2022 19:41
@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Mar 29, 2022
@Dandandan Dandandan changed the title JIT-compille DataFusion expression with alias / column name #2123 JIT-compille DataFusion expression with column name #2123 Mar 29, 2022
@Dandandan Dandandan changed the title JIT-compille DataFusion expression with column name #2123 JIT-compille DataFusion expression with column name Mar 29, 2022
@yjshen
Copy link
Member

yjshen commented Mar 30, 2022

Thanks again @Dandandan for working on JIT! I hope I can have time to work with you on jit soon.

FWIW, one of our blockers is Cranelift has no ability to inline assembly, so it causes performance regression for columnar to row conversion (as recognized in #1975). A possible solution is to use LLVM instead of Cranelift because PostgreSQL and Impala have proved that inline is feasible in LLVM. I would pick it up if it's not taken already at the time.

@Dandandan
Copy link
Contributor Author

Thanks again @Dandandan for working on JIT! I hope I can have time to work with you on jit soon.

FWIW, one of our blockers is Cranelift has no ability to inline assembly, so it causes performance regression for columnar to row conversion (as recognized in #1975). A possible solution is to use LLVM instead of Cranelift because PostgreSQL and Impala have proved that inline is feasible in LLVM. I would pick it up if it's not taken already at the time.

Cool, thanks for the context.

I was thinking for generating the code for compiling expressions to cranelift jit, this would not be bad, as we could generate the whole loop instead to generate the new array contents which doesn't need to call any function. I am not sure if we run into the same problem as that thread, as it is about accessors for datastructures? It seems if we can operate on arrays with primitive datatypes this wouldn't be problematic?

I was thinking of the following strategy:

  • Get the start address, array length and the bytes for the primitive type.
  • Pre-allocate an output array, pass the address
  • Use this info to generate the full loop where the loop uses the length, the assignment uses the addresses and increases the pointer with the number of bytes.

Is this somewhere going to get stuck?

@yjshen
Copy link
Member

yjshen commented Mar 30, 2022

I'm a bit confused here. Do you mean jorgecarleitao/arrow2#627 chaining expressions and avoid repeated array allocation through Jit?

@alamb
Copy link
Contributor

alamb commented Mar 30, 2022

chaining expressions and avoid repeated array allocation through Jit?

FWIW I think if we "JIT'd" the entire expression, an approach like jorgecarleitao/arrow2#627 will likely not help all that much (as the temporaries will be in registers rather than being materialized in intermediate arrow arrays)

@Dandandan
Copy link
Contributor Author

Dandandan commented Mar 30, 2022

I'm a bit confused here. Do you mean jorgecarleitao/arrow2#627 chaining expressions and avoid repeated array allocation through Jit?

The goal in that issue is similar yes, but here I suggest using generated code instead of reusing arrays.

The idea is that we can compile the entire loop.
The compiled expression a + b would roughly compile to something like the following pseudo code:

i = 0
while i < length {
    *item = *a + *b;

    a += size_a;
    b += size_b;
    item += size_item;
}

Here item is the pointer to items in the target array and a / b are pointing to items in arrays a and b.
In this case we only need to allocate one target array, instead of intermediate arrays.

@houqp
Copy link
Member

houqp commented Mar 31, 2022

I think we are good as long as we don't invoke rust functions from the JIT'd code. I am curious if cranelift will be smart enough to auto vectorize the loop though :)

If we keep the code gen API similar, then when we switch to LLVM, the migration shouldn't be too hard.

@Dandandan
Copy link
Contributor Author

Dandandan commented Mar 31, 2022

I think we are good as long as we don't invoke rust functions from the JIT'd code. I am curious if cranelift will be smart enough to auto vectorize the loop though :)

If we keep the code gen API similar, then when we switch to LLVM, the migration shouldn't be too hard.

💯 Maybe it could event coexist together (both a Cranelift / LLVM backend).

Also not sure on the autovectorization, though I am hopeful we might be able to instrument cranelift enough to have it compile to SIMD instructions, given that we now more about the data types and code than a generic for-loop.
(there are for example SIMD types https://docs.rs/cranelift/latest/cranelift/prelude/types/index.html).

@Dandandan Dandandan merged commit 4c2320e into apache:master Mar 31, 2022
@alamb
Copy link
Contributor

alamb commented Apr 2, 2022

I am sorry I am not following along this work this too carefully, but I really like where this is headed. JIT compiling predicate evaluation is definitely state of the art and I love to see it arriving in DataFusion. I think we can potentially get pretty far just JIT compiling the most common predicates even if for more complicated one we (initially) fall back to vectorized evaluation

So thank you all for pushing this forward

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JIT-compille DataFusion expression with column name
4 participants