Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tpcds planning stack overflows - Join planning refactoring #9962

Merged
merged 2 commits into from
Apr 5, 2024

Conversation

Jefffrey
Copy link
Contributor

@Jefffrey Jefffrey commented Apr 5, 2024

Which issue does this PR close?

Closes #8837

Rationale for this change

Thanks to the previous analysis from #1047 and with some lldb debugging, was able to identify that create_initial_plan(...) and it's sibling create_initial_plan_multi(...) were the culprits for the stack overflow (also identified by this comment #8837 (comment))

So went looking through it and found a place were there was a large number of local variables before the recursive call to create_initial_plan(...)/create_initial_plan_multi(...) and tested out extracting those into a separate function. This seems to resolve the issue.

What changes are included in this PR?

Extract local variables and functionality in physical join planning in create_initial_plan(...) to a separate function to try reduce stack size before the recursive create_initial_plan(...) call

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Apr 5, 2024
@Jefffrey Jefffrey marked this pull request as ready for review April 5, 2024 13:27
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Jefffrey -- this is a great improvement

I tested it locally too. On main:

$ cargo test --test tpcds_planning
...
test tpcds_physical_q5 ... ok

thread 'tpcds_physical_q54' has overflowed its stack
fatal runtime error: stack overflow
error: test failed, to rerun pass `-p datafusion --test tpcds_planning`

On this branch:

$ cargo test --test tpcds_planning

test result: ok. 192 passed; 0 failed; 6 ignored; 0 measured; 0 filtered out; finished in 1.08s

run: |
echo "RUSTC_WRAPPER=sccache" >> $GITHUB_ENV
echo "SCCACHE_GHA_ENABLED=true" >> $GITHUB_ENV
echo "RUST_BACKTRACE=1" >> $GITHUB_ENV
echo "RUST_MIN_STACK=3000000" >> $GITHUB_ENV
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

@alamb
Copy link
Contributor

alamb commented Apr 5, 2024

FYI @comphead

@alamb alamb merged commit dd8a5d3 into apache:main Apr 5, 2024
24 checks passed
@comphead
Copy link
Contributor

comphead commented Apr 5, 2024

wow!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] Investigate stack overflow errors on TPC-DS queries
3 participants