Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large OR list overflows the stack #9375

Closed
alamb opened this issue Feb 27, 2024 · 26 comments · Fixed by #13376
Closed

Large OR list overflows the stack #9375

alamb opened this issue Feb 27, 2024 · 26 comments · Fixed by #13376
Assignees
Labels
bug Something isn't working

Comments

@alamb
Copy link
Contributor

alamb commented Feb 27, 2024

Describe the bug

In InfluxDB we saw people issue queries with many OR chains that caused a stack overflow

SELECT ... WHERE x = 1 OR x = 2 OR ..... x = 10000

To Reproduce

blowout2.zip

Download: blowout.zip

And run

datafusion-cli -f blowout2.sql

This results in

andrewlamb@Andrews-MacBook-Pro Downloads % datafusion-cli  -f blowout2.sql
datafusion-cli  -f blowout2.sql
DataFusion CLI v36.0.0
0 rows in set. Query took 0.015 seconds.

+-------+
| count |
+-------+
| 1     |
+-------+
1 row in set. Query took 0.001 seconds.


thread 'main' has overflowed its stack
fatal runtime error: stack overflow

The query looks like this

SELECT *
FROM foo
WHERE x = 1
  OR x = 1
  OR x = 1
  OR x = 1
  OR x = 1
....

Expected behavior

a runtime error rather than stack overflow. Bonus points if the query actually completed

Additional context

Here is the stack trace in a release build:

datafusion_common::tree_node::TreeNode::transform_up 0x0000000103cb96f4
datafusion_expr::tree_node::expr::transform_boxed 0x0000000103d86c7c
datafusion_common::tree_node::TreeNode::transform_up 0x0000000103cb9d34
.... MANY MORE ....
datafusion_expr::tree_node::expr::transform_boxed 0x0000000103d86c7c
datafusion_common::tree_node::TreeNode::transform_up 0x0000000103cb9d34
datafusion_expr::tree_node::expr::transform_boxed 0x0000000103d86c7c
datafusion_common::tree_node::TreeNode::transform_up 0x0000000103cb9d34
datafusion_expr::tree_node::expr::transform_boxed 0x0000000103d86c7c
datafusion_common::tree_node::TreeNode::transform_up 0x0000000103cb9d34
datafusion_expr::tree_node::expr::transform_boxed 0x0000000103d86c7c
datafusion_common::tree_node::TreeNode::transform_up 0x0000000103cb9d34
datafusion_expr::tree_node::expr::transform_boxed 0x0000000103d86c7c
datafusion_common::tree_node::TreeNode::transform_up 0x0000000103cb9d34
datafusion_expr::tree_node::expr::transform_boxed 0x0000000103d86c7c
datafusion_common::tree_node::TreeNode::transform_up 0x0000000103cb9d34
datafusion_expr::expr::Expr::infer_placeholder_types 0x0000000103cd30e8
datafusion_sql::expr::<impl datafusion_sql::planner::SqlToRel<S>>::sql_to_expr 0x0000000102b4b180
datafusion_sql::select::<impl datafusion_sql::planner::SqlToRel<S>>::plan_selection 0x0000000102b63980
datafusion_sql::select::<impl datafusion_sql::planner::SqlToRel<S>>::select_to_plan 0x0000000102b64b5c
datafusion_sql::set_expr::<impl datafusion_sql::planner::SqlToRel<S>>::set_expr_to_plan 0x0000000102b710b4
datafusion_sql::query::<impl datafusion_sql::planner::SqlToRel<S>>::query_to_plan_with_schema 0x0000000102b60414
datafusion_sql::statement::<impl datafusion_sql::planner::SqlToRel<S>>::sql_statement_to_plan_with_context_impl 0x0000000102b7ab30
datafusion_sql::statement::<impl datafusion_sql::planner::SqlToRel<S>>::statement_to_plan 0x0000000102b76b84
datafusion::execution::context::SessionState::statement_to_plan::{{closure}} 0x0000000102bded28
datafusion_cli::exec::exec_and_print::{{closure}} 0x0000000102be8428
datafusion_cli::exec::exec_from_lines::{{closure}} 0x0000000102be9efc
datafusion_cli::exec::exec_from_files::{{closure}} 0x0000000102be9a38
datafusion_cli::main_inner::{{closure}} 0x0000000102c051b4
tokio::runtime::park::CachedParkThread::block_on 0x0000000102bfef64
tokio::runtime::runtime::Runtime::block_on 0x0000000102d1ae7c
datafusion_cli::main 0x0000000102ca84d0
std::sys_common::backtrace::__rust_begin_short_backtrace 0x0000000102cd44f0
std::rt::lang_start::{{closure}} 0x0000000102d033f0
[Inlined] core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once function.rs:284
[Inlined] std::panicking::try::do_call panicking.rs:552
[Inlined] std::panicking::try panicking.rs:516
[Inlined] std::panic::catch_unwind panic.rs:142
[Inlined] std::rt::lang_start_internal::{{closure}} rt.rs:148
[Inlined] std::panicking::try::do_call panicking.rs:552
[Inlined] std::panicking::try panicking.rs:516
[Inlined] std::panic::catch_unwind panic.rs:142
std::rt::lang_start_internal rt.rs:148
main 0x0000000102ca85d8
start 0x00000001825fd0e0
@alamb alamb added the bug Something isn't working label Feb 27, 2024
@Lordworms
Copy link
Contributor

take

@alamb
Copy link
Contributor Author

alamb commented Feb 28, 2024

FYI @Lordworms I think this will be a pretty challenging task

Also, @peter-toth has an outstanding substantial change to TreeNode APIs here: #8891 which you should be aware of.

@Lordworms
Copy link
Contributor

FYI @Lordworms I think this will be a pretty challenging task

Also, @peter-toth has an outstanding substantial change to TreeNode APIs here: #8891 which you should be aware of.

Yes, My basic plan for this is to change recursion in treenode into iteration, I will test my implementation first and read this PR #8891

@Lordworms
Copy link
Contributor

start to do this one, the current plan is to rewrite the infer_placeholder_types function to iteration.

@jayzhan211
Copy link
Contributor

@Lordworms Are you still working on this?

@Lordworms
Copy link
Contributor

@Lordworms Are you still working on this?

Sorry I forget about this one, you can go for it.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jun 26, 2024

@alamb I'm thinking of introducing CommutativeExpr that specialize these kind of query to make them possible to transform iteratively. What do you think about this?

#[derive(Clone, PartialEq, Eq, Hash, Debug)]
pub struct CommutativeExpr {
    /// Expressions
    pub exprs: Vec<Expr>,
    /// The operator that is commutative (order-insensitive), like OR, AND, StringConcat, BitWise Op
    pub op: Operator,
}

CommutativeExpr { vec![a, b, c], op: OR} is equivalent to Binary(Binary(a,b,OR), c, OR)

@alamb
Copy link
Contributor Author

alamb commented Jun 26, 2024

@alamb I'm thinking of introducing CommutativeExpr that specialize these kind of query to make them possible to transform iteratively. What do you think about this?

I think that sounds like a good idea to me

One thing that would be nice would be to somehow avoid having two potential representions for the same expression

So like A OR B would that be represented as BinaryExpr { left: A, op: Or, rigith: B} or CommutativeExpr { exprs: [A, B], op: Or} 🤔

But that is starting to sound like a large change 🤔

@jayzhan211
Copy link
Contributor

@alamb I'm thinking of introducing CommutativeExpr that specialize these kind of query to make them possible to transform iteratively. What do you think about this?

I think that sounds like a good idea to me

One thing that would be nice would be to somehow avoid having two potential representions for the same expression

So like A OR B would that be represented as BinaryExpr { left: A, op: Or, rigith: B} or CommutativeExpr { exprs: [A, B], op: Or} 🤔

But that is starting to sound like a large change 🤔

we could restrict the minimum elements of CommutativeExpr to 3.

@alamb
Copy link
Contributor Author

alamb commented Jun 26, 2024

we could restrict the minimum elements of CommutativeExpr to 3.

That could make sense

My biggest concern with this proposal is its potential impact on backwards compatibility / causing API churn to solve a very narrow usecase

I wonder if you have considered the approach to turning a stack overflow into an error?

So like maybe add a configuration flag like "max_expression_depth = 10" or something and then if that depth is exceeded in SqlToRel raise an error?

That would protect against crashes/ stack overflows but still allow people who wanted more complex expressions (and were willing to raise their stack sizes) to run

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jun 26, 2024

we could restrict the minimum elements of CommutativeExpr to 3.

That could make sense

My biggest concern with this proposal is its potential impact on backwards compatibility / causing API churn to solve a very narrow usecase

I wonder if you have considered the approach to turning a stack overflow into an error?

So like maybe add a configuration flag like "max_expression_depth = 10" or something and then if that depth is exceeded in SqlToRel raise an error?

That would protect against crashes/ stack overflows but still allow people who wanted more complex expressions (and were willing to raise their stack sizes) to run

I'm able to run the large or list without stack overflow. As far as I can tell. I think we don't need to worry about the backwards compatibility, this does not need to break any existing API.

causing API churn for narrow usecase

I think we can deal with simple case for the query that has the same operator like large OR list, large AND list, but we can't deal with mixing case like A OR B AND C OR D AND E.... For this case, we still need to raise runtime error.

If complete large OR / AND list query for the same operator is worth to add a new Expr, I think we could go for it.
Otherwise we should count the transformation and raise the error.

@alamb
Copy link
Contributor Author

alamb commented Jun 27, 2024

Otherwise we should count the transformation and raise the error.

I recommend we start with the error (to avoid a stack overflow) and then if someone comes with a usecase when they need a super deep OR tree we can figure out if it is worth adding additional code

I think one common case of large OR chains is generated SQL like col = 1 OR col = 2 OR col = 3 OR ... and it is typically better to change the application to generate SQL like col IN (1, 2, 3, ...) instead which is both less SQL (and less nested) and faster

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jun 28, 2024

It seems to support configurable max recursive depth involved huge breaking change for tree traversal transfrom_* function given it is used widely already. I would like to avoid this huge change just for early return error. 😢

Given stack size is not always the same, I'm not sure constant max recursive depth is a good idea too.

@alamb
Copy link
Contributor Author

alamb commented Jun 28, 2024

Here is how the recursion guard is implemented in sqlparser: https://github.com/sqlparser-rs/sqlparser-rs/blob/f9ab8dcc27fd2d55030b9c5fa71e41d5c08dd601/src/parser/mod.rs#L67-L127

And then at the start of each major statement, this gets called:

https://github.com/sqlparser-rs/sqlparser-rs/blob/f9ab8dcc27fd2d55030b9c5fa71e41d5c08dd601/src/parser/mod.rs#L469-L470

So I don't think we would have to change the transform_* functions, but simply update the closure that was being called by those functions

@peter-toth
Copy link
Contributor

peter-toth commented Jun 28, 2024

I came across stacker (https://docs.rs/stacker/latest/stacker/index.html) and a convenience crate (https://docs.rs/recursive/latest/recursive/index.html) above stacker to dynamically increase stack sizes during deep recursions. According to their documentation it isn't zero cost to use these, but could be worth to try and see them in action in some benchmarks.

@alamb
Copy link
Contributor Author

alamb commented Jun 28, 2024

I evaluated stacker as part of sqlparser and I thought it was doing some too crazy stuff that made it hard to use in embedded / wasm environments. Maybe that is better now

@gruuya
Copy link
Contributor

gruuya commented Nov 12, 2024

So I got curious and wanted to test #13310 on this example too. It doesn't seem to resolve it, though now instead of a stack overflow I see a bus error on Mac M2—anyone seeing that too?

@peter-toth
Copy link
Contributor

peter-toth commented Nov 12, 2024

So I got curious and wanted to test #13310 on this example too. It doesn't seem to resolve it, though now instead of a stack overflow I see a bus error on Mac M2—anyone seeing that too?

Bus error is similar to stack overflow in this case. The stacktrace looks like the following with the blowout2 example on latest main (Mac M3 / debug build):

Process 31194 stopped
* thread #1, name = 'main', queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x16f607940)
    frame #0: 0x0000000100ebc86c datafusion-cli`_$LT$sqlparser..ast..Expr$u20$as$u20$sqlparser..ast..visitor..Visit$GT$::visit::hece9cb577e66b52f(self=<unavailable>, visitor=<unavailable>) at mod.rs:532
   529 	#[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
   530 	#[cfg_attr(
   531 	    feature = "visitor",
-> 532 	    derive(Visit, VisitMut),
   533 	    visit(with = "visit_expr")
   534 	)]
   535 	pub enum Expr {
Target 0: (datafusion-cli) stopped.
(lldb) thread backtrace
* thread #1, name = 'main', queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x16f607940)
  * frame #0: 0x0000000100ebc86c datafusion-cli`_$LT$sqlparser..ast..Expr$u20$as$u20$sqlparser..ast..visitor..Visit$GT$::visit::hece9cb577e66b52f(self=<unavailable>, visitor=<unavailable>) at mod.rs:532
    frame #1: 0x0000000100dc2bf4 datafusion-cli`_$LT$alloc..boxed..Box$LT$T$GT$$u20$as$u20$sqlparser..ast..visitor..Visit$GT$::visit::hf156d31d19b0b1a1(self=0x000005506c4bca28, visitor=0x000000016fd8e3a0) at visitor.rs:68:9
    frame #2: 0x0000000100ebcd28 datafusion-cli`_$LT$sqlparser..ast..Expr$u20$as$u20$sqlparser..ast..visitor..Visit$GT$::visit::hece9cb577e66b52f(self=0x000005506c4bca00, visitor=0x000000016fd8e3a0) at mod.rs:603:9
    frame #3: 0x0000000100dc2bf4 datafusion-cli`_$LT$alloc..boxed..Box$LT$T$GT$$u20$as$u20$sqlparser..ast..visitor..Visit$GT$::visit::hf156d31d19b0b1a1(self=0x000005506c4bcf28, visitor=0x000000016fd8e3a0) at visitor.rs:68:9
    frame #4: 0x0000000100ebcd28 datafusion-cli`_$LT$sqlparser..ast..Expr$u20$as$u20$sqlparser..ast..visitor..Visit$GT$::visit::hece9cb577e66b52f(self=0x000005506c4bcf00, visitor=0x000000016fd8e3a0) at mod.rs:603:9
    frame #5: 0x0000000100dc2bf4 datafusion-cli`_$LT$alloc..boxed..Box$LT$T$GT$$u20$as$u20$sqlparser..ast..visitor..Visit$GT$::visit::hf156d31d19b0b1a1(self=0x000005506c4bd428, visitor=0x000000016fd8e3a0) at visitor.rs:68:9

... repeating frames ...

    frame #7575: 0x0000000100dc2bf4 datafusion-cli`_$LT$alloc..boxed..Box$LT$T$GT$$u20$as$u20$sqlparser..ast..visitor..Visit$GT$::visit::hf156d31d19b0b1a1(self=0x000005506cc46a28, visitor=0x000000016fd8e3a0) at visitor.rs:68:9
    frame #7576: 0x0000000100ebcd28 datafusion-cli`_$LT$sqlparser..ast..Expr$u20$as$u20$sqlparser..ast..visitor..Visit$GT$::visit::hece9cb577e66b52f(self=0x000005506cc46a00, visitor=0x000000016fd8e3a0) at mod.rs:603:9
    frame #7577: 0x0000000100dc2bf4 datafusion-cli`_$LT$alloc..boxed..Box$LT$T$GT$$u20$as$u20$sqlparser..ast..visitor..Visit$GT$::visit::hf156d31d19b0b1a1(self=0x000005506cc46f28, visitor=0x000000016fd8e3a0) at visitor.rs:68:9
    frame #7578: 0x0000000100ebcd28 datafusion-cli`_$LT$sqlparser..ast..Expr$u20$as$u20$sqlparser..ast..visitor..Visit$GT$::visit::hece9cb577e66b52f(self=0x000005506cc46f00, visitor=0x000000016fd8e3a0) at mod.rs:603:9
    frame #7579: 0x0000000100dc2bf4 datafusion-cli`_$LT$alloc..boxed..Box$LT$T$GT$$u20$as$u20$sqlparser..ast..visitor..Visit$GT$::visit::hf156d31d19b0b1a1(self=0x000005506c260550, visitor=0x000000016fd8e3a0) at visitor.rs:68:9
    frame #7580: 0x0000000100ebcd28 datafusion-cli`_$LT$sqlparser..ast..Expr$u20$as$u20$sqlparser..ast..visitor..Visit$GT$::visit::hece9cb577e66b52f(self=0x000005506c260528, visitor=0x000000016fd8e3a0) at mod.rs:603:9
    frame #7581: 0x0000000100c6d070 datafusion-cli`_$LT$core..option..Option$LT$T$GT$$u20$as$u20$sqlparser..ast..visitor..Visit$GT$::visit::h0e35f5cc1d3482a7(self=0x000005506c260528, visitor=0x000000016fd8e3a0) at visitor.rs:51:13
    frame #7582: 0x0000000100c4c96c datafusion-cli`_$LT$sqlparser..ast..query..Select$u20$as$u20$sqlparser..ast..visitor..Visit$GT$::visit::h7328cfc81f60a4d1(self=0x000005506c260400, visitor=0x000000016fd8e3a0) at query.rs:290:5
    frame #7583: 0x0000000100dc2ad4 datafusion-cli`_$LT$alloc..boxed..Box$LT$T$GT$$u20$as$u20$sqlparser..ast..visitor..Visit$GT$::visit::h9d95e625eaf3c604(self=0x000005506cc50208, visitor=0x000000016fd8e3a0) at visitor.rs:68:9
    frame #7584: 0x0000000100c4ce10 datafusion-cli`_$LT$sqlparser..ast..query..SetExpr$u20$as$u20$sqlparser..ast..visitor..Visit$GT$::visit::h188e4027db64c4f9(self=0x000005506cc50200, visitor=0x000000016fd8e3a0) at query.rs:136:12
    frame #7585: 0x0000000100dc2a44 datafusion-cli`_$LT$alloc..boxed..Box$LT$T$GT$$u20$as$u20$sqlparser..ast..visitor..Visit$GT$::visit::h1ad78804abfe3e37(self=0x000005506c0d1280, visitor=0x000000016fd8e3a0) at visitor.rs:68:9
    frame #7586: 0x0000000100c4c3f0 datafusion-cli`_$LT$sqlparser..ast..query..Query$u20$as$u20$sqlparser..ast..visitor..Visit$GT$::visit::h57b8fe7313e25a1f(self=0x000005506c0d0e00, visitor=0x000000016fd8e3a0) at query.rs:33:5
    frame #7587: 0x0000000100dc2b94 datafusion-cli`_$LT$alloc..boxed..Box$LT$T$GT$$u20$as$u20$sqlparser..ast..visitor..Visit$GT$::visit::he5ad221bedc68f4e(self=0x000005506cc51008, visitor=0x000000016fd8e3a0) at visitor.rs:68:9
    frame #7588: 0x0000000100ebf70c datafusion-cli`_$LT$sqlparser..ast..Statement$u20$as$u20$sqlparser..ast..visitor..Visit$GT$::visit::hf760c8f9e44438fb(self=0x000005506cc51000, visitor=0x000000016fd8e3a0) at mod.rs:2189:11
    frame #7589: 0x0000000100c4be50 datafusion-cli`datafusion::catalog_common::resolve_table_references::visit_statement::he584f15ccc7bb0e6(statement=0x000005506c1d54d0, visitor=0x000000016fd8e3a0) at mod.rs:181:25
    frame #7590: 0x0000000100c4b50c datafusion-cli`datafusion::catalog_common::resolve_table_references::h514b49e9b68ba0eb(statement=0x000005506c1d54d0, enable_ident_normalization=true) at mod.rs:198:5
    frame #7591: 0x0000000100afe9e8 datafusion-cli`datafusion::execution::session_state::SessionState::resolve_table_references::h86bbf64d45254dee(self=0x000005506c1d4890, statement=0x000005506c1d54d0) at session_state.rs:523:31
    frame #7592: 0x000000010002b6f8 datafusion-cli`datafusion::execution::session_state::SessionState::statement_to_plan::_$u7b$$u7b$closure$u7d$$u7d$::hfaecfb2dc677af56((null)=0x000000016fdec970) at session_state.rs:535:26
    frame #7593: 0x000000010008dc3c datafusion-cli`datafusion_cli::exec::create_plan::_$u7b$$u7b$closure$u7d$$u7d$::h67927c394fa1e801((null)=0x000000016fdec970) at exec.rs:309:69
    frame #7594: 0x000000010008eedc datafusion-cli`datafusion_cli::exec::exec_and_print::_$u7b$$u7b$closure$u7d$$u7d$::h5dfb40af5fb21031((null)=0x000000016fdec970) at exec.rs:231:48
    frame #7595: 0x000000010009250c datafusion-cli`datafusion_cli::exec::exec_from_lines::_$u7b$$u7b$closure$u7d$$u7d$::h8deef9186caf1cbf((null)=0x000000016fdec970) at exec.rs:83:69
    frame #7596: 0x0000000100091f8c datafusion-cli`datafusion_cli::exec::exec_from_files::_$u7b$$u7b$closure$u7d$$u7d$::h3d4217517257fe10((null)=0x000000016fdec970) at exec.rs:119:58
    frame #7597: 0x00000001000077d8 datafusion-cli`datafusion_cli::main_inner::_$u7b$$u7b$closure$u7d$$u7d$::hab6d74ec69080026((null)=0x000000016fdec970) at main.rs:224:60
    frame #7598: 0x000000010000879c datafusion-cli`datafusion_cli::main::_$u7b$$u7b$closure$u7d$$u7d$::h697d34ccd1e284f7((null)=0x000000016fdec970) at main.rs:131:34
    frame #7599: 0x0000000100015d54 datafusion-cli`_$LT$core..pin..Pin$LT$P$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h9ab635cc6d96255f(self=Pin<&mut core::pin::Pin<alloc::boxed::Box<datafusion_cli::main::{async_block_env#0}, alloc::alloc::Global>>> @ 0x000000016fdec860, cx=0x000000016fdec970) at future.rs:123:9
    frame #7600: 0x0000000100096bdc datafusion-cli`tokio::runtime::park::CachedParkThread::block_on::_$u7b$$u7b$closure$u7d$$u7d$::hcd3f5dd7af5ce2ca at park.rs:281:63
    frame #7601: 0x00000001000962a8 datafusion-cli`tokio::runtime::park::CachedParkThread::block_on::h4d185a378e617ce3 at coop.rs:107:5
    frame #7602: 0x0000000100096250 datafusion-cli`tokio::runtime::park::CachedParkThread::block_on::h4d185a378e617ce3 [inlined] tokio::runtime::coop::budget::h4645f10741670d69(f={closure_env#0}<core::pin::Pin<alloc::boxed::Box<datafusion_cli::main::{async_block_env#0}, alloc::alloc::Global>>> @ 0x000000016fdec9e8) at coop.rs:73:5
    frame #7603: 0x00000001000961d0 datafusion-cli`tokio::runtime::park::CachedParkThread::block_on::h4d185a378e617ce3(self=0x000000016fdeca76, f=(__pointer = 0x000005506c1d3000)) at park.rs:281:31
    frame #7604: 0x000000010006e0e4 datafusion-cli`tokio::runtime::context::blocking::BlockingRegionGuard::block_on::hf507318e5fc00a2b(self=0x000000016fdecb40, f=(__pointer = 0x000005506c1d3000)) at blocking.rs:66:9
    frame #7605: 0x00000001000a2004 datafusion-cli`tokio::runtime::scheduler::multi_thread::MultiThread::block_on::_$u7b$$u7b$closure$u7d$$u7d$::hb0419ba227ee0cc1(blocking=0x000000016fdecb40) at mod.rs:87:13
    frame #7606: 0x00000001000b0384 datafusion-cli`tokio::runtime::context::runtime::enter_runtime::h5827c5aa2d755146(handle=0x000000016fdfa458, allow_block_in_place=true, f={closure_env#0}<core::pin::Pin<alloc::boxed::Box<datafusion_cli::main::{async_block_env#0}, alloc::alloc::Global>>> @ 0x000000016fdecaf8) at runtime.rs:65:16
    frame #7607: 0x00000001000a1ea4 datafusion-cli`tokio::runtime::scheduler::multi_thread::MultiThread::block_on::h90c18f1478cd44ff(self=0x000000016fdfa430, handle=0x000000016fdfa458, future=(__pointer = 0x000005506c1d3000)) at mod.rs:86:9
    frame #7608: 0x000000010008a594 datafusion-cli`tokio::runtime::runtime::Runtime::block_on_inner::ha6441b1a0292cd22(self=0x000000016fdfa428, future=(__pointer = 0x000005506c1d3000), (null)=(_pd = core::marker::PhantomData<void *> @ 0x000000016fdecc2f)) at runtime.rs:370:45
    frame #7609: 0x000000010008a9f4 datafusion-cli`tokio::runtime::runtime::Runtime::block_on::h53200a97a01420ac(self=0x000000016fdfa428, future={async_block_env#0} @ 0x000000016fdfa5b0) at runtime.rs:340:13
    frame #7610: 0x000000010001b178 datafusion-cli`datafusion_cli::main::hf402cd722afe9c8d at main.rs:131:5
    frame #7611: 0x0000000100061ca8 datafusion-cli`core::ops::function::FnOnce::call_once::hfd3d15becb98955b((null)=(datafusion-cli`datafusion_cli::main::hf402cd722afe9c8d at main.rs:130), (null)=<unavailable>) at function.rs:250:5
    frame #7612: 0x000000010001ca0c datafusion-cli`std::sys::backtrace::__rust_begin_short_backtrace::hf7f21602a5b409b2(f=(datafusion-cli`datafusion_cli::main::hf402cd722afe9c8d at main.rs:130)) at backtrace.rs:154:18
    frame #7613: 0x00000001000b2c0c datafusion-cli`std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::h8d9a4a5fc7b2d781 at rt.rs:164:18
    frame #7614: 0x0000000104c99590 datafusion-cli`std::rt::lang_start_internal::h9e88109c8deb8787 [inlined] core::ops::function::impls::_$LT$impl$u20$core..ops..function..FnOnce$LT$A$GT$$u20$for$u20$$RF$F$GT$::call_once::hf77a1752ba39c45f at function.rs:284:13 [opt]
    frame #7615: 0x0000000104c99588 datafusion-cli`std::rt::lang_start_internal::h9e88109c8deb8787 [inlined] std::panicking::try::do_call::hf02556a6b145ecfc at panicking.rs:554:40 [opt]
    frame #7616: 0x0000000104c99584 datafusion-cli`std::rt::lang_start_internal::h9e88109c8deb8787 [inlined] std::panicking::try::h2bb23dba91be7e3b at panicking.rs:518:19 [opt]
    frame #7617: 0x0000000104c99584 datafusion-cli`std::rt::lang_start_internal::h9e88109c8deb8787 [inlined] std::panic::catch_unwind::h1844bc6507215052 at panic.rs:345:14 [opt]
    frame #7618: 0x0000000104c99584 datafusion-cli`std::rt::lang_start_internal::h9e88109c8deb8787 [inlined] std::rt::lang_start_internal::_$u7b$$u7b$closure$u7d$$u7d$::ha90e2c319598814e at rt.rs:143:48 [opt]
    frame #7619: 0x0000000104c99584 datafusion-cli`std::rt::lang_start_internal::h9e88109c8deb8787 [inlined] std::panicking::try::do_call::h7de69f625a47132a at panicking.rs:554:40 [opt]
    frame #7620: 0x0000000104c99584 datafusion-cli`std::rt::lang_start_internal::h9e88109c8deb8787 [inlined] std::panicking::try::h2198f44c68c232f7 at panicking.rs:518:19 [opt]
    frame #7621: 0x0000000104c99584 datafusion-cli`std::rt::lang_start_internal::h9e88109c8deb8787 [inlined] std::panic::catch_unwind::h40a34eeb64f44ac6 at panic.rs:345:14 [opt]
    frame #7622: 0x0000000104c99584 datafusion-cli`std::rt::lang_start_internal::h9e88109c8deb8787 at rt.rs:143:20 [opt]
    frame #7623: 0x00000001000b2bd8 datafusion-cli`std::rt::lang_start::hafc0eb4d69b5d0cb(main=(datafusion-cli`datafusion_cli::main::hf402cd722afe9c8d at main.rs:130), argc=3, argv=0x000000016fdff170, sigpipe='\0') at rt.rs:163:17
    frame #7624: 0x000000010001b224 datafusion-cli`main + 36
    frame #7625: 0x00000001996eb154 dyld`start + 2476

So it seems this stack overflow comes from sqlparser: https://github.com/apache/datafusion-sqlparser-rs/blob/v0.51.0/src/ast/visitor.rs#L68
Shall we add stacker/recursive there as well?

@peter-toth
Copy link
Contributor

Actually, the release build has a different issue. Let me check...

@peter-toth
Copy link
Contributor

This is the stacktrace with a release build. Looks like we forgot to annotate get_type with recursive.
The last annotated method is TreeNode::rewrite in frame #179 so the last check for >128KB stack remained was there. But then each get_type eats up 704 bytes...

(lldb) thread backtrace
* thread #1, name = 'main', queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x1064ffe40)
  * frame #0: sp=0x00000001064ffea0 fp=0x0000000106500150 pc=0x0000000101d1d4cc datafusion-cli`_$LT$datafusion_expr..expr..Expr$u20$as$u20$datafusion_expr..expr_schema..ExprSchemable$GT$::get_type::hd93d0d7d195d9517
    frame #1: sp=0x00000001064ffea0 fp=0x0000000106500150 pc=0x0000000101d1d810 datafusion-cli`_$LT$datafusion_expr..expr..Expr$u20$as$u20$datafusion_expr..expr_schema..ExprSchemable$GT$::get_type::hd93d0d7d195d9517 + 836
    frame #2: sp=0x0000000106500160 fp=0x0000000106500410 pc=0x0000000101d1d810 datafusion-cli`_$LT$datafusion_expr..expr..Expr$u20$as$u20$datafusion_expr..expr_schema..ExprSchemable$GT$::get_type::hd93d0d7d195d9517 + 836

...

    frame #175: sp=0x000000010651dd20 fp=0x000000010651dfd0 pc=0x0000000101d1d810 datafusion-cli`_$LT$datafusion_expr..expr..Expr$u20$as$u20$datafusion_expr..expr_schema..ExprSchemable$GT$::get_type::hd93d0d7d195d9517 + 836
    frame #176: sp=0x000000010651dfe0 fp=0x000000010651e290 pc=0x0000000101d1d810 datafusion-cli`_$LT$datafusion_expr..expr..Expr$u20$as$u20$datafusion_expr..expr_schema..ExprSchemable$GT$::get_type::hd93d0d7d195d9517 + 836
    frame #177: sp=0x000000010651e2a0 fp=0x000000010651e790 pc=0x00000001013fe568 datafusion-cli`datafusion_optimizer::analyzer::type_coercion::TypeCoercionRewriter::coerce_binary_op::ha0b41d96e2a388be + 80
    frame #178: sp=0x000000010651e7a0 fp=0x0000000106520780 pc=0x00000001013ff8a0 datafusion-cli`_$LT$datafusion_optimizer..analyzer..type_coercion..TypeCoercionRewriter$u20$as$u20$datafusion_common..tree_node..TreeNodeRewriter$GT$::f_up::hf15debdf22e96478 + 3704
    frame #179: sp=0x0000000106520790 fp=0x0000000106520ce0 pc=0x00000001010ed6a0 datafusion-cli`datafusion_common::tree_node::TreeNode::rewrite::h5b31f9ce53898b08 + 864
    frame #180: sp=0x0000000106520cf0 fp=0x0000000106520fd0 pc=0x00000001013ebc60 datafusion-cli`datafusion_expr::tree_node::transform_box::h94edc72df1434b7e + 68
    frame #181: sp=0x0000000106520fe0 fp=0x0000000106521920 pc=0x00000001010c8390 datafusion-cli`datafusion_expr::tree_node::_$LT$impl$u20$datafusion_common..tree_node..TreeNode$u20$for$u20$datafusion_expr..expr..Expr$GT$::map_children::hc6cf396dc1ddffb4 + 732
    frame #182: sp=0x0000000106521930 fp=0x0000000106521e80 pc=0x00000001010ed498 datafusion-cli`datafusion_common::tree_node::TreeNode::rewrite::h5b31f9ce53898b08 + 344
    frame #183: sp=0x0000000106521e90 fp=0x0000000106522170 pc=0x00000001013ebc60 datafusion-cli`datafusion_expr::tree_node::transform_box::h94edc72df1434b7e + 68
    frame #184: sp=0x0000000106522180 fp=0x0000000106522ac0 pc=0x00000001010c8390 datafusion-cli`datafusion_expr::tree_node::_$LT$impl$u20$datafusion_common..tree_node..TreeNode$u20$for$u20$datafusion_expr..expr..Expr$GT$::map_children::hc6cf396dc1ddffb4 + 732
    frame #185: sp=0x0000000106522ad0 fp=0x0000000106523020 pc=0x00000001010ed498 datafusion-cli`datafusion_common::tree_node::TreeNode::rewrite::h5b31f9ce53898b08 + 344
    frame #186: sp=0x0000000106523030 fp=0x0000000106523310 pc=0x00000001013ebc60 datafusion-cli`datafusion_expr::tree_node::transform_box::h94edc72df1434b7e + 68

@peter-toth
Copy link
Contributor

Yeah, works if I add #[recursive] to get_type(). Let me open a follow-up PR.

@findepi
Copy link
Member

findepi commented Nov 12, 2024

#[recursive] is a good thing, but we could do better.
in the IR, we could make OR a multi-child operation (not binary). This would naturally solve tree depth problem (just like #[recursive] but with lower memory needs), and would also improve other processing.

(alternatively, and more simply, we could convert left- or right-deep OR tree to a balanced one, since it doesn't change semantics)

@peter-toth
Copy link
Contributor

peter-toth commented Nov 12, 2024

#[recursive] is a good thing, but we could do better. in the IR, we could make OR a multi-child operation (not binary). This would naturally solve tree depth problem (just like #[recursive] but with lower memory needs), and would also improve other processing.

(alternatively, and more simply, we could convert left- or right-deep OR tree to a balanced one, since it doesn't change semantics)

Those are nice optimizations, and indeed we should do them, but #[recursive] is more general fix and works with non-associative operations as well.

@findepi
Copy link
Member

findepi commented Nov 12, 2024

i am not saying we shouldn't do this
(OTOH i believe the stack overflows on generates queries, and usually because long associative operations. it's harder to generate otherwise)
let's do #[recursive] since it's simple to do and brings immediate benefit, but it's likely same benefits (and other perf benefits) would be available by restructuring IR. likely worth exploring as a follow-up to #[recursive]

@alamb
Copy link
Contributor Author

alamb commented Nov 13, 2024

Thank you all for working on this

@findepi
Copy link
Member

findepi commented Nov 14, 2024

BTW the fact that we needed to make get_type recursive that we can spend quite some time recalculating types. If an optimizer code calls expr.get_type, this can easily get quadratic on left-deep or right-deep rexpressions like the one in this issue.
Would be great to have IR with types: #12604

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants