-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix nil pointer dereference in event handling node removal #104
Conversation
For the moment we have unsafe coding where we try to use 'host' object before checking the success of it's 'get' operation. When a Scylla node gets forcibly terminated we have a race condition and to some probability to get the following panic: panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x632814] goroutine 1723 [running]: github.com/gocql/gocql.(*HostInfo).HostID(0xc0001500b8) %full-path%/[email protected]/host_source.go:271 +0x34 github.com/gocql/gocql.(*Session).handleRemovedNode( 0xc000150000, {0xc0134c10d8, 0xc0093d69b0, 0xa}, 0x45264d) %full-path%/[email protected]/events.go:243 +0x5f github.com/gocql/gocql.(*Session).handleNodeEvent( 0xc000150000, {0xc01701a000, 0x1, 0x43e305}) %full-path%/[email protected]/events.go:176 +0x28e created by github.com/gocql/gocql.(*eventDebouncer).flush %full-path%/[email protected]/events.go:67 +0xb5 So, fix it by checking the result first and only then use 'host' object.
Example of the failure without this fix is here: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/valerii/job/longevity-quorum-failure-reproducer-scylla-bench-small-partitions-3h/9/consoleFull :
And successful runs with the fix:
|
If we get We return At least that is what I think the reasons for ignoring those errors are, so I wonder what are the cases when we would want to retry such errors. Now, if the Where does the canceled context come from in your case? |
I didn't check the "where" part. I believe that the termination of a Scylla member makes other alive nodes have latency spikes which can cause "DeadlineExceeded" which leads to the context cancellation. And in such a case I expect retry to work and the fix Our expectation is simple - if we set retry policy, we do not expect errors before the timeout for retries. |
7000727
to
e3917ed
Compare
This fix is not complete to solve the problem. The condition to enter the retries block return The bug is that any query call gets calculated as an attempt and attempt is always bigger than the number of retries after some small period of load time. P.S. my previous CI jobs passed because my first patch-set to this PR didn't return from the loop even not reaching the retries block. And such |
I agree with Martin, cancelled context means no further work shall be done. Let's first undersantd from where these contexts errors are coming from before you decide that they shall be retried. |
e3917ed
to
db56502
Compare
@martin-sucha , @zimnx So, from this PR I only need first commit - the one with fix for the removed host handling. |
apply patch for gocql issue scylladb/gocql#104 Ref: scylladb/gocql#104
@mmatczuk @martin-sucha @zimnx @piodul Also, created PR for the upstream project here: apache#1652 |
@zimnx it merge on the upstream apache#1652, can you take a look ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Thanks for merging it. |
done |
For the moment we have unsafe coding where we try to use 'host' object
before checking the success of it's 'get' operation.
When a Scylla node gets forcibly terminated we have a race condition
and to some probability to get the following panic:
So, fix it by checking the result first and only then use 'host' object.