Fix nil pointer dereference in event handling node removal #104

vponomaryov · 2022-08-28T12:49:30Z

For the moment we have unsafe coding where we try to use 'host' object
before checking the success of it's 'get' operation.

When a Scylla node gets forcibly terminated we have a race condition
and to some probability to get the following panic:

      panic: runtime error: invalid memory address or nil pointer dereference
      [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x632814]
    
      goroutine 1723 [running]:
      github.com/gocql/gocql.(*HostInfo).HostID(0xc0001500b8)
        %full-path%/[email protected]/host_source.go:271 +0x34
      github.com/gocql/gocql.(*Session).handleRemovedNode(
            0xc000150000, {0xc0134c10d8, 0xc0093d69b0, 0xa}, 0x45264d)
        %full-path%/[email protected]/events.go:243 +0x5f
      github.com/gocql/gocql.(*Session).handleNodeEvent(
            0xc000150000, {0xc01701a000, 0x1, 0x43e305})
        %full-path%/[email protected]/events.go:176 +0x28e
      created by github.com/gocql/gocql.(*eventDebouncer).flush
        %full-path%/[email protected]/events.go:67 +0xb5

So, fix it by checking the result first and only then use 'host' object.

For the moment we have unsafe coding where we try to use 'host' object before checking the success of it's 'get' operation. When a Scylla node gets forcibly terminated we have a race condition and to some probability to get the following panic: panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x632814] goroutine 1723 [running]: github.com/gocql/gocql.(*HostInfo).HostID(0xc0001500b8) %full-path%/[email protected]/host_source.go:271 +0x34 github.com/gocql/gocql.(*Session).handleRemovedNode( 0xc000150000, {0xc0134c10d8, 0xc0093d69b0, 0xa}, 0x45264d) %full-path%/[email protected]/events.go:243 +0x5f github.com/gocql/gocql.(*Session).handleNodeEvent( 0xc000150000, {0xc01701a000, 0x1, 0x43e305}) %full-path%/[email protected]/events.go:176 +0x28e created by github.com/gocql/gocql.(*eventDebouncer).flush %full-path%/[email protected]/events.go:67 +0xb5 So, fix it by checking the result first and only then use 'host' object.

vponomaryov · 2022-08-28T12:58:20Z

Example of the failure without this fix is here: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/valerii/job/longevity-quorum-failure-reproducer-scylla-bench-small-partitions-3h/9/consoleFull :

09:34:39  ----- LAST ERROR EVENT -------------------------------------------------------
09:34:39  2022-08-26 05:09:00.560 <2022-08-26 05:08:36.000>: (ScyllaBenchLogEvent Severity.ERROR) period_type=one-time event_id=9980d2b7-9a58-4bd9-9a14-90e28597f5c4: type=ConsistencyError regex=received only line_number=5691 node=Node longevity-large-partitions-3h-dev-loader-node-e94f5d93-4 [34.254.187.129 | 10.4.2.43] (seed: False)
09:34:39  2022/08/26 05:08:36 Operation timed out for scylla_bench.test - received only 1 responses from 2 CL=QUORUM.

And successful runs with the fix:

martin-sucha · 2022-08-30T12:30:33Z

If we get context.Canceled and context.DeadlineExceeded, that means that the context is canceled and any work associated with the context should stop because the caller is not waiting for the result anymore.

We return ErrNotFound only in methods that expect to return values from the first row of the result. It does not seem we should retry in this case, since the database returned successful response, it was just empty.

At least that is what I think the reasons for ignoring those errors are, so I wonder what are the cases when we would want to retry such errors.

Now, if the context.Canceled error comes from some internal gocql context, we might need to retry that. But it seems we should not retry all context.Canceled/context.DeadlineExceeded errors. Specifically, any context passed from the user should not be retried as the user not waiting for the result anymore and internal context error might need retry.

Where does the canceled context come from in your case?

vponomaryov · 2022-08-30T12:51:56Z

Where does the canceled context come from in your case?

I didn't check the "where" part.
But the root cause of the query failure is sudden termination of a Scylla node.
Our loaders are always alive, so, there are always the waiting callers in place.

I believe that the termination of a Scylla member makes other alive nodes have latency spikes which can cause "DeadlineExceeded" which leads to the context cancellation. And in such a case I expect retry to work and the fix
proves it in the lots of the CI jobs which were trying to reproduce the original problem, but everything worked OK in 100% cases.

Our expectation is simple - if we set retry policy, we do not expect errors before the timeout for retries.

vponomaryov · 2022-08-31T08:27:08Z

This fix is not complete to solve the problem.

The condition to enter the retries block return false because of the another bug mentioned in the bugreport here: #101

The bug is that any query call gets calculated as an attempt and attempt is always bigger than the number of retries after some small period of load time.
So, the complete fix should also include the fix for the query attempts calculation, it must store only retries and not any query call where the last one is useless.

P.S. my previous CI jobs passed because my first patch-set to this PR didn't return from the loop even not reaching the retries block. And such not return was the reason for 2 failed unit tests and it was, de-facto, endless retry.

zimnx · 2022-08-31T13:15:25Z

I agree with Martin, cancelled context means no further work shall be done. Let's first undersantd from where these contexts errors are coming from before you decide that they shall be retried.

vponomaryov · 2022-09-01T07:30:06Z

@martin-sucha , @zimnx
Thank you for the review. I finally investigated the real root cause.
The root cause was in incorrect/multiple usage of query objects.

So, from this PR I only need first commit - the one with fix for the removed host handling.
It is needed for anyone who uses retry logic.

apply patch for gocql issue scylladb/gocql#104 Ref: scylladb/gocql#104

vponomaryov · 2022-09-02T11:09:05Z

@mmatczuk @martin-sucha @zimnx @piodul
Please, review this oneliner.

Also, created PR for the upstream project here: apache#1652

roydahan · 2022-09-04T14:22:48Z

@zimnx / @piodul can we merge this one?
It was already merged in upstream.

fruch · 2022-09-04T14:27:27Z

@zimnx it merge on the upstream apache#1652, can you take a look ?

zimnx

LGTM, thanks!

vponomaryov · 2022-09-05T09:10:08Z

LGTM, thanks!

Thanks for merging it.
Is it possible to make a tag with it? So we could start using it ASAP.

zimnx · 2022-09-05T09:33:24Z

LGTM, thanks!

Thanks for merging it. Is it possible to make a tag with it? So we could start using it ASAP.

done

vponomaryov requested review from zimnx and mmatczuk August 28, 2022 12:53

vponomaryov mentioned this pull request Aug 28, 2022

"received only 1 responses from 2 CL=QUORUM" query error having RF=3 and only 1 down node #101

Closed

vponomaryov force-pushed the fix-query-retries branch from 7000727 to e3917ed Compare August 30, 2022 13:42

vponomaryov force-pushed the fix-query-retries branch from e3917ed to db56502 Compare September 1, 2022 07:25

vponomaryov changed the title ~~fix: Make query retryPolicy be respected always~~ Fix nil pointer dereference in events.go handling node removal Sep 1, 2022

vponomaryov mentioned this pull request Sep 1, 2022

Add possibility to configure retry policy scylladb/scylla-bench#96

Merged

fruch mentioned this pull request Sep 1, 2022

use cluster.Keyspace before connecting to the nodes scylladb/scylla-bench#97

Merged

fruch added a commit to fruch/scylla-bench that referenced this pull request Sep 1, 2022

gocql: Fix nil pointer dereference in events.go handling node removal

be0deab

apply patch for gocql issue scylladb/gocql#104 Ref: scylladb/gocql#104

vponomaryov requested review from piodul and removed request for mmatczuk September 2, 2022 11:07

zimnx approved these changes Sep 5, 2022

View reviewed changes

zimnx changed the title ~~Fix nil pointer dereference in events.go handling node removal~~ Fix nil pointer dereference in event handling node removal Sep 5, 2022

zimnx merged commit e5a83d2 into scylladb:master Sep 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix nil pointer dereference in event handling node removal #104

Fix nil pointer dereference in event handling node removal #104

vponomaryov commented Aug 28, 2022 •

edited

Loading

vponomaryov commented Aug 28, 2022

martin-sucha commented Aug 30, 2022

vponomaryov commented Aug 30, 2022 •

edited

Loading

vponomaryov commented Aug 31, 2022 •

edited

Loading

zimnx commented Aug 31, 2022

vponomaryov commented Sep 1, 2022

vponomaryov commented Sep 2, 2022 •

edited

Loading

roydahan commented Sep 4, 2022

fruch commented Sep 4, 2022

zimnx left a comment

vponomaryov commented Sep 5, 2022

zimnx commented Sep 5, 2022

Fix nil pointer dereference in event handling node removal #104

Fix nil pointer dereference in event handling node removal #104

Conversation

vponomaryov commented Aug 28, 2022 • edited Loading

vponomaryov commented Aug 28, 2022

martin-sucha commented Aug 30, 2022

vponomaryov commented Aug 30, 2022 • edited Loading

vponomaryov commented Aug 31, 2022 • edited Loading

zimnx commented Aug 31, 2022

vponomaryov commented Sep 1, 2022

vponomaryov commented Sep 2, 2022 • edited Loading

roydahan commented Sep 4, 2022

fruch commented Sep 4, 2022

zimnx left a comment

Choose a reason for hiding this comment

vponomaryov commented Sep 5, 2022

zimnx commented Sep 5, 2022

vponomaryov commented Aug 28, 2022 •

edited

Loading

vponomaryov commented Aug 30, 2022 •

edited

Loading

vponomaryov commented Aug 31, 2022 •

edited

Loading

vponomaryov commented Sep 2, 2022 •

edited

Loading