Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix nil pointer dereference in event handling node removal #104

Merged
merged 1 commit into from
Sep 5, 2022

Conversation

vponomaryov
Copy link

@vponomaryov vponomaryov commented Aug 28, 2022

For the moment we have unsafe coding where we try to use 'host' object
before checking the success of it's 'get' operation.

When a Scylla node gets forcibly terminated we have a race condition
and to some probability to get the following panic:

      panic: runtime error: invalid memory address or nil pointer dereference
      [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x632814]
    
      goroutine 1723 [running]:
      github.com/gocql/gocql.(*HostInfo).HostID(0xc0001500b8)
        %full-path%/[email protected]/host_source.go:271 +0x34
      github.com/gocql/gocql.(*Session).handleRemovedNode(
            0xc000150000, {0xc0134c10d8, 0xc0093d69b0, 0xa}, 0x45264d)
        %full-path%/[email protected]/events.go:243 +0x5f
      github.com/gocql/gocql.(*Session).handleNodeEvent(
            0xc000150000, {0xc01701a000, 0x1, 0x43e305})
        %full-path%/[email protected]/events.go:176 +0x28e
      created by github.com/gocql/gocql.(*eventDebouncer).flush
        %full-path%/[email protected]/events.go:67 +0xb5

So, fix it by checking the result first and only then use 'host' object.

For the moment we have unsafe coding where we try to use 'host' object
before checking the success of it's 'get' operation.

When a Scylla node gets forcibly terminated we have a race condition
and to some probability to get the following panic:

  panic: runtime error: invalid memory address or nil pointer dereference
  [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x632814]

  goroutine 1723 [running]:
  github.com/gocql/gocql.(*HostInfo).HostID(0xc0001500b8)
    %full-path%/[email protected]/host_source.go:271 +0x34
  github.com/gocql/gocql.(*Session).handleRemovedNode(
        0xc000150000, {0xc0134c10d8, 0xc0093d69b0, 0xa}, 0x45264d)
    %full-path%/[email protected]/events.go:243 +0x5f
  github.com/gocql/gocql.(*Session).handleNodeEvent(
        0xc000150000, {0xc01701a000, 0x1, 0x43e305})
    %full-path%/[email protected]/events.go:176 +0x28e
  created by github.com/gocql/gocql.(*eventDebouncer).flush
    %full-path%/[email protected]/events.go:67 +0xb5

So, fix it by checking the result first and only then use 'host' object.
@vponomaryov
Copy link
Author

Example of the failure without this fix is here: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/valerii/job/longevity-quorum-failure-reproducer-scylla-bench-small-partitions-3h/9/consoleFull :

09:34:39  ----- LAST ERROR EVENT -------------------------------------------------------
09:34:39  2022-08-26 05:09:00.560 <2022-08-26 05:08:36.000>: (ScyllaBenchLogEvent Severity.ERROR) period_type=one-time event_id=9980d2b7-9a58-4bd9-9a14-90e28597f5c4: type=ConsistencyError regex=received only line_number=5691 node=Node longevity-large-partitions-3h-dev-loader-node-e94f5d93-4 [34.254.187.129 | 10.4.2.43] (seed: False)
09:34:39  2022/08/26 05:08:36 Operation timed out for scylla_bench.test - received only 1 responses from 2 CL=QUORUM.

And successful runs with the fix:

  1. https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/valerii/job/longevity-quorum-failure-reproducer-scylla-bench-small-partitions-3h/12/consoleFull
  2. https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/valerii/job/longevity-quorum-failure-reproducer-scylla-bench-small-partitions-3h/13/consoleFull

@martin-sucha
Copy link

If we get context.Canceled and context.DeadlineExceeded, that means that the context is canceled and any work associated with the context should stop because the caller is not waiting for the result anymore.

We return ErrNotFound only in methods that expect to return values from the first row of the result. It does not seem we should retry in this case, since the database returned successful response, it was just empty.

At least that is what I think the reasons for ignoring those errors are, so I wonder what are the cases when we would want to retry such errors.

Now, if the context.Canceled error comes from some internal gocql context, we might need to retry that. But it seems we should not retry all context.Canceled/context.DeadlineExceeded errors. Specifically, any context passed from the user should not be retried as the user not waiting for the result anymore and internal context error might need retry.

Where does the canceled context come from in your case?

@vponomaryov
Copy link
Author

vponomaryov commented Aug 30, 2022

Where does the canceled context come from in your case?

I didn't check the "where" part.
But the root cause of the query failure is sudden termination of a Scylla node.
Our loaders are always alive, so, there are always the waiting callers in place.

I believe that the termination of a Scylla member makes other alive nodes have latency spikes which can cause "DeadlineExceeded" which leads to the context cancellation. And in such a case I expect retry to work and the fix
proves it in the lots of the CI jobs which were trying to reproduce the original problem, but everything worked OK in 100% cases.

Our expectation is simple - if we set retry policy, we do not expect errors before the timeout for retries.

@vponomaryov
Copy link
Author

vponomaryov commented Aug 31, 2022

This fix is not complete to solve the problem.

The condition to enter the retries block return false because of the another bug mentioned in the bugreport here: #101

The bug is that any query call gets calculated as an attempt and attempt is always bigger than the number of retries after some small period of load time.
So, the complete fix should also include the fix for the query attempts calculation, it must store only retries and not any query call where the last one is useless.

P.S. my previous CI jobs passed because my first patch-set to this PR didn't return from the loop even not reaching the retries block. And such not return was the reason for 2 failed unit tests and it was, de-facto, endless retry.

@zimnx
Copy link
Collaborator

zimnx commented Aug 31, 2022

I agree with Martin, cancelled context means no further work shall be done. Let's first undersantd from where these contexts errors are coming from before you decide that they shall be retried.

@vponomaryov vponomaryov changed the title fix: Make query retryPolicy be respected always Fix nil pointer dereference in events.go handling node removal Sep 1, 2022
@vponomaryov
Copy link
Author

@martin-sucha , @zimnx
Thank you for the review. I finally investigated the real root cause.
The root cause was in incorrect/multiple usage of query objects.

So, from this PR I only need first commit - the one with fix for the removed host handling.
It is needed for anyone who uses retry logic.

@vponomaryov
Copy link
Author

vponomaryov commented Sep 2, 2022

@mmatczuk @martin-sucha @zimnx @piodul
Please, review this oneliner.

Also, created PR for the upstream project here: apache#1652

@roydahan
Copy link
Collaborator

roydahan commented Sep 4, 2022

@zimnx / @piodul can we merge this one?
It was already merged in upstream.

@fruch
Copy link

fruch commented Sep 4, 2022

@zimnx it merge on the upstream apache#1652, can you take a look ?

Copy link
Collaborator

@zimnx zimnx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@zimnx zimnx changed the title Fix nil pointer dereference in events.go handling node removal Fix nil pointer dereference in event handling node removal Sep 5, 2022
@zimnx zimnx merged commit e5a83d2 into scylladb:master Sep 5, 2022
@vponomaryov
Copy link
Author

LGTM, thanks!

Thanks for merging it.
Is it possible to make a tag with it? So we could start using it ASAP.

@zimnx
Copy link
Collaborator

zimnx commented Sep 5, 2022

LGTM, thanks!

Thanks for merging it. Is it possible to make a tag with it? So we could start using it ASAP.

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants