Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug that will cause concurrency access to search attributes map #6262

Merged

Conversation

shijiesheng
Copy link
Contributor

What changed?
This is causing concurrent access to the map and crashed production instances.

Why?

Copy search attributes instead of mutate it.

How did you test it?

Potential risks

Release notes

Documentation Changes

Copy link

codecov bot commented Sep 2, 2024

Codecov Report

Attention: Patch coverage is 71.42857% with 2 lines in your changes missing coverage. Please review.

Project coverage is 73.02%. Comparing base (3f1f3a6) to head (0cb770c).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
...ervice/history/task/transfer_task_executor_base.go 71.42% 1 Missing and 1 partial ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
...ervice/history/task/transfer_task_executor_base.go 80.64% <71.42%> (+0.23%) ⬆️

... and 8 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3f1f3a6...0cb770c. Read the comment docs.

@shijiesheng shijiesheng merged commit 30d403f into cadence-workflow:master Sep 3, 2024
19 of 20 checks passed
@shijiesheng
Copy link
Contributor Author

I merged it for a hot fix. I'll add more test to this package later.

@Groxx
Copy link
Member

Groxx commented Sep 5, 2024

From digging around in the code with the stack trace we caught on v1.2.13-prerelease05:

fatal error: concurrent map read and map write

goroutine 15047 [running]:
github.com/uber/cadence/service/history/task.appendContextHeaderToSearchAttributes(0x0?, 0xc0141a8db0?, 0xc00503c810)
	external/com_github_uber_cadence/service/history/task/transfer_task_executor_base.go:447 +0x1ef
github.com/uber/cadence/service/history/task.(*transferTaskExecutorBase).recordWorkflowClosed(0xc00cd9fd80, {0x29129d0, 0xc00d41fc00}, {0xc010637e00, 0x24}, {0xc010637d40, 0x24}, {0xc010637e30, 0x24}, {0xc013ac8a08, ...}, ...)
	external/com_github_uber_cadence/service/history/task/transfer_task_executor_base.go:343 +0x2b0
github.com/uber/cadence/service/history/task.(*transferStandbyTaskExecutor).processCloseExecution.func1({0x29129d0, 0xc00d41fc00}, {0x29562b0?, 0xc0131b4dc0?}, {0x2990898, 0xc01439b208})
	external/com_github_uber_cadence/service/history/task/transfer_standby_task_executor.go:276 +0x77d
github.com/uber/cadence/service/history/task.(*transferStandbyTaskExecutor).processTransfer(0xc009ece3f0, {0x29129d0, 0xc00d41fc00}, 0x1, {0x292d420, 0xc01062db00}, 0xc00ecc1ba0, 0x207a750)
	external/com_github_uber_cadence/service/history/task/transfer_standby_task_executor.go:580 +0x30b
github.com/uber/cadence/service/history/task.(*transferStandbyTaskExecutor).processCloseExecution(0x29127d8?, {0x29129d0?, 0xc00d41fc00?}, 0xc007a62e80?)
	external/com_github_uber_cadence/service/history/task/transfer_standby_task_executor.go:298 +0x5d
github.com/uber/cadence/service/history/task.(*transferStandbyTaskExecutor).Execute(0xc009ece3f0, {0x294d440?, 0xc01089c1e0?}, 0x1)
	external/com_github_uber_cadence/service/history/task/transfer_standby_task_executor.go:96 +0x26f
github.com/uber/cadence/service/history/task.(*taskImpl).Execute(0xc01089c1e0)
	external/com_github_uber_cadence/service/history/task/task.go:213 +0x2ba
github.com/uber/cadence/common/task.(*parallelTaskProcessorImpl).executeTask.func2()
	external/com_github_uber_cadence/common/task/parallel_task_processor.go:166 +0x26
github.com/uber/cadence/common/backoff.(*ThrottleRetry).Do(0xc0149e7040, {0x29127d8, 0x9416f20}, 0xc00ecc1e88)
	external/com_github_uber_cadence/common/backoff/retry.go:172 +0xbe
github.com/uber/cadence/common/task.(*parallelTaskProcessorImpl).executeTask(0xc008250d20, {0x7ff7bc5ff9a8, 0xc01089c1e0}, 0xc00791e7e0)
	external/com_github_uber_cadence/common/task/parallel_task_processor.go:186 +0x2d2
github.com/uber/cadence/common/task.(*parallelTaskProcessorImpl).taskWorker(0xc008250d20, 0xc00791e7e0)
	external/com_github_uber_cadence/common/task/parallel_task_processor.go:148 +0x89
created by github.com/uber/cadence/common/task.(*parallelTaskProcessorImpl).Start in goroutine 14860
	external/com_github_uber_cadence/common/task/parallel_task_processor.go:96 +0x7f

this is definitely not a fix. the crash is occurring on this read on line 447:

      key := fmt.Sprintf(definition.HeaderFormat, sanitizedKey)
>>>   if _, ok := attr[key]; ok { // skip if key already exists
          continue
      }

which means the write needs to be something happening concurrently somewhere else, and copying it in this goroutine will just move the point where the still-dangerous read occurs.


since this is run in a goroutine (which is at the bottom of the stack trace):

      func (p *parallelTaskProcessorImpl) taskWorker(shutdownCh chan struct{}) {
        defer p.shutdownWG.Done()
      
        for {
          select {
          case <-shutdownCh:
            return
>>>       case task := <-p.tasksCh:
            p.executeTask(task, shutdownCh)
          }
        }
      }

we'll need to perform the copy before it crosses threads, i.e. before that channel is written to.

so we've got code somewhere that's pushing a task to a concurrent processor, and then mutating what it pushed. that's pretty much always unsafe, and that's the flaw that needs to be fixed.


aaah. and this data is coming from mutable state's GetExecutionInfo() which is....... called all over the place.

well. we already knew that thing is a mess of race conditions due to no defensive copying. this is probably just another to add to the pile.

@taylanisikdemir
Copy link
Member

From digging around in the code with the stack trace we caught on v1.2.13-prerelease05:

fatal error: concurrent map read and map write

goroutine 15047 [running]:
github.com/uber/cadence/service/history/task.appendContextHeaderToSearchAttributes(0x0?, 0xc0141a8db0?, 0xc00503c810)
	external/com_github_uber_cadence/service/history/task/transfer_task_executor_base.go:447 +0x1ef
github.com/uber/cadence/service/history/task.(*transferTaskExecutorBase).recordWorkflowClosed(0xc00cd9fd80, {0x29129d0, 0xc00d41fc00}, {0xc010637e00, 0x24}, {0xc010637d40, 0x24}, {0xc010637e30, 0x24}, {0xc013ac8a08, ...}, ...)
	external/com_github_uber_cadence/service/history/task/transfer_task_executor_base.go:343 +0x2b0
github.com/uber/cadence/service/history/task.(*transferStandbyTaskExecutor).processCloseExecution.func1({0x29129d0, 0xc00d41fc00}, {0x29562b0?, 0xc0131b4dc0?}, {0x2990898, 0xc01439b208})
	external/com_github_uber_cadence/service/history/task/transfer_standby_task_executor.go:276 +0x77d
github.com/uber/cadence/service/history/task.(*transferStandbyTaskExecutor).processTransfer(0xc009ece3f0, {0x29129d0, 0xc00d41fc00}, 0x1, {0x292d420, 0xc01062db00}, 0xc00ecc1ba0, 0x207a750)
	external/com_github_uber_cadence/service/history/task/transfer_standby_task_executor.go:580 +0x30b
github.com/uber/cadence/service/history/task.(*transferStandbyTaskExecutor).processCloseExecution(0x29127d8?, {0x29129d0?, 0xc00d41fc00?}, 0xc007a62e80?)
	external/com_github_uber_cadence/service/history/task/transfer_standby_task_executor.go:298 +0x5d
github.com/uber/cadence/service/history/task.(*transferStandbyTaskExecutor).Execute(0xc009ece3f0, {0x294d440?, 0xc01089c1e0?}, 0x1)
	external/com_github_uber_cadence/service/history/task/transfer_standby_task_executor.go:96 +0x26f
github.com/uber/cadence/service/history/task.(*taskImpl).Execute(0xc01089c1e0)
	external/com_github_uber_cadence/service/history/task/task.go:213 +0x2ba
github.com/uber/cadence/common/task.(*parallelTaskProcessorImpl).executeTask.func2()
	external/com_github_uber_cadence/common/task/parallel_task_processor.go:166 +0x26
github.com/uber/cadence/common/backoff.(*ThrottleRetry).Do(0xc0149e7040, {0x29127d8, 0x9416f20}, 0xc00ecc1e88)
	external/com_github_uber_cadence/common/backoff/retry.go:172 +0xbe
github.com/uber/cadence/common/task.(*parallelTaskProcessorImpl).executeTask(0xc008250d20, {0x7ff7bc5ff9a8, 0xc01089c1e0}, 0xc00791e7e0)
	external/com_github_uber_cadence/common/task/parallel_task_processor.go:186 +0x2d2
github.com/uber/cadence/common/task.(*parallelTaskProcessorImpl).taskWorker(0xc008250d20, 0xc00791e7e0)
	external/com_github_uber_cadence/common/task/parallel_task_processor.go:148 +0x89
created by github.com/uber/cadence/common/task.(*parallelTaskProcessorImpl).Start in goroutine 14860
	external/com_github_uber_cadence/common/task/parallel_task_processor.go:96 +0x7f

this is definitely not a fix. the crash is occurring on this read on line 447:

      key := fmt.Sprintf(definition.HeaderFormat, sanitizedKey)
>>>   if _, ok := attr[key]; ok { // skip if key already exists
          continue
      }

which means the write needs to be something happening concurrently somewhere else, and copying it in this goroutine will just move the point where the still-dangerous read occurs.

since this is run in a goroutine (which is at the bottom of the stack trace):

      func (p *parallelTaskProcessorImpl) taskWorker(shutdownCh chan struct{}) {
        defer p.shutdownWG.Done()
      
        for {
          select {
          case <-shutdownCh:
            return
>>>       case task := <-p.tasksCh:
            p.executeTask(task, shutdownCh)
          }
        }
      }

we'll need to perform the copy before it crosses threads, i.e. before that channel is written to.

so we've got code somewhere that's pushing a task to a concurrent processor, and then mutating what it pushed. that's pretty much always unsafe, and that's the flaw that needs to be fixed.

aaah. and this data is coming from mutable state's GetExecutionInfo() which is....... called all over the place.

well. we already knew that thing is a mess of race conditions due to no defensive copying. this is probably just another to add to the pile.

Good digging. I didn't check the stack trace of crash assuming it was the recent change in this file that caused it. Looks like write is somewhere else.
We had similar concurrent read/write for the maps returned by mutable state on frontend side. I wonder if mutable state should always return a copy. Otherwise some caller somewhere will miss that requirement and try to write to the map

@Groxx
Copy link
Member

Groxx commented Sep 6, 2024

From more collaborative digging, we are currently at three possibilities:

  1. The newly-added request-header-to-search-attr merging is the cause, it's racing on itself, and a copy will indeed fix it (by preventing all writes to the shared value)
    • If true, this PR is a fix for this crash.
  2. Something else is writing to the search attr map while this visibility-processing happens
    • If true, this PR is not a fix. It doesn't remove the dangerous read operation, only moves it.
    • ^ this was my original assumption.
  3. We have the same instance of this search-attr map (and probably other data) in the cache under more than one key, and different locks are allowed by design, but they share state unsafely.

There's also the suspicion that this may be failover-related, due to timing.


1 seems to require flawed locking around the mutable state, but the lock does seem to be getting acquired during this code's execution: https://github.com/uber/cadence/blob/c0cd4c51116edde8542e993460fe425034975ae7/service/history/task/transfer_standby_task_executor.go#L551

If ^ that lock is working correctly, there would be no self-concurrent read/write because there would only be one processing at a time.
In that case, this change would be only "do not add the request-header-search-attrs to mutable state", which does seem to be what we want. They aren't "real" search attrs, and are not held in that field in the history.

In that case, this is a useful PR either way. That map probably should not have these headers added. We're treating them roughly like workflow status or close time: a pseudo-attr that can be searched.


2 and 3 are hard to trace down. There's a LOT of shallow-copying of search-attrs, and mutable state info as a whole, through multiple types (hundreds of references, plus transitives).

All of the ones I've looked at appear to just be passed through to things reading it, and probably protected by the lock at a higher level. I've looked at probably less than 10% of them though, and those were only easier ones.

I'm inclined to believe that 3 is not happening, because copying data from one workflow to another seems unlikely....
... except perhaps for resets 😟. Hopefully we just load the new record from scratch though, and then the run-ID part of the key should be good enough protection.


If we ignore 3 and assume that 1's linked locking code should be correct, that means we either have a flawed mutable-state lock (i.e. recordWorkflowClosed is written correctly but is self-racing, or at least capable of it, due to higher level issues), or we have some other code writing to the map without holding the lock.

Both of them mean that there is another deeper problem, and fixing that would also fix this crash (and may be causing other crashes).

The former looks... probably fine? There's a lot of indirection, but the core lock looks alright and the surrounding stuff seems reasonable to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants