Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling of CurrentWorkflowConditionFailedError when create wf #3349

Merged
merged 3 commits into from
Jun 24, 2020

Conversation

vancexu
Copy link
Contributor

@vancexu vancexu commented Jun 18, 2020

What changed?
Fix handling of CurrentWorkflowConditionFailedError when create workflow

Why?
In create workflow, when CurrentWorkflowConditionFailedError happens, it doesn't make sense for shard to renewRange, because it doesn't help such error case and likely to cause more errors.

How did you test it?
existing tests.
bench test on staging

Potential risks
Limited. in worst case, create workflow will encounter unexpected error.

@vancexu vancexu requested a review from yux0 June 18, 2020 18:29
@@ -496,6 +496,7 @@ Create_Loop:
switch err.(type) {
case *shared.WorkflowExecutionAlreadyStartedError,
*persistence.WorkflowExecutionAlreadyStartedError,
*persistence.CurrentWorkflowConditionFailedError,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the range ID changes, it will throw this CurrentWorkflowConditionFailedError? If it is the case, why no update the range id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CurrentWorkflowConditionFailedError in StartWorkflow only happened when concurrent record messed up. RangeID changes will not lead to this error. That's my understanding based on reading of https://github.com/uber/cadence/blob/master/common/persistence/cassandra/cassandraPersistence.go#L1229 - L1242

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @yycptt makes a point. should we retry on this error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this error will be retry by client; it should be non-retriable from history service point of view currently.

@vancexu vancexu requested a review from yux0 June 22, 2020 18:38
@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 67.234% when pulling 190c01c on fixconcurrent into fa3155e on master.

@vancexu vancexu merged commit b5ce9c7 into master Jun 24, 2020
@vancexu vancexu deleted the fixconcurrent branch June 24, 2020 01:29
@mkolodezny mkolodezny self-requested a review June 24, 2020 17:52
Copy link
Contributor

@mkolodezny mkolodezny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the metric to check once it gets landed?

@vancexu
Copy link
Contributor Author

vancexu commented Jun 24, 2020

Whats the metric to check once it gets landed?

  1. Error log on CurrentWorkflowConditionFailedError should no longer see same failing signalwithstart request repeated happening with increasing rangeID
  2. History service signalwithstart latency should drop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants