Try to fix synchronization in `SemaphoreStep` #258

dwnusbaum · 2024-03-28T20:58:08Z

Even after #235, I saw a test flake because a semaphore step hung forever even though SempahoreStep.success was called. The logs indicate an order of operations that show a bug in the existing synchronization:

  10.238 [id=1884]	INFO	o.j.p.w.t.s.SemaphoreStep$Execution#start: Blocking wait/1
  10.239 [id=1839]	INFO	o.j.p.w.test.steps.SemaphoreStep#success: Planning to unblock wait/1 as success

Here is my hypothesis as to what happened. We have two threads: Thread 1 calls SemaphoreStep.start. Thread 2 calls SemaphoreStep.success. They race for the State monitor. I think that thread 1 acquired the State monitor first and ran this entire block:

workflow-support-plugin/src/test/java/org/jenkinsci/plugins/workflow/test/steps/SemaphoreStep.java

Lines 192 to 200 in 1e84e32

    
           synchronized (s) { 
        
               if (s.returnValues.containsKey(k)) { 
        
                   success = true; 
        
                   returnValue = s.returnValues.get(k); 
        
               } else if (s.errors.containsKey(k)) { 
        
                   failure = true; 
        
                   error = s.errors.get(k); 
        
               } 
        
           }

Thread 2 then acquired the State monitor and ran this block:

workflow-support-plugin/src/test/java/org/jenkinsci/plugins/workflow/test/steps/SemaphoreStep.java

Lines 136 to 143 in 1e84e32

    
           synchronized (s) { 
        
               if (!s.contexts.containsKey(k)) { 
        
                   LOGGER.info(() -> "Planning to unblock " + k + " as failure"); 
        
                   s.errors.put(k, error); 
        
                   return; 
        
               } 
        
               c = getContext(s, k); 
        
           }

At this point, s.contexts does not contain k, so Planning to unblock wait/1 as success gets logged, and the monitor is released, but thread 1 already checked s.returnValues and will never check it again. Thread 1 in the meantime proceeded to this synchronized block and ran it once thread 2 released the monitor:

workflow-support-plugin/src/test/java/org/jenkinsci/plugins/workflow/test/steps/SemaphoreStep.java

Lines 210 to 212 in 1e84e32

    
           synchronized (s) { 
        
               s.contexts.put(k, c); 
        
           }

At this point though, this does nothing. Thread 2 already ran SemaphoreStep.success, and the context did not exist, so it put a value in State.returnValues. Thread 1 had already checked SemaphoreStep.returnValues though, and will not check it again, so the step just hangs forever and the test times out.

CC @jgreffe

dwnusbaum · 2024-03-28T20:58:30Z

src/test/java/org/jenkinsci/plugins/workflow/test/steps/SemaphoreStep.java

@@ -189,13 +189,16 @@ public static class Execution extends AbstractStepExecutionImpl {
            Object returnValue = null;
            Throwable error = null;
            boolean success = false, failure = false, sync = true;
+            String c = Jenkins.XSTREAM.toXML(getContext());


Just keeping this out of the synchronized block as before.

dwnusbaum · 2024-03-28T21:01:07Z

src/test/java/org/jenkinsci/plugins/workflow/test/steps/SemaphoreStep.java

            synchronized (s) {
                if (s.returnValues.containsKey(k)) {
                    success = true;
                    returnValue = s.returnValues.get(k);
                } else if (s.errors.containsKey(k)) {
                    failure = true;
                    error = s.errors.get(k);
+                } else {
+                    s.contexts.put(k, c);


This is the fix. Other threads now only ever see State in one of two modes: either contexts does not contain the relevant key, and returnValues and errors should be used and will be checked here, or contexts does contain the key, and the StepContext should be used instead. Previously, there was a case where contexts did not contain the key yet but returnValues and errors had already been checked.

dwnusbaum · 2024-03-28T21:04:06Z

src/test/java/org/jenkinsci/plugins/workflow/test/steps/SemaphoreStep.java

-                String c = Jenkins.XSTREAM.toXML(getContext());
-                synchronized (s) {
-                    s.contexts.put(k, c);
-                }
                sync = false;


FWIW I think there may be other pre-existing bugs here. What happens if another thread calls success or failure after the first synchronized block above but before this completes? Those threads will call StepContext.onSuccess/StepContext.onFailure before StepExecution.start has completed, which seems unusual, but maybe it's fine.

Maybe we should change the overall approach, drop support for the synchronous completion cases, and have the step just poll continuously on a background thread looking for results. It would be slower and less efficient, but it also seems less prone to complex synchronization issues.

Maybe we should change the overall approach

FWIW, I filed #259 with an example of what that could look like.

Those threads will call StepContext.onSuccess/StepContext.onFailure before StepExecution.start has completed

That would be normal if false is returned. It would be weird if start later returns true, but I do not think it is harmful. TBH I do not recall the logic here in workflow-cps very well. Anyway this seems unlikely in a normal test, because it should call wait and then do some stuff before calling success.

jglick · 2024-03-28T21:33:26Z

jenkinsci/support-core-plugin#515 (comment) ?

jglick · 2024-03-28T21:35:39Z

Perhaps it would be easier to reason about if instead of three maps there were just one, with values being a union of Blocked | Success(Object) | Error(Throwable)?

dwnusbaum · 2024-03-28T21:37:27Z

jenkinsci/support-core-plugin#515 (comment) ?

I saw the issue in a proprietary plugin, but indeed the logs for that test failure look the same:

   9.938 [id=63]	INFO	o.j.p.w.t.s.SemaphoreStep$Execution#start: Blocking wait/1
   9.939 [id=141]	INFO	o.j.p.w.test.steps.SemaphoreStep#success: Planning to unblock wait/1 as success

jglick · 2024-03-28T21:41:57Z

Sorry, four maps, I forgot about started.

dwnusbaum · 2024-03-28T21:44:03Z

Perhaps it would be easier to reason about if instead of [four] maps there were just one, with values being a union...

Yes, but this pattern is a bit awkward in Java (at least pre-sealed classes/interfaces and future related improvements). I can file another PR with that approach for comparison.

dwnusbaum · 2024-03-28T22:43:37Z

See #260.

jglick

I have a general preference for something like #260 as a refactoring, but OTOH this is fairly simple and looks like it may solve the practical issues, so you should go ahead and get something tested & released.

jglick · 2024-03-29T12:40:52Z

src/test/java/org/jenkinsci/plugins/workflow/test/steps/SemaphoreStep.java

-                String c = Jenkins.XSTREAM.toXML(getContext());
-                synchronized (s) {
-                    s.contexts.put(k, c);
-                }
                sync = false;


Those threads will call StepContext.onSuccess/StepContext.onFailure before StepExecution.start has completed

That would be normal if false is returned. It would be weird if start later returns true, but I do not think it is harmful. TBH I do not recall the logic here in workflow-cps very well. Anyway this seems unlikely in a normal test, because it should call wait and then do some stuff before calling success.

dwnusbaum · 2024-03-29T17:11:02Z

@jglick From some PCT testing, this PR looks fine, but #260 has a few issues that would need to be addressed (#260 (comment)). I can fix them, but is it ok with you if we just go ahead and merge/release this for now?

jglick · 2024-03-29T17:29:54Z

Feel free to release this PR if it seems to pass PCT. No need to waste days on this.

Try to fix synchronization in SemaphoreStep

5f06b21

dwnusbaum requested a review from a team as a code owner March 28, 2024 20:58

dwnusbaum commented Mar 28, 2024

View reviewed changes

dwnusbaum changed the title ~~Try to fix synchronization in SemaphoreStep~~ Try to fix synchronization in SemaphoreStep Mar 28, 2024

jglick added the developer label Mar 28, 2024

dwnusbaum mentioned this pull request Mar 28, 2024

Alternate approach to SemaphoreStep using Timer #259

Closed

dwnusbaum mentioned this pull request Mar 28, 2024

Move SemaphoreStep states into a single map to try to simplify the implementation #260

Closed

jglick approved these changes Mar 29, 2024

View reviewed changes

dwnusbaum merged commit 175aa9c into jenkinsci:master Mar 29, 2024
14 checks passed

dwnusbaum deleted the synchronization-fix branch March 29, 2024 17:30

jglick mentioned this pull request Apr 4, 2024

Added SemaphoreStep.waitForStart() for workflow to wait until semaphore release before continuing. jenkinsci/support-core-plugin#530

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try to fix synchronization in `SemaphoreStep` #258

Try to fix synchronization in `SemaphoreStep` #258

dwnusbaum commented Mar 28, 2024 •

edited

Loading

dwnusbaum Mar 28, 2024

dwnusbaum Mar 28, 2024 •

edited

Loading

dwnusbaum Mar 28, 2024

dwnusbaum Mar 28, 2024

dwnusbaum Mar 28, 2024

jglick Mar 29, 2024

jglick commented Mar 28, 2024

jglick commented Mar 28, 2024 •

edited

Loading

dwnusbaum commented Mar 28, 2024

jglick commented Mar 28, 2024

dwnusbaum commented Mar 28, 2024 •

edited

Loading

dwnusbaum commented Mar 28, 2024

jglick left a comment

jglick Mar 29, 2024

dwnusbaum commented Mar 29, 2024 •

edited

Loading

jglick commented Mar 29, 2024

	synchronized (s) {
	if (s.returnValues.containsKey(k)) {
	success = true;
	returnValue = s.returnValues.get(k);
	} else if (s.errors.containsKey(k)) {
	failure = true;
	error = s.errors.get(k);
	}
	}

	synchronized (s) {
	if (!s.contexts.containsKey(k)) {
	LOGGER.info(() -> "Planning to unblock " + k + " as failure");
	s.errors.put(k, error);
	return;
	}
	c = getContext(s, k);
	}

Try to fix synchronization in SemaphoreStep #258

Try to fix synchronization in SemaphoreStep #258

Conversation

dwnusbaum commented Mar 28, 2024 • edited Loading

dwnusbaum Mar 28, 2024

Choose a reason for hiding this comment

dwnusbaum Mar 28, 2024 • edited Loading

Choose a reason for hiding this comment

dwnusbaum Mar 28, 2024

Choose a reason for hiding this comment

dwnusbaum Mar 28, 2024

Choose a reason for hiding this comment

dwnusbaum Mar 28, 2024

Choose a reason for hiding this comment

jglick Mar 29, 2024

Choose a reason for hiding this comment

jglick commented Mar 28, 2024

jglick commented Mar 28, 2024 • edited Loading

dwnusbaum commented Mar 28, 2024

jglick commented Mar 28, 2024

dwnusbaum commented Mar 28, 2024 • edited Loading

dwnusbaum commented Mar 28, 2024

jglick left a comment

Choose a reason for hiding this comment

jglick Mar 29, 2024

Choose a reason for hiding this comment

dwnusbaum commented Mar 29, 2024 • edited Loading

jglick commented Mar 29, 2024

Try to fix synchronization in `SemaphoreStep` #258

Try to fix synchronization in `SemaphoreStep` #258

dwnusbaum commented Mar 28, 2024 •

edited

Loading

dwnusbaum Mar 28, 2024 •

edited

Loading

jglick commented Mar 28, 2024 •

edited

Loading

dwnusbaum commented Mar 28, 2024 •

edited

Loading

dwnusbaum commented Mar 29, 2024 •

edited

Loading