Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-73835] Do not allow builds to be deleted while they are still running and ensure build discarders run after builds are fully complete #9810

Merged
merged 6 commits into from
Oct 12, 2024

Conversation

dwnusbaum
Copy link
Member

@dwnusbaum dwnusbaum commented Oct 1, 2024

See JENKINS-73835 and jenkinsci/workflow-job-plugin#470. I noticed this while investigating #9790, but it is a distinct issue.

This PR makes Run.delete throw an exception if it is called on a build which has not yet completed. It also adjusts some logic related to LogRotator, which is the main programmatic caller of Run.delete to avoid some race conditions and related issues.

Testing done

See new automated PRs.

Proposed changelog entries

  • Do not allow builds to be deleted while they are still building
  • Ensure build discarders only process builds which have fully completed

Proposed upgrade guidelines

N/A

Submitter checklist

Desired reviewers

Before the changes are marked as ready-for-merge:

Maintainer checklist

@dwnusbaum dwnusbaum added the bug For changelog: Minor bug. Will be listed after features label Oct 1, 2024
Comment on lines -1886 to -1893

try {
getParent().logRotate();
} catch (Exception e) {
LOGGER.log(Level.SEVERE, "Failed to rotate log", e);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this into GlobalBuildDiscarderListener so that isLogUpdated() == true when log rotation runs. It still runs synchronously, it just runs a little bit later now down in onEndBuilding. This allows us to check isLogUpdated() elsewhere to avoid race conditions with Pipeline builds.

@@ -250,7 +250,7 @@ private boolean shouldKeepRun(Run r, Run lsb, Run lstb) {
LOGGER.log(FINER, "{0} is not to be removed or purged of artifacts because it’s the last stable build", r);
return true;
}
if (r.isBuilding()) {
if (r.isLogUpdated()) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to avoid race conditions involving Pipeline builds. Previously it was possible for log rotation to delete builds which had not yet fully completed, which meant that a build.xml file could be written back out into the build directory after log rotation had deleted the build. I will try to demonstrate this downstream in workflow-job once I have an incremental build here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you confident that this is safe for freestyle builds? I.e., that !logUpdated → !building or conversely that building → logUpdated? The lifecycles for AbstractBuild and WorkflowRun are pretty different and the Javadoc has always been vague.

Copy link
Member Author

@dwnusbaum dwnusbaum Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure it is safe, but if you prefer we can switch to isBuilding() || isLogUpdated() like the check for the UI in Jelly here (or maybe we should simplify that to only check isLogUpdated as well). See the non-Pipeline implementations here and the state enum here.

For !logUpdated → !building,!isLogUpdated is only true in State.COMPLETED, which ensures !isBuilding since COMPLETED is after POST_PRODUCTION.

The converse building → logUpdated should be ok as well, since if isBuilding then we must be in State.NOT_STARTED or State.BUILIDING, both of which are not COMPLETED, so isLogUpdated will be true.

For Pipeline things are simpler since WorkflowRun.isLogUpdated delegates to WorkflowRun.isBuilding.

Also just as a data point, if isLogUpdated could not be used on its own for non-pipelines, I think we would have seen it cause issues in tests due to the usage in JenkinsRule.waitForCompletion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And as far as I know the states for a Run can only move forward, although looking at the code I guess a subclass that calls some of the protected methods incorrectly could cause random state changes.)

Copy link
Member

@jglick jglick Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as far as I know the states for a Run can only move forward

That looks right.

like the check for the UI in Jelly

Not sure what you mean.

<j:if test="${!it.building and !it.keepLog}">
is only checking building and ought to be switched to only check logUpdated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 I misread keepLog as logUpdated. I'll update that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed that in f676be5.

} catch (Exception e) {
LOGGER.log(Level.WARNING, e, () -> "Failed to rotate log for " + run);
}
// Avoid calling Job.logRotate twice in case JobGlobalBuildDiscarderStrategy is configured globally.
Copy link
Member Author

@dwnusbaum dwnusbaum Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing, but essentially the old behavior was that Run.run called Job.logRotate while the build was in POST_PRODUCTION state, but also ever since #4368 if the configuration of GlobalBuildDiscarderConfiguration included JobGlobalBuildDiscarderStrategy (which it does by default), we also called Job.logRotate here while the build was in COMPLETED state. For backwards compatibility I think we need to ensure it gets called at least once regardless of the global configuration, and it's preferable to call it here in onFinalized so that we can check Run.isLogUpdated in LogRotator.shouldKeepRun and Run.delete to avoid race conditions with Pipeline builds. We avoid the redundant call by filtering out the relevant strategy when processing the globally-configured discarders.

Also, the old behavior is why I am not concerned about removing this call in workflow-job and also why I am not worried about making the above call asynchronous for the sake of Pipeline builds. BackgroundGlobalBuildDiscarder.processJob has been (redundantly) calling logRotate synchronously for years now in default configurations, so it seems that the asynchronous behavior in jenkinsci/workflow-job-plugin@63fdbe8 is no longer critical.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(jenkinsci/workflow-job-plugin#70 for better linking)

test/src/test/java/hudson/model/RunTest.java Show resolved Hide resolved
@@ -96,6 +101,17 @@ public void successVsFailureWithRemoveLastBuild() throws Exception {
assertEquals(2, numberOf(project.getLastFailedBuild()));
}

@Test
public void ableToDeleteCurrentBuild() throws Exception {
Copy link
Member Author

@dwnusbaum dwnusbaum Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this was to check that the switch to isLogUpdated in LogRotator.shouldKeepRun did not break anything, since previously the call in Run.run would not have been able to delete the build with the switch. Because of the redundant call via GlobalBuildDiscarderListener though it worked anyway, the only difference was a log warning from the deleted code in Run.run, which we can't assert against any more. So maybe this test is no longer checking anything useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, in principle, someone could want to keep zero builds if they were using publishers (GitHub Checks, Slack, etc.) to track failures?

…essEvenTheBuildIsRunning to match new behavior
@@ -139,24 +136,18 @@ public class DeleteBuildsCommandTest {
assertThat(result.stdout(), containsString("Deleted 0 builds"));
}

@Test public void deleteBuildsShouldSuccessEvenTheBuildIsRunning() throws Exception {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not notice this test before. It was added in #2310 along with all of the other tests for delete-builds.

I still think the previous behavior was undesirable - deleting builds that are still running and reachable in memory does not make sense, can lead to tons of errors in the Jenkins logs, and any builds deleted while running are very likely to come back after a Jenkins restart given that build completion will cause build.xml to be rewritten.

I can see the argument though that Run.delete should work like Job.delete and cancel the build if it is running, waiting up to 15 seconds and only throwing an exception if the build still hasn't completed at that point. @jglick suggested this approach as well, but I was hoping to avoid adding complexity here unless we really think that behavior is necessary. Any opinions on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion. I think the proposed error behavior is intuitive enough. You can explicitly abort the build first if that is what you intended.

@dwnusbaum
Copy link
Member Author

/label ready-for-merge


This PR is now ready for merge, after ~24 hours, we will merge it if there's no negative feedback.

Thanks!

@comment-ops-bot comment-ops-bot bot added the ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback label Oct 11, 2024
@MarkEWaite MarkEWaite merged commit d34b17e into jenkinsci:master Oct 12, 2024
16 checks passed
@dwnusbaum dwnusbaum deleted the JENKINS-73835 branch October 14, 2024 16:03
@basil
Copy link
Member

basil commented Oct 15, 2024

[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 7.344 s <<< FAILURE! -- in org.jenkins.plugins.lockableresources.LockStepTest
[ERROR] org.jenkins.plugins.lockableresources.LockStepTest.deleteRunningBuildNewBuildClearsLock -- Time elapsed: 7.330 s <<< ERROR!
java.io.IOException: Unable to delete p #1 because it is still running
	at hudson.model.Run.delete(Run.java:1555)
	at org.jenkins.plugins.lockableresources.LockStepTest.deleteRunningBuildNewBuildClearsLock(LockStepTest.java:725)
	at java.base/java.lang.reflect.Method.invoke(Method.java:569)
	at org.jvnet.hudson.test.JenkinsRule$1.evaluate(JenkinsRule.java:658)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.lang.Thread.run(Thread.java:840)

@dwnusbaum
Copy link
Member Author

I'll take a look at LockStepTest.deleteRunningBuildNewBuildClearsLock. Perhaps the test can just be deleted now.

@dwnusbaum
Copy link
Member Author

jenkinsci/lockable-resources-plugin#716 deletes the test.

basil pushed a commit to basil/jenkins that referenced this pull request Oct 15, 2024
…l running and ensure build discarders run after builds are fully complete (jenkinsci#9810)

* [JENKINS-73835] Do not allow builds to be deleted while they are still running

* [JENKINS-73835] Avoid redundant calls to Job.logRotate when builds complete and always call Job.logRotate after build finalization

* [JENKINS-73835] Add issue reference to RunTest.buildsMayNotBeDeletedWhileRunning

* [JENKINS-73835] Adjust DeleteBuildsCommandTest.deleteBuildsShouldSuccessEvenTheBuildIsRunning to match new behavior

* [JENKINS-73835] Run/delete.jelly should check Run.isLogUpdated, not Run.isBuilding
basil pushed a commit to basil/jenkins that referenced this pull request Oct 15, 2024
…l running and ensure build discarders run after builds are fully complete (jenkinsci#9810)

* [JENKINS-73835] Do not allow builds to be deleted while they are still running

* [JENKINS-73835] Avoid redundant calls to Job.logRotate when builds complete and always call Job.logRotate after build finalization

* [JENKINS-73835] Add issue reference to RunTest.buildsMayNotBeDeletedWhileRunning

* [JENKINS-73835] Adjust DeleteBuildsCommandTest.deleteBuildsShouldSuccessEvenTheBuildIsRunning to match new behavior

* [JENKINS-73835] Run/delete.jelly should check Run.isLogUpdated, not Run.isBuilding

(cherry picked from commit d34b17e)
@basil
Copy link
Member

basil commented Oct 15, 2024

Ever since this PR was merged, hudson.cli.DeleteBuildsCommandTest#deleteBuildsShouldFailIfTheBuildIsRunning has been consistently failing on Windows CI builds with:

java.nio.file.DirectoryNotEmptyException: C:\Jenkins\agent\workspace\Core_jenkins_master\test\target\j h14545459714006339336\jobs\aProject\builds\1
	at java.base/sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:272)
	at java.base/sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:110)
	at java.base/java.nio.file.Files.deleteIfExists(Files.java:1191)
	at org.jvnet.hudson.test.TemporaryDirectoryAllocator.delete(TemporaryDirectoryAllocator.java:146)
	at org.jvnet.hudson.test.TemporaryDirectoryAllocator.delete(TemporaryDirectoryAllocator.java:136)
	at org.jvnet.hudson.test.TemporaryDirectoryAllocator.delete(TemporaryDirectoryAllocator.java:136)
	at org.jvnet.hudson.test.TemporaryDirectoryAllocator.delete(TemporaryDirectoryAllocator.java:136)
	at org.jvnet.hudson.test.TemporaryDirectoryAllocator.delete(TemporaryDirectoryAllocator.java:136)
	at org.jvnet.hudson.test.TemporaryDirectoryAllocator.dispose(TemporaryDirectoryAllocator.java:104)
	at org.jvnet.hudson.test.TestEnvironment.dispose(TestEnvironment.java:84)
	at org.jvnet.hudson.test.JenkinsRule.after(JenkinsRule.java:538)
	at org.jvnet.hudson.test.JenkinsRule$1.evaluate(JenkinsRule.java:676)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.lang.Thread.run(Thread.java:840)
	Suppressed: java.io.IOException: These files still exist : log
		at org.jvnet.hudson.test.TemporaryDirectoryAllocator.delete(TemporaryDirectoryAllocator.java:150)
		... 10 more

First failure was https://ci.jenkins.io/job/Core/job/jenkins/job/master/6632/testReport/junit/hudson.cli/DeleteBuildsCommandTest/windows_jdk17___Windows___JDK_17___Build___Test___deleteBuildsShouldFailIfTheBuildIsRunning/

@dwnusbaum
Copy link
Member Author

Yes, looks like the test needs to forcibly stop the build. I will file a PR to fix it in a bit.

@dwnusbaum
Copy link
Member Author

I filed #9876 to stabilize that test.

logRotator.setRemoveLastBuild(true);
p.setBuildDiscarder(logRotator);
j.buildAndAssertStatus(Result.SUCCESS, p);
assertNull(p.getBuildByNumber(1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what we would await for, because log rotation runs synchronously during build completion. Unless this is a test-specific timing issue due to JenkinsRule.buildAndAssertSuccess just waiting for QueueTaskFuture.get instead of !Run.isLogUpdated, then this test probably just needs to be skipped on Windows.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From some testing, j.buildAndAssertStatus(Result.SUCCESS, p) is guaranteed to block until GlobalBuildDiscarderListener.onFinalized has completed. Perhaps the BuildWatcher background thread tried to copy the logs at an inopportune time and caused log rotation to fail. Seems best to delete the BuildWatcher here in case it made any of the other tests flaky on Windows and then probably skip this test explicitly on Windows as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the AccessDeniedException seems strange. Either way, I filed #9923.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug For changelog: Minor bug. Will be listed after features ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants