Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handling of errors in scheduler poller #2233

Merged
merged 6 commits into from
Oct 20, 2021
Merged

Conversation

WH77
Copy link
Contributor

@WH77 WH77 commented Sep 22, 2021

Currently if the scheduler poller encounters an unexpected error it causes Singularity as a whole to abort. This PR changes the scheduler poller to try again in the event of unexpected errors, as well as modifying the scheduled task poll action to log errors for specific failed tasks instead of failing the entire batch.

This approach may swallow too much, but there are metrics that should still inform us if something is wrong with the scheduler as a whole.

cc - @pschoenfelder

Comment on lines 3846 to 3850
public void testBrokenQuartzCron() throws Exception {
String quartz = "0 0/60 * * * ?";
new CronExpression(quartz);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how useful this test is in the sense that this is technically not a valid expression, but library maintainers have made it work in the past and since decided to change that behavior. Maybe add a note about the quartz upgrade for background here in case this starts failing on a future upgrade?

@pschoenfelder
Copy link
Contributor

Have you tested this yet?

@WH77
Copy link
Contributor Author

WH77 commented Sep 28, 2021

No, figured I'd wait until after we decided whether to move forward with this approach.


@Override
protected boolean abortsOnError() {
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why set this to false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid killing the entire leader if the scheduler poller runs into an unexpected exception during a poll - currently it aborts at

@@ -3424,6 +3424,61 @@ public void testRequestedPorts() {
Assertions.assertEquals(1, taskManager.getActiveTaskIds().size());
}

@Test
public void testPortUsage() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test doesnt seem related to the PR -- can we move it to a separate one if it's still necessary?

);
} catch (Exception e) {
LOG.error("Error handling pending requests for {}, skipping", deployKey, e);
unscheduledTasksMetric.mark();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debated just reusing the lost tasks meter here, could be misleading (same task would get "lost" repeatedly) but would be fairly correct in spirit - @pschoenfelder you have any preference here?

@pschoenfelder
Copy link
Contributor

🚢

@WH77 WH77 merged commit ff606ca into master Oct 20, 2021
@ssalinas ssalinas added this to the 1.5.0 milestone May 4, 2022
@ssalinas ssalinas deleted the error-in-scheduler-loop branch May 4, 2022 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants