-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better handling of errors in scheduler poller #2233
Conversation
public void testBrokenQuartzCron() throws Exception { | ||
String quartz = "0 0/60 * * * ?"; | ||
new CronExpression(quartz); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how useful this test is in the sense that this is technically not a valid expression, but library maintainers have made it work in the past and since decided to change that behavior. Maybe add a note about the quartz upgrade for background here in case this starts failing on a future upgrade?
Have you tested this yet? |
No, figured I'd wait until after we decided whether to move forward with this approach. |
|
||
@Override | ||
protected boolean abortsOnError() { | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why set this to false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid killing the entire leader if the scheduler poller runs into an unexpected exception during a poll - currently it aborts at
Line 150 in 1415dea
if (abortsOnError()) { |
@@ -3424,6 +3424,61 @@ public void testRequestedPorts() { | |||
Assertions.assertEquals(1, taskManager.getActiveTaskIds().size()); | |||
} | |||
|
|||
@Test | |||
public void testPortUsage() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test doesnt seem related to the PR -- can we move it to a separate one if it's still necessary?
4ce4331
to
88d8307
Compare
); | ||
} catch (Exception e) { | ||
LOG.error("Error handling pending requests for {}, skipping", deployKey, e); | ||
unscheduledTasksMetric.mark(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debated just reusing the lost tasks meter here, could be misleading (same task would get "lost" repeatedly) but would be fairly correct in spirit - @pschoenfelder you have any preference here?
🚢 |
Currently if the scheduler poller encounters an unexpected error it causes Singularity as a whole to abort. This PR changes the scheduler poller to try again in the event of unexpected errors, as well as modifying the scheduled task poll action to log errors for specific failed tasks instead of failing the entire batch.
This approach may swallow too much, but there are metrics that should still inform us if something is wrong with the scheduler as a whole.
cc - @pschoenfelder