Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do the worker HB timeout check when HB's are updated #706

Merged

Conversation

d2r
Copy link
Contributor

@d2r d2r commented Oct 10, 2013

Instead of asking zookeeper for the latest heartbeat data for all topologies and then checking which heartbeats have timed out, combine the timeout check with the heartbeat update.

This makes nimbus more tolerant of a slow zookeeper server, in which case the updating of heartbeats can take so long that by the time the nimbus actually examines the new heartbeats for timeouts, most of the earliest topologies have complete timeouts of executors.

@d2r
Copy link
Contributor Author

d2r commented Oct 10, 2013

This pull request is initially intended to start discussion.

Current unit tests pass with this change.

@d2r
Copy link
Contributor Author

d2r commented Oct 10, 2013

This could mitigate a possible root cause for #689.

@brndnmtthws
Copy link
Contributor

👍

Good stuff.

@ptgoetz
Copy link
Collaborator

ptgoetz commented Oct 21, 2013

+1
(tested in distributed mode with 5-node cluster)

@nathanmarz
Copy link
Owner

+1

1 similar comment
@xumingming
Copy link
Collaborator

+1

ptgoetz added a commit that referenced this pull request Oct 25, 2013
Do the worker HB timeout check when HB's are updated
@ptgoetz ptgoetz merged commit a0bc262 into nathanmarz:master Oct 25, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants