-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
broker: track RPC state and send error responses for lost peers #3800
Comments
garlick
added a commit
to garlick/flux-core
that referenced
this issue
Aug 15, 2021
Problem: when TBON children become lost, any pending RPCs passing through them may go unanswered, leading to hangs in other parts of the system. Track pending RPCs for each TBON child. When a child's state transitions from an online state to offline/lost, responses are generated for these RPCs. RPCs are considered terminated when the RPC request has: - the NORESPONSE flag is set - the STREAMING flag is set, and a matching error response is received - neither flag set, and any matching response is received - the same sending UUID as a disconnect request Note: this ony affects RPCs where the next hop is in the downstream/leaves direction. Each broker along the path of a multi-hop RPC tracks RPCs routed to its downstream peer, but only the broker whose downstream peer transitions to lost or offline sends an error response. This PR does not address loss of the parent. Fixes flux-framework#3800
garlick
added a commit
to garlick/flux-core
that referenced
this issue
Aug 15, 2021
Problem: when TBON children become lost, any pending RPCs passing through them may go unanswered, leading to hangs in other parts of the system. Track pending RPCs for each TBON child. When a child's state transitions from an online state to offline/lost, responses are generated for these RPCs. RPCs are considered terminated when the RPC request has: - the NORESPONSE flag is set - the STREAMING flag is set, and a matching error response is received - neither flag set, and any matching response is received - the same sending UUID as a disconnect request Note: this ony affects RPCs where the next hop is in the downstream/leaves direction. Each broker along the path of a multi-hop RPC tracks RPCs routed to its downstream peer, but only the broker whose downstream peer transitions to lost or offline sends an error response. This PR does not address loss of the parent. Fixes flux-framework#3800
garlick
added a commit
to garlick/flux-core
that referenced
this issue
Aug 16, 2021
Problem: when TBON children become lost, any pending RPCs passing through them may go unanswered, leading to hangs in other parts of the system. Track pending RPCs for each TBON child. When a child's state transitions from an online state to offline/lost, responses are generated for these RPCs. RPCs are considered terminated when the RPC request has: - the NORESPONSE flag is set - the STREAMING flag is set, and a matching error response is received - neither flag set, and any matching response is received - the same sending UUID as a disconnect request Note: this ony affects RPCs where the next hop is in the downstream/leaves direction. Each broker along the path of a multi-hop RPC tracks RPCs routed to its downstream peer, but only the broker whose downstream peer transitions to lost or offline sends an error response. This PR does not address loss of the parent. Fixes flux-framework#3800
garlick
added a commit
to garlick/flux-core
that referenced
this issue
Aug 16, 2021
Problem: when TBON children become lost, any pending RPCs passing through them may go unanswered, leading to hangs in other parts of the system. Track pending RPCs for each TBON child. When a child's state transitions from an online state to offline/lost, responses are generated for these RPCs. RPCs are considered terminated when the RPC request has: - the NORESPONSE flag is set - the STREAMING flag is set, and a matching error response is received - neither flag set, and any matching response is received - the same sending UUID as a disconnect request Note: this ony affects RPCs where the next hop is in the downstream/leaves direction. Each broker along the path of a multi-hop RPC tracks RPCs routed to its downstream peer, but only the broker whose downstream peer transitions to lost or offline sends an error response. This PR does not address loss of the parent. Fixes flux-framework#3800
garlick
added a commit
to garlick/flux-core
that referenced
this issue
Aug 17, 2021
Problem: when TBON children become lost, any pending RPCs passing through them may go unanswered, leading to hangs in other parts of the system. Track pending RPCs for each TBON child. When a child's state transitions from an online state to offline/lost, responses are generated for these RPCs. RPCs are considered terminated when the RPC request has: - the NORESPONSE flag is set - the STREAMING flag is set, and a matching error response is received - neither flag set, and any matching response is received - the same sending UUID as a disconnect request Note: this ony affects RPCs where the next hop is in the downstream/leaves direction. Each broker along the path of a multi-hop RPC tracks RPCs routed to its downstream peer, but only the broker whose downstream peer transitions to lost or offline sends an error response. This PR does not address loss of the parent. Fixes flux-framework#3800
chu11
pushed a commit
to chu11/flux-core
that referenced
this issue
Sep 28, 2021
Problem: when TBON children become lost, any pending RPCs passing through them may go unanswered, leading to hangs in other parts of the system. Track pending RPCs for each TBON child. When a child's state transitions from an online state to offline/lost, responses are generated for these RPCs. RPCs are considered terminated when the RPC request has: - the NORESPONSE flag is set - the STREAMING flag is set, and a matching error response is received - neither flag set, and any matching response is received - the same sending UUID as a disconnect request Note: this ony affects RPCs where the next hop is in the downstream/leaves direction. Each broker along the path of a multi-hop RPC tracks RPCs routed to its downstream peer, but only the broker whose downstream peer transitions to lost or offline sends an error response. This PR does not address loss of the parent. Fixes flux-framework#3800
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When a broker peer is lost, there is no possibility for responses to be received for pending RPCs. Brokers should track outstanding RPCs and send error responses when the next hop dies to avoid hangs.
The text was updated successfully, but these errors were encountered: