-
Notifications
You must be signed in to change notification settings - Fork 600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(failover): Recover rebooted compute nodes with notification and snapshot #1500
Conversation
Signed-off-by: Bowen Zhou <[email protected]>
Signed-off-by: Bowen Zhou <[email protected]>
Signed-off-by: Bowen Zhou <[email protected]>
…o zbw/cn_recovery_with_notification Signed-off-by: Bowen Zhou <[email protected]>
Signed-off-by: Bowen Zhou <[email protected]>
Signed-off-by: Bowen Zhou <[email protected]>
Signed-off-by: Bowen Zhou <[email protected]>
Codecov Report
@@ Coverage Diff @@
## main #1500 +/- ##
============================================
- Coverage 70.54% 70.44% -0.10%
Complexity 2766 2766
============================================
Files 1029 1031 +2
Lines 90341 90472 +131
Branches 1790 1790
============================================
+ Hits 63729 63737 +8
- Misses 25721 25844 +123
Partials 891 891
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job, LGTM except the naming.
@@ -296,6 +300,7 @@ message SubscribeResponse { | |||
catalog.Table table_v2 = 10; | |||
catalog.Source source = 11; | |||
MetaSnapshot fe_snapshot = 12; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it be better to name these as SnapshotForFE and SnapshotForBE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just calling it FrontendSnapshot
and BackendSnapshot
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, or just keep it until we should produce more info for backend.
Signed-off-by: Bowen Zhou <[email protected]>
Should we also use the notification service to create sources on compute nodes, like we did in catalog? |
Yes, I will do this after this PR merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM
for source in snapshot.sources { | ||
match source.info.unwrap() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may reuse the code from the create_source
in stream manager.
What's changed and what's your intention?
In #1215 and #1428, the recovery process in meta will manually create sources on compute nodes during failover. This may lead to duplicated creation since failure may due to network isolation.
This PR introduced a similar
ObserverManager
in compute node. After reboot, the observer manager will subscribe to meta's notification manager, and wait for a snapshot containing source information. Thus in recovery process, we only need to wait for all compute node being back online.Checklist
Refer to a related PR or issue link (optional)
#1215, #1428