Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(failover): Recover rebooted compute nodes with notification and snapshot #1500

Merged
merged 8 commits into from
Apr 1, 2022

Conversation

zbzbw
Copy link
Contributor

@zbzbw zbzbw commented Apr 1, 2022

What's changed and what's your intention?

In #1215 and #1428, the recovery process in meta will manually create sources on compute nodes during failover. This may lead to duplicated creation since failure may due to network isolation.
This PR introduced a similar ObserverManager in compute node. After reboot, the observer manager will subscribe to meta's notification manager, and wait for a snapshot containing source information. Thus in recovery process, we only need to wait for all compute node being back online.

Checklist

  • I have written necessary docs and comments
  • I have added necessary unit tests and integration tests

Refer to a related PR or issue link (optional)

#1215, #1428

Bowen Zhou added 7 commits March 31, 2022 16:44
Signed-off-by: Bowen Zhou <[email protected]>
…o zbw/cn_recovery_with_notification

Signed-off-by: Bowen Zhou <[email protected]>
Signed-off-by: Bowen Zhou <[email protected]>
Signed-off-by: Bowen Zhou <[email protected]>
Signed-off-by: Bowen Zhou <[email protected]>
@codecov
Copy link

codecov bot commented Apr 1, 2022

Codecov Report

Merging #1500 (e652a93) into main (9b9ea94) will decrease coverage by 0.09%.
The diff coverage is 0.00%.

@@             Coverage Diff              @@
##               main    #1500      +/-   ##
============================================
- Coverage     70.54%   70.44%   -0.10%     
  Complexity     2766     2766              
============================================
  Files          1029     1031       +2     
  Lines         90341    90472     +131     
  Branches       1790     1790              
============================================
+ Hits          63729    63737       +8     
- Misses        25721    25844     +123     
  Partials        891      891              
Flag Coverage Δ
java 61.01% <ø> (ø)
rust 72.45% <0.00%> (-0.12%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
rust/compute/src/server.rs 0.00% <0.00%> (ø)
rust/frontend/src/observer/observer_manager.rs 0.00% <0.00%> (ø)
rust/meta/src/barrier/mod.rs 79.67% <ø> (-0.17%) ⬇️
rust/meta/src/barrier/recovery.rs 0.00% <0.00%> (ø)
rust/meta/src/manager/catalog_v2.rs 0.00% <0.00%> (-19.65%) ⬇️
rust/meta/src/rpc/server.rs 0.00% <ø> (ø)
rust/meta/src/rpc/service/notification_service.rs 0.00% <0.00%> (ø)
rust/meta/src/stream/stream_manager.rs 75.72% <ø> (-0.12%) ⬇️
rust/source/src/manager.rs 82.88% <0.00%> (-6.71%) ⬇️
rust/stream/src/task/mod.rs 54.32% <ø> (ø)
... and 30 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

Copy link
Member

@yezizp2012 yezizp2012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, LGTM except the naming.

@@ -296,6 +300,7 @@ message SubscribeResponse {
catalog.Table table_v2 = 10;
catalog.Source source = 11;
MetaSnapshot fe_snapshot = 12;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be better to name these as SnapshotForFE and SnapshotForBE?

Copy link
Contributor Author

@zbzbw zbzbw Apr 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just calling it FrontendSnapshot and BackendSnapshot 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, or just keep it until we should produce more info for backend.

@yezizp2012 yezizp2012 requested a review from HuaHuaY April 1, 2022 03:59
Signed-off-by: Bowen Zhou <[email protected]>
@zbzbw zbzbw enabled auto-merge (squash) April 1, 2022 04:09
@BugenZhao
Copy link
Member

BugenZhao commented Apr 1, 2022

Should we also use the notification service to create sources on compute nodes, like we did in catalog?

@yezizp2012
Copy link
Member

Should we also use the notification service to create sources on compute nodes, like we did in catalog?

Yes, I will do this after this PR merged.

Copy link
Contributor

@HuaHuaY HuaHuaY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@BugenZhao BugenZhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM

Comment on lines +187 to +188
for source in snapshot.sources {
match source.info.unwrap() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may reuse the code from the create_source in stream manager.

@zbzbw zbzbw merged commit 4572418 into main Apr 1, 2022
@zbzbw zbzbw deleted the zbw/cn_recovery_with_notification branch April 1, 2022 04:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants