Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix:sealing-fsm:wait mutable fsm state for immutable sector upgrade error #9598

Merged
merged 7 commits into from
Nov 14, 2022

Conversation

ZenGround0
Copy link
Contributor

@ZenGround0 ZenGround0 commented Nov 7, 2022

Related Issues

#8496

Proposed Changes

WaitMutable state that nicely waits for deadline to be over before continuing

Additional Info

No tests yet so it's probably broken. But this is a good place to start from

Checklist

Before you mark the PR ready for review, please make sure that:

  • Commits have a clear commit message.
  • PR title is in the form of of <PR type>: <area>: <change being made>
    • example: fix: mempool: Introduce a cache for valid signatures
    • PR type: fix, feat, build, chore, ci, docs, perf, refactor, revert, style, test
    • area, e.g. api, chain, state, market, mempool, multisig, networking, paych, proving, sealing, wallet, deps
  • New features have usage guidelines and / or documentation updates in
  • Tests exist for new functionality or change in behavior
  • CI is green

@ZenGround0 ZenGround0 requested a review from a team as a code owner November 7, 2022 14:58
Copy link
Contributor

@magik6k magik6k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Briefly skimming this it looks like it should work.

An itest shouldn't be that hard to write, basically start with sealing a sector, stop mining and manipulate which epoch we're at much like as is done in wdpost tests.

dlineEpochsRemaining := dlinfo.NextOpen() - ts.Height()
if sectorDeadlineOpen {
// sleep for remainder of deadline
time.Sleep(time.Duration(build.BlockDelaySecs) * time.Second * time.Duration(dlineEpochsRemaining))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should respect ctx.Context() here

@Reiers
Copy link

Reiers commented Nov 8, 2022

From last nights, stress test:

Sectors:
        Total: 52174
        Proving: 45353
        Available: 1780
        UpdateActivating: 92
        SnapDealsWaitDeals: 11
        ProveReplicaUpdate: 10
        WaitMutable: 2
        Removed: 4909

Every sector within the deadlines ended up in a WaitMutable state. Worth noting that some where active and some not.

32630  WaitMutable            YES      YES     1      
32633  WaitMutable            YES      NO      1      
32640  WaitMutable            YES      YES     1      
32641  WaitMutable            YES      YES     1      
32663  WaitMutable            YES      YES     1      
32665  WaitMutable            YES      YES     1      
32691  WaitMutable            YES      NO      1      
32695  WaitMutable            YES      YES     1      
32707  WaitMutable            YES      YES     1      
32716  WaitMutable            YES      YES     1  

This morning, the sectors that entered the state - where still in a WaitMutable state.
Active status NO:

47.	2022-11-08 02:33:47 +0100 CET:	[event;sealing.SectorFinalized]	{"User":{}}
48.	2022-11-08 02:33:47 +0100 CET:	[event;sealing.SectorDeadlineImmutable]	{"User":{}}

Active status YES:

139.	2022-11-08 02:13:05 +0100 CET:	[event;sealing.SectorFinalized]	{"User":{}}
140.	2022-11-08 02:13:05 +0100 CET:	[event;sealing.SectorDeadlineImmutable]	{"User":{}}

No difference in the sector log comparing the ones that was Active or Not.

Status right now,
Pulled and built the potential fix for the stuck sectors, every sector aborted the upgrade in a Panic:

1-08T14:05:00.038+0100	ERROR	sectors	pipeline/states_failed.go:260	sector marked for upgrade 32633 no longer active, aborting upgrade
2022-11-08T14:05:00.039+0100	WARN	sectors	pipeline/fsm.go:799	sector 32633 got error event sealing.SectorAbortUpgrade: %!v(PANIC=Error method: runtime error: invalid memory address or nil pointer dereference)
2022-11-08T14:05:00.039+0100	ERROR	sectors	pipeline/states_failed.go:260	sector marked for upgrade 32691 no longer active, aborting upgrade
2022-11-08T14:05:00.039+0100	WARN	sectors	pipeline/fsm.go:799	sector 32691 got error event sealing.SectorAbortUpgrade: %!v(PANIC=Error method: runtime error: invalid memory address or nil pointer dereference)

@Reiers
Copy link

Reiers commented Nov 10, 2022

Success ! Non stuck, no faulty, no errors:
Got about 30 more, but here are 2 sector logs:

67.	2022-11-10 03:29:39 +0100 CET:	[event;sealing.SectorFinalized]	{"User":{}}
68.	2022-11-10 03:29:39 +0100 CET:	[event;sealing.SectorDeadlineImmutable]	{"User":{}}
69.	2022-11-10 03:35:39 +0100 CET:	[event;sealing.SectorDeadlineMutable]	{"User":{}}
70.	2022-11-10 03:35:39 +0100 CET:	[event;sealing.SectorReplicaUpdateSubmitted]	{"User":{"Message":{"/":"bafy2bzaceb4tauhndqgu3ovobhl36bsysmwfwnwd5565xnzb2rtjosp5yw57u"}}}
71.	2022-11-10 03:38:59 +0100 CET:	[event;sealing.SectorReplicaUpdateLanded]	{"User":{}}
72.	2022-11-10 11:12:30 +0100 CET:	[event;sealing.SectorUpdateActive]	{"User":{}}
73.	2022-11-10 11:12:30 +0100 CET:	[event;sealing.SectorKeyReleased]	{"User":{}}
39.	2022-11-10 02:42:49 +0100 CET:	[event;sealing.SectorFinalized]	{"User":{}}
40.	2022-11-10 02:42:49 +0100 CET:	[event;sealing.SectorDeadlineImmutable]	{"User":{}}
41.	2022-11-10 03:35:49 +0100 CET:	[event;sealing.SectorDeadlineMutable]	{"User":{}}
42.	2022-11-10 03:35:49 +0100 CET:	[event;sealing.SectorReplicaUpdateSubmitted]	{"User":{"Message":{"/":"bafy2bzacebb6uzzfjj6gns63xhca4iehzywwzpazcdw3ub7zukvg6xph5cy2c"}}}
43.	2022-11-10 03:39:29 +0100 CET:	[event;sealing.SectorReplicaUpdateLanded]	{"User":{}}
44.	2022-11-10 11:13:00 +0100 CET:	[event;sealing.SectorUpdateActive]	{"User":{}}
45.	2022-11-10 11:13:00 +0100 CET:	[event;sealing.SectorKeyReleased]	{"User":{}}

@ZenGround0
Copy link
Contributor Author

@magik6k I'm not going to get around to an integration test anytime soon and @Reiers's live testing seems to check out. You think we can merge this one as it is?

}

select {
case <-time.After(waitTime):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: For integration tests we'll want to replace this sleep with events.ChainAt as in those time isn't really related to height

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@magik6k its worth you looking quickly to double check I'm using chain events right. Also @Reiers if it's not too much work running another test to make sure this didn't regress something would be nice. If it is a huge effort its probably not worth it though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The events thing looks reasonable

Comment on lines +207 to +208
log.Errorf("handleWaitMutable: api error, not proceeding: %+v", err)
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to sleep a bit (~60s) and loop around, this can happen when the API is disconnected, which.. can happen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes a ton of sense. Becuase I'm just copying this pattern from all over the place inside the fsm I think this issue is out of scope for this PR and should be handled across the board for all fsm states. Filed #9637 for this

@ZenGround0 ZenGround0 requested a review from magik6k November 14, 2022 15:51
@magik6k magik6k merged commit c79085e into master Nov 14, 2022
@magik6k magik6k deleted the fix/ancient-snap-deals-terrible branch November 14, 2022 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants