Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wdPoSt scheduler notifs channel closes when lotus-miner is under heavy load #8362

Closed
10 of 18 tasks
Tracked by #10338
phoenixzhua opened this issue Mar 23, 2022 · 54 comments · Fixed by filecoin-project/go-jsonrpc#97 or #10395
Closed
10 of 18 tasks
Tracked by #10338
Assignees
Labels
area/proving Area: Proving kind/bug Kind: Bug kind/enhancement Kind: Enhancement P2 P2: Should be resolved

Comments

@phoenixzhua
Copy link

Checklist

  • This is not a security-related bug/issue. If it is, please follow please follow the security policy.
  • This is not a question or a support request. If you have any lotus related questions, please ask in the lotus forum.
  • This is not a new feature request. If it is, please file a feature request instead.
  • This is not an enhancement request. If it is, please file a improvement suggestion instead.
  • I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
  • I am running the Latest release, or the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
  • I did not make any code changes to lotus.

Lotus component

  • lotus daemon - chain sync
  • lotus miner - mining and block production
  • lotus miner/worker - sealing
  • lotus miner - proving(WindowPoSt)
  • lotus miner/market - storage deal
  • lotus miner/market - retrieval deal
  • lotus miner/market - data transfer
  • lotus client
  • lotus JSON-RPC API
  • lotus message management (mpool)
  • Other

Lotus Version

Daemon:  1.14.2+mainnet+git.a65386b49.dirty+api1.4.0
Local: lotus-miner version 1.14.2+mainnet+git.a65386b49.dirty

Describe the Bug

Before the network version v15 upgrade, my miner and daemon are on the same box. (only do the Finalize to get the final ~ 32GB data to long-term storage.)

  • CPU: 2 * EPYC 7402
  • MEM: 12 * 64GB
  • GPU: 2 * RTX 3090
  • OS: Ubuntu 20.04.3 LTS on DELL-MZ7LH960HBJR0D3 960GB
  • NVMe: Intel P4510 8TB

Everything goes well for 30+ days. We can seal about 500 sectors one day.

After the network version v15 upgrade, when I start sealing, the connection between miner and daemon will timout and close, and window post will fail.

Then I seprate daemon and miner to different boxes. When there is no sealing jobs, window post goes well for days.
But when I start sealing jobs for about 16-18 hours, the connection between miner and daemon will timout and close, and window post will fail.

It seems the miner box is not overload except the heavy buff/cache memory, more than 600GiB

Logging Information

{"level":"error","ts":"2022-03-23T03:56:26.448+0800","logger":"rpc","caller":"[email protected]/websocket.go:667","msg":"Connection timeout","remote":"10.0.20.127:1234"}
{"level":"error","ts":"2022-03-23T03:56:26.449+0800","logger":"rpc","caller":"[email protected]/websocket.go:648","msg":"sendReqest failed (Handle me): write tcp 10.0.20.121:38742->10.0.20.127:1234 use of closed network connection"}
{"level":"error","ts":"2022-03-23T03:56:26.449+0800","logger":"rpc","caller":"[email protected]/websocket.go:498","msg":"sending ping message: write tcp 10.0.20.121:38742->10.0.20.127:1234 use of closed network connection"}
{"level":"error","ts":"2022-03-23T03:56:26.449+0800","logger":"rpc","caller":"[email protected]/websocket.go:648","msg":"sendReqest failed (Handle me): write tcp 10.0.20.121:38742->10.0.20.127:1234 use of closed network connection"}
{"level":"error","ts":"2022-03-23T03:56:26.449+0800","logger":"rpc","caller":"[email protected]/websocket.go:97","msg":"setting read deadlineset tcp 10.0.20.121:38742: use of closed network connection"}
{"level":"error","ts":"2022-03-23T03:56:26.449+0800","logger":"rpc","caller":"[email protected]/websocket.go:648","msg":"sendReqest failed (Handle me): write tcp 10.0.20.121:38742->10.0.20.127:1234 use of closed network connection"}
{"level":"warn","ts":"2022-03-23T03:56:26.449+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:215","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://10.0.20.128:4005/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2022-03-23T03:56:26.449+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:215","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://10.0.20.128:4006/rpc/v0\": context deadline exceeded"}
{"level":"error","ts":"2022-03-23T03:56:26.449+0800","logger":"events","caller":"events/events_called.go:478","msg":"messagesForTs MessagesForBlock failed (ts.H=1655751, Bcid:bafy2bzacecidmf5omr6gigjcxom6kyzm24xytjon2lyvw3lewt7pvrswjd3z6, B.Mcid:bafy2bzacecheqipbxrgxbbg5ysw6pabjyfgf6qp36bf5kmo4rvjw6qtabtdba): handler: websocket connection closed"}
{"level":"error","ts":"2022-03-23T03:56:26.449+0800","logger":"miner","caller":"miner/miner.go:233","msg":"failed to get best mining candidate: handler: websocket connection closed"}
{"level":"warn","ts":"2022-03-23T03:56:26.449+0800","logger":"storageminer","caller":"storage/wdpost_sched.go:124","msg":"window post scheduler notifs channel closed"}

Repo Steps

  1. Run 'lotus daemon and lotus-miner on different boxes'
  2. Do 'sectors pledge script every 3 minutes for 16-18 hours'
  3. See error 'window post scheduler notifs channel closed'
@rjan90
Copy link
Contributor

rjan90 commented Mar 24, 2022

Hi @phoenixzhua!

Thanks for the report. I will add labels and will assign it to the right team for analysis.
We might tag you again for more information.

@rjan90 rjan90 added area/proving Area: Proving kind/enhancement Kind: Enhancement and removed area/proving Area: Proving labels Mar 24, 2022
@rjan90
Copy link
Contributor

rjan90 commented Mar 24, 2022

Adding some additional information here based on the Slack-thread:

  • It seems like the websocket closes when there is a lot of Finalize/GET-tasks running simultaneously. And if a lot these jobs are running when you are going to do windowPoSt, you will get window post scheduler notifs channel closed and end up not computing the windowPoSt for that deadline/partition.

@phoenixzhua
Copy link
Author

Adding some additional information here based on the Slack-thread:

  • It seems like the websocket closes when there is a lot of Finalize/GET-tasks running simultaneously. And if a lot these jobs are running when you are going to do windowPoSt, you will get window post scheduler notifs channel closed and end up not computing the windowPoSt for that deadline/partition.

If a lot these Finalize/GET-tasks are running when doing windowPoSt, I can see the computing the windowPoSt for current deadline has finished verify_window_post: finish in the rust log after window post scheduler notifs channel closed appears in the miner log.
But the SubmitWindowedPoSt message does not landed on chain, and next deadline windowPoSt does not start because I can't see something like this "logger":"storageminer","caller":"storage/wdpost_run.go:229","msg":"Checked sectors","checked":2349,"good":2349 in the miner log.

Some more infromation:
I can see lots of context deadline exceeded error in the miner log before window post scheduler notifs channel closed.

{"level":"warn","ts":"2022-03-23T03:50:21.046+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:215","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://10.0.20.122:2003/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2022-03-23T03:50:21.046+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:215","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://10.0.20.123:4006/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2022-03-23T03:50:21.046+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:215","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://10.0.20.128:4006/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2022-03-23T03:50:21.046+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:215","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://10.0.20.124:2003/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2022-03-23T03:50:21.088+0800","logger":"miner","caller":"miner/miner.go:477","msg":"completed mineOne","tookMilliseconds":20,"forRound":1655741,"baseEpoch":1655740,"baseDeltaSeconds":21,"nullRounds":0,"lateStart":true,"beaconEpoch":1751585,"lookbackEpochs":900,"networkPowerAtLookback":"18354297485952450560","minerPowerAtLookback":"2768364120309760","isEligible":true,"isWinner":false,"error":null}
{"level":"warn","ts":"2022-03-23T03:50:33.215+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:215","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://10.0.20.123:4002/rpc/v0\": context deadline exceeded"}

@rjan90 rjan90 self-assigned this Mar 28, 2022
@cryptowhizzard
Copy link

cryptowhizzard commented Apr 12, 2022

This tends to occur under heavy load.

@phoenixzhua do you have swap configured and some statistics on this machine (load metrics) just before and after the error occurs? ( Grafana can give you nice info if you have that running )

Cpu IO and Wasted are good metrics. You can find these with the top command in linux.
Also disk metrics ( iostat -x 1 -m ) to see what the disks are doing where lotus and lotus-miner resides and where (if you have) swap resides.

Do you have NFS mounts configured on the machine? If so, do you see anything usefull in dmesg about NFS timeout errors?

@rjan90
Copy link
Contributor

rjan90 commented Apr 13, 2022

Hey!

So an update to this is that the devs are hopeful that this issue will be fixed by the upcoming PoSt workers and improved scheduling logic, which will be included in the v1.15.2 release (coming as a release candidate later today).

The PoSt workers will enable you to read the sealed sectors from local storage, or from any remote worker, which should reduce the load on the lotus-miner which is currently causing the channel to close. I will publish the PoSt worker docs later today, but here is a small preview

When a winningPoSt or windowPoSt worker connects to the lotus-miner, it will delegate all winningPoSt or windowPoSt tasks to that worker.
If both tasks are delegated to separate PoSt workers, the lotus-miner will execute no PoSt tasks locally on the miner instance. If a worker is stopped, the lotus-miner instance automatically switches back to local PoSt.

The windowPoSt process requires reading random leaves of all the sealed sectors in a proving deadline. When setting up windowPoSt workers, one needs to consider how the workers can access those files.
The PoSt workers can ask any other worker to read challenges for them, including the lotus-miner process, but it will prefer reading it from local paths.

@rjan90 rjan90 added the P2 P2: Should be resolved label Apr 20, 2022
@rjan90 rjan90 changed the title window post fail when sealing sectors wdPoSt scheduler notifs channel closes when lotus-miner is under heavy load Apr 20, 2022
@cryptowhizzard
Copy link

cryptowhizzard commented Apr 21, 2022

Following up running on 1.15.2-rc1 with a Winningpost worker attached and Windowpost worker attached the problem persists.

2022-04-20T07:44:41.398Z ERROR events events/events_called.go:352 event diff fn failed: handler: websocket connection closed
2022-04-20T07:44:41.402Z ERROR events events/events_called.go:478 messagesForTs MessagesForBlock failed (ts.H=1737807, Bcid:bafy2bzacec5wg5g2i4p43kcylslchj25plens6a5fhe7ot3tbgeddq5ikimyi, B.Mcid:bafy2bzaceclj7gmwplxtaj2dv3lulcxqoojlv4o6223j6dojdpacebxcsylxo): handler: websocket connection closed
2022-04-20T07:44:41.386Z WARN storageminer storage/wdpost_sched.go:124 window post scheduler notifs channel closed
2022-04-20T07:44:41.386Z DEBUG rcmgr [email protected]/scope.go:470 blocked connection from constraining edge {"scope": "conn-100213", "edge": "transient", "direction": "Inbound", "usefd": true, "stat": {"NumStreamsInbound":2,"NumStreamsOutbound":0,"NumConnsInbound":32,"NumConnsOutbound":0,"NumFD":32,"Memory":0}, "error": "transient: cannot reserve connection: resource limit exceeded"}
2022-04-20T07:44:04.931Z WARN advmgr sector-storage/sched_worker.go:209 failed to check worker session {"error": "RPC client error: sendRequest failed: Post "http://--------:2349/rpc/v0": context deadline exceeded"}
2022-04-20T07:44:41.409Z DEBUG advmgr sector-storage/sched.go:361 SCHED 9 queued; 80 open windows
2022-04-20T07:44:03.657Z WARN advmgr sector-storage/sched_post.go:177 failed to check worker session {"error": "RPC client error: sendRequest failed: Post "http://--------:2302/rpc/v0": context deadline exceeded"}
2022-04-20T07:44:41.395Z WARN advmgr sector-storage/sched_worker.go:209 failed to check worker session {"error": "RPC client error: sendRequest failed: Post "http://10.0.18.13:2349/rpc/v0\": context deadline exceeded"}

@rjan90
Copy link
Contributor

rjan90 commented Apr 21, 2022

Following up running on 1.15.2-rc1 with a Winningpost worker attached and Windowpost worker attached the problem persists.

@cryptowhizzard Can you clarify a bit if these PoSt workers used remote storage access to the sealed sectors, or if they had local paths to the sealed sectors on the servers? https://lotus.filecoin.io/storage-providers/seal-workers/post-workers/#remote-storage-access

@cryptowhizzard
Copy link

@rjan90 yes the workers are configured with dedicated access to the storage.

This issue causes the wpost scheduler not to commence at all so the post workers will not be involved.

@rjan90
Copy link
Contributor

rjan90 commented Apr 21, 2022

The reason why I´m asking is that we need a very detailed, and hopefully an easy way to get a good repro (reliable/easy/fast) on this issue. Just posting logs does not necessarily help us move forward with this issue.

So all the extra information everybody can provide about how their setup looks like, how many sectors are currently finilized when this issue happens, and so forth are very much needed for us to get to the bottom of it.

I have closed the duplicate issues, and will be tracking everything in this thread:

  1. Ideally we are looking for a very detailed, and an easy way to get a good repro (reliable/easy/fast) on this issue - so if anybody here has some insights to how we can easily repro it, please post the steps here.
  2. Goroutine dumps from both sides (lotus daemon & lotus-miner) could be helpful, but ideally we are looking for 1.

@f8-ptrk
Copy link
Contributor

f8-ptrk commented Apr 22, 2022

#5614 maybe look at this and check why the fix implemented doesn't work....

as the websockets obviously do not reconnect - @rjan90 you know as good as all of us that reliable/easy/fast is not what will happen here. it will take days or weeks to get a miner purposefully into a state to force this issue.

best bet in my eyes: 8 core, 256 ram, run chain and miner + pc2 worker and make sure it does as much work as possible - some APs, maybe a markets import in addition, some transfers - at some point this will occur

@f8-ptrk
Copy link
Contributor

f8-ptrk commented Apr 22, 2022

@cryptowhizzard reported this in v0.7.0 and it seems to be still around! #3830

might just be the same symptoms and a different disease

@rjan90
Copy link
Contributor

rjan90 commented Apr 22, 2022

@f8-ptrk It is suspected that it's either a go-jsonrpc bug or something in Lotus that uses it wrong! Meaning it could be really hard to uncover and therefore it's unfortunately not as easy as just going through an old PR and see why that did not fix the actual underlying issue, which might be in a dependency.

reliably/easy/fast is subjective, but repro-steps better then finalize a lot of sectors and general log output of the error will help.

@cryptowhizzard
Copy link

Giving this a second thought i got a hint from someone else.

The problem only occurs (here) when we import a lot of data ( Straight from CLI with lotus import ). Looking closer on my machines i had a few machines where ran scripts with the lotus-miner storage-deals import commands in parallel.

Looking in the log i see some notices about recourses not available ( FD' s ) ->

ERROR events events/events_called.go:352 event diff fn failed: handler: websocket connection closed
ERROR events events/events_called.go:478 messagesForTs MessagesForBlock failed (ts.H=1737807, Bcid:bafy2bzacec5wg5g2i4p43kcylslchj25plens6a5fhe7ot3tbgeddq5ikimyi, B.Mcid:bafy2bzaceclj7gmwplxtaj2dv3lulcxqoojlv4o6223j6dojdpacebxcsylxo): handler: websocket connection closed
2022-04-20T07:44:41.402Z WARN storageminer storage/wdpost_sched.go:124 window post scheduler notifs channel closed
022-04-20T07:44:41.386Z DEBUG rcmgr [email protected]/scope.go:470 blocked connection from constraining edge {"scope": "conn-100213", "edge": "transient", "direction": "Inbound", "usefd": true, "stat": {"NumStreamsInbound":2,"NumStreamsOutbound":0,"NumConnsInbound":32,"NumConnsOutbound":0,"NumFD":32,"Memory":0}, "error": "transient: cannot reserve connection: resource limit exceeded"}

This begs the questions , can parallel import cause resource restrains for the miner?

@Aloxaf
Copy link
Contributor

Aloxaf commented Apr 27, 2022

Have the same problem, window post scheduler keep restarting...

I modified the code to let the lotus-worker finalize the sector to storage path. But I still get this error.

@cryptowhizzard
Copy link

@hasanalaref @Aloxaf ... are you guys running with a separate market node or is this lotus-miner in monolith?

@Aloxaf
Copy link
Contributor

Aloxaf commented Apr 28, 2022

@cryptowhizzard In monolith.

I finally solved it by add a separate WindowPoSt worker.

@magik6k
Copy link
Contributor

magik6k commented Sep 28, 2022

That seems to point at an issue in go-jsonrpc, possibly on the server side - maybe in the websocket library something is single-threaded / blocking on large messages?

  • It would be really helpful to get both side logs with debug log level on the rpc logger.
  • It would also be very useful to see goroutine dumps on both sides while the rpc output message buffer are getting printed

@rjan90
Copy link
Contributor

rjan90 commented Sep 28, 2022

Sent you rpc debug logs from both sides, as well as goroutine dumps on Slack!

@magik6k
Copy link
Contributor

magik6k commented Feb 28, 2023

There was a good amount of improvements made to connection robustness in go-jsonrpc v0.2.0, and lotus v0.20.0 / master is now at go-jsonrpc v0.2.1, so it would be good to see if this issue has improved.

If it still happens, the new go-jsonrpc also added better debugging logs that would be very helpful in nailing down what the issue is here:

  • Setting the rpc logger to debug will enable most of those logs
  • JSONRPC_ENABLE_DEBUG_TRACE=1, together with the rpc logger in debug will enable very verbose rpc tracing which could tell us what exactly is happening
  • ^ those would be useful on both lotus and lotus-miner

@rjan90 rjan90 added this to the v1.21.0 milestone Feb 28, 2023
@rjan90
Copy link
Contributor

rjan90 commented Feb 28, 2023

I re-ran the repro of calling Filecoin.ListMarketDeal while running a lotus-miner proving compute window-post <deadline> on master, and where able to successfully compute the windowPoSt now, without any window post scheduler notifs channel closed or rpc output message buffer logs - just normal logs.

That makes me cautiously optimistic that it is fixed, but will await some further confirmations 👀

If it still happens, the new go-jsonrpc also added better debugging logs that would be very helpful in nailing down what the issue is here:

Setting the rpc logger to debug will enable most of those logs
JSONRPC_ENABLE_DEBUG_TRACE=1, together with the rpc logger in debug will enable very verbose rpc tracing which could tell us what exactly is happening
^ those would be useful on both lotus and lotus-miner

Will add these to the knowledge base, so we can point users there in case they meet any jsonrpc issues in the future

@magik6k
Copy link
Contributor

magik6k commented Mar 6, 2023

Should be closed after #10395 lands

@magik6k magik6k reopened this Mar 6, 2023
@rjan90 rjan90 moved this from 👀 In Review to ✅ Done - v1.21.0 in Lotus-Miner-V2 Mar 7, 2023
@RobQuistNL
Copy link
Contributor

The channel itself is not getting closed, but wdpost workers don't work in 1.20.3.

The workers do the work, and when they are done, they fail some random way (context canceled is not really... descriptive)

Then the miner itself tries to do it, but obviously is too late.

17:09:30.147Z wdpost/wdpost_run.go:260","msg":"starting PoSt cycle","cycle":"2023-03-11T17:09:30.147Z","manual":false,"ts":"[bafy2bzacebmdvyya4fqadp76p7bo75qdx53sh35qaboazrl6ncge3x73y2le6]","deadline":25}
17:09:30.147Z wdpost/wdpost_run.go:412","msg":"computing window post","cycle":"2023-03-11T16:44:00.329Z","batch":0,"elapsed":603.633691775,"skip":0,"err":"partitionCount:1 err:RPC client error: sendRequest failed: Post \"http://172.16.2.10:45810/rpc/v0\": context canceled","errVerbose":"partitionCount:1 err:RPC client error: sendRequest failed: Post \"http://172.16.2.10:45810/rpc/v0\": context canceled:\n    github.com/filecoin-project/lotus/storage/sealer.(*Manager).generateWindowPoSt.func2\n        /home/filecoinnew/lotus/storage/sealer/manager_post.go:193"}
17:09:30.147Z wdpost/wdpost_run.go:414","msg":"error generating window post: partitionCount:1 err:RPC client error: sendRequest failed: Post \"http://172.16.2.10:45810/rpc/v0\": context canceled","cycle":"2023-03-11T16:44:00.329Z"}
17:09:30.147Z wdpost/wdpost_run.go:262","msg":"post cycle done","cycle":"2023-03-11T16:44:00.329Z","took":1529.817936669}
17:09:30.147Z wdpost/wdpost_run.go:98","msg":"runPoStCycle failed: running window post failed:\n    github.com/filecoin-project/lotus/storage/wdpost.(*WindowPoStScheduler).runPoStCycle\n        /home/filecoinnew/lotus/storage/wdpost/wdpost_run.go:475\n  - partitionCount:1 err:RPC client error: sendRequest failed: Post \"http://172.16.2.10:45810/rpc/v0\": context canceled:\n    github.com/filecoin-project/lotus/storage/sealer.(*Manager).generateWindowPoSt.func2\n        /home/filecoinnew/lotus/storage/sealer/manager_post.go:193"}
17:09:30.147Z wdpost/wdpost_changehandler.go:254","msg":"Aborted window post Proving (Deadline: &{CurrentEpoch:2674888 PeriodStart:2673439 Index:24 Open:2674879 Close:2674939 Challenge:2674859 FaultCutoff:2674809 WPoStPeriodDeadlines:48 WPoStProvingPeriod:2880 WPoStChallengeWindow:60 WPoStChallengeLookback:20 FaultDeclarationCutoff:70})"}

@RobQuistNL
Copy link
Contributor

WDPost worker explodes with this error;

{"level":"error","ts":"2023-03-11T18:39:29.811Z","logger":"stores","caller":"paths/local.go:806","msg":"failed to generate valilla PoSt proof before context cancellation","err":"context canceled","duration":158.067953517,"cache-id":"xxx-storage2","sealed-id":"xxx-storage2","cache":"/mnt/xxx/storage2/cache/s-t0xxx-176645","sealed":"/mnt/xxx/storage2/sealed/s-t0xxx-176645"}

and

{"level":"error","ts":"2023-03-11T18:39:29.813Z","logger":"advmgr","caller":"sealer/worker_local.go:689","msg":"reading PoSt challenge for sector 176703, vlen:0, err: failed to generate vanilla proof before context cancellation: context canceled"}

@rjan90
Copy link
Contributor

rjan90 commented Mar 11, 2023

@RobQuistNL The fixes to the channel closed issues are not in v1.20.3, but are coming in v1.21.0 which updates go-jsonrpc to v0.2.2:

The v1.21.0 release will also include quite a bit of fixes for wdPoSt worker assigning logic

@cryptowhizzard
Copy link

@RobQuistNL The fixes to the channel closed issues are not in v1.20.3, but are coming in v1.21.0 which updates go-jsonrpc to v0.2.2:

The v1.21.0 release will also include quite a bit of fixes for wdPoSt worker assigning logic

@RobQuistNL

Rjan90 is not on the same page here. There are new thing introduced in lotus-miner.

Your window breaks because the default timeout for vanilla is 600 secs now. That is way to low. After setting it higher here the problem resolved. I set it to 22 minutes here.

cheers!

[Proving]
#ParallelCheckLimit = 128
SingleCheckTimeout = "22m0s"
#PartitionCheckTimeout = "20m0s"
DisableBuiltinWindowPoSt = false
DisableBuiltinWinningPoSt = false

#DisableWDPoStPreChecks = false
#MaxPartitionsPerPoStMessage = 0
#MaxPartitionsPerRecoveryMessage = 0
#SingleRecoveringPartitionPerPostMessage = false

@RobQuistNL
Copy link
Contributor

RobQuistNL commented Mar 11, 2023

Actually, you know what resolved it?

Removing the long-term storage from my PC1 worker machines.

Makes no sense, or does it? #10454

@donkabat
Copy link

@RobQuistNL
I use below setting for config with wdpost workers and long storage workers for several months without any problems:

[Proving]
ParallelCheckLimit = 66
SingleCheckTimeout = "20m0s"
#PartitionCheckTimeout = "20m0s"

DisableBuiltinWindowPoSt = true
DisableBuiltinWinningPoSt = true
DisableWDPoStPreChecks = true

and wdpost workers run with --post-parallel-reads=66

@RobQuistNL
Copy link
Contributor

I think the problem is that somehow workers request where a sector is (despite it never moving) through the miner, which is then scanning through thousands of listed items, and then continues to fetch through the miner, which is slowing things down.

Lotus' storage management is probably one of the major culprits. E.g. #8783

@RobQuistNL
Copy link
Contributor

I haven't changed any of those settings, and it simply boils down to this:
Have the PC1 workers attached to long term storage -> wdposts take too long & fail, miner has constant network usage, high CPU, (each pc2 worker has a constant 120mb/s network usage), scheduler hangs, lotus-miner sealing jobs takes 4 minutes to return, sched-diag takes 5 minutes

Remove the long term storage from the PC1 workers -> all works OK (except that sealed sectors are now sent over HTTP through FETCH jobs instead of just moving quickly over NFS shares)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/proving Area: Proving kind/bug Kind: Bug kind/enhancement Kind: Enhancement P2 P2: Should be resolved
Projects
Status: No status
12 participants