Scheduler does not honour available disk space on long term storage #8566

RobQuistNL · 2022-04-28T15:12:30Z

Checklist

This is not a security-related bug/issue. If it is, please follow please follow the security policy.
This is not a question or a support request. If you have any lotus related questions, please ask in the lotus forum.
This is not a new feature request. If it is, please file a feature request instead.
This is not an enhancement request. If it is, please file a improvement suggestion instead.
I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
I am running the Latest release, or the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
I did not make any code changes to lotus.

Lotus component

Lotus Version

1.15.2-rc2

Describe the Bug

When running a lotus-worker with all flags on false - so we just have these tasks;

taskTypes = append(
  taskTypes, sealtasks.TTFetch, 
  sealtasks.TTCommit1, 
  sealtasks.TTProveReplicaUpdate1, 
  sealtasks.TTFinalize, 
  sealtasks.TTFinalizeReplicaUpdate
)

and just have long-term storage attached - get mass assigned GETs, even though there are free workers with more free diskspace.

Logging Information

Repo Steps

Run worker on storage machine
See all GETs be assigned to one or two hosts
See scheduler does not honour free disk space on those attached storages

The text was updated successfully, but these errors were encountered:

RobQuistNL · 2022-04-28T15:13:04Z

@magik6k this might be one of the only remaining issues in the scheduler logic

RobQuistNL · 2022-05-17T01:16:49Z

for me this has quite high priority - apart from the fact that;

Load does not get balanced
WDPost load does not get balanced

its keeping all your eggs in one basket..

stor-13:/storage0      162T   33G  162T   1% /mnt/stor-13/storage0
stor-13:/storage1      162T   42G  162T   1% /mnt/stor-13/storage1
stor-13:/storage2      162T   36G  162T   1% /mnt/stor-13/storage2
stor-13:/storage3      162T   76G  162T   1% /mnt/stor-13/storage3
stor-13:/storage4      162T   34G  162T   1% /mnt/stor-13/storage4
stor-13:/storage5      162T   33G  162T   1% /mnt/stor-13/storage5
stor-14:/storage0      162T  386G  162T   1% /mnt/stor-14/storage0
stor-14:/storage1      162T  355G  162T   1% /mnt/stor-14/storage1
stor-14:/storage2      162T  353G  162T   1% /mnt/stor-14/storage2
stor-14:/storage3      162T  356G  162T   1% /mnt/stor-14/storage3
stor-14:/storage4      162T  355G  162T   1% /mnt/stor-14/storage4
stor-14:/storage5      162T  333G  162T   1% /mnt/stor-14/storage5
stor-15:/storage0      162T   35G  162T   1% /mnt/stor-15/storage0
stor-15:/storage1      162T   34G  162T   1% /mnt/stor-15/storage1
stor-15:/storage2      162T  134G  162T   1% /mnt/stor-15/storage2
stor-15:/storage3      162T   33G  162T   1% /mnt/stor-15/storage3
stor-15:/storage4      162T   48G  162T   1% /mnt/stor-15/storage4
stor-15:/storage5      162T  129G  162T   1% /mnt/stor-15/storage5

Because stor-14 booted / registered first (or last) with the miner, that one's getting all the FETCH jobs.

All we need is a round-robin task queueing - that would fix all these issues (possibly with other scheduling issues too)

rjan90 · 2022-09-15T13:57:15Z

So I was able to reproduce this on a local-network, but it takes a bit of configuration. So happy to hand you the login credentials to this server @shrenujbansal, so you faster can see the issue and hopefully find and confirm a potential fix 😄 Its based on the most recent master (lotus version 1.17.2-dev+2k+git.4e830a8c3). The steps to getting to the issue are:

Create a local network.

Create 3 tmpfs with 100M that the storage only lotus-workers will use:
mkdir /root/storage-worker-1 && mount -t tmpfs -o size=100M tmpfs /root/storage-worker-1
mkdir /root/storage-worker-2 && mount -t tmpfs -o size=100M tmpfs /root/storage-worker-2
mkdir /root/storage-worker-3 && mount -t tmpfs -o size=100M tmpfs /root/storage-worker-3

Initialize the 3 storage-only-lotus-worker (I set up a screen session for each, for easier management)
First storage only lotus-worker (uses the .lotusworker)

export MINER_API_INFO=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBbGxvdyI6WyJyZWFkIiwid3JpdGUiLCJzaWduIiwiYWRtaW4iXX0.tRu0gcsde8KxZeEueVtkXo4Q2GxYanIIrkMowW0MGic:/ip4/127.0.0.1/tcp/2345/http
nohup lotus-worker run --name=storage-only-worker-1 --no-local-storage=true --no-default=true > ~/storage-only-lotusworker1.log 2>&1 &
lotus-worker storage attach --init --store /root/storage-worker-1
Rename "id" in sectorstore.json in /root/storage-worker-1 to "storage-only-lotus-worker-1" for easier understanding of storage list. Restart worker.

Second storage only lotus-worker (uses /root/storagelotusworker2)

export LOTUS_WORKER_PATH=/root/storagelotusworker2
export MINER_API_INFO=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBbGxvdyI6WyJyZWFkIiwid3JpdGUiLCJzaWduIiwiYWRtaW4iXX0.tRu0gcsde8KxZeEueVtkXo4Q2GxYanIIrkMowW0MGic:/ip4/127.0.0.1/tcp/2345/http
nohup lotus-worker run --listen=0.0.0.0:4567 --name=storage-only-worker-2 --no-local-storage=true --no-default=true > ~/storage-only-lotusworker2.log 2>&1 &
lotus-worker storage attach --init --store /root/storage-worker-2
Rename "id" in sectorstore.json in /root/storage-worker-2 to "storage-only-lotus-worker-2" for easier understanding of storage-id. Restart worker.

Third storage only lotus-worker (uses /root/storagelotusworker3)

export LOTUS_WORKER_PATH=/root/storagelotusworker3
export MINER_API_INFO=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBbGxvdyI6WyJyZWFkIiwid3JpdGUiLCJzaWduIiwiYWRtaW4iXX0.tRu0gcsde8KxZeEueVtkXo4Q2GxYanIIrkMowW0MGic:/ip4/127.0.0.1/tcp/2345/http
nohup lotus-worker run --listen=0.0.0.0:5678 --name=storage-only-worker-3 --no-local-storage=true --no-default=true > ~/storage-only-lotusworker3.log 2>&1 &
lotus-worker storage attach --init --store /root/storage-worker-3
Rename "id" in sectorstore.json in /root/storage-worker-3 to "storage-only-lotus-worker-3" for easier understanding of storage-id. Restart worker.

Create a regular sealing-worker (uses /root/sealingworker):

export LOTUS_WORKER_PATH=/root/sealingworker
export MINER_API_INFO=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBbGxvdyI6WyJyZWFkIiwid3JpdGUiLCJzaWduIiwiYWRtaW4iXX0.tRu0gcsde8KxZeEueVtkXo4Q2GxYanIIrkMowW0MGic:/ip4/127.0.0.1/tcp/2345/http
nohup lotus-worker run --listen=0.0.0.0:6789 --name=sealing-worker --no-local-storage=true > ~/sealing-worker.log 2>&1 &
Create sealing-space: mkdir /root/scratchspace && mount -t tmpfs -o size=200M tmpfs /root/scratchspace
lotus-worker storage attach --init --seal /root/scratchspace
Rename "id" in sectorstore.json in /root/scratchspace to "seal-worker" for easier understanding of storage-id. Restart worker.

Turn off storage and sealing on the lotus-miner process. The local network SP is set up with store and seal from the start.

Weight: 10; Use: Seal Store
Local: /root/.lotus-miner-local-net

Set Seal and Store to false in the sectorstore.json file in ~/.lotus-miner-local-net
Set all the sealing tasks to false in the config.toml file in ~/.lotus-miner-local-net
Turn off batching precommits and aggregating commits, so we do not have to think about publishing these.

And restart the whole system/workers. The output of lotus-miner storage list should look like this now:

lotus-miner storage list
287e2bca-863e-4155-8060-da0adeff2f30:
        [######################################            ] 333.6 GiB/436.3 GiB 76%
        Unsealed: 0; Sealed: 2; Caches: 2; Reserved: 0 B
        Use: ReadOnly
        Local: /root/.genesis-sectors
        URL: http://127.0.0.1:2345/remote

b7ad63af-56df-43e7-b90b-a14d5df6baae:
        [######################################            ] 333.6 GiB/436.3 GiB 76%
        Unsealed: 0; Sealed: 1; Caches: 1; Reserved: 0 B
        Use: ReadOnly
        Local: /root/.lotus-miner-local-net
        URL: http://127.0.0.1:2345/remote

scratch-space:
        [                                                  ] 4 KiB/200 MiB 0%
        Unsealed: 0; Sealed: 0; Caches: 0; Reserved: 0 B
        Weight: 10; Use: Seal 
        URL: http://127.0.0.1:6789/remote (latency: 600µs)

storage-only-lotus-worker-1:
        [                                                  ] 4 KiB/100 MiB 0%
        Unsealed: 0; Sealed: 0; Caches: 0; Reserved: 0 B
        Weight: 10; Use: Store
        URL: http://127.0.0.1:3456/remote (latency: 400µs)

storage-only-lotus-worker-2:
        [                                                  ] 4 KiB/100 MiB 0%
        Unsealed: 0; Sealed: 0; Caches: 0; Reserved: 0 B
        Weight: 10; Use: Store
        URL: http://127.0.0.1:4567/remote (latency: 400µs)

storage-only-lotus-worker-3:
        [                                                  ] 4 KiB/100 MiB 0%
        Unsealed: 0; Sealed: 0; Caches: 0; Reserved: 0 B
        Weight: 10; Use: Store
        URL: http://127.0.0.1:5678/remote (latency: 400µs)

The expected behaviour is now that the sectors will seal on the sealing-worker which will then send the sealed sector to one of the three storage-only-lotus-workers based on the storage picking logic (weight*% available space).

If we then run a script to pledge a lot of sectors, you should see after a while that only one (or two) of the storage-only-lotus-workers gets assigned, even if the last storage-only-lotus-worker has more available space. Current situation looks like this:

lotus-miner storage list
287e2bca-863e-4155-8060-da0adeff2f30:
        [######################################            ] 333.6 GiB/436.3 GiB 76%
        Unsealed: 0; Sealed: 2; Caches: 2; Reserved: 0 B
        Use: ReadOnly
        Local: /root/.genesis-sectors
        URL: http://127.0.0.1:2345/remote

b7ad63af-56df-43e7-b90b-a14d5df6baae:
        [######################################            ] 333.6 GiB/436.3 GiB 76%
        Unsealed: 0; Sealed: 1; Caches: 1; Reserved: 0 B
        Use: ReadOnly
        Local: /root/.lotus-miner-local-net
        URL: http://127.0.0.1:2345/remote

scratch-space:
        [                                                  ] 148 KiB/200 MiB 0%
        Unsealed: 4; Sealed: 4; Caches: 4; Reserved: 0 B
        Weight: 10; Use: Seal 
        URL: http://127.0.0.1:6789/remote (latency: 4ms)

storage-only-lotus-worker-1:
        [#                                                 ] 2.129 MiB/100 MiB 2%
        Unsealed: 0; Sealed: 136; Caches: 136; Reserved: 0 B
        Weight: 10; Use: Store
        URL: http://127.0.0.1:3456/remote (latency: 3.9ms)

storage-only-lotus-worker-2:
        [                                                  ] 4 KiB/100 MiB 0%
        Unsealed: 0; Sealed: 0; Caches: 0; Reserved: 0 B
        Weight: 10; Use: Store
        URL: http://127.0.0.1:4567/remote (latency: 3.9ms)

storage-only-lotus-worker-3:
        [                                                  ] 4 KiB/100 MiB 0%
        Unsealed: 0; Sealed: 0; Caches: 0; Reserved: 0 B
        Weight: 10; Use: Store
        URL: http://127.0.0.1:5678/remote (latency: 1.2ms)

Here it would be expected that the storage-only-lotus-worker-2 or storage-only-lotus-worker-3 would start to get assigned sectors, but all sectors still go to storage-only-lotus-worker-1.

shrenujbansal · 2022-09-15T20:19:08Z

Discussing this issue with @magik6k and looking at the code, the scheduler currently schedules tasks based on compute utilization. Since the GETs require very little compute, all such tasks get assigned to the first available worker which is why we see sectors getting assigned only to the a single worker and not all the other available workers
The code making the decisions based on the utilization is here: https://github.com/filecoin-project/lotus/blob/master/storage/sealer/sched_assigner_utilization.go#L15

There are 2 options to resolving the above problem:

(Simpler but not as effective) Use a round robin scheduler which is already available here: https://github.com/filecoin-project/lotus/blob/master/storage/sealer/sched_assigner_spread.go#L15
(Complex but probably the right way) Add storage utilization as a criteria to determine which worker a task gets farmed out to. My recommendation would be to have separate utilization checks for compute heavy tasks vs storage heavy tasks like GETs to keep the logic simpler. The biggest challenge here is how to properly obtain the utilization information from the workers and account for the tasks in flight within the scheduler

benjaminh83 · 2022-09-15T20:46:16Z

Round robin makes very good sense from a performance scaling perspective (getting more capacity and throughput by adding more workers/paths). It does come with some issues, that needs addressing:

weights in paths are basically not obeyed - would look like a bug unless explicitly called out to be "the exception to usage of weight".
It does still need to obey the limit set on the storage path. This might be more problematic, if the scheduler selects a worker, but then after selecting, it figures out that this worker does actually not have a path that has available capacity. Not sure if it is a problem, but just noting that this must not happen.

Otherwise using the storage utilisation could be right for some use cases, but not for all. Like I have a storage cluster with 16 JBODs / 16 individual paths with a worker for each, all filled up 50%. Now I add yet another but empty JBOD, so 0%, and the scheduler will based on storage utilisation, send every sector to this single worker / path. This would reduce the amount of parallel GETs from using all workers, to only using one, and throttle my sealing output a lot!

There could also be a way where its not a round robin, but maybe just a "allow only x concurrent GET to a worker". Like only allowing a worker to have one GET at a time would force the scheduler to move on to the next "free" worker, that is not currently occupied with doing a GET. And then it could choose worker ranking based on storage utilisation. NOT EASY, but this would kind of capture the "don't send ALL sectors to the new worker", but still spread out the load.

Lastly, I think it would be optimal if the SP could choose between strategy, like we can on the current "assigner spread". In an ideal world, have, "utilization", "pread" and maybe combined with a concurrent GET limiter, so it automatically is forced to spread out the load, rather than only hitting the same path. It's not a problem that it wants to fill up the new empty storage, we just don't want it scheduling 30 GETs against a single worker, while the rest are idle.

Not an easy ask I know, but this will have HUGE impact on storage efficiency and possibly remove current misfit network storage strategies. Basically making it much easier for SPs to move beyond direct attach storage without the pitfalls of using network storage with lotus, which is very hard to do well.

clinta · 2022-09-15T21:09:08Z

maybe combined with a concurrent GET limiter, so it automatically is forced to spread out the load, rather than only hitting the same path

Is this not supposed to already work? Using GET_32G_MAX_CONCURRENT on the storage worker? As documented here: https://lotus.filecoin.io/storage-providers/seal-workers/seal-workers/#limit-tasks-run-in-parallel

benjaminh83 · 2022-09-16T06:31:49Z

@clinta It would certainly make sense to load balance with the GET limiter, but I'm quite certain that functionality has issues in lotus as well. @rjan90 can confirm that we see GET limitations not getting enforced: #9213 (comment)

Also I used to run our workers with the flag --parallel-fetch-limit=1, which also does not seem to have any effect (this would maybe not touch the scheduling, but basically limit the network load by only allowing one job at a time. We need to fix it at the scheduling time, but this was just to underline that it seems like all these limiters is currently broken for GETs.

magik6k · 2023-02-27T19:36:42Z

Experiments is #10356 may make this bettter

rjan90 · 2023-02-27T19:52:07Z

Just ran the above test with the experiment-spread-tasks-qcount assigner, and got these results on three different lotus-worker --no-default storage workers when running a pledge script for a while:

17b49f4a-0d06-4a52-ae4c-4cb66ae38abc:
        [                                                  ] 532 KiB/100 MiB 0%
        Unsealed: 0; Sealed: 33; Caches: 33; Updated: 0; Update-caches: 0; Reserved: 0 B
        Weight: 10; Use: Store
        URL: http://127.0.0.1:5678/remote (latency: 600µs)

904bda38-d0cc-4ed3-85fb-f09a84184eb7:
        [                                                  ] 868 KiB/100 MiB 0%
        Unsealed: 0; Sealed: 54; Caches: 54; Updated: 0; Update-caches: 0; Reserved: 0 B
        Weight: 10; Use: Store
        URL: http://127.0.0.1:3456/remote (latency: 700µs)

e3fb6816-8ad4-4ccb-a271-59d9a1c75c21:
        [                                                  ] 356 KiB/100 MiB 0%
        Unsealed: 0; Sealed: 22; Caches: 22; Updated: 0; Update-caches: 0; Reserved: 0 B
        Weight: 10; Use: Store
        URL: http://127.0.0.1:4568/remote (latency: 600µs)

It shows a quite good improvement over the current spread assigner. Will re-run the experiment with the experiment-random assigner, and check the difference.

rjan90 · 2023-02-27T20:47:38Z

With the experiment-random assigner I got these results. Which I would say is a slightly better alternative for storage workers at least:

91d3f551-37c6-4c86-ac2b-5998d085fc05:
        [                                                  ] 596 KiB/100 MiB 0%
        Unsealed: 0; Sealed: 37; Caches: 37; Updated: 0; Update-caches: 0; Reserved: 0 B
        Weight: 10; Use: Store
        URL: http://127.0.0.1:5678/remote (latency: 700µs)

bef3536f-b5e8-4857-bd75-33bd6da5054a:
        [                                                  ] 692 KiB/100 MiB 0%
        Unsealed: 0; Sealed: 43; Caches: 43; Updated: 0; Update-caches: 0; Reserved: 0 B
        Weight: 10; Use: Store
        URL: http://127.0.0.1:4567/remote (latency: 700µs)

dce91ab6-ee65-4f63-9b1d-2f3b462df73b:
        [                                                  ] 500 KiB/100 MiB 0%
        Unsealed: 0; Sealed: 31; Caches: 31; Updated: 0; Update-caches: 0; Reserved: 0 B
        Weight: 10; Use: Store
        URL: http://127.0.0.1:3456/remote (latency: 600µs)

benjaminh83 · 2023-02-28T21:32:32Z

Is the round-robin assigner not possible from a technical standpoint? It would somewhat ensure longest time between hitting the same storage-worker.
In our testing we would ideally have two workers for each path. I wonder how that complicates things. I do not really like the random, as you might still end up getting congestion if many sectors are randomly hitting the same path for a while.

rjan90 · 2023-03-03T10:28:50Z

Is the round-robin assigner not possible from a technical standpoint? It would somewhat ensure longest time between hitting the same storage-worker.

As far as I understand its not possible without rewriting the whole scheduler - which we are working towards with in the [EPIC] Lotus Miner v2 - External task queue milestone. .

These new experimental assigners can be seen as a mitigation to get storage only lotus-workers to actually work in systems, while we work towards the task queue.

I do not really like the random, as you might still end up getting congestion if many sectors are randomly hitting the same path for a while.

Yeah, you might see multiple sectors hitting the same path with the experimental-random assigner, but the GET_xx_MAX_CONCURRENT not being enforced bug also has a fix now, so that should alliviate congestion consequences.

strahe · 2023-04-27T07:42:30Z

Whether to consider adding another mode, when the worker is idle, take the initiative to claim the task, which can reduce a lot of work of the miner.

RobQuistNL added kind/bug Kind: Bug need/triage labels Apr 28, 2022

rjan90 added need/analysis Hint: Needs Analysis and removed kind/bug Kind: Bug labels Apr 29, 2022

TippyFlitsUK added the area/sealing label Apr 29, 2022

TippyFlitsUK removed the need/triage label May 17, 2022

TippyFlitsUK self-assigned this May 17, 2022

RobQuistNL mentioned this issue May 18, 2022

stores: Deduplicate parallel stat requests #8589

Merged

jennijuju assigned shrenujbansal and unassigned TippyFlitsUK Aug 11, 2022

jennijuju added this to the v1.17.2 milestone Aug 11, 2022

rjan90 mentioned this issue Feb 24, 2023

Lotus-Miner Backlog Sprint #10338

Closed

rjan90 mentioned this issue Feb 27, 2023

feat: sched: Assigner experiments #10356

Merged

10 tasks

rjan90 modified the milestones: v1.17.2, v1.21.0 Feb 28, 2023

rjan90 moved this to 👀 In Review in Lotus-Miner-V2 Mar 2, 2023

rjan90 added this to Lotus-Miner-V2 Mar 2, 2023

rjan90 removed this from the v1.21.0 milestone Mar 2, 2023

rjan90 added this to the Lotus-Miner Backlog Sprint milestone Mar 2, 2023

rjan90 linked a pull request Mar 3, 2023 that will close this issue

feat: sched: Assigner experiments #10356

Merged

10 tasks

magik6k closed this as completed in #10356 Mar 9, 2023

rjan90 moved this from 👀 In Review to ✅ Done - v1.21.0 in Lotus-Miner-V2 Mar 9, 2023

rjan90 mentioned this issue Jun 8, 2023

Miner dont stop storing data on storage, which is completely full #10679

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler does not honour available disk space on long term storage #8566

Scheduler does not honour available disk space on long term storage #8566

RobQuistNL commented Apr 28, 2022 •

edited

Loading

RobQuistNL commented Apr 28, 2022

RobQuistNL commented May 17, 2022

rjan90 commented Sep 15, 2022 •

edited

Loading

shrenujbansal commented Sep 15, 2022

benjaminh83 commented Sep 15, 2022

clinta commented Sep 15, 2022

benjaminh83 commented Sep 16, 2022

magik6k commented Feb 27, 2023

rjan90 commented Feb 27, 2023

rjan90 commented Feb 27, 2023

benjaminh83 commented Feb 28, 2023

rjan90 commented Mar 3, 2023 •

edited

Loading

strahe commented Apr 27, 2023 •

edited

Loading

Scheduler does not honour available disk space on long term storage #8566

Scheduler does not honour available disk space on long term storage #8566

Comments

RobQuistNL commented Apr 28, 2022 • edited Loading

Checklist

Lotus component

Lotus Version

Describe the Bug

Logging Information

Repo Steps

RobQuistNL commented Apr 28, 2022

RobQuistNL commented May 17, 2022

rjan90 commented Sep 15, 2022 • edited Loading

shrenujbansal commented Sep 15, 2022

benjaminh83 commented Sep 15, 2022

clinta commented Sep 15, 2022

benjaminh83 commented Sep 16, 2022

magik6k commented Feb 27, 2023

rjan90 commented Feb 27, 2023

rjan90 commented Feb 27, 2023

benjaminh83 commented Feb 28, 2023

rjan90 commented Mar 3, 2023 • edited Loading

strahe commented Apr 27, 2023 • edited Loading

RobQuistNL commented Apr 28, 2022 •

edited

Loading

rjan90 commented Sep 15, 2022 •

edited

Loading

rjan90 commented Mar 3, 2023 •

edited

Loading

strahe commented Apr 27, 2023 •

edited

Loading