Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler does not honour available disk space on long term storage #8566

Closed
8 of 18 tasks
Tracked by #10338
RobQuistNL opened this issue Apr 28, 2022 · 13 comments · Fixed by #10356
Closed
8 of 18 tasks
Tracked by #10338

Scheduler does not honour available disk space on long term storage #8566

RobQuistNL opened this issue Apr 28, 2022 · 13 comments · Fixed by #10356
Assignees
Labels
area/sealing need/analysis Hint: Needs Analysis

Comments

@RobQuistNL
Copy link
Contributor

RobQuistNL commented Apr 28, 2022

Checklist

  • This is not a security-related bug/issue. If it is, please follow please follow the security policy.
  • This is not a question or a support request. If you have any lotus related questions, please ask in the lotus forum.
  • This is not a new feature request. If it is, please file a feature request instead.
  • This is not an enhancement request. If it is, please file a improvement suggestion instead.
  • I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
  • I am running the Latest release, or the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
  • I did not make any code changes to lotus.

Lotus component

  • lotus daemon - chain sync
  • lotus miner - mining and block production
  • lotus miner/worker - sealing
  • lotus miner - proving(WindowPoSt)
  • lotus miner/market - storage deal
  • lotus miner/market - retrieval deal
  • lotus miner/market - data transfer
  • lotus client
  • lotus JSON-RPC API
  • lotus message management (mpool)
  • Other

Lotus Version

1.15.2-rc2

Describe the Bug

When running a lotus-worker with all flags on false - so we just have these tasks;

taskTypes = append(
  taskTypes, sealtasks.TTFetch, 
  sealtasks.TTCommit1, 
  sealtasks.TTProveReplicaUpdate1, 
  sealtasks.TTFinalize, 
  sealtasks.TTFinalizeReplicaUpdate
)

and just have long-term storage attached - get mass assigned GETs, even though there are free workers with more free diskspace.

Logging Information

-

Repo Steps

  1. Run worker on storage machine
  2. See all GETs be assigned to one or two hosts
  3. See scheduler does not honour free disk space on those attached storages
@RobQuistNL
Copy link
Contributor Author

@magik6k this might be one of the only remaining issues in the scheduler logic

@rjan90 rjan90 added need/analysis Hint: Needs Analysis and removed kind/bug Kind: Bug labels Apr 29, 2022
@RobQuistNL
Copy link
Contributor Author

for me this has quite high priority - apart from the fact that;

  • Load does not get balanced
  • WDPost load does not get balanced

its keeping all your eggs in one basket..

stor-13:/storage0      162T   33G  162T   1% /mnt/stor-13/storage0
stor-13:/storage1      162T   42G  162T   1% /mnt/stor-13/storage1
stor-13:/storage2      162T   36G  162T   1% /mnt/stor-13/storage2
stor-13:/storage3      162T   76G  162T   1% /mnt/stor-13/storage3
stor-13:/storage4      162T   34G  162T   1% /mnt/stor-13/storage4
stor-13:/storage5      162T   33G  162T   1% /mnt/stor-13/storage5
stor-14:/storage0      162T  386G  162T   1% /mnt/stor-14/storage0
stor-14:/storage1      162T  355G  162T   1% /mnt/stor-14/storage1
stor-14:/storage2      162T  353G  162T   1% /mnt/stor-14/storage2
stor-14:/storage3      162T  356G  162T   1% /mnt/stor-14/storage3
stor-14:/storage4      162T  355G  162T   1% /mnt/stor-14/storage4
stor-14:/storage5      162T  333G  162T   1% /mnt/stor-14/storage5
stor-15:/storage0      162T   35G  162T   1% /mnt/stor-15/storage0
stor-15:/storage1      162T   34G  162T   1% /mnt/stor-15/storage1
stor-15:/storage2      162T  134G  162T   1% /mnt/stor-15/storage2
stor-15:/storage3      162T   33G  162T   1% /mnt/stor-15/storage3
stor-15:/storage4      162T   48G  162T   1% /mnt/stor-15/storage4
stor-15:/storage5      162T  129G  162T   1% /mnt/stor-15/storage5

Because stor-14 booted / registered first (or last) with the miner, that one's getting all the FETCH jobs.

All we need is a round-robin task queueing - that would fix all these issues (possibly with other scheduling issues too)

@rjan90
Copy link
Contributor

rjan90 commented Sep 15, 2022

So I was able to reproduce this on a local-network, but it takes a bit of configuration. So happy to hand you the login credentials to this server @shrenujbansal, so you faster can see the issue and hopefully find and confirm a potential fix 😄 Its based on the most recent master (lotus version 1.17.2-dev+2k+git.4e830a8c3). The steps to getting to the issue are:

Create a local network.

Create 3 tmpfs with 100M that the storage only lotus-workers will use:
mkdir /root/storage-worker-1 && mount -t tmpfs -o size=100M tmpfs /root/storage-worker-1
mkdir /root/storage-worker-2 && mount -t tmpfs -o size=100M tmpfs /root/storage-worker-2
mkdir /root/storage-worker-3 && mount -t tmpfs -o size=100M tmpfs /root/storage-worker-3

Initialize the 3 storage-only-lotus-worker (I set up a screen session for each, for easier management)
First storage only lotus-worker (uses the .lotusworker)

  1. export MINER_API_INFO=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBbGxvdyI6WyJyZWFkIiwid3JpdGUiLCJzaWduIiwiYWRtaW4iXX0.tRu0gcsde8KxZeEueVtkXo4Q2GxYanIIrkMowW0MGic:/ip4/127.0.0.1/tcp/2345/http

  2. nohup lotus-worker run --name=storage-only-worker-1 --no-local-storage=true --no-default=true > ~/storage-only-lotusworker1.log 2>&1 &

  3. lotus-worker storage attach --init --store /root/storage-worker-1

  4. Rename "id" in sectorstore.json in /root/storage-worker-1 to "storage-only-lotus-worker-1" for easier understanding of storage list. Restart worker.

Second storage only lotus-worker (uses /root/storagelotusworker2)

  1. export LOTUS_WORKER_PATH=/root/storagelotusworker2
  2. export MINER_API_INFO=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBbGxvdyI6WyJyZWFkIiwid3JpdGUiLCJzaWduIiwiYWRtaW4iXX0.tRu0gcsde8KxZeEueVtkXo4Q2GxYanIIrkMowW0MGic:/ip4/127.0.0.1/tcp/2345/http
  3. nohup lotus-worker run --listen=0.0.0.0:4567 --name=storage-only-worker-2 --no-local-storage=true --no-default=true > ~/storage-only-lotusworker2.log 2>&1 &
  4. lotus-worker storage attach --init --store /root/storage-worker-2
  5. Rename "id" in sectorstore.json in /root/storage-worker-2 to "storage-only-lotus-worker-2" for easier understanding of storage-id. Restart worker.

Third storage only lotus-worker (uses /root/storagelotusworker3)

  1. export LOTUS_WORKER_PATH=/root/storagelotusworker3
  2. export MINER_API_INFO=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBbGxvdyI6WyJyZWFkIiwid3JpdGUiLCJzaWduIiwiYWRtaW4iXX0.tRu0gcsde8KxZeEueVtkXo4Q2GxYanIIrkMowW0MGic:/ip4/127.0.0.1/tcp/2345/http
  3. nohup lotus-worker run --listen=0.0.0.0:5678 --name=storage-only-worker-3 --no-local-storage=true --no-default=true > ~/storage-only-lotusworker3.log 2>&1 &
  4. lotus-worker storage attach --init --store /root/storage-worker-3
  5. Rename "id" in sectorstore.json in /root/storage-worker-3 to "storage-only-lotus-worker-3" for easier understanding of storage-id. Restart worker.

Create a regular sealing-worker (uses /root/sealingworker):

  1. export LOTUS_WORKER_PATH=/root/sealingworker
  2. export MINER_API_INFO=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBbGxvdyI6WyJyZWFkIiwid3JpdGUiLCJzaWduIiwiYWRtaW4iXX0.tRu0gcsde8KxZeEueVtkXo4Q2GxYanIIrkMowW0MGic:/ip4/127.0.0.1/tcp/2345/http
  3. nohup lotus-worker run --listen=0.0.0.0:6789 --name=sealing-worker --no-local-storage=true > ~/sealing-worker.log 2>&1 &
  4. Create sealing-space: mkdir /root/scratchspace && mount -t tmpfs -o size=200M tmpfs /root/scratchspace
  5. lotus-worker storage attach --init --seal /root/scratchspace
  6. Rename "id" in sectorstore.json in /root/scratchspace to "seal-worker" for easier understanding of storage-id. Restart worker.

Turn off storage and sealing on the lotus-miner process. The local network SP is set up with store and seal from the start.

Weight: 10; Use: Seal Store
Local: /root/.lotus-miner-local-net
  1. Set Seal and Store to false in the sectorstore.json file in ~/.lotus-miner-local-net
  2. Set all the sealing tasks to false in the config.toml file in ~/.lotus-miner-local-net
  3. Turn off batching precommits and aggregating commits, so we do not have to think about publishing these.

And restart the whole system/workers. The output of lotus-miner storage list should look like this now:

lotus-miner storage list
287e2bca-863e-4155-8060-da0adeff2f30:
        [######################################            ] 333.6 GiB/436.3 GiB 76%
        Unsealed: 0; Sealed: 2; Caches: 2; Reserved: 0 B
        Use: ReadOnly
        Local: /root/.genesis-sectors
        URL: http://127.0.0.1:2345/remote

b7ad63af-56df-43e7-b90b-a14d5df6baae:
        [######################################            ] 333.6 GiB/436.3 GiB 76%
        Unsealed: 0; Sealed: 1; Caches: 1; Reserved: 0 B
        Use: ReadOnly
        Local: /root/.lotus-miner-local-net
        URL: http://127.0.0.1:2345/remote

scratch-space:
        [                                                  ] 4 KiB/200 MiB 0%
        Unsealed: 0; Sealed: 0; Caches: 0; Reserved: 0 B
        Weight: 10; Use: Seal 
        URL: http://127.0.0.1:6789/remote (latency: 600µs)

storage-only-lotus-worker-1:
        [                                                  ] 4 KiB/100 MiB 0%
        Unsealed: 0; Sealed: 0; Caches: 0; Reserved: 0 B
        Weight: 10; Use: Store
        URL: http://127.0.0.1:3456/remote (latency: 400µs)

storage-only-lotus-worker-2:
        [                                                  ] 4 KiB/100 MiB 0%
        Unsealed: 0; Sealed: 0; Caches: 0; Reserved: 0 B
        Weight: 10; Use: Store
        URL: http://127.0.0.1:4567/remote (latency: 400µs)

storage-only-lotus-worker-3:
        [                                                  ] 4 KiB/100 MiB 0%
        Unsealed: 0; Sealed: 0; Caches: 0; Reserved: 0 B
        Weight: 10; Use: Store
        URL: http://127.0.0.1:5678/remote (latency: 400µs)

The expected behaviour is now that the sectors will seal on the sealing-worker which will then send the sealed sector to one of the three storage-only-lotus-workers based on the storage picking logic (weight*% available space).

If we then run a script to pledge a lot of sectors, you should see after a while that only one (or two) of the storage-only-lotus-workers gets assigned, even if the last storage-only-lotus-worker has more available space. Current situation looks like this:

lotus-miner storage list
287e2bca-863e-4155-8060-da0adeff2f30:
        [######################################            ] 333.6 GiB/436.3 GiB 76%
        Unsealed: 0; Sealed: 2; Caches: 2; Reserved: 0 B
        Use: ReadOnly
        Local: /root/.genesis-sectors
        URL: http://127.0.0.1:2345/remote

b7ad63af-56df-43e7-b90b-a14d5df6baae:
        [######################################            ] 333.6 GiB/436.3 GiB 76%
        Unsealed: 0; Sealed: 1; Caches: 1; Reserved: 0 B
        Use: ReadOnly
        Local: /root/.lotus-miner-local-net
        URL: http://127.0.0.1:2345/remote

scratch-space:
        [                                                  ] 148 KiB/200 MiB 0%
        Unsealed: 4; Sealed: 4; Caches: 4; Reserved: 0 B
        Weight: 10; Use: Seal 
        URL: http://127.0.0.1:6789/remote (latency: 4ms)

storage-only-lotus-worker-1:
        [#                                                 ] 2.129 MiB/100 MiB 2%
        Unsealed: 0; Sealed: 136; Caches: 136; Reserved: 0 B
        Weight: 10; Use: Store
        URL: http://127.0.0.1:3456/remote (latency: 3.9ms)

storage-only-lotus-worker-2:
        [                                                  ] 4 KiB/100 MiB 0%
        Unsealed: 0; Sealed: 0; Caches: 0; Reserved: 0 B
        Weight: 10; Use: Store
        URL: http://127.0.0.1:4567/remote (latency: 3.9ms)

storage-only-lotus-worker-3:
        [                                                  ] 4 KiB/100 MiB 0%
        Unsealed: 0; Sealed: 0; Caches: 0; Reserved: 0 B
        Weight: 10; Use: Store
        URL: http://127.0.0.1:5678/remote (latency: 1.2ms)

Here it would be expected that the storage-only-lotus-worker-2 or storage-only-lotus-worker-3 would start to get assigned sectors, but all sectors still go to storage-only-lotus-worker-1.

@shrenujbansal
Copy link
Contributor

Discussing this issue with @magik6k and looking at the code, the scheduler currently schedules tasks based on compute utilization. Since the GETs require very little compute, all such tasks get assigned to the first available worker which is why we see sectors getting assigned only to the a single worker and not all the other available workers
The code making the decisions based on the utilization is here: https://github.com/filecoin-project/lotus/blob/master/storage/sealer/sched_assigner_utilization.go#L15

There are 2 options to resolving the above problem:

  • (Simpler but not as effective) Use a round robin scheduler which is already available here: https://github.com/filecoin-project/lotus/blob/master/storage/sealer/sched_assigner_spread.go#L15
  • (Complex but probably the right way) Add storage utilization as a criteria to determine which worker a task gets farmed out to. My recommendation would be to have separate utilization checks for compute heavy tasks vs storage heavy tasks like GETs to keep the logic simpler. The biggest challenge here is how to properly obtain the utilization information from the workers and account for the tasks in flight within the scheduler

@benjaminh83
Copy link

Round robin makes very good sense from a performance scaling perspective (getting more capacity and throughput by adding more workers/paths). It does come with some issues, that needs addressing:

  1. weights in paths are basically not obeyed - would look like a bug unless explicitly called out to be "the exception to usage of weight".
  2. It does still need to obey the limit set on the storage path. This might be more problematic, if the scheduler selects a worker, but then after selecting, it figures out that this worker does actually not have a path that has available capacity. Not sure if it is a problem, but just noting that this must not happen.

Otherwise using the storage utilisation could be right for some use cases, but not for all. Like I have a storage cluster with 16 JBODs / 16 individual paths with a worker for each, all filled up 50%. Now I add yet another but empty JBOD, so 0%, and the scheduler will based on storage utilisation, send every sector to this single worker / path. This would reduce the amount of parallel GETs from using all workers, to only using one, and throttle my sealing output a lot!

There could also be a way where its not a round robin, but maybe just a "allow only x concurrent GET to a worker". Like only allowing a worker to have one GET at a time would force the scheduler to move on to the next "free" worker, that is not currently occupied with doing a GET. And then it could choose worker ranking based on storage utilisation. NOT EASY, but this would kind of capture the "don't send ALL sectors to the new worker", but still spread out the load.

Lastly, I think it would be optimal if the SP could choose between strategy, like we can on the current "assigner spread". In an ideal world, have, "utilization", "pread" and maybe combined with a concurrent GET limiter, so it automatically is forced to spread out the load, rather than only hitting the same path. It's not a problem that it wants to fill up the new empty storage, we just don't want it scheduling 30 GETs against a single worker, while the rest are idle.

Not an easy ask I know, but this will have HUGE impact on storage efficiency and possibly remove current misfit network storage strategies. Basically making it much easier for SPs to move beyond direct attach storage without the pitfalls of using network storage with lotus, which is very hard to do well.

@clinta
Copy link
Contributor

clinta commented Sep 15, 2022

maybe combined with a concurrent GET limiter, so it automatically is forced to spread out the load, rather than only hitting the same path

Is this not supposed to already work? Using GET_32G_MAX_CONCURRENT on the storage worker? As documented here: https://lotus.filecoin.io/storage-providers/seal-workers/seal-workers/#limit-tasks-run-in-parallel

@benjaminh83
Copy link

@clinta It would certainly make sense to load balance with the GET limiter, but I'm quite certain that functionality has issues in lotus as well. @rjan90 can confirm that we see GET limitations not getting enforced: #9213 (comment)

Also I used to run our workers with the flag --parallel-fetch-limit=1, which also does not seem to have any effect (this would maybe not touch the scheduling, but basically limit the network load by only allowing one job at a time. We need to fix it at the scheduling time, but this was just to underline that it seems like all these limiters is currently broken for GETs.

@magik6k
Copy link
Contributor

magik6k commented Feb 27, 2023

Experiments is #10356 may make this bettter

@rjan90
Copy link
Contributor

rjan90 commented Feb 27, 2023

Just ran the above test with the experiment-spread-tasks-qcount assigner, and got these results on three different lotus-worker --no-default storage workers when running a pledge script for a while:

17b49f4a-0d06-4a52-ae4c-4cb66ae38abc:
        [                                                  ] 532 KiB/100 MiB 0%
        Unsealed: 0; Sealed: 33; Caches: 33; Updated: 0; Update-caches: 0; Reserved: 0 B
        Weight: 10; Use: Store
        URL: http://127.0.0.1:5678/remote (latency: 600µs)

904bda38-d0cc-4ed3-85fb-f09a84184eb7:
        [                                                  ] 868 KiB/100 MiB 0%
        Unsealed: 0; Sealed: 54; Caches: 54; Updated: 0; Update-caches: 0; Reserved: 0 B
        Weight: 10; Use: Store
        URL: http://127.0.0.1:3456/remote (latency: 700µs)

e3fb6816-8ad4-4ccb-a271-59d9a1c75c21:
        [                                                  ] 356 KiB/100 MiB 0%
        Unsealed: 0; Sealed: 22; Caches: 22; Updated: 0; Update-caches: 0; Reserved: 0 B
        Weight: 10; Use: Store
        URL: http://127.0.0.1:4568/remote (latency: 600µs)

It shows a quite good improvement over the current spread assigner. Will re-run the experiment with the experiment-random assigner, and check the difference.

@rjan90
Copy link
Contributor

rjan90 commented Feb 27, 2023

With the experiment-random assigner I got these results. Which I would say is a slightly better alternative for storage workers at least:

91d3f551-37c6-4c86-ac2b-5998d085fc05:
        [                                                  ] 596 KiB/100 MiB 0%
        Unsealed: 0; Sealed: 37; Caches: 37; Updated: 0; Update-caches: 0; Reserved: 0 B
        Weight: 10; Use: Store
        URL: http://127.0.0.1:5678/remote (latency: 700µs)

bef3536f-b5e8-4857-bd75-33bd6da5054a:
        [                                                  ] 692 KiB/100 MiB 0%
        Unsealed: 0; Sealed: 43; Caches: 43; Updated: 0; Update-caches: 0; Reserved: 0 B
        Weight: 10; Use: Store
        URL: http://127.0.0.1:4567/remote (latency: 700µs)

dce91ab6-ee65-4f63-9b1d-2f3b462df73b:
        [                                                  ] 500 KiB/100 MiB 0%
        Unsealed: 0; Sealed: 31; Caches: 31; Updated: 0; Update-caches: 0; Reserved: 0 B
        Weight: 10; Use: Store
        URL: http://127.0.0.1:3456/remote (latency: 600µs)

@rjan90 rjan90 modified the milestones: v1.17.2, v1.21.0 Feb 28, 2023
@benjaminh83
Copy link

Is the round-robin assigner not possible from a technical standpoint? It would somewhat ensure longest time between hitting the same storage-worker.
In our testing we would ideally have two workers for each path. I wonder how that complicates things. I do not really like the random, as you might still end up getting congestion if many sectors are randomly hitting the same path for a while.

@rjan90 rjan90 moved this to 👀 In Review in Lotus-Miner-V2 Mar 2, 2023
@rjan90 rjan90 removed this from the v1.21.0 milestone Mar 2, 2023
@rjan90 rjan90 added this to the Lotus-Miner Backlog Sprint milestone Mar 2, 2023
@rjan90
Copy link
Contributor

rjan90 commented Mar 3, 2023

Is the round-robin assigner not possible from a technical standpoint? It would somewhat ensure longest time between hitting the same storage-worker.

As far as I understand its not possible without rewriting the whole scheduler - which we are working towards with in the [EPIC] Lotus Miner v2 - External task queue milestone. .

These new experimental assigners can be seen as a mitigation to get storage only lotus-workers to actually work in systems, while we work towards the task queue.

I do not really like the random, as you might still end up getting congestion if many sectors are randomly hitting the same path for a while.

Yeah, you might see multiple sectors hitting the same path with the experimental-random assigner, but the GET_xx_MAX_CONCURRENT not being enforced bug also has a fix now, so that should alliviate congestion consequences.

@rjan90 rjan90 linked a pull request Mar 3, 2023 that will close this issue
10 tasks
@rjan90 rjan90 moved this from 👀 In Review to ✅ Done - v1.21.0 in Lotus-Miner-V2 Mar 9, 2023
@strahe
Copy link
Contributor

strahe commented Apr 27, 2023

Whether to consider adding another mode, when the worker is idle, take the initiative to claim the task, which can reduce a lot of work of the miner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/sealing need/analysis Hint: Needs Analysis
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

9 participants