-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parallelize batch flushing #4296
Conversation
1. Modern storage devices (i.e., SSDs) tend to be highly parallel. 2. Allows us to read and write at the same time (avoids pausing while flushing). fixes #898 (comment) License: MIT Signed-off-by: Steven Allen <[email protected]>
We may want to reduce the parallelism. However, we should probably test badger first (it may work better with increased parallelism). |
This makes adds with sync enabled almost as fast adds with sync disabled with the badger datastore (and the same |
This is really nice, great find here :) |
merkledag/batch.go
Outdated
) | ||
|
||
// ParallelBatchCommits is the number of batch commits that can be in-flight before blocking. | ||
// TODO: Experiment with multiple datastores, storage devices, and CPUs to find |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you create issues for this instead of in code TODO (they just get forgotten).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine! (...grumble... we'll never get to it anyways)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that might be true but it is still better than having TODO in code.
As you said in your issue, someone might just take a stab at it out of pure boredom.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I was just being lazy 🙂.
merkledag/batch.go
Outdated
}(t.blocks) | ||
|
||
t.activeCommits++ | ||
t.blocks = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would preallocate a buffer here of MaxBlocks
as appending will expand the buffer and cause more allocations and copies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, the max size of this array is 128 pointers to blocks (2KiB). However, it will likely never be greater than 32 pointers (0.5KiB) assuming that we have 256KiB blocks. Does 32 sound like a reasonable default size?
Personally, I don't think that will make much of a difference. We already do 1 allocation per block so this will only add another log(n)
allocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I just preallocated a blocks array of the same size as the one we just filled. That should be a reasonable guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As go allocates next power of 2, getting to 128 would be 9 reallocations and copies. IMO it is worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, my log(n)
estimate was incorrect anyways. log(n)
per batch but still n
overall (7-15% allocation overhead depending on the block sizes).
License: MIT Signed-off-by: Steven Allen <[email protected]>
It's probably safe to assume that this buffer will be about the same time each flush. This could cause 1 extra allocation (if this is the last commit) but that's unlikely to be an issue. License: MIT Signed-off-by: Steven Allen <[email protected]>
After further testing, the effect isn't nearly so pronounced for medium-size files (the tests above were on single large files) and is probably non-existent for small files as we create a new batch per file. We should consider using the same batch when adding multiple small files. |
(ipfs/kubo#4296) 1. Modern storage devices (i.e., SSDs) tend to be highly parallel. 2. Allows us to read and write at the same time (avoids pausing while flushing). fixes ipfs/kubo#898 (comment)
(ipfs/kubo#4296) 1. Modern storage devices (i.e., SSDs) tend to be highly parallel. 2. Allows us to read and write at the same time (avoids pausing while flushing). fixes ipfs/kubo#898 (comment)
This makes
ipfs add --local
~3.5x faster with the flatfs datastore (untestedwith badger).
fixes #898 (comment)
License: MIT
Signed-off-by: Steven Allen [email protected]