-
Notifications
You must be signed in to change notification settings - Fork 110
forky: Fixed Chunk Data Size Store #2017
base: master
Are you sure you want to change the base?
Conversation
@@ -168,10 +167,6 @@ func dbImport(ctx *cli.Context) { | |||
} | |||
|
|||
func openLDBStore(path string, basekey []byte) (*localstore.DB, error) { | |||
if _, err := os.Stat(filepath.Join(path, "CURRENT")); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is removed in order to allow dbImport to create localstore directory if it does not exist, to simplify migration steps (remove the need to start swarm node in between export and import, as it was required in v0.5.0 release).
storage/fcds/fcds.go
Outdated
shards map[uint8]*os.File // relations with shard id and a shard file | ||
shardsMu map[uint8]*sync.Mutex // mutex for every shard file | ||
meta MetaStore // stores chunk offsets | ||
free map[uint8]struct{} // which shards have free offsets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would also just have a shardCount length bool array here
|
||
// offsetCache is a simple cache of offset integers | ||
// by shard files. | ||
type offsetCache struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am a bit lost here. what do we need this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the caching of free offsets in Store. It reduces the number of calls to MetaStore.FreeOffset in Store.Put to avoid disk i/o.
storage/fcds/fcds.go
Outdated
func (s *Store) getOffset(shard uint8) (offset int64, reclaimed bool, err error) { | ||
if !s.shardHasFreeOffsets(shard) { | ||
// shard does not have free offset | ||
return -1, false, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err -> nil preferred
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, yes of course, my mistake.
I am confirming that smoke tests pass with the current state of this PR https://snapshot.raintank.io/dashboard/snapshot/BDyvSKICIk5a0AbAjXxBeJfgvEvvpiqP. These results are with deployment of 3 swarm pods per one c5.large ec2 node. The first two failures are because smoke test job started before all swarm nodes in the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, very nice 👏 I left a few minor comments, mostly cosmetic ones.
storage/fcds/fcds.go
Outdated
// If offset is less then 0, no free offsets are available. | ||
func (s *Store) getOffset(shard uint8) (offset int64, reclaimed bool, err error) { | ||
if !s.shardHasFreeOffsets(shard) { | ||
// shard does not have free offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would maybe reduce comments in this function as the function comment itself explains the behavior and the code itself with private function names are 100% self explanatory. So I think having only the code makes it even more readable, imo.
storage/fcds/fcds.go
Outdated
}() | ||
select { | ||
case <-done: | ||
case <-time.After(15 * time.Second): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a debug log here?
storage/fcds/fcds.go
Outdated
default: | ||
} | ||
s.wg.Add(1) | ||
return s.wg.Done, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not have a private Store.done()
function instead of returning done
function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason is to make it more obvious that done
must be called in order to finish the protection and that it is only in the relation to protect function. But for this small codebase, I think that it is not so significant. Is it ok to call the new method unprotect
instead done
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both make sense to me, I agree with your point as well, so go for what feels better for you. If you decide to make it a private function unprotect
sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usage is nicer with a private method, with one less line. Thanks.
func (s *MetaStore) Get(addr chunk.Address) (m *fcds.Meta, err error) { | ||
s.mu.RLock() | ||
m = s.meta[string(addr)] | ||
s.mu.RUnlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defer? Not needed functionally, just a subjective cosmetic suggestion :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defer comes with a cost, and I wanted to have as good as possible performance for measurements. In general, I would agree, but here defer contributes with no insignificant relative cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense 👍
// Reclaimed flag denotes that the chunk is at the place of | ||
// already deleted chunk, not appended to the end of the file. | ||
func (s *MetaStore) Set(addr chunk.Address, shard uint8, reclaimed bool, m *fcds.Meta) (err error) { | ||
s.mu.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defer unlock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defer comes with a cost, and I wanted to have as good as possible performance for measurements.
|
||
// offsetCache is a simple cache of offset integers | ||
// by shard files. | ||
type offsetCache struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would expiry be helpful to guard against memory "leak"? Is there a situation where the cache items are not removed for a long time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. In the context of offsetCache, this is certainly a possibility, and would be a good improvement. In the wider context, removal depends on MetaStore.FreeOffset function, which is injected and cannot be trusted for any implementation. I will add basic TTL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a basic TTL and cleanup, but I am not sure that it is actually needed while adding a bit more complexity and locking.
There is a case where offsetCashe may hold offsets for longer periods in case of no activity (uploads, retrievals or syncing). In that case expiration reduces the optimisation that offsetCache is here for, and I would not add it.
Maybe @zelig would like to share opinion on this subject also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, makes sense. I do not think that a small TTL is needed, maybe like 24h one or something similar, just to prevent cache growing forever, but I am not sure it is needed at all as well. Do we expect to have this same cache for the whole time the app running? Sorry if it's a stupid question 🙈 If yes, and if the app is supposed to be running for long periods of time (like multiple days), then it might have be good to clean it up some time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reverted the TTL feature after a brief discussion. We have concluded that if there would be a cache leak, then a much bigger problem would have cause it in this case, so the additional complexity is most likely not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This a very good improvement.
I want to test it in a test cluster that we setup.
LGTM
Thanks, @santicomp2014. I have updated this branch with the current master and pushed docker image to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great work on this PR @janos!
a few comments from my side:
- different chunk sizes - will we need different shards for different chunk sizes? how much effort are we talking in migrating this implementation to support different chunk size? i think that adding the infrastructure to this right now would save us some trouble down the road
- shard count may change for various reasons and i think we should not tie
getShard
function output to this number - max chunk size is not enforced and gracefully truncates data
- if we want to accommodate for a hypothetical future feature that reconstructs localstore from the actual data on the shard files (something which is a low hanging fruit with this kind of a design, and allows us to reconstruct the db from an inconsistent state) but also accommodate variable chunk size - we cannot since the shards do not encode chunk size data. this is just a sidemark, i'm sure you are well aware of this
- traditionally i am not so happy about manual migrations but i'm gonna put a sock in it
in any case i would prefer to have this in the next release just to give us some leeway to play around with this some more
<3
storage/fcds/fcds.go
Outdated
// Interface specifies methods required for FCDS implementation. | ||
// It can be used where alternative implementations are needed to | ||
// switch at runtime. | ||
type Interface interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really want to call an interface Interface
? maybe Storer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took the inspiration from sort.Interface
, but Storer
is a good name. Will change, thanks.
case <-ctx.Done(): | ||
} | ||
} | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this means we continue although there were errors, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this behaviour should be fixed in the whole goroutine where the error is not nil. If there is an error, goroutine should return. I will fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @acud for a review it is very helpful to see there the implementation is weak.
- different chunk sizes - will we need different shards for different chunk sizes? how much effort are we talking in migrating this implementation to support different chunk size? i think that adding the infrastructure to this right now would save us some trouble down the road
I am not sure that there is a decision to have variable chunk sizes. This highly influences this and probably other components that deal with the chunks.
- shard count may change for various reasons and i think we should not tie
getShard
function output to this number
I am open for suggestions.
- max chunk size is not enforced and gracefully truncates data
This is a great find. Thanks.
- if we want to accommodate for a hypothetical future feature that reconstructs localstore from the actual data on the shard files (something which is a low hanging fruit with this kind of a design, and allows us to reconstruct the db from an inconsistent state) but also accommodate variable chunk size - we cannot since the shards do not encode chunk size data. this is just a sidemark, i'm sure you are well aware of this
Yes, and the name of the store reflects on assumption that there is a fixed maximal chunk data size constant in time.
- traditionally i am not so happy about manual migrations but i'm gonna put a sock in it
This is just as we did a year ago when we introduced a new localstore (I think that you implemented the migration), but without and external link to the migration steps.
in any case i would prefer to have this in the next release just to give us some leeway to play around with this some more
I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @janos. Thanks for addressing all of my comments
also please do not merge until the next release is out |
@acud could you reject this PR until the next release is released, to block merging that way? :) |
} | ||
|
||
for _, sh := range s.shards { | ||
if err := sh.f.Close(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping the fd open for a long duration without mmap is a disaster in the making. I am not sure if go file flush() does a os level fsync(). If it is, then we should flush() at regular intervals to save the contents against crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice observation. Go os.File.Sync() is flushing the content to the disk, I am not sure to which flush() are you referring to.
Flushing on (regular) intervals is what operating system is already doing. I am not sure how this would help agains crash unless we fsync on every write, which is quite costly. I have already tested fsync on every write and it makes fcds much slower, even compared to go-leveldb, as go-leveldb dos not fsync at all.
Mmap brings its own complexity, especially on different operating systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reply.
Go os.File.Sync() is flushing the content to the disk, I am not sure to which flush() are you referring to.
Apologies if i confused you. There are two things...
- flush(), which flushes the application write buffers to OS.
- fsync(), the OS level sync which absolutely makes sure that buffers has gone to disk.
The first one is done by golang itself as pointed out by you. The second one is the one i am concerned.
Flushing on (regular) intervals is what operating system is already doing
It is usually done by OS disk drivers whenever they feel it is okay. The fsync() is expensive as pointed out by you. To avoid fsyncing on every commit, DB's usually implement WAL's which fsyncs it on very small regular intervals on the background. so that even if some data is lost it will be of for very small duration . If you fsync on foreground (query path) everytime that will be very expensive.
even compared to go-leveldb, as go-leveldb dos not fsync at all.
leveldb takes another strategy my doing a larger mmap file than required and written directly using memcopy, then on mmap driver takes care of writing the dirty pages to the disk.
All i am saying is, one way or other, if we want to evade crash and end up with corrupted files on bootup, We have to implement this otherwise the I am sure we can expect some gibberish files when you switch off power abruptly.
Mmap brings its own complexity, especially on different operating systems.
Yes. We should not reinvent the wheel.
As a remedy... we have two options
- Use already existing DB which takes care of this ( we can talk about badger if you like)
- In Forky, do a FIle.Sync() every X seconds ( where we can tolerate X seconds data loss) in a background thread for all files.
I found this intresting read on this topic
https://www.joeshaw.org/dont-defer-close-on-writable-files/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explanations.
Use already existing DB which takes care of this ( we can talk about badger if you like)
I have already tried badger and it is slower then go-leveldb. Maybe you can find to do it more efficiently. And still it does not scale with more cpu cores.
In Forky, do a FIle.Sync() every X seconds ( where we can tolerate X seconds data loss) in a background thread for all files.
I am concerned if this is actually help us, as os is already doing that in some frequency. Also, as you pointed, the solution would be to implement WAL, which in some basic form I already did, with of course, some performance penalties.
As far as I can see go-leveldb is not using mmap.
I see your point very valid, but I think that your suggestions should be tested, even the ones that I already did, to revalidate. We can talk about possibilities, but it would be good to actually test resilience for the weak points that you described.
As this is a very important part of the swarm I am more in favour not to merge this PR until we are clear that it is reliable. Currently, I see that it creates more problems than it brings benefits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did few changes in the badger configuration and did run TestLevelDBForky and TestBadgerSuite test cases. These are the results.
The iteration in Badger is somewhat slow... so i commented out the Iteration and did these test cases.. with an assumption that write/read/delete is more important and not iteration.
There are more write improvements i can make in badger.. like batching the Writes and so on.. but for now.. i will leave it like this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, but I cannot say anything from the screenshots except to compare timings and conclude that TestBadgerSuite writes are slower. It would be very helpful to actually see the changes and what is TestBadgerSuite doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look at the last 2 commits in my fork https://github.com/jmozah/forky/commits/master
The only one that matters is SyncWrites = false, which makes Forky and badger apples to apples in my opinion. Others configs are more experimental.
Based on recent discussions and decision to go for a more reliable and in general better approach with using Badger for chunk data storage, I am closing this PR. Thank you all for investing time to improve, test and review this PR, and I am sorry for a long attempt that was so late identified as not the best approach to improve storage performance. At least the high level design, localstore integration and migration part could be reused. |
Please don't delete the branch. This should be used to test performance with the badger. |
@jmozah Of course. I deliberately did not delete the branch. |
Hi , could you please just summarize why the approach was abandoned? Not everyone is able to join in these discussions, still it would be nice to have it documented here. |
This PR changes the chunk data disk persistence structure. It is based on previous experiment https://github.com/janos/forky where multiple approaches were tried out. The most performant was chosen and added to swarm as
storage/fcds
package.This PR includes additional changes required for new storage to be used and optionally manually migrated.
FCDS is integrated into
storage/localstore
without breaking changes to the package API. That involves:At the roundtable discussion, it was decided to have an optional manual data migration. To achieve this,
instructions with steps are printed when the new swarm version is started with older data format. LocalStore
migrations are adjusted to make this functionality. Some issues to
getMigrations
function are discoveredthat are fixed and tested in this pr through additional tests. Schema name variables are now unexported
and names from the legacy ldbstore removed from migrations as they are not needed to be there.
In order for migration to be complete, import and export localstore functions needed to handle pinning
information. This functionality is added.
Measurements
Measurements are performed locally on 4 core MacBook Pro mid 2014.
Every run on a clean swarm data directory by uploading files with random data.
Local test1 - 1GB size - 5x speedup
Local test4 - 4GB size - 6.5x speedup
Smoke tests on cluster
Smoke tests were run multiple times for validation and to measure performance. However, the performance
gain is related to the number of cpus that ec2 node has and how many swarm nodes are running on the same ec2 node.
These are results from running one swarm node on 2 core c5.large https://snapshot.raintank.io/dashboard/snapshot/dD0JQruCpHpOjvKqwR6YN3ay3GmONLyr.
If there are two swarm processes on the same node, upload speed is about the half. It is noticeable that
garbage collection influences performance and this is the area where further adjustments can be made.