Fix persistence and add automatic journal flushing #98

JustAnotherArchivist · 2018-11-16T18:52:38Z

This fixes #84.

A new configuration option, flushJournal, controls whether the journal should be flushed to disk on every write. This is enabled by default when using a file-based journal (i.e. when journalFile is not None). It is an error to specify flushJournal = True for the memory-based journal.

To store the currentTerm and the votedForNodeIdHash in the journal, the header had to be expanded, which also means that the journal file format is now version 2. Migration is performed upon restarting with the upgraded code (and there's also a test for this). Regarding details about votedForNodeIdHash (and why this is an MD5 hash), see my recent comment in #84. The FileJournal now also verifies that the journal file version is supported by the code and throws a RuntimeError if that is not the case; previously, it simply always tried to read it as a version 1 file.

If journal flushing is disabled, the node will not take part in elections until 1.1 times the maximum timeout has elapsed. This is necessary to prevent a node from voting twice in the same election, which could lead to two leaders being elected within a single term. It can however lead to a longer delay before the cluster is operational again after failover.

This furthermore fixes a crash in the ResizableFile when performing a big write (compared to the current file size). See commit message of 38d6adf for details.

…e when writing large data ResizableFile only extended once even when the data to be written to it was larger than that. For example, if the file is currently 1024 B and we're trying to write 1.5 KiB to it, it would get resized to 2048 B and the write would fail (because 2.5 KiB would be required).

The FileJournal only stores the hash of the votedForNodeId because the size of the stored value has to be constant. Therefore, FileJournal.votedForNodeId does not return the actual node ID but a proxy object which can be compared against the node ID.

…ed or no file journal is used

JustAnotherArchivist · 2018-11-16T19:06:33Z

Interesting failures. The test_sync_noflush_novote one seems timing-related, so I guess I'll need to increase the timeouts to make it less sensitive. And apparently math.ceil returns a float on Python 2, because of course Python 2 would do annoying things like that. Will fix shortly.

- The test was very sensitive to small timing differences. This updated version is still sensitive, but it should be much better now. I was unable to trigger any failures even under high load. - o1 should be a follower after o2's first vote request due to the higher term in o2's message and its timeout not having been triggered again. - Python 2's math.ceil returns a float instead of an int for whatever reason.

JustAnotherArchivist · 2018-11-17T12:48:54Z

Alright, I blame the Travis environment. I'm unable to reproduce it on either 2.7 or 3.6 on my machine. Multiple runs always complete successfully. And everything passed on AppVeyor as well.

These failures don't mean that the code is wrong, by the way. Instead, it seems that the test suite's doTicks timeout is not very accurate: when I measured it earlier, it ticked for 3.19 seconds instead of 3.05 as specified through the three calls (the last doTicks call used a timeout of 0.5 s then). This is probably somewhat unavoidable, but it's problematic with tests like this one which is extremely sensitive to timing (the second call has to stop after between 2.5 and 2.64 seconds currently). There's also some overhead of the test code around doTicks, which is obviously not taken into account right now but is also important.

I can think of two workarounds. The first would be to increase the timeouts even more such that the difference between a timeout being triggered by o1 and the vote block time expiring on o2 becomes larger. That would make it less sensitive to those timing variations, but it would also mean that the test takes even longer. The second idea would be to use individual ticks to ensure that the elapsed time is right. The latter is probably a much better solution, and I'll try that.

The doTicks approach did not work well because it isn't very accurate and doesn't take into account the time spent outside of the actual ticking. So this replaces it with a direct loop which compares against the current time to ensure that exactly the right amount of time has elapsed since the creation of the SyncObj. This test may break again if the time spent inside SyncObj increases greatly; for example, if __init__ takes very long, the start time would not match the corresponding values inside the objects. Each individual tick may also not take too long.

coveralls · 2018-11-17T14:31:40Z

Coverage increased (+0.5%) to 87.899% when pulling 1bc7246 on JustAnotherArchivist:persistence into b823639 on bakwc:master.

JustAnotherArchivist · 2018-11-17T14:50:40Z

So everything except 3.5 passed on both Travis and AppVeyor, and I still can't reproduce that failure. I'm also not sure what could cause a timing issue with the current code.
I'm giving up for now. If you have any ideas, please let me know.

bakwc · 2018-11-17T17:05:45Z

Hard week. I'll try to look at this tomorrow. Thanks for PR!

bakwc · 2018-11-18T20:40:46Z

pysyncobj/syncobj.py

+            # We can't be sure whether this node voted in the current term yet, so don't vote until maxTimeout has elapsed.
+            # Factor 1.1 just to be safe.
+		# Note that an entire non-flushing cluster should always use the same maximum timeout to ensure that this method works!
+            self.__voteBlockTime = time.time() + 1.1 * self.__conf.raftMaxTimeout


I can't see any example of using cluster where part of nodes has enabled flushing and part of them not. So there is no need to use this 1.1 constant. Also i'm not sure this gives any guarantee, looks like a hack.

I don't see how having such a cluster relates to this change.

You're right that this factor 1.1 is a hack. The entire voteBlockTime thing is a giant hack really. The Raft algorithm requires that the term and vote (and the log) is stored to stable storage before responding to vote requests (or any other message), and using an in-memory journal or disabling flushing violates this requirement.

The Raft algorithm never relies on timing anywhere to guarantee consistency. Clock skew therefore won't affect reliability as long as the algorithm is implemented correctly. But in this particular case, clock skew will affect reliability (because the algorithm's assumptions are violated as mentioned above). For example, if the restarting node's clock is running faster than the rest of the cluster's clocks, it could still vote twice within a term. The factor 1.1 is a hacky fix for this issue in all but the most severe cases (if the clock is running more than 10 % too fast, something's seriously wrong).

In my opinion, using an in-memory journal or not flushing the journal file should not be an option in the first place since it violates a key assumption in Raft. I realise that this has a severe performance impact, but well, that's the cost for reliability. I prefer a slightly slower, reliable solution over a more performant one with nasty timing bugs that are nearly impossible to reproduce or debug.

Ok, but may be it would be a good idea at least to move it to a config (instead of hardcoding in-place)?

Sounds reasonable. I'll do that.

bakwc · 2018-11-18T20:54:24Z

Looked briefly, left few comments.

pysyncobj/journal.py

bakwc · 2018-11-18T20:52:42Z

test_syncobj.py

+	# Specifically, o2 is expected to ignore the messages until 1.1 * timeout, i.e. including the one sent by o1 after 2.5 seconds, except for updating its term.
+	# Note that o1 has flushing enabled but o2 doesn't!
+
+	def tick(o1, o2, startTime, totalTickTime, sleepTime):


time-specific tests will randomly fails sometimes - you should mock time or not rely on time at all

Oh, great idea. I'll implement that. 👍

JustAnotherArchivist · 2018-12-21T15:30:54Z

I finally had some time to benchmark this, and it doesn't look good. With flushing disabled, I managed 14k RPS on my test machine (with a request size of 10 and 3 nodes). With flushing enabled, I managed 40. Yes, you read that right: forty instead of fourteen thousand, i.e. about 350 times slower!

This doesn't surprise me too much since flushing is a very expensive operation, although I'd've expected to at least reach a few hundred RPS.

I'm not sure how to improve this. Flushing only every so often doesn't make sense as mentioned in another comment already, and I can't think of anything else than reducing the flushing frequency. If you have any ideas, I'd be happy to hear about them.

As a sidenote, I also tested it with the journal file stored on a tmpfs. Obviously, that doesn't make any sense for actual usage, but it allows for testing the performance of the high-level flushing code. In this test, I'm able to reach the same ~14k RPS. This shows that it's the actual flushing to disk that murders the performance, not my code for calling the flushing. That shouldn't surprise anyone, but I still thought I'd mention it.

I tested this with the existing benchmarking code with a few changes to accomodate the journal file and flushing configuration settings. The patch file is here.

bakwc · 2018-12-21T17:29:59Z

I think that we can try to flush once per multiple writes. But we shouldn't increment lastCommitIdx untill flush.

JustAnotherArchivist · 2018-12-21T18:18:06Z

There isn't any benefit to that over just disabling flushing entirely though. The Raft spec requires that nodes do not respond to RPCs before persisting those values to disk. So if we only flush after a while, we're still in the same situation as master is now: there is no guarantee that a node hasn't voted, so it has to wait, which introduces timing issues into the cluster.

bakwc · 2018-12-22T08:41:34Z

There is a batch mode - we already wait a little before sending commands. If we add another 0.1 sec. delay - that will be much better than reducing overall performance.

JustAnotherArchivist · 2018-12-22T13:21:45Z

Ah, so the append_entries messages typically already have multiple entries? In that case, it should be possible to make SyncObj append multiple entries at once to the journal and only then respond to the leader. I'll try that out.

JustAnotherArchivist · 2018-12-22T13:46:41Z

Ok, that helps, but unfortunately the batches seem to be rather small (less than 10 entries in my tests), so it only provides a fairly small improvement. My test machine now manages about 70 RPS instead of 40.
I tried adjusting the appendEntriesPeriod (and the raft{Min,Max}Timeout) to get larger batches, but that didn't improve performance.

bakwc · 2018-12-22T13:59:23Z

I'm not sure that it's possible to achive using existing batch mode in append_entries. I think you should try following:

Don't call sendNextNodeIdx after receiving append_entries. Just add messages to journal.
Inside journal make a function that will call flush N times per second (eg 5 or 10, configurable)
After journal flushed - call sendNextNodeIdx with last flushed entry.

JustAnotherArchivist · 2018-12-22T14:12:25Z

Yeah. It looks like the Journal.add call in SyncObj._checkCommandsToApply is the bottleneck. After disabling flushing there, I manage 12k RPS. (That's with the batch-addition to the journal in the append_entries message handler in SyncObj.__onMessageReceived.)
I'll need to take a close look at that code to see how I could delay the flushing there.

fengxuduke · 2019-03-03T01:14:02Z

Any further progress on this issue? I am really keen to see this improvement on RPS.

JustAnotherArchivist added 7 commits November 8, 2018 23:35

Add automatic journal flushing

1983a29

Delete file one last time after testing

28171f2

Add header fields for current term and the voted-for candidate node ID

0776b3c

Move currentTerm and votedForNodeId from SyncObj to the journal

a251652

Block votes for 1.1 times the maximum timeout when flushing is disabl…

5af5f9a

…ed or no file journal is used

bakwc reviewed Nov 18, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix persistence and add automatic journal flushing #98

Fix persistence and add automatic journal flushing #98

JustAnotherArchivist commented Nov 16, 2018

JustAnotherArchivist commented Nov 16, 2018

JustAnotherArchivist commented Nov 17, 2018

coveralls commented Nov 17, 2018

JustAnotherArchivist commented Nov 17, 2018

bakwc commented Nov 17, 2018

bakwc Nov 18, 2018

JustAnotherArchivist Nov 20, 2018

bakwc Nov 20, 2018

JustAnotherArchivist Nov 20, 2018

bakwc commented Nov 18, 2018

bakwc Nov 18, 2018

JustAnotherArchivist Nov 20, 2018

JustAnotherArchivist commented Dec 21, 2018

bakwc commented Dec 21, 2018

JustAnotherArchivist commented Dec 21, 2018

bakwc commented Dec 22, 2018

JustAnotherArchivist commented Dec 22, 2018

JustAnotherArchivist commented Dec 22, 2018

bakwc commented Dec 22, 2018

JustAnotherArchivist commented Dec 22, 2018

fengxuduke commented Mar 3, 2019

Fix persistence and add automatic journal flushing #98

Are you sure you want to change the base?

Fix persistence and add automatic journal flushing #98

Conversation

JustAnotherArchivist commented Nov 16, 2018

JustAnotherArchivist commented Nov 16, 2018

JustAnotherArchivist commented Nov 17, 2018

coveralls commented Nov 17, 2018

JustAnotherArchivist commented Nov 17, 2018

bakwc commented Nov 17, 2018

bakwc Nov 18, 2018

Choose a reason for hiding this comment

JustAnotherArchivist Nov 20, 2018

Choose a reason for hiding this comment

bakwc Nov 20, 2018

Choose a reason for hiding this comment

JustAnotherArchivist Nov 20, 2018

Choose a reason for hiding this comment

bakwc commented Nov 18, 2018

bakwc Nov 18, 2018

Choose a reason for hiding this comment

JustAnotherArchivist Nov 20, 2018

Choose a reason for hiding this comment

JustAnotherArchivist commented Dec 21, 2018

bakwc commented Dec 21, 2018

JustAnotherArchivist commented Dec 21, 2018

bakwc commented Dec 22, 2018

JustAnotherArchivist commented Dec 22, 2018

JustAnotherArchivist commented Dec 22, 2018

bakwc commented Dec 22, 2018

JustAnotherArchivist commented Dec 22, 2018

fengxuduke commented Mar 3, 2019