Shard download space fix #3604

undertome · 2020-09-09T21:56:53Z

High Level Overview of Change

With this change, the DatabaseShardImp will confirm that sufficient space is available to store all the shards queued for download, instead of confirming that each shard will fit without considering the additional space required by the others.

Context of Change

This PR is stacked on two unmerged PRs. The relevant code starts with commit 6e1b74e.

Type of Change

[ X] Bug fix (non-breaking change which fixes an issue)

scottschurr · 2020-09-11T19:45:24Z

Hi @undertome, it would be useful for me to have a little more background on this bug fix. It would be handy to know:

The user observed nature of the bug before the fix.
The nature of the improved user experience after the bug fix.

Also it looks like the fix is associated with being low on resources. Does the user have any recourse once they discover that they are too low on resources to handle the shards?

Thanks.

undertome · 2020-09-15T16:35:38Z

Hi @undertome, it would be useful for me to have a little more background on this bug fix. It would be handy to know:
* The user observed nature of the bug before the fix.

* The nature of the improved user experience after the bug fix.
Also it looks like the fix is associated with being low on resources. Does the user have any recourse once they discover that they are too low on resources to handle the shards?

Thanks.

Hi Scott,

Before the fix, the user would be able to initiate a batch of shard downloads when there was insufficient space to store all the shards, but sufficient space to store each shard in isolation. The user would see the first download complete, but subsequent downloads would fail.
With the fix, a request to download a batch of shards will fail immediately if there is insufficient space to store all the shards and a descriptive log message will be created.

If the user is low on resources, the DatabaseShardImp will log a warning, discontinue acquiring shards, and reject requests to download shards. At this point, it would be the user's responsibility to provide more storage to accommodate additional shards.

Hope that helps.

scottschurr · 2020-09-15T23:58:56Z

A couple of your commit messages don't follow these guidelines: https://chris.beams.io/posts/git-commit/ May I suggest rewordings? Consider:

Delay SSLHTTPDownloader dtor until current session finishes

and

Download queued shards only if there is space for them all

Or perhaps you can come up with something better.

scottschurr

👍 Looks good. I left notes about a few things for you to think about. Nothing critical, although covering a few more of the error cases in the unit tests would be really nice. All comments are at your discretion. Thanks for making life a little easier for ledger users!

src/ripple/nodestore/impl/DatabaseShardImp.cpp

src/ripple/net/HTTPStream.h

src/ripple/net/impl/HTTPStream.cpp

src/ripple/net/impl/HTTPDownloader.cpp

src/ripple/net/HTTPStream.h

src/ripple/net/impl/HTTPStream.cpp

scottschurr · 2020-09-18T16:55:32Z

FWIW, I was running the unit tests for code coverage on this branch using macOS and I got this error message during the test run:

ERR:ShardArchiveHandler async_connect
ERR:ShardArchiveHandler Downloading shard id 10 from URL 127.0.0.1/10.tar.lz4
rippled(2843,0x13c67c5c0) malloc: *** error for object 0x13eed4000: pointer being freed was not allocated
rippled(2843,0x13c67c5c0) malloc: *** set a breakpoint in malloc_error_break to debug
make[3]: *** [CMakeFiles/coverage_report] Abort trap: 6
make[2]: *** [CMakeFiles/coverage_report.dir/all] Error 2
make[1]: *** [CMakeFiles/coverage_report.dir/rule] Error 2
make: *** [coverage_report] Error 2

It did not leave a stack trace behind. I'll see if I can reproduce the problem.

scottschurr · 2020-09-18T17:38:52Z

Not surprisingly the bug is not easy to reproduce.

Running the unit tests alone (once) did not recreate it.
Running code coverage on just ShardArchiveHandler did not reproduce it.
Running all of the unit tests under code coverage again did not reproduce it.

I'm going to start a loop running the ShardArchiveHandler test to see if I can reproduce the bug with enough repetitions.

scottschurr

Thanks for the improved unit test coverage.

I'm guessing the non-reproducible unit test failure that I saw is not because of these changes.

src/test/rpc/ShardArchiveHandler_test.cpp

scottschurr · 2020-09-21T16:32:27Z

I continued running a test over the weekend to try and get more information about the crash observed on Friday. I ran just the ShardArchiveHandler unit test in a loop. After 5928 iterations I got a different crash from the originally reported crash. Here's link to the stack trace of this most recent crash in case it's useful: https://gist.github.com/scottschurr/cec16dc035d9ec83fe61e6411cee0ddf

scottschurr · 2020-09-21T17:43:06Z

Success! By running just the ShardArchiveHandler under code coverage in a loop I was able to reproduce the original symptom quickly. And this time I got a stack trace. On the console I see:

ERR:ShardArchiveHandler Downloading shard id 9 from URL 127.0.0.1/9.tar.lz4
ERR:ShardArchiveHandler async_connect
ERR:ShardArchiveHandler Downloading shard id 10 from URL 127.0.0.1/10.tar.lz4
rippled(2685,0x148d915c0) malloc: *** error for object 0x13f548000: pointer being freed was not allocated
rippled(2685,0x148d915c0) malloc: *** set a breakpoint in malloc_error_break to debug
make[3]: *** [CMakeFiles/coverage_report] Abort trap: 6
make[2]: *** [CMakeFiles/coverage_report.dir/all] Error 2
make[1]: *** [CMakeFiles/coverage_report.dir/rule] Error 2
make: *** [coverage_report] Error 2

You can find the stack trace here: https://gist.github.com/scottschurr/f43be7a61290562738728bb4b668a6e8

undertome · 2020-09-22T15:41:13Z

I continued running a test over the weekend to try and get more information about the crash observed on Friday. I ran just the ShardArchiveHandler unit test in a loop. After 5928 iterations I got a different crash from the originally reported crash. Here's link to the stack trace of this most recent crash in case it's useful: https://gist.github.com/scottschurr/cec16dc035d9ec83fe61e6411cee0ddf

Are you sure you're running code from this branch? That stack trace references SSLHTTPDownloader which no longer exists.

scottschurr · 2020-09-22T16:15:33Z

Are you sure you're running code from this branch? That stack trace references SSLHTTPDownloader which no longer exists.

That's a fair question. Here's the git log of the branch I've been testing:

commit fb1ca671319fcfdbeb3943f5d1fc554c3d7d230c (HEAD -> devon-shard-download-space-fix, devon/shard-download-space-fix)
Author: Devon White <[email protected]>
Date:   Thu Sep 17 17:26:02 2020 -0400

    [FOLD] Clang format patch

commit 09827143f62683907745a23842f6876d3612ade4
Author: Devon White <[email protected]>
Date:   Thu Sep 17 17:21:28 2020 -0400

    [FOLD] Add unit tests

commit f9b9647a5bd6b47133f2e58e1eb6585944c2d0fa
Author: Devon White <[email protected]>
Date:   Wed Sep 9 17:15:13 2020 -0400

    Download queued shards only if there is space for them all

commit e31c02ba89b6b7727eb2a8246d3d4ad4f227b104
Author: Devon White <[email protected]>
Date:   Fri Aug 14 15:45:56 2020 -0400

    Delay HTTPDownloader dtor until current session
    finishes

commit 649d4a741e112777575e9f9d10100c6588037d08
Author: Devon White <[email protected]>
Date:   Tue Aug 11 12:10:10 2020 -0400

    Allow shard downloads via HTTP or HTTPS

commit a8d481c2a54267d80bd539003f56481424b9ae63 (upstream/develop, origin/develop, develop)
Author: Nik Bougalis <[email protected]>
Date:   Tue Sep 1 10:43:26 2020 -0700

    Set version to 1.7.0-b1

scottschurr · 2020-09-22T17:05:04Z

That said, even though that's the source code currently in place, it's possible (but seems unlikely) that I did not do a proper build before running the ShardArchiveHandler unit test in a loop over the weekend. If that's the case please accept my apologies.

For the malloc failure under Codecov, however, I'm pretty confident that I have both the correct sources and that those are the source files I'm running. The rebuild is part of how Codecov is invoked.

undertome · 2020-09-22T20:34:58Z

That said, even though that's the source code currently in place, it's possible (but seems unlikely) that I did not do a proper build before running the ShardArchiveHandler unit test in a loop over the weekend. If that's the case please accept my apologies.

For the malloc failure under Codecov, however, I'm pretty confident that I have both the correct sources and that those are the source files I'm running. The rebuild is part of how Codecov is invoked.

The malloc failure seems current.

miguelportilla · 2020-09-24T19:26:37Z

I took a quick look at the stack @scottschurr produced and observed the following. The crash comes from within a journal. From ShardArchiveHandler_test::testRedundantShardFailure, on line 612, the test has only created Env called close() which calls RPCCall::fromNetwork. A possible culprit I see is that inside fromNetwork a local journal j is declared and passed into the callbacks and the HTTPClient object. This might be a problem as HTTPClient::request is non blocking and j's sink object may have gone out of scope when the callbacks or invokeComplete are executed. Its possible the sink may be still be valid at this point as it might be kept alive by someone but I am not sure.

undertome · 2020-09-24T19:36:16Z

I took a quick look at the stack @scottschurr produced and observed the following. The crash comes from within a journal. From ShardArchiveHandler_test::testRedundantShardFailure, on line 612, the test has only created Env called close() which calls RPCCall::fromNetwork. A possible culprit I see is that inside fromNetwork a local journal j is declared and passed into the callbacks and the HTTPClient object. This might be a problem as HTTPClient::request is non blocking and j's sink object may have gone out of scope when the callbacks or invokeComplete are executed. Its possible the sink may be still be valid at this point as it might be kept alive by someone but I am not sure.

Thanks for looking into it, I will see if this checks out.

src/ripple/net/impl/HTTPStream.cpp

codecov-io · 2020-10-18T23:44:25Z

Codecov Report

❗ No coverage uploaded for pull request base (develop@a8d481c). Click here to learn what that means.
The diff coverage is n/a.

miguelportilla

👍

undertome · 2020-11-01T19:24:12Z

New commit fixes a Travis error caused by an "unused" (it's used in an assert statement) variable. There was also a bug I discovered that was caused by the Checkpointer outliving a DatabaseCon.

undertome requested review from scottschurr and miguelportilla September 9, 2020 21:57

scottschurr assigned miguelportilla and scottschurr Sep 10, 2020

undertome force-pushed the shard-download-space-fix branch from 7cbbd6c to f9b9647 Compare September 16, 2020 18:29

scottschurr approved these changes Sep 16, 2020

View reviewed changes