Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shard download space fix #3604

Closed
wants to merge 1 commit into from

Conversation

undertome
Copy link
Contributor

High Level Overview of Change

With this change, the DatabaseShardImp will confirm that sufficient space is available to store all the shards queued for download, instead of confirming that each shard will fit without considering the additional space required by the others.

Context of Change

This PR is stacked on two unmerged PRs. The relevant code starts with commit 6e1b74e.

Type of Change

  • [ X] Bug fix (non-breaking change which fixes an issue)

@scottschurr
Copy link
Collaborator

Hi @undertome, it would be useful for me to have a little more background on this bug fix. It would be handy to know:

  • The user observed nature of the bug before the fix.
  • The nature of the improved user experience after the bug fix.

Also it looks like the fix is associated with being low on resources. Does the user have any recourse once they discover that they are too low on resources to handle the shards?

Thanks.

@undertome
Copy link
Contributor Author

Hi @undertome, it would be useful for me to have a little more background on this bug fix. It would be handy to know:

* The user observed nature of the bug before the fix.

* The nature of the improved user experience after the bug fix.

Also it looks like the fix is associated with being low on resources. Does the user have any recourse once they discover that they are too low on resources to handle the shards?

Thanks.

Hi Scott,

  • Before the fix, the user would be able to initiate a batch of shard downloads when there was insufficient space to store all the shards, but sufficient space to store each shard in isolation. The user would see the first download complete, but subsequent downloads would fail.

  • With the fix, a request to download a batch of shards will fail immediately if there is insufficient space to store all the shards and a descriptive log message will be created.

If the user is low on resources, the DatabaseShardImp will log a warning, discontinue acquiring shards, and reject requests to download shards. At this point, it would be the user's responsibility to provide more storage to accommodate additional shards.

Hope that helps.

@scottschurr
Copy link
Collaborator

A couple of your commit messages don't follow these guidelines: https://chris.beams.io/posts/git-commit/ May I suggest rewordings? Consider:

Delay SSLHTTPDownloader dtor until current session finishes

and

Download queued shards only if there is space for them all

Or perhaps you can come up with something better.

@undertome undertome force-pushed the shard-download-space-fix branch from 7cbbd6c to f9b9647 Compare September 16, 2020 18:29
Copy link
Collaborator

@scottschurr scottschurr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good. I left notes about a few things for you to think about. Nothing critical, although covering a few more of the error cases in the unit tests would be really nice. All comments are at your discretion. Thanks for making life a little easier for ledger users!

src/ripple/nodestore/impl/DatabaseShardImp.cpp Outdated Show resolved Hide resolved
src/ripple/nodestore/impl/DatabaseShardImp.cpp Outdated Show resolved Hide resolved
src/ripple/nodestore/impl/DatabaseShardImp.cpp Outdated Show resolved Hide resolved
src/ripple/nodestore/impl/DatabaseShardImp.cpp Outdated Show resolved Hide resolved
src/ripple/nodestore/impl/DatabaseShardImp.cpp Outdated Show resolved Hide resolved
@scottschurr
Copy link
Collaborator

FWIW, I was running the unit tests for code coverage on this branch using macOS and I got this error message during the test run:

ERR:ShardArchiveHandler async_connect
ERR:ShardArchiveHandler Downloading shard id 10 from URL 127.0.0.1/10.tar.lz4
rippled(2843,0x13c67c5c0) malloc: *** error for object 0x13eed4000: pointer being freed was not allocated
rippled(2843,0x13c67c5c0) malloc: *** set a breakpoint in malloc_error_break to debug
make[3]: *** [CMakeFiles/coverage_report] Abort trap: 6
make[2]: *** [CMakeFiles/coverage_report.dir/all] Error 2
make[1]: *** [CMakeFiles/coverage_report.dir/rule] Error 2
make: *** [coverage_report] Error 2

It did not leave a stack trace behind. I'll see if I can reproduce the problem.

@scottschurr
Copy link
Collaborator

Not surprisingly the bug is not easy to reproduce.

  • Running the unit tests alone (once) did not recreate it.
  • Running code coverage on just ShardArchiveHandler did not reproduce it.
  • Running all of the unit tests under code coverage again did not reproduce it.

I'm going to start a loop running the ShardArchiveHandler test to see if I can reproduce the bug with enough repetitions.

Copy link
Collaborator

@scottschurr scottschurr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improved unit test coverage.

I'm guessing the non-reproducible unit test failure that I saw is not because of these changes.

src/test/rpc/ShardArchiveHandler_test.cpp Show resolved Hide resolved
@scottschurr
Copy link
Collaborator

I continued running a test over the weekend to try and get more information about the crash observed on Friday. I ran just the ShardArchiveHandler unit test in a loop. After 5928 iterations I got a different crash from the originally reported crash. Here's link to the stack trace of this most recent crash in case it's useful: https://gist.github.com/scottschurr/cec16dc035d9ec83fe61e6411cee0ddf

@scottschurr
Copy link
Collaborator

Success! By running just the ShardArchiveHandler under code coverage in a loop I was able to reproduce the original symptom quickly. And this time I got a stack trace. On the console I see:

ERR:ShardArchiveHandler Downloading shard id 9 from URL 127.0.0.1/9.tar.lz4
ERR:ShardArchiveHandler async_connect
ERR:ShardArchiveHandler Downloading shard id 10 from URL 127.0.0.1/10.tar.lz4
rippled(2685,0x148d915c0) malloc: *** error for object 0x13f548000: pointer being freed was not allocated
rippled(2685,0x148d915c0) malloc: *** set a breakpoint in malloc_error_break to debug
make[3]: *** [CMakeFiles/coverage_report] Abort trap: 6
make[2]: *** [CMakeFiles/coverage_report.dir/all] Error 2
make[1]: *** [CMakeFiles/coverage_report.dir/rule] Error 2
make: *** [coverage_report] Error 2

You can find the stack trace here: https://gist.github.com/scottschurr/f43be7a61290562738728bb4b668a6e8

@undertome
Copy link
Contributor Author

I continued running a test over the weekend to try and get more information about the crash observed on Friday. I ran just the ShardArchiveHandler unit test in a loop. After 5928 iterations I got a different crash from the originally reported crash. Here's link to the stack trace of this most recent crash in case it's useful: https://gist.github.com/scottschurr/cec16dc035d9ec83fe61e6411cee0ddf

Are you sure you're running code from this branch? That stack trace references SSLHTTPDownloader which no longer exists.

@scottschurr
Copy link
Collaborator

Are you sure you're running code from this branch? That stack trace references SSLHTTPDownloader which no longer exists.

That's a fair question. Here's the git log of the branch I've been testing:

commit fb1ca671319fcfdbeb3943f5d1fc554c3d7d230c (HEAD -> devon-shard-download-space-fix, devon/shard-download-space-fix)
Author: Devon White <[email protected]>
Date:   Thu Sep 17 17:26:02 2020 -0400

    [FOLD] Clang format patch

commit 09827143f62683907745a23842f6876d3612ade4
Author: Devon White <[email protected]>
Date:   Thu Sep 17 17:21:28 2020 -0400

    [FOLD] Add unit tests

commit f9b9647a5bd6b47133f2e58e1eb6585944c2d0fa
Author: Devon White <[email protected]>
Date:   Wed Sep 9 17:15:13 2020 -0400

    Download queued shards only if there is space for them all

commit e31c02ba89b6b7727eb2a8246d3d4ad4f227b104
Author: Devon White <[email protected]>
Date:   Fri Aug 14 15:45:56 2020 -0400

    Delay HTTPDownloader dtor until current session
    finishes

commit 649d4a741e112777575e9f9d10100c6588037d08
Author: Devon White <[email protected]>
Date:   Tue Aug 11 12:10:10 2020 -0400

    Allow shard downloads via HTTP or HTTPS

commit a8d481c2a54267d80bd539003f56481424b9ae63 (upstream/develop, origin/develop, develop)
Author: Nik Bougalis <[email protected]>
Date:   Tue Sep 1 10:43:26 2020 -0700

    Set version to 1.7.0-b1

@scottschurr
Copy link
Collaborator

That said, even though that's the source code currently in place, it's possible (but seems unlikely) that I did not do a proper build before running the ShardArchiveHandler unit test in a loop over the weekend. If that's the case please accept my apologies.

For the malloc failure under Codecov, however, I'm pretty confident that I have both the correct sources and that those are the source files I'm running. The rebuild is part of how Codecov is invoked.

@undertome
Copy link
Contributor Author

That said, even though that's the source code currently in place, it's possible (but seems unlikely) that I did not do a proper build before running the ShardArchiveHandler unit test in a loop over the weekend. If that's the case please accept my apologies.

For the malloc failure under Codecov, however, I'm pretty confident that I have both the correct sources and that those are the source files I'm running. The rebuild is part of how Codecov is invoked.

The malloc failure seems current.

@miguelportilla
Copy link
Contributor

I took a quick look at the stack @scottschurr produced and observed the following. The crash comes from within a journal. From ShardArchiveHandler_test::testRedundantShardFailure, on line 612, the test has only created Env called close() which calls RPCCall::fromNetwork. A possible culprit I see is that inside fromNetwork a local journal j is declared and passed into the callbacks and the HTTPClient object. This might be a problem as HTTPClient::request is non blocking and j's sink object may have gone out of scope when the callbacks or invokeComplete are executed. Its possible the sink may be still be valid at this point as it might be kept alive by someone but I am not sure.

@undertome
Copy link
Contributor Author

I took a quick look at the stack @scottschurr produced and observed the following. The crash comes from within a journal. From ShardArchiveHandler_test::testRedundantShardFailure, on line 612, the test has only created Env called close() which calls RPCCall::fromNetwork. A possible culprit I see is that inside fromNetwork a local journal j is declared and passed into the callbacks and the HTTPClient object. This might be a problem as HTTPClient::request is non blocking and j's sink object may have gone out of scope when the callbacks or invokeComplete are executed. Its possible the sink may be still be valid at this point as it might be kept alive by someone but I am not sure.

Thanks for looking into it, I will see if this checks out.

@codecov-io
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (develop@a8d481c). Click here to learn what that means.
The diff coverage is n/a.

Copy link
Contributor

@miguelportilla miguelportilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@undertome undertome force-pushed the shard-download-space-fix branch from 2265040 to 453894b Compare October 29, 2020 21:45
@undertome undertome added Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. and removed Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. labels Oct 29, 2020
@undertome
Copy link
Contributor Author

New commit fixes a Travis error caused by an "unused" (it's used in an assert statement) variable. There was also a bug I discovered that was caused by the Checkpointer outliving a DatabaseCon.

@undertome undertome force-pushed the shard-download-space-fix branch from f8f52d0 to 449cc29 Compare November 3, 2020 16:00
@carlhua carlhua added the Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. label Nov 13, 2020
@manojsdoshi manojsdoshi mentioned this pull request Nov 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants