Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rippled locks up when client use pathing (all versions) #4224

Closed
shortthefomo opened this issue Jul 7, 2022 · 19 comments · Fixed by #4822
Closed

Rippled locks up when client use pathing (all versions) #4224

shortthefomo opened this issue Jul 7, 2022 · 19 comments · Fixed by #4822
Assignees

Comments

@shortthefomo
Copy link

shortthefomo commented Jul 7, 2022

When clients connect to the node using pathing the node soon locks up.

Steps to Reproduce

Run node and connect some clients via websocket and path_find (note they need to be requesting the path_find from this node). https://xrpl-pathfinding.netlify.app is a simple way to do this.

Expected Result

Node does not lock up.

Actual Result

Node fails with,
Screen Shot 2022-07-07 at 18 38 19
Now the node fails to restart and rippled server_info reports error.
Only way I have found to fix this is to remove the DB, then restart the node

Environment

No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.4 LTS
Release: 20.04
Codename: focal

rippled 1.9.1
installed from apt.

@WietseWind
Copy link
Member

Looping in @ximinez

@ximinez
Copy link
Collaborator

ximinez commented Jul 8, 2022

Still working on determining the cause here, but if you have fast_load=1 another workaround that doesn't require deleting your DB is to set fast_load=0.

@WietseWind
Copy link
Member

Still working on determining the cause here, but if you have fast_load=1 another workaround that doesn't require deleting your DB is to set fast_load=0.

Interesting! Thank you, just deployed that to the pathfinding nodes. Let's see :) That saves some of the hassle.

@shortthefomo
Copy link
Author

yes had the same flag here enabled will be giving that a go here as well

@ximinez
Copy link
Collaborator

ximinez commented Jul 12, 2022

Just to add a disclaimer, I wouldn't use this workaround on any full-history servers or servers that need a lot of history, because I don't know if there are any other side effects.

@shortthefomo
Copy link
Author

shortthefomo commented Jul 13, 2022

Yup i'm running pathing on a separate submission node, keeping the more critical nodes separate and only reachable via other validators.

I've run into this again this morning here so ive just flipped to fast_load=0 it has brought the node back up as described above here :) thanks for that tip @ximinez

I'm not sure when the next rippled release is scheduled but I would suggest at minim add this comment about pathing and fast_load in the default rippled.cfg for anyone else. Hmm, strange just looked for that and don't find any mention of fastload in the default cfg. Maybe this todo can go when that default documentation is added to the cfg.

@WietseWind
Copy link
Member

WietseWind commented Jul 13, 2022

Just to add a disclaimer, I wouldn't use this workaround on any full-history servers or servers that need a lot of history, because I don't know if there are any other side effects.

Of course :) I realized that. Thanks for the heads up. This indeed keeps the pathfinding nodes restarting when crashing without the node store error. Thanks. Crash not resolved, but at least the servers are back at it in a headless way.

@intelliot
Copy link
Collaborator

Is this an out of memory (OOM) problem? Or is there something else going on here?

@intelliot intelliot added this to the memory milestone Sep 8, 2023
@shortthefomo
Copy link
Author

shortthefomo commented Oct 21, 2023

It could well be I do know my swap fills up swap vs actual memory I have 126 gigs available here in this box the swap is usually 100% after a while but still have many gigs free in physical ram image

@WietseWind
Copy link
Member

WietseWind commented Oct 21, 2023

@lathanbritz Same here. No matter how much mem you give it (I tried 1TB 😂) it ends up eating it all and crashes.

You just give it more mem = more time. But it will be OOM killed.

image

^^ This one has a few hours left.

image

^^ This one is almost there.

For this reason we have a bunch of them for XRPLCluster and we take them down automatically, restart rippled and have them sync back up, add them back to the pool. They take turns this way. Crappy but somewhat functional.

@sophiax851
Copy link
Collaborator

sophiax851 commented Oct 22, 2023 via email

@github-project-automation github-project-automation bot moved this to 📋 Backlog in Core Ledger Oct 23, 2023
@WietseWind
Copy link
Member

@sophiax851

The only calls directed to the machines are pathfinding calls, the subscription-ones, over Websocket.

Like what's started here:
https://xrpl-pathfinding.netlify.app/
https://xrpl.org/path_find.html

@sophiax851
Copy link
Collaborator

@WietseWind

Since you can reproduce it reliably, I was hoping to get some sample requests payload, or sample addresses and tokens for the request to try them out in lab. We could only reproduce it when manually setting up a very complex data model with interwoven trustlines on a single token, but this model eats up memory too quick to allow us to conduct any analysis before the host had the OOM. Since you have a realistic case and the growth is gradually, it might be easier to debug. It's ok if it's difficult to share, we can recreate the synthetic data.

@shortthefomo
Copy link
Author

@sophiax851

MAINNET

{
    "id": "example",
    "command": "path_find",
    "subcommand": "create",
    "source_account": "rThREeXrp54XTQueDowPV1RxmkEAGUmg8",
    "destination_account": "rThREeXrp54XTQueDowPV1RxmkEAGUmg8",
    "destination_amount": "1000"
}

should be sufficient to produce this. https://github.com/WietseWind/Vue-Pathfinding-Demo/blob/249670cabf51e569baaeba80c62478ebffb66440/src/components/PathFinder.vue#L201

@sophiax851
Copy link
Collaborator

@lathanbritz Thanks for the sample. We've found the source of the memory growth and are working on a solution. Will keep you updated.

@WietseWind
Copy link
Member

@lathanbritz Thanks for the sample. We've found the source of the memory growth and are working on a solution. Will keep you updated.

Woahhhh!!! THANK YOU! That's AWESOME! :D

@ximinez
Copy link
Collaborator

ximinez commented Nov 29, 2023

#4822 contains a fix that we're planning to release in 2.0.1 (within a month of 2.0.0). If anyone is able to build that branch and run it on some of their path finding nodes, I'd love to get feedback about how well it works.

@sophiax851
Copy link
Collaborator

We conducted multiple rounds of testing and memory profiling to ensure that all objects created during the path_find request were cleared after the RPC request's exit and the termination of the WebSocket connection. Prior to the fix, using our test case, there was approximately 10GB of heap growth after just 21 path_find calls, resulting in about 12GB of RAM growth. With the fix in place, none of the previously accumulated objects are showing up in the memory snapshots taken during the test. So we believe the issue has been fixed but it'd be good that @WietseWind and @lathanbritz to also try it out and confirm since you have dedicated servers running this with broader range of requests.

@sophiax851
Copy link
Collaborator

Actually more precisely speaking, the memory was cleared after each round of the path updating, so even with long lasting ongoing requests, rippled should not have memory accumulation

@intelliot intelliot changed the title Rippled locks up when client use pathing (Version: 1.9.1) Rippled locks up when client use pathing (all versions) Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 📋 Backlog
8 participants
@intelliot @shortthefomo @WietseWind @ximinez @mtrippled @sophiax851 and others