-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rippled locks up when client use pathing (all versions) #4224
Comments
Looping in @ximinez |
Still working on determining the cause here, but if you have |
Interesting! Thank you, just deployed that to the pathfinding nodes. Let's see :) That saves some of the hassle. |
yes had the same flag here enabled will be giving that a go here as well |
Just to add a disclaimer, I wouldn't use this workaround on any full-history servers or servers that need a lot of history, because I don't know if there are any other side effects. |
Yup i'm running pathing on a separate submission node, keeping the more critical nodes separate and only reachable via other validators. I've run into this again this morning here so ive just flipped to I'm not sure when the next rippled release is scheduled but I would suggest at minim add this comment about pathing and fast_load in the default rippled.cfg for anyone else. Hmm, strange just looked for that and don't find any mention of fastload in the default cfg. Maybe this todo can go when that default documentation is added to the cfg. |
Of course :) I realized that. Thanks for the heads up. This indeed keeps the pathfinding nodes restarting when crashing without the node store error. Thanks. Crash not resolved, but at least the servers are back at it in a headless way. |
Is this an out of memory (OOM) problem? Or is there something else going on here? |
@lathanbritz Same here. No matter how much mem you give it (I tried 1TB 😂) it ends up eating it all and crashes. You just give it more mem = more time. But it will be OOM killed. ^^ This one has a few hours left. ^^ This one is almost there. For this reason we have a bunch of them for XRPLCluster and we take them down automatically, restart |
The only calls directed to the machines are pathfinding calls, the subscription-ones, over Websocket. Like what's started here: |
Since you can reproduce it reliably, I was hoping to get some sample requests payload, or sample addresses and tokens for the request to try them out in lab. We could only reproduce it when manually setting up a very complex data model with interwoven trustlines on a single token, but this model eats up memory too quick to allow us to conduct any analysis before the host had the OOM. Since you have a realistic case and the growth is gradually, it might be easier to debug. It's ok if it's difficult to share, we can recreate the synthetic data. |
MAINNET
should be sufficient to produce this. https://github.com/WietseWind/Vue-Pathfinding-Demo/blob/249670cabf51e569baaeba80c62478ebffb66440/src/components/PathFinder.vue#L201 |
@lathanbritz Thanks for the sample. We've found the source of the memory growth and are working on a solution. Will keep you updated. |
Woahhhh!!! THANK YOU! That's AWESOME! :D |
#4822 contains a fix that we're planning to release in 2.0.1 (within a month of 2.0.0). If anyone is able to build that branch and run it on some of their path finding nodes, I'd love to get feedback about how well it works. |
We conducted multiple rounds of testing and memory profiling to ensure that all objects created during the path_find request were cleared after the RPC request's exit and the termination of the WebSocket connection. Prior to the fix, using our test case, there was approximately 10GB of heap growth after just 21 path_find calls, resulting in about 12GB of RAM growth. With the fix in place, none of the previously accumulated objects are showing up in the memory snapshots taken during the test. So we believe the issue has been fixed but it'd be good that @WietseWind and @lathanbritz to also try it out and confirm since you have dedicated servers running this with broader range of requests. |
Actually more precisely speaking, the memory was cleared after each round of the path updating, so even with long lasting ongoing requests, rippled should not have memory accumulation |
When clients connect to the node using pathing the node soon locks up.
Steps to Reproduce
Run node and connect some clients via websocket and path_find (note they need to be requesting the path_find from this node). https://xrpl-pathfinding.netlify.app is a simple way to do this.
Expected Result
Node does not lock up.
Actual Result
Node fails with,
Now the node fails to restart and rippled server_info reports error.
Only way I have found to fix this is to remove the DB, then restart the node
Environment
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.4 LTS
Release: 20.04
Codename: focal
rippled 1.9.1
installed from apt.
The text was updated successfully, but these errors were encountered: