Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sim-lib: continue simulation on send payment errors #104

Merged
merged 1 commit into from
Oct 4, 2023

Conversation

okjodom
Copy link
Collaborator

@okjodom okjodom commented Sep 15, 2023

Instead of stopping the simulation on the first send payment rpc error, we gracefully handle the error and report a default send payment result. Since send payment rpc might fail before dispatching a payment (i.e in CLN case), we allow activity result reporting to skip tracking of payment if there's no payment hash

Sample results with failed keysends. The simulation continues until stopped

source,destination,hash,amount_msat,dispatch_time,htlc_count,payment_outcome
023d45dc57ed99e6d793e2fed8dcf79d2f6ad2f32ec31b48102de445e245e08336,03b73874dde2c0af24619f03b6fc5d36adbf00ad6d25ce6479d4f5b60f648c8cb6,83d340ab8a7f7b383294dd06a8fc02bcbff6c77c73debaa8e590aad550216f98,1000,1695160575397,1,Success
02c0c854f511082f9d174f9b0e51ca14fc299c25de9d65dddadebe9ac1fd529758,03b73874dde2c0af24619f03b6fc5d36adbf00ad6d25ce6479d4f5b60f648c8cb6,Unknown,10000000000,1695160576400,0,Unknown

TODO:

  • Continue simulation on send payment errors
  • Formalize terminal error handling
  • Parse CLN send_payment rpc errors to determine terminal failures
  • Parse LND send_payment rpc errors to determine terminal failures

fixes #96

@okjodom okjodom requested a review from carlaKC September 19, 2023 22:00
@okjodom okjodom marked this pull request as ready for review September 19, 2023 22:00
Copy link
Contributor

@carlaKC carlaKC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up! Sorry that I gave you rebase conflicts :')

sim-lib/src/cln.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@carlaKC carlaKC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach looks good!

I do think that we need to take a slightly more subtle approach to handling errors (right now we'll just create failed events for any send_payment failure). In my mind there are two types of errors:

  • Terminal errors: the lightning node has shut down -> we can't recover and should also shut down.
  • Payment errors: the API has returned an error, but it's actually just a pathfinding error (eg: no route) -> we can recover from this, and should record the failure and try again.

Right now, if one of the lightning nodes shuts down we'll just continue to track failed payments when we should really just shut down. To address #96, we need to examine the error from CLN and figure out whether it's terminal or now. So basically all of this code as-is, with a little some more precise error handling in the CLN impl.

sim-lib/src/lib.rs Show resolved Hide resolved
sim-lib/src/lib.rs Outdated Show resolved Hide resolved
@okjodom
Copy link
Collaborator Author

okjodom commented Sep 21, 2023

I do think that we need to take a slightly more subtle approach to handling errors (right now we'll just create failed events for any send_payment failure). In my mind there are two types of errors:

  • Terminal errors: the lightning node has shut down -> we can't recover and should also shut down.
  • Payment errors: the API has returned an error, but it's actually just a pathfinding error (eg: no route) -> we can recover from this, and should record the failure and try again.

@carlaKC I agree with your sentiment above, but have a slightly different proposal regarding interpretation of Terminal errors

RE: Terminal errors: the lightning node has shut down -> we can't recover and should also shut down.

Instead of shutting down the entire simulation, I propose we stop simulating the activity or activity set affected by nodes we have seen terminal errors on. This shuts down parts of the simulation, but continues the still good parts.

Right now, if one of the lightning nodes shuts down we'll just continue to track failed payments when we should really just shut down. To address #96, we need to examine the error from CLN and figure out whether it's terminal or now. So basically all of this code as-is, with a little some more precise error handling in the CLN impl.

So building on this PR, my proposal is we stop simulating activity related to the CLN node if we observe terminal failure on the node.

Oh, and we terminate the simulation if there's no more activities to simulate.

sim-lib/src/lib.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@carlaKC carlaKC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just read through the shutdown strategy commit!

sim-lib/src/lib.rs Outdated Show resolved Hide resolved
sim-lib/src/lib.rs Outdated Show resolved Hide resolved
@okjodom okjodom force-pushed the continue-sim branch 2 times, most recently from e2b66c8 to f89506a Compare September 25, 2023 22:02
@carlaKC
Copy link
Contributor

carlaKC commented Sep 26, 2023

f89506a looking nice 👌 thanks for going through a few iterations to get it there!

I you want to go ahead and fill in the parsing I think we can do this all in one PR?

@okjodom
Copy link
Collaborator Author

okjodom commented Sep 26, 2023

f89506a looking nice 👌 thanks for going through a few iterations to get it there!

I you want to go ahead and fill in the parsing I think we can do this all in one PR?

I was planning the parsing as follow-ups when I get a chance to build on this.

Starting w cln because that's the most offending

@carlaKC
Copy link
Contributor

carlaKC commented Sep 26, 2023

I was planning the parsing as follow-ups when I get a chance to build on this.

SGTM, no strong feelings.

Starting w cln because that's the most offending

GLHF :')

@okjodom okjodom marked this pull request as draft September 26, 2023 18:38
@okjodom okjodom force-pushed the continue-sim branch 2 times, most recently from d433227 to f5e9eaa Compare September 28, 2023 19:48
@okjodom okjodom marked this pull request as ready for review September 28, 2023 19:49
Copy link
Member

@sr-gi sr-gi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good overall. Check some comments inline.

f5e9eaa and 5bae876 should be squashable with f89506a

sim-lib/src/cln.rs Outdated Show resolved Hide resolved
sim-lib/src/lib.rs Outdated Show resolved Hide resolved
sim-lib/src/lib.rs Outdated Show resolved Hide resolved
sim-lib/src/lib.rs Outdated Show resolved Hide resolved
sim-lib/src/lib.rs Show resolved Hide resolved
sim-lib/src/lib.rs Outdated Show resolved Hide resolved
sim-lib/src/lib.rs Outdated Show resolved Hide resolved
sim-lib/src/cln.rs Outdated Show resolved Hide resolved
sim-lib/src/cln.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@carlaKC carlaKC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tACK against the error originally described in #96

 ERROR [sim_lib] Error while sending payment 0327b7826a232ba4c7032f659ae0b65528f9b9f3191c54cbea423a207cff150029 -> 0346b6eeeaa8f13a64e4bff924a2cb9db94e01b3cc1819970c1f3e8cb4eabb3f5f

Simulation output:

0327b7826a232ba4c7032f659ae0b65528f9b9f3191c54cbea423a207cff150029,0346b6eeeaa8f13a64e4bff924a2cb9db94e01b3cc1819970c1f3e8cb4eabb3f5f,2000,Unknown,1696256242579,0,Unknown

LGTM pending @sr-gi's comments. Nothing blocking in my review except for shutting down if we can't send a payment result in produce_simulation_results (I'm also happy to pick up in followup, as shutdown hygiene needs some general work IMO).

sim-lib/src/lib.rs Outdated Show resolved Hide resolved
sim-lib/src/lib.rs Outdated Show resolved Hide resolved
sim-lib/src/lib.rs Outdated Show resolved Hide resolved
Comment on lines 694 to 706
if results.clone().send((payment, result)).await.is_err() {
log::debug!("Could not send payment result");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pre-existing, but I think we should break if we can't send into the results channel? Because something has exit downstream.

Note to future selves: We don't currently do this for trackpayment but I don't think that's actually correct. I want to do a run-through of all our shutdown logic because I think using channels has made it overly complex, so can deal with trackpayment separately.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made the same note

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pre-existing, but I think we should break if we can't send into the results channel? Because something has exit downstream.

I think we should only break out of these listeners if shutdown is called. If something exits with finality, it should trigger shutdown, which leads to break(s), which can lead to channel send fail (because we bias to shutdowns?), but we log these send failures like we do here

sim-lib/src/lib.rs Outdated Show resolved Hide resolved
sim-lib/src/cln.rs Show resolved Hide resolved
@okjodom okjodom force-pushed the continue-sim branch 2 times, most recently from 6993085 to 37652cf Compare October 2, 2023 18:57
Copy link
Member

@sr-gi sr-gi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, mostly nits, LGTM otherwise

sim-lib/src/lib.rs Outdated Show resolved Hide resolved
sim-lib/src/lib.rs Outdated Show resolved Hide resolved
set.spawn(track_payment_result(
source_node,results.clone(),simulation_output, shutdown.clone(),
source_node,results.clone(),payment, shutdown.clone(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why fmt is not catching this, but there are some spaces missing here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fmt is unreliable in select :(

Copy link
Collaborator Author

@okjodom okjodom Oct 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm big facepalm

 instead of stopping the simulation on the first send payment rpc error,
 we gracefully handle the error and report a default send payment
result. since send payment rpc might fail before dispatching a payment
(i.e in CLN case), we allow activity result reporting to skip tracking
of payment if there's no payment hash
Copy link
Contributor

@carlaKC carlaKC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tACK fb981a2 🎉

@sr-gi
Copy link
Member

sr-gi commented Oct 3, 2023

ACK fb981a2

@carlaKC carlaKC merged commit bc0f8ed into bitcoin-dev-project:main Oct 4, 2023
@carlaKC carlaKC mentioned this pull request Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: Handle Send Errors for CLN
3 participants