-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sim-lib: continue simulation on send payment errors #104
Conversation
1766ffd
to
bacc835
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for picking this up! Sorry that I gave you rebase conflicts :')
bacc835
to
0615f4b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach looks good!
I do think that we need to take a slightly more subtle approach to handling errors (right now we'll just create failed events for any send_payment
failure). In my mind there are two types of errors:
- Terminal errors: the lightning node has shut down -> we can't recover and should also shut down.
- Payment errors: the API has returned an error, but it's actually just a pathfinding error (eg: no route) -> we can recover from this, and should record the failure and try again.
Right now, if one of the lightning nodes shuts down we'll just continue to track failed payments when we should really just shut down. To address #96, we need to examine the error from CLN and figure out whether it's terminal or now. So basically all of this code as-is, with a little some more precise error handling in the CLN impl.
@carlaKC I agree with your sentiment above, but have a slightly different proposal regarding interpretation of Terminal errors
Instead of shutting down the entire simulation, I propose we stop simulating the activity or activity set affected by nodes we have seen terminal errors on. This shuts down parts of the simulation, but continues the still good parts.
So building on this PR, my proposal is we stop simulating activity related to the CLN node if we observe terminal failure on the node. Oh, and we terminate the simulation if there's no more activities to simulate. |
0615f4b
to
988f908
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just read through the shutdown strategy commit!
e2b66c8
to
f89506a
Compare
f89506a looking nice 👌 thanks for going through a few iterations to get it there! I you want to go ahead and fill in the parsing I think we can do this all in one PR? |
I was planning the parsing as follow-ups when I get a chance to build on this. Starting w cln because that's the most offending |
SGTM, no strong feelings.
GLHF :') |
d433227
to
f5e9eaa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tACK against the error originally described in #96
ERROR [sim_lib] Error while sending payment 0327b7826a232ba4c7032f659ae0b65528f9b9f3191c54cbea423a207cff150029 -> 0346b6eeeaa8f13a64e4bff924a2cb9db94e01b3cc1819970c1f3e8cb4eabb3f5f
Simulation output:
0327b7826a232ba4c7032f659ae0b65528f9b9f3191c54cbea423a207cff150029,0346b6eeeaa8f13a64e4bff924a2cb9db94e01b3cc1819970c1f3e8cb4eabb3f5f,2000,Unknown,1696256242579,0,Unknown
LGTM pending @sr-gi's comments. Nothing blocking in my review except for shutting down if we can't send a payment result in produce_simulation_results
(I'm also happy to pick up in followup, as shutdown hygiene needs some general work IMO).
sim-lib/src/lib.rs
Outdated
if results.clone().send((payment, result)).await.is_err() { | ||
log::debug!("Could not send payment result"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pre-existing, but I think we should break
if we can't send
into the results channel? Because something has exit downstream.
Note to future selves: We don't currently do this for trackpayment
but I don't think that's actually correct. I want to do a run-through of all our shutdown logic because I think using channels has made it overly complex, so can deal with trackpayment
separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made the same note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pre-existing, but I think we should break if we can't send into the results channel? Because something has exit downstream.
I think we should only break out of these listeners if shutdown is called. If something exits with finality, it should trigger shutdown, which leads to break(s), which can lead to channel send fail (because we bias to shutdowns?), but we log these send failures like we do here
6993085
to
37652cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments, mostly nits, LGTM otherwise
sim-lib/src/lib.rs
Outdated
set.spawn(track_payment_result( | ||
source_node,results.clone(),simulation_output, shutdown.clone(), | ||
source_node,results.clone(),payment, shutdown.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why fmt
is not catching this, but there are some spaces missing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fmt
is unreliable in select
:(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm big facepalm
instead of stopping the simulation on the first send payment rpc error, we gracefully handle the error and report a default send payment result. since send payment rpc might fail before dispatching a payment (i.e in CLN case), we allow activity result reporting to skip tracking of payment if there's no payment hash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tACK fb981a2 🎉
ACK fb981a2 |
Instead of stopping the simulation on the first send payment rpc error, we gracefully handle the error and report a default send payment result. Since send payment rpc might fail before dispatching a payment (i.e in CLN case), we allow activity result reporting to skip tracking of payment if there's no payment hash
Sample results with failed keysends. The simulation continues until stopped
TODO:
send_payment
rpc errors to determine terminal failuressend_payment
rpc errors to determine terminal failuresfixes #96