You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note This bug report is somewhat speculative, I am not 100% certain that the description here is correct.
I think I know why this PR (#25588) does not make any difference.
This is the machine charts. A few observations can be made:
There is always 128-150 pending (outstanding) requests
The Snap trienode handle times are most commonly in 3-7s bucket, but quite often up to 20 seconds.
So if all our responses have already arrived, handling them will take ~1280 seconds, or 21 minutes. So once we get to actually handle them, they will have timed out in the snap layer, although they were fine in the p2p packet tracker layer.
This is wasteful: we are making requests at a higher rate than we can handle, and thus ignoring responses and refetching stuff. We need to adjust the mechanism so that we do not request more than we can handle -- alternatively, fix the timeout management so that we do not time out deliveries which have already arrived and are just waiting in the queue.
The text was updated successfully, but these errors were encountered:
In sync.go, every time the loop runs, we will assign/issue another trienode heal request. So once an item has ben lying on the queue for ~15 minutes, we issue and send out a new request. It may be served quickly by the remote peer, but will not be handled until after it has timed out, 15-20 minutes later.
A better model in this case would be to tune down the 128 pending ones down to maybe 10. The node in question has ~300 peers -- I guess this problem doesn't normally occur unless the node has a lot of peers.
OBS: this is a consequence of disk IO speed, not network lag.
It places the cap very un-dynamically at the place where we schedule requests. It could be applied in several different places, and could also be made dynamic -- e.g. adjusted so that the maxPending goes down if the mean handle-time goes up. Suggestions appreciated.
Note This bug report is somewhat speculative, I am not 100% certain that the description here is correct.
I think I know why this PR (#25588) does not make any difference.
This is the machine charts. A few observations can be made:
Snap trienode handle
times are most commonly in3-7s
bucket, but quite often up to 20 seconds.So if all our responses have already arrived, handling them will take ~
1280
seconds, or 21 minutes. So once we get to actually handle them, they will have timed out in the snap layer, although they were fine in the p2p packet tracker layer.This is wasteful: we are making requests at a higher rate than we can handle, and thus ignoring responses and refetching stuff. We need to adjust the mechanism so that we do not request more than we can handle -- alternatively, fix the timeout management so that we do not time out deliveries which have already arrived and are just waiting in the queue.
The text was updated successfully, but these errors were encountered: