Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outgoing request error rate #4661

Closed
twoeths opened this issue Oct 17, 2022 · 11 comments
Closed

Outgoing request error rate #4661

twoeths opened this issue Oct 17, 2022 · 11 comments
Assignees
Labels
prio-medium Resolve this some time soon (tm). scope-networking All issues related to networking, gossip, and libp2p.
Milestone

Comments

@twoeths
Copy link
Contributor

twoeths commented Oct 17, 2022

Describe the bug

Starting from Oct 14, the "Outgoing request error rate" metric spiked on unstable

Screen Shot 2022-10-17 at 14 10 38

There are logs like:

ct-16 00:00:03.150[network]       ^[[36mverbose^[[39m: Req  error method=ping, encoding=ssz_snappy, client=Unknown, peer=16...1fmVye, requestId=33248 code=REQUEST_ERROR_DIAL_ERROR, error=The dial request has no valid addresses
Error: The dial request has no valid addresses
    at DefaultDialer.dial (file:///usr/app/node_modules/libp2p/src/connection-manager/dialer/index.ts:182:21)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at runNextTicks (node:internal/process/task_queues:65:3)
    at processTimers (node:internal/timers:499:9)
    at EventTarget.openConnection (file:///usr/app/node_modules/libp2p/src/connection-manager/index.ts:521:26)
    at EventTarget.dial (file:///usr/app/node_modules/libp2p/src/libp2p.ts:382:12)
    at EventTarget.dialProtocol (file:///usr/app/node_modules/libp2p/src/libp2p.ts:396:24)
    at file:///usr/app/packages/beacon-node/src/network/reqresp/request/index.ts:91:22
    at withTimeout (file:///usr/app/packages/utils/src/timeout.ts:19:12)
Error: REQUEST_ERROR_DIAL_ERROR
    at file:///usr/app/packages/beacon-node/src/network/reqresp/request/index.ts:102:15
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at runNextTicks (node:internal/process/task_queues:65:3)
    at processTimers (node:internal/timers:499:9)
    at sendRequest (file:///usr/app/packages/beacon-node/src/network/reqresp/request/index.ts:88:20)
    at ReqResp.sendRequest (file:///usr/app/packages/beacon-node/src/network/reqresp/reqResp.ts:157:22)
    at ReqResp.ping (file:///usr/app/packages/beacon-node/src/network/reqresp/reqResp.ts:103:12)
    at PeerManager.requestPing (file:///usr/app/packages/beacon-node/src/network/peers/peerManager.ts:348:25)
Oct-16 00:00:03.150[network]       ^[[36mverbose^[[39m: Req  error method=status, encoding=ssz_snappy, client=Unknown, peer=16...1fmVye, requestId=33249 code=REQUEST_ERROR_DIAL_ERROR, error=The dial request has no valid addresses

@g11tech this could be related to #4658 as I don't see other commits around this time

Expected behavior

No error

@g11tech
Copy link
Contributor

g11tech commented Oct 17, 2022

checking

@g11tech
Copy link
Contributor

g11tech commented Oct 17, 2022

Documenting the investigation so far:

log of a peer id where this error was thrown

  id: Secp256k1PeerIdImpl [PeerId(16Uiu2HAmVPLt1XZLrafBnTK3i1sKf8qQ5zYTokQyNEm4R581Bbnq)] {
    type: 'secp256k1',
    multihash: Digest {
      code: 0,
      size: 37,
      digest: <Buffer 08 02 12 21 03 f8 94 99 06 b1 3b b6 3c 15 b8 01 7b 25 b5 10 ea d5 43 1b 8a 9d 11 c2 de 6a 7b 76 de 66 21 b2 1a>,
      bytes: [Uint8Array]
    },
    privateKey: undefined,
    publicKey: <Buffer 08 02 12 21 03 f8 94 99 06 b1 3b b6 3c 15 b8 01 7b 25 b5 10 ea d5 43 1b 8a 9d 11 c2 de 6a 7b 76 de 66 21 b2 1a>
  },
  multiaddrs: [],
  dialTarget: {
    id: '16Uiu2HAmVPLt1XZLrafBnTK3i1sKf8qQ5zYTokQyNEm4R581Bbnq',
    addrs: []
  }
}

clearly this peerId didn't have gave any multiaddrs and hence there were no addrs to dail,

but on a fetch of the info of the peer via api:

{"data":{"peer_id":"16Uiu2HAmVPLt1XZLrafBnTK3i1sKf8qQ5zYTokQyNEm4R581Bbnq","enr":"","last_seen_p2p_address":"/ip4/34.221.99.113/tcp/9000/p2p/16Uiu2HAmVPLt1XZLrafBnTK3i1sKf8qQ5zYTokQyNEm4R581Bbnq","direction":"outbound","state":"connected"}}

The IP doesn't seem to belong to any of lodestar deployments, so unlikely that #4658 is the cause.

But will observe and see if there is any coherent pattern that can be observed on this error.

@g11tech
Copy link
Contributor

g11tech commented Oct 17, 2022

UPDATE:
for unstable lg1 this error rate has come drastically down to previous normals on container restart it seems

image

while feat1 errors on new deployment have since gone up

@g11tech
Copy link
Contributor

g11tech commented Oct 17, 2022

on feat 1 lg1 since deployment of unstable, errors seemed to have increased however, but not the dail errors so much (so no valid addresses are really the concern for e,g. on 3 instances of such errors like like 4-5 hours)

However status and ping errors really have seemed to increase:
image

and looking at the debug log there seems to be some issue

image

@g11tech
Copy link
Contributor

g11tech commented Oct 17, 2022

also w.r.t. `no valid addresses error"... it seems that there is a recurring dial out of a single peer which is having this bad data, shouldn't libp2p ban this peer automatically?

image

also node doesnt seem to have any data for this particular peer
image

@g11tech
Copy link
Contributor

g11tech commented Oct 17, 2022

looking at the dial errors (which could also be reason behind status and ping errors) on feat1 lg1, just a single peer seems to be the reason of all the trouble (added peerId console log before throwing)

image

@twoeths
Copy link
Contributor Author

twoeths commented Oct 18, 2022

yeah this is the common issue since we upgrade libp2p, same to v1.2.0-rc.1 #4660 (comment)

@philknows philknows added prio-critical Drop everything to resolve this immediately. Examples: consensus bug, un-usable CLI scope-networking All issues related to networking, gossip, and libp2p. labels Oct 18, 2022
@philknows philknows added this to the v1.3.0 milestone Nov 9, 2022
@twoeths
Copy link
Contributor Author

twoeths commented Nov 11, 2022

this still happens in v1.2.0

Screen Shot 2022-11-11 at 09 38 48

I think the reason is we received many goodbye requests, we also sent out many goodbye requests

Screen Shot 2022-11-11 at 09 39 44

which lead to 20% Dial error rate

Screen Shot 2022-11-11 at 09 40 16

  • v1.1.0
grep -e "REQUEST_ERROR_DIAL_ERROR" -rn beacon-2022-11-08.log | wc -l
2304
  • v1.2.0
grep -e "REQUEST_ERROR_DIAL_ERROR" -rn beacon-2022-11-10.log | wc -l
36534

sample log

Nov-10 00:00:01.871[network]       ^[[36mverbose^[[39m: Req  error method=ping, encoding=ssz_snappy, client=Teku, peer=16...iK3cxw, requestId=33174 code=REQUEST_ERROR_DIAL_ERROR, error=All promises were rejected
AggregateError: All promises were rejected
Error: REQUEST_ERROR_DIAL_ERROR
    at file:///usr/app/packages/beacon-node/src/network/reqresp/request/index.ts:102:15
    at sendRequest (file:///usr/app/packages/beacon-node/src/network/reqresp/request/index.ts:88:20)
    at ReqResp.sendRequest (file:///usr/app/packages/beacon-node/src/network/reqresp/reqResp.ts:198:22)
    at ReqResp.ping (file:///usr/app/packages/beacon-node/src/network/reqresp/reqResp.ts:104:12)
    at PeerManager.requestPing (file:///usr/app/packages/beacon-node/src/network/peers/peerManager.ts:348:25)
Nov-10 00:00:01.871[network]       ^[[36mverbose^[[39m: Req  error method=status, encoding=ssz_snappy, client=Teku, peer=16...iK3cxw, requestId=33176 code=REQUEST_ERROR_DIAL_ERROR, error=All promises were rejected
AggregateError: All promises were rejected
Error: REQUEST_ERROR_DIAL_ERROR
    at file:///usr/app/packages/beacon-node/src/network/reqresp/request/index.ts:102:15
    at sendRequest (file:///usr/app/packages/beacon-node/src/network/reqresp/request/index.ts:88:20)
    at ReqResp.sendRequest (file:///usr/app/packages/beacon-node/src/network/reqresp/reqResp.ts:198:22)
    at ReqResp.status (file:///usr/app/packages/beacon-node/src/network/reqresp/reqResp.ts:96:12)
    at PeerManager.requestStatus (file:///usr/app/packages/beacon-node/src/network/peers/peerManager.ts:360:27)
    at async Promise.all (index 0)
    at PeerManager.requestStatusMany (file:///usr/app/packages/beacon-node/src/network/peers/peerManager.ts:369:7)
Nov-10 00:00:02.191[network]       ^[[36mverbose^[[39m: Req  done method=status, encoding=ssz_snappy, client=Prysm, peer=16...5TozdR, requestId=33177, numResponse=1

@twoeths
Copy link
Contributor Author

twoeths commented Nov 27, 2022

This happens with v1.2.1 in mainnet nodes. It happens to a single peer really have no multiaddr somehow, every 10 seconds we got 2 "REQUEST_ERROR_DIAL_ERROR" errors with status request and ping request, the metrics say that's low tolerance error so we should disconnect the peer due to low score but somehow we did not.

@twoeths twoeths added prio-medium Resolve this some time soon (tm). and removed prio-critical Drop everything to resolve this immediately. Examples: consensus bug, un-usable CLI labels Nov 27, 2022
@philknows philknows modified the milestones: v1.3.0, v1.4.0 Dec 9, 2022
@philknows
Copy link
Member

As of v1.4.0 some samples from mainnet nodes:

cip-canary-bn
Screenshot 2023-01-31 at 9 35 31 AM

beta-mainnet-hzax41
Screenshot 2023-01-31 at 9 34 47 AM

Moving to v1.5.0 milestone if still required.

@philknows philknows modified the milestones: v1.4.0, v1.5.0 Jan 31, 2023
@twoeths
Copy link
Contributor Author

twoeths commented Feb 15, 2023

this issue was original caused when we keeps dialing to a single peer, does not happen anymore in v1.4.2 (there are a lot of libp2p/network upgrade since we found the issue). Caught some metrics from some nodes:

  • Node 7

Screen Shot 2023-02-15 at 11 17 01

  • Node 8

Screen Shot 2023-02-15 at 11 17 28

  • Node 9

Screen Shot 2023-02-15 at 11 17 49

@twoeths twoeths closed this as completed Feb 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
prio-medium Resolve this some time soon (tm). scope-networking All issues related to networking, gossip, and libp2p.
Projects
None yet
Development

No branches or pull requests

3 participants