-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(libp2p): add autodial retry threshold config option #1943
Conversation
When auto-dialing peers, if we have failed to dial them recently there's little point in redialing them. Adds a `autoDialPeerRetryThreshold` which is a value in ms. If we have attempted to dial a peer and that dial attempt failed but we are under our min connection count, do not auto dial the peer within the retry threshold. Defaults to 10 minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks promising, and should address some of the concerns of #1899. I have a few comments.
expect(connectionManager.openConnection.callCount).to.equal(1) | ||
expect(connectionManager.openConnection.calledWith(matchPeerId(peerWithAddress.id))).to.be.true() | ||
expect(connectionManager.openConnection.calledWith(matchPeerId(undialablePeer.id))).to.be.false() | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add another test to ensure that the undialable peer is eventually dialled once the threshold has expired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments / questions. Action items i would love to see addressed prior to merging:
- lower the default value
- don't set LAST_DIAL_FAILED if not an auto-dial dial
/** | ||
* When we've failed to dial a peer, do not autodial them again within this | ||
* number of ms. (default: 10 minutes) | ||
*/ | ||
autoDialPeerRetryThreshold?: number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
honestly, even 1 minute would significantly help from my testing experience.
Once we add in a badPeer/bad multiaddr, it will free up time to check others instead of constantly spamming the same addresses.
const when = parseInt(uint8ArrayToString(lastDialFailure)) | ||
|
||
if (isNaN(when)) { | ||
return true | ||
} | ||
|
||
// only dial if the time since the last failure is above the retry threshold | ||
return Date.now() - when > this.autoDialPeerRetryThresholdMs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const when = parseInt(uint8ArrayToString(lastDialFailure)) | |
if (isNaN(when)) { | |
return true | |
} | |
// only dial if the time since the last failure is above the retry threshold | |
return Date.now() - when > this.autoDialPeerRetryThresholdMs | |
const lastDialFailuretimestamp = parseInt(uint8ArrayToString(lastDialFailure)) | |
if (isNaN(lastDialFailuretimestamp)) { | |
return true | |
} | |
// only dial if the time since the last failure is above the retry threshold | |
return Date.now() - lastDialFailuretimestamp > this.autoDialPeerRetryThresholdMs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a more intentional variable name here would be nice
/** | ||
* @see https://libp2p.github.io/js-libp2p/interfaces/libp2p.index.unknown.ConnectionManagerInit.html#autoDialPeerRetryThreshold | ||
*/ | ||
export const AUTO_DIAL_PEER_RETRY_THRESHOLD = 1000 * 60 * 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could safely lower this and still achieve the goals of not spamming likely-to-fail multiaddrs, while also not blocking users with a 10 minute wait (or forcing a refresh in the browser)
1-2 minutes would be a significant improvement and minor inconvenience in the case of internet outages.
A good approach could be determining how long it takes(D) to dial typical size of peers(N) and setting the value to some ratio of that number. If we expect users to typically find 100-200 peers per session (in the browser that's likely), and it takes 5 minutes to dial them all, then 5 minutes should be default.. but I doubt it needs to be that high.
|
||
/** | ||
* Store as part of the peer store metadata for a given peer, the value for this | ||
* key is a stringified number representing the timestamp of the last time a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* key is a stringified number representing the timestamp of the last time a | |
* key is a timestamp representing the last time a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the existing comment is wordier than needs to be
@@ -230,9 +232,22 @@ export class DialQueue { | |||
// clean up abort signals/controllers | |||
signal.clear() | |||
}) | |||
.catch(err => { | |||
.catch(async err => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if i'm not mistaken, this dial-queue is also used for direct dials, should we discern between an auto-dial triggered dial vs a manually/user requested dial?
it('should not re-dial peers we have recently failed to dial', async () => { | ||
const peerWithAddress: Peer = { | ||
id: await createEd25519PeerId(), | ||
protocols: [], | ||
addresses: [{ | ||
multiaddr: multiaddr('/ip4/127.0.0.1/tcp/4001'), | ||
isCertified: true | ||
}], | ||
metadata: new Map(), | ||
tags: new Map() | ||
} | ||
const undialablePeer: Peer = { | ||
id: await createEd25519PeerId(), | ||
protocols: [], | ||
addresses: [{ | ||
multiaddr: multiaddr('/ip4/127.0.0.1/tcp/4002'), | ||
isCertified: true | ||
}], | ||
// we failed to dial them recently | ||
metadata: new Map([[LAST_DIAL_FAILURE_KEY, uint8ArrayFromString(`${Date.now() - 10}`)]]), | ||
tags: new Map() | ||
} | ||
|
||
await peerStore.save(peerWithAddress.id, peerWithAddress) | ||
await peerStore.save(undialablePeer.id, undialablePeer) | ||
|
||
const connectionManager = stubInterface<ConnectionManager>({ | ||
getConnectionsMap: new PeerMap(), | ||
getDialQueue: [] | ||
}) | ||
|
||
autoDialler = new AutoDial({ | ||
peerStore, | ||
connectionManager, | ||
events | ||
}, { | ||
minConnections: 10, | ||
autoDialInterval: 10000 | ||
}) | ||
autoDialler.start() | ||
void autoDialler.autoDial() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other comment as well, but do we need all the test scaffolding for auto-dialer or can we just attempt to dial the peer directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regardless of the answer here, we may want another test regarding direct-dials covering the expected cases (dial them again even if failed auto dial previously, because it's a direct dial request)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last-failed-dial key is only used by the auto-dialer to exclude those peers from autodialing, it's not used by explicit dials.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that now, thanks
log.error('dial failed to %s', pendingDial.multiaddrs.map(ma => ma.toString()).join(', '), err) | ||
|
||
if (peerId != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (peerId != null) { | |
if (peerId != null && priority < someVal) { |
to address manual dials? or do we want a specific dial option to flag whether we will update the LAST_DIAL_FAILURE_KEY for a peer that we only use with auto-dialer?
Any reason why? The value is only used by the auto-dialer to exclude peers from the list to dial, direct dials don't use this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I would add another test to ensure that the undiallable peer is eventually dialled once the threshold has expired, but my opinion is strong on that.
I added this check into the existing test |
Ah, I see now, great. |
|
||
/** | ||
* Store as part of the peer store metadata for a given peer, the value for this | ||
* key is a stringified number representing the timestamp of the last time a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the existing comment is wordier than needs to be
const when = parseInt(uint8ArrayToString(lastDialFailure)) | ||
|
||
if (isNaN(when)) { | ||
return true | ||
} | ||
|
||
// only dial if the time since the last failure is above the retry threshold | ||
return Date.now() - when > this.autoDialPeerRetryThresholdMs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a more intentional variable name here would be nice
When auto-dialing peers, if we have failed to dial them recently there's little point in redialing them as it's probably just going to fail again which consumes a slot in the dial queue and other resources.
Adds a
autoDialPeerRetryThreshold
config key to the connection manager which is a value in ms.If we have attempted to dial a peer and that dial attempt failed but we are under our min connection count, do not auto dial the peer within the retry threshold.
Defaults to 1 minute.
Closes #1899