You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.
And then we start a container requiring ipam on host3:
host3:~$ weave run -ti gliderlabs/alpine /bin/sh
This takes ~30 seconds to complete.
Enabling debug logging on host3 shows
DEBU: 2015/07/11 15:13:43.544408 [allocator 46:db:ad:5c:8c:89] Paxos proposing
DEBU: 2015/07/11 15:13:48.543790 [allocator 46:db:ad:5c:8c:89] Paxos proposing
DEBU: 2015/07/11 15:13:53.543764 [allocator 46:db:ad:5c:8c:89] Paxos proposing
DEBU: 2015/07/11 15:13:58.543680 [allocator 46:db:ad:5c:8c:89] Paxos proposing
DEBU: 2015/07/11 15:14:02.973963 [allocator 46:db:ad:5c:8c:89]: Allocator.OnGossip: 567 bytes
DEBU: 2015/07/11 15:14:02.977816 [allocator 46:db:ad:5c:8c:89]: Decided to ask peer f6:0a:27:c5:a9:98 for space in range [10.32.0.1-10.47.255.255)
DEBU: 2015/07/11 15:14:02.978697 [allocator 46:db:ad:5c:8c:89]: OnGossipUnicast from f6:0a:27:c5:a9:98 : 607 bytes
DEBU: 2015/07/11 15:14:02.979114 [allocator 46:db:ad:5c:8c:89]: Allocated 10.40.0.0 for d60e20ae5373d901af9a5995102c0a0ca3827cc68d5751b42e0f0bd8c62c0dac in [10.32.0.1-10.47.255.255)
So it looks like we only establish consensus when the periodic ipam gossip takes place.
Note that the commands as shown above doesn't actually reproduce the problem for me. Instead I have to run the full test from #1117, which first launches the three routers with normal discovery, starts two non-ipam containers (on host1 and host3), and then stops all routers. I reckon the difference is probably just down to timing and possible PRNG seeding.
Thinking about it and looking a the ipam paxos code, I believe what is happening here is that due to the partially connected topology with a non-ipam node in the middle, when peer3 starts it does not receive any IPAM gossip, since it connects to peer2 which doesn't run IPAM. And the paxos code on peer1 only broadcasts gossip in some very narrowly defined circumstances, which probably do not hold here. In particular, peer1 has quorum of one so can just create the ring, at which point it will no longer broadcast the ring state when receiving a paxos message (i.e. from peer3). Hence peer3 only finds out about the ring when the period gossip on peer3 takes place.
Perhaps the conditions under which ipam paxos broadcasts the ring need to be relaxed a bit.
The text was updated successfully, but these errors were encountered:
I observe this when running the test I added in #1117.
The setup is
host1 <-> host2(no-ipam) <-> host3
, established like this:And then we start a container requiring ipam on host3:
This takes ~30 seconds to complete.
Enabling debug logging on host3 shows
So it looks like we only establish consensus when the periodic ipam gossip takes place.
Note that the commands as shown above doesn't actually reproduce the problem for me. Instead I have to run the full test from #1117, which first launches the three routers with normal discovery, starts two non-ipam containers (on host1 and host3), and then stops all routers. I reckon the difference is probably just down to timing and possible PRNG seeding.
Thinking about it and looking a the ipam paxos code, I believe what is happening here is that due to the partially connected topology with a non-ipam node in the middle, when peer3 starts it does not receive any IPAM gossip, since it connects to peer2 which doesn't run IPAM. And the paxos code on peer1 only broadcasts gossip in some very narrowly defined circumstances, which probably do not hold here. In particular, peer1 has quorum of one so can just create the ring, at which point it will no longer broadcast the ring state when receiving a paxos message (i.e. from peer3). Hence peer3 only finds out about the ring when the period gossip on peer3 takes place.
Perhaps the conditions under which ipam paxos broadcasts the ring need to be relaxed a bit.
The text was updated successfully, but these errors were encountered: