-
Notifications
You must be signed in to change notification settings - Fork 880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add missing locks in agent and service code #1570
Conversation
@aboch could you also take care of other missing locks for moby/moby#28712 ? I think we need a sweep of the swarm related changes in libnetwork and look for missing locks and other concurrency issues. cc @sanimej |
@mavenugo Sure thing, I was not aware of that issue. |
@@ -44,6 +44,8 @@ func (l *logWriter) Write(p []byte) (int, error) { | |||
|
|||
// SetKey adds a new key to the key ring | |||
func (nDB *NetworkDB) SetKey(key []byte) { | |||
nDB.Lock() | |||
defer nDB.Unlock() | |||
logrus.Debugf("Adding key %s", hex.EncodeToString(key)[0:5]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug logging should be outside of the lock. Same comment for SetPrimaryKey
and RemoveKey
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally put it inside the lock so that it gets printed when this function effectively does the job
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is not the purpose of lock. After printing the debug if the lock can't be acquired the go routine is going to be stuck. If its a deadlock the stack decode will capture the state when the user dumps it.
@@ -136,6 +144,7 @@ func (c *controller) handleKeyChange(keys []*types.EncryptionKey) error { | |||
} | |||
} | |||
} | |||
c.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling a.networkDB.SetKey(key.Key)
inside the controller lock doesn't look correct; its not necessary. Also,SetKey
takes nDB.Lock which can potentially lead to deadlocs.. We should save the new key and call a.networkDB.SetKey(key.Key)
outside of the controller lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can it deadlock ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two concurrent go routines, one takes c.Lock()
and tries nDB.Lock()
; second go routine takes nDB.Lock()
and tries c.Lock()
. We had such deadlocks before, between network and endpoint locks IIRC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understand, but the nDB object does not have any reference to the agent or controller. It's a leaf object. From what I see, networkDB methods are properly coded so that they are the only one to exercise the nDB lock over the nDB data.
I don't think the situation you are describing can happen.
But I can take care of moving the function call outside of the controller lock, if that makes it less confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took care of it. PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, networkdb is currently a separate package without any dependency on libnetwork core. But its safer to avoid nested locks unless its really required. In this case there is no need to call networkDB.SetKey
under the controller lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yes. I agree, being conservative when it comes to holding locks is preferred. Especially when we have locks around a function call (with closely related code), it is much better to avoid it.
@@ -339,11 +363,13 @@ func (c *controller) agentClose() { | |||
return | |||
} | |||
|
|||
agent.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get the slice of agent.driverCancelFuncs
under the lock and do the actual invocation outside of the lock ? Only the access to agent.driverCancelFuncs
is racy here. But we are locking around the execution of those functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will take care of it. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took care of it. PTAL
Signed-off-by: Alessandro Boch <[email protected]>
Thanks @aboch. LGTM |
@@ -127,7 +136,7 @@ func (c *controller) handleKeyChange(keys []*types.EncryptionKey) error { | |||
if !same { | |||
c.keys = append(c.keys, key) | |||
if key.Subsystem == subsysGossip { | |||
a.networkDB.SetKey(key.Key) | |||
added = key.Key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
am not too confident on this code-path. So consider this as a question... It looks like SetKey can be potentially called multiple times based on how many unique keys of type "subsysGossip" is present ?
With this change, it seems to be assumed that it will be only once and it is cached in added
variable which is later used to set Key again. Is that the expectation ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it follows the existing logic for the deleted key (look for the deleted
variable).
This is the code path for the key rotation, where we know there will only be one deleted and one added key. But @sanimej can correct us here, he initially wrote this logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mavenugo What aboch said is correct.The key rotation logic adds one new key to the set and removes one. So before this change also only one key would change on a rotation.
return n.ctrlr.agent.networkDB.Peers(n.id) | ||
} | ||
return []networkdb.PeerInfo{} | ||
return agent.networkDB.Peers(n.ID()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it safe to assume that networkDB
will never be nil ?
The current code might behave this way. But why should we have such assumptions built in (which might cause panics in the future) ? I think it is safe to add the nil check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is safe, because the networkDB
field is instantiated at agent creation, and never reset.
This is why there is no check over agent.networkDB == nil
throughout the agent code.
LGTM |
@mavenugo : Q: Does this locking changes effective on global/swarm scope or just local scope? |
@bklau it should be effective mostly for global/swarm scoped networks. |
@mavenugo : Is my understanding correct that if I have a globally-scoped overlay network based on Consul, say, (not 1.12 Swarm mode); then if I issued two network overlay to create/delete the SAME network "my_net" concurrently, a global lock would be contested and held first before doing any create/delete for "my_net"? |
Related to moby/moby#28697, moby/moby#28845 and moby/moby#28712,
Signed-off-by: Alessandro Boch [email protected]