-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix and work around timeout handler data races #1308
Conversation
The function `GetLogMessages` isn't used anywhere. I had tried to remove it to see if it was causing the data race in tektoncd#1124 - it _isnt_ but still it's not being used anywhere so why not remove :)
Logging in the timeout handler was added as part of tektoncd#731 cuz it helped us debug when the timeout handler didn't work as expected. Unfortunately it looks like the logger we're using can't be used in multiple go routines (uber-go/zap#99 may be related). Removing this logging to fix tektoncd#1124, hopefully can find a safe way to add logging back in tektoncd#1307.
GetRunKey is accessed in goroutines via the timeout_handler; accessing attributes of an object in a goroutine is not threadsafe. This is used as a key in a map, so for now replacing this with a value that should be unique but also threadsafe to fix tektoncd#1124
The following is the coverage report on pkg/.
|
I cannot reproduce the races with this code, nice! |
@@ -246,7 +246,8 @@ func (pr *PipelineRun) IsCancelled() bool { | |||
|
|||
// GetRunKey return the pipelinerun key for timeout handler map | |||
func (pr *PipelineRun) GetRunKey() string { | |||
return fmt.Sprintf("%s/%s/%s", pipelineRunControllerName, pr.Namespace, pr.Name) | |||
// The address of the pointer is a threadsafe unique identifier for the pipelinerun | |||
return fmt.Sprintf("%s/%p", pipelineRunControllerName, pr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
@@ -86,7 +76,6 @@ type Informers struct { | |||
// TestAssets holds references to the controller, logs, clients, and informers. | |||
type TestAssets struct { | |||
Controller *controller.Impl | |||
Logs *observer.ObservedLogs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is strange - I remember using Logs to debug a test in the past, before the controller refactor.
I might be nice to have GetLogMessages populate the Logs field here, which would give tests access to logs and would enable testing on seeing a certain log message.
But I agree since it's not used now we can remove it and add it back if needed in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this work!
/lgtm
@@ -170,7 +170,6 @@ func backoffDuration(count uint, jf jitterFunc) time.Duration { | |||
func (t *TimeoutSet) checkPipelineRunTimeouts(namespace string, pipelineclientset clientset.Interface) { | |||
pipelineRuns, err := pipelineclientset.TektonV1alpha1().PipelineRuns(namespace).List(metav1.ListOptions{}) | |||
if err != nil { | |||
t.logger.Errorf("Can't get pipelinerun list in namespace %s: %s", namespace, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super-happy to remove logging just ahead of a release - it may make troubleshooting harder for folks. That said, removing non-thread safe behavior is probably more important, so I'm ok with this - and we still have a chance to bring logging back before the release anyways.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: afrittoli, bobcatfish, dibyom The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Changes
This mostly fixes #1124, will still need to update the
gcr.io/tekton-releases/tests/test-runner
to use a newer version of go which can catch this race.Will re-add logging to the timeout handler in #1307
Remove logging from timeout handler ✏️
Logging in the timeout handler was added as part of #731 cuz it helped
us debug when the timeout handler didn't work as expected. Unfortunately
it looks like the logger we're using can't be used in multiple go
routines (uber-go/zap#99 may be related).
Removing this logging to fix #1124, hopefully can find a safe way to add
logging back in #1307.
Make GetRunKey threadsafe 🔒
GetRunKey is accessed in goroutines via the timeout_handler; accessing
attributes of an object in a goroutine is not threadsafe. This is used
as a key in a map, so for now replacing this with a value that should be
unique but also threadsafe to fix #1124
Remove unused function 👻
The function
GetLogMessages
isn't used anywhere. I had tried to removeit to see if it was causing the data race in #1124 - it isnt but still
it's not being used anywhere so why not remove :)
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
See the contribution guide for more details.
Double check this list of stuff that's easy to miss:
cmd
dir, please updatethe release Task and TaskRun to build and release this image
Reviewer Notes
If API changes
are included, additive changes
must be approved by at least two OWNERS
and backwards incompatible changes
must be approved by more than 50% of the OWNERS,
and they must first be added
in a backwards compatible way.
Release Notes