-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the scraper/discover manager coordination on the Prometheus receiver #2089
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2089 +/- ##
==========================================
- Coverage 92.16% 92.13% -0.04%
==========================================
Files 279 279
Lines 16980 16971 -9
==========================================
- Hits 15650 15636 -14
- Misses 910 915 +5
Partials 420 420
Continue to review full report at Codecov.
|
discoveryCtx := context.Background() | ||
discoveryCtx, cancel := context.WithCancel(ctx) | ||
r.cancelFunc = cancel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 60 can be removed, the discoveryCtx
is always re-written correct? (but see the other comment about why the discoverCtx should not inherit from the context passed to start)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very interesting pattern that I observed in a lot of places (maybe you can help me understand better how can we avoid this "possible" mistake).
Extending the incoming context from the Start func I think is wrong, when the context is saved for "async" work. In this case you pass discoveryCtx
to the discovery manager, and save a reference to it, then check if the context is cancelled to understand when to stop.
Here is why I think it is wrong:
- The caller of the "Start" will possibly add a "deadline" to the context to ensure that the components start within a certain amount of time. Or can define its own cancelation to the context (for whatever reasons).
- In case of extending from that context, you inherit the cancellation. And from the deadline example the extended context will be cancelled (the
WithTimeout
suggest to call the cancellation func when done), the caller is consider done when you return fromStart
, so the extended contextdiscoveryCtx
will be closed almost immediately after the Start ends.
I usually try to tell users that when you schedule "async" work you cannot inherit from "sync" context because the caller may cancel the context when you return (or you need to do a trick like I did here
type noCancellationContext struct { |
This is a long comment, hope you understand the problem, we had this #1813 where exactly this happened during the execution of an gRPC request we were scheduling async work in the queue_retry and return, the gRPC when we return will call done on the timeout return func, and the saved context will be "cancelled" so all the export calls will fail because we will pass a done context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with the lifecycle of the components, so I looked at the the Host interface and it recommended me to start a new background context:
// If the component needs to perform a long-running starting operation then it is recommended
// that Start() returns quickly and the long-running operation is performed in background.
// In that case make sure that the long-running operation does not use the context passed
// to Start() function since that context will be cancelled soon and can abort the long-running operation.
// Create a new context from the context.Background() for long-running operations.
Start(ctx context.Context, host Host) error
I'm not familiar with the lifecycle of the components. Do we expect components to handle the cancellation/timeout signals in Start? If yes, we should mentioned it in the interface. If you clarify what the components should do with the context, I can try to make modifications. Unfortunately the Prometheus APIs doesn't allow us to pass different contexts to discovery.NewManager and discoveryManager.Run and that's the real problem here.
In an ideal situation, this code should be written like the following but the Prometheus APIs doesn't allow us to do it:
func (r *pReceiver) Start(ctx context.Context, host component.Host) error {
discoveryManager := discovery.NewManager(ctx, logger)
// ...
discoveryRunCtx := context.Background()
discoveryRunCtx, cancel := context.WithCancel(ctx)
discoveryManager.Run(discoveryRunCtx);
Splitting the cancellation and timeout sounds like a hack to get around the limitations of the APIs. Am I right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, I removed the discoveryCtx := context.Background()
because it seems like the only way ctx is cancelled when things are timed out or something wrong happened in the starting process. It's ok to cancel the discovery manager in both of those cases. Let me know if this looks ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rakyll I think code will be simpler:
// In the service try to start all components in maximum 30 seconds
func (srv *Service) startAllComponents() {
ctx, cancelFunc := context.WithTimeout(context.Background(), 30*time.Second)
defer cancelFunc()
for i := range srv.components {
srv.components[i].Start(ctx, srv.host)
}
}
func (r *pReceiver) Start(ctx context.Context, host component.Host) error {
discoveryCtx, cancel := context.WithCancel(ctx)
r.cancelFunc = cancel
// Manager will run until discoveryCtx is cancelled.
discoveryManager := discovery.NewManager(discoveryCtx, logger)
...
}
As you can see the discoveryCtx
will be cancelled when the startAllComponents
ends, way before the service shutting down.
If you want, you can write a test that does exactly what I wrote, and you will see that the manager will be cancelled before you expect in the "Shutdown".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In an ideal situation, this code should be written like the following but the Prometheus APIs doesn't allow us to do it:
func (r *pReceiver) Start(ctx context.Context, host component.Host) error {
discoveryManager := discovery.NewManager(ctx, logger)
// ...
discoveryRunCtx := context.Background()
discoveryRunCtx, cancel := context.WithCancel(ctx)
discoveryManager.Run(discoveryRunCtx);
}
Yes, the context pass to run should be decoupled from the context pass to Start. You are right.
But because you cannot pass a different context to Run, you have to do discoveryCtx, cancel := context.WithCancel(context.Background())
otherwise you have the problem that I am mentioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched back to the background context.
The receiver contains various unnecessary sections. Rewriting the receiver's Start for better maintainability. Related to #1909.
The conflict is fixed. |
@rakyll Please fix the lint error:
|
Done. |
Fix open-telemetry#2089 Prepare release 1.9.0
The receiver contains various unnecessary sections. Rewriting the
receiver's Start for better maintainability.
Related to #1909.