-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log when dropping metrics due to missing process_start_time_seconds #1921
Log when dropping metrics due to missing process_start_time_seconds #1921
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1921 +/- ##
==========================================
- Coverage 91.36% 91.35% -0.02%
==========================================
Files 280 280
Lines 16640 16641 +1
==========================================
- Hits 15203 15202 -1
- Misses 1006 1007 +1
- Partials 431 432 +1
Continue to review full report at Codecov.
|
@@ -159,6 +159,12 @@ func (tr *transaction) Commit() error { | |||
if tr.useStartTimeMetric { | |||
// AdjustStartTime - startTime has to be non-zero in this case. | |||
if tr.metricBuilder.startTime == 0.0 { | |||
// Unable to adjust start time because of missing start time metric | |||
tr.logger.Info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered that initially, but since it's not the issue with collector or its config, but rather with the target applications, Info
seems more appropriate?
i.e. collector itself performs correctly, it just informs us that some applications are misconfigured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider it a warrning, as it informs about degraded behavior and there is a user action that can be taken to resolve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not log on bad input, see https://github.com/open-telemetry/opentelemetry-collector/blob/master/CONTRIBUTING.md#logging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not log on bad input
@tigrannajaryan could you suggest alternative solution then?
As described in #1921, we currently don't have any visibility in these metrics being dropped and why are they being dropped. Different people at Google ran into this issue in the last few months and spent hours / days debugging it.
Having a log message would help a lot.
As @serathius pointed above, in most cases this does require human intervention - either fixing the application, or change the collector config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI the error returned here is handled by Prometheus code, which will log it (but won't include Prometheus job and instance): https://github.com/prometheus/prometheus/blob/3240cf83f08e448e0b96a4a1f96c0e8b2d51cf61/scrape/scrape.go#L1074-L1077
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the logger does contain Prometheus target: https://github.com/prometheus/prometheus/blob/3240cf83f08e448e0b96a4a1f96c0e8b2d51cf61/scrape/scrape.go#L259
so extra log message is redundant - will remove it from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for late reply, I was a away for a few days.
IMO, the right approach is to record the failures in an internal metric. The guidelines mention it:
For such high-frequency events instead of logging consider adding an internal metric and increment it when the event happens.
I think obsreport.EndMetricsReceiveOp should do that.
If you want to also log the failure then I believe it is better to use logger.Debug() so that it is not enabled by default. Another alternate if it must have more visibility is to log an error once and clearly indicate in the error message that it will be only logged once. A third alternate is to use log rate limiting. zap logger seems to support it (I haven't tried it).
The
|
@@ -159,6 +159,12 @@ func (tr *transaction) Commit() error { | |||
if tr.useStartTimeMetric { | |||
// AdjustStartTime - startTime has to be non-zero in this case. | |||
if tr.metricBuilder.startTime == 0.0 { | |||
// Unable to adjust start time because of missing start time metric | |||
tr.logger.Info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not log on bad input, see https://github.com/open-telemetry/opentelemetry-collector/blob/master/CONTRIBUTING.md#logging
@nilebox I merged the PR, please feel free to submit a follow up PR if you want to introduce debug or rate-limited logging. |
* Added Reason to Contributing and Updated TracerConfig * PR comment fixup * Changed how span Options work. * Fix Markdown linting * Added meter configs. * Fixes from PR comments * fix for missing instrument Co-authored-by: Tyler Yahn <[email protected]>
* Update recommended installation methods * Update internal/buildscripts/packaging/installer/install.sh Co-authored-by: Ryan Fitzpatrick <[email protected]> Co-authored-by: Ryan Fitzpatrick <[email protected]>
Description:
- Log a message when metrics are dropped by Prometheus receiver due to missingprocess_start_time_seconds
metric.obsreport.EndMetricsReceiveOp
and return error in transactionCommit()
Link to tracking Issue: Fixes #969