Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new alerts to the tempo-mixin #1292

Merged
merged 2 commits into from
Feb 21, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@
distributors. Also, during this period, the ingesters will use considerably more resources and as such should be scaled up (or incoming traffic should be
heavily throttled). Once all distributors and ingesters have rolled performance will return to normal. Internally we have observed ~1.5x CPU load on the
ingesters during the rollout. [#1227](https://github.com/grafana/tempo/pull/1227) (@joe-elliott)
* [ENHACEMENT] Enterprise jsonnet: add config to create tokengen job explicitly [#1256](https://github.com/grafana/tempo/pull/1256) (@kvrhdn)
* [ENHANCEMENT] Enterprise jsonnet: add config to create tokengen job explicitly [#1256](https://github.com/grafana/tempo/pull/1256) (@kvrhdn)
* [ENHANCEMENT] Add new scaling alerts to the tempo-mixin [#1292](https://github.com/grafana/tempo/pull/1292) (@mapno)
* [BUGFIX]: Remove unnecessary PersistentVolumeClaim [#1245](https://github.com/grafana/tempo/issues/1245)
* [BUGFIX] Fixed issue when query-frontend doesn't log request details when request is cancelled [#1136](https://github.com/grafana/tempo/issues/1136) (@adityapwr)

Expand Down
59 changes: 59 additions & 0 deletions operations/tempo-mixin/alerts.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,65 @@
runbook_url: 'https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoTenantIndexTooOld',
},
},
{
alert: 'TempoBadOverrides',
expr: |||
sum(tempo_runtime_config_last_reload_successful{namespace=~"%s"} == 0) by (cluster, namespace, job)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this metric increase even if the overrides config does not change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. dskit doesn't even seem to check if the config has changed or not: https://github.com/grafana/dskit/blob/main/runtimeconfig/manager.go#L145-L170

||| % $._config.namespace,
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: '{{ $labels.job }} failed to reload overrides.',
runbook_url: 'https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoBadOverrides',
},
},
// ingesters
{
alert: 'TempoProvisioningTooManyWrites',
// 30MB/s written to the WAL per ingester max
expr: |||
avg by (cluster, namespace) (rate(tempo_ingester_bytes_received_total{job=~".+/ingester"}[1m])) / 1024 / 1024 > 30
|||,
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: 'Ingesters in {{ $labels.cluster }}/{{ $labels.namespace }} are receiving more data/second than desired, add more ingesters.',
runbook_url: 'https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoProvisioningTooManyWrites',
},
},
// compactors
{
alert: 'TempoCompactorsTooManyOutstandingBlocks',
expr: |||
sum by (cluster, namespace, tenant) (tempodb_compaction_outstanding_blocks{container="compactor", namespace=~"%s"}) / ignoring(tenant) group_left count(tempo_build_info{container="compactor", namespace=~"%s"}) by (cluster, namespace) > %d
||| % [$._config.namespace, $._config.namespace, $._config.alerts.outstanding_blocks_warning],
'for': '6h',
labels: {
severity: 'warning',
},
annotations: {
message: "There are too many outstanding compaction blocks in {{ $labels.cluster }}/{{ $labels.namespace }} for tenant {{ $labels.tenant }}, increase compactor's CPU or add more compactors.",
runbook_url: 'https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoCompactorsTooManyOutstandingBlocks',
},
},
{
alert: 'TempoCompactorsTooManyOutstandingBlocks',
expr: |||
sum by (cluster, namespace, tenant) (tempodb_compaction_outstanding_blocks{container="compactor", namespace=~"%s"}) / ignoring(tenant) group_left count(tempo_build_info{container="compactor", namespace=~"%s"}) by (cluster, namespace) > %d
||| % [$._config.namespace, $._config.namespace, $._config.alerts.outstanding_blocks_critical],
'for': '24h',
labels: {
severity: 'critical',
},
annotations: {
message: "There are too many outstanding compaction blocks in {{ $labels.cluster }}/{{ $labels.namespace }} for tenant {{ $labels.tenant }}, increase compactor's CPU or add more compactors.",
runbook_url: 'https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoCompactorsTooManyOutstandingBlocks',
},
},
],
},
],
Expand Down
2 changes: 2 additions & 0 deletions operations/tempo-mixin/config.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@
max_tenant_index_age_seconds: 600,
p99_request_threshold_seconds: 3,
p99_request_exclude_regex: 'metrics|/frontend.Frontend/Process|debug_pprof',
outstanding_blocks_warning: 100,
outstanding_blocks_critical: 250,
},

// Groups labels to uniquely identify and group by {jobs, clusters, tenants}
Expand Down
37 changes: 36 additions & 1 deletion operations/tempo-mixin/runbook.md
Original file line number Diff line number Diff line change
Expand Up @@ -184,4 +184,39 @@ to pull is to simply delete stale tenant indexes as all components will fallback

```
/<tenant>/index.json.gz
```
```

### TempoBadOverrides

Fix the overrides! Overrides are loaded by the distributors so hopefully there is
some meaningful logging there.

## TempoProvisioningTooManyWrites

This alert fires if the average number of samples ingested / sec in ingesters is above our target.

How to fix:

1. Scale up ingesters
- To compute the desired number of ingesters to satisfy the average samples
rate you can run the following query, replacing <namespace> with the namespace
to analyse and <target> with the target number of samples/sec per ingester
(check out the alert threshold to see the current target):
```
sum(rate(tempo_ingester_bytes_received_total{namespace="<namespace>"}[$__rate_interval])) / (<target> * 0.9)
```

## TempoCompactorsTooManyOutstandingBlocks

This alert fires when there are too many blocks to be compacted for a long period of time.
The alert does not require immediate action, but is a symptom that compaction is underscaled
and could affect the read path in particular.

How to fix:

Compaction's bottleneck is most commonly CPU time, so adding more compactors is the most effective measure.

After compaction has been scaled out, it'll take a time for compactors to catch
up with their outstanding blocks.
Take a look at `tempodb_compaction_outstanding_blocks` and check if blocks start
going down. If not, further scaling may be necessary.
36 changes: 36 additions & 0 deletions operations/tempo-mixin/yamls/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -99,3 +99,39 @@
"for": "5m"
"labels":
"severity": "critical"
- "alert": "TempoBadOverrides"
"annotations":
"message": "{{ $labels.job }} failed to reload overrides."
"runbook_url": "https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoBadOverrides"
"expr": |
sum(tempo_runtime_config_last_reload_successful{namespace=~".*"} == 0) by (cluster, namespace, job)
"for": "15m"
"labels":
"severity": "warning"
- "alert": "TempoProvisioningTooManyWrites"
"annotations":
"message": "Ingesters in {{ $labels.cluster }}/{{ $labels.namespace }} are receiving more data/second than desired, add more ingesters."
"runbook_url": "https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoProvisioningTooManyWrites"
"expr": |
avg by (cluster, namespace) (rate(tempo_ingester_bytes_received_total{job=~".+/ingester"}[1m])) / 1024 / 1024 > 30
"for": "15m"
"labels":
"severity": "warning"
- "alert": "TempoCompactorsTooManyOutstandingBlocks"
"annotations":
"message": "There are too many outstanding compaction blocks in {{ $labels.cluster }}/{{ $labels.namespace }} for tenant {{ $labels.tenant }}, increase compactor's CPU or add more compactors."
"runbook_url": "https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoCompactorsTooManyOutstandingBlocks"
"expr": |
sum by (cluster, namespace, tenant) (tempodb_compaction_outstanding_blocks{container="compactor", namespace=~".*"}) / ignoring(tenant) group_left count(tempo_build_info{container="compactor", namespace=~".*"}) by (cluster, namespace) > 100
"for": "6h"
"labels":
"severity": "warning"
- "alert": "TempoCompactorsTooManyOutstandingBlocks"
"annotations":
"message": "There are too many outstanding compaction blocks in {{ $labels.cluster }}/{{ $labels.namespace }} for tenant {{ $labels.tenant }}, increase compactor's CPU or add more compactors."
"runbook_url": "https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoCompactorsTooManyOutstandingBlocks"
"expr": |
sum by (cluster, namespace, tenant) (tempodb_compaction_outstanding_blocks{container="compactor", namespace=~".*"}) / ignoring(tenant) group_left count(tempo_build_info{container="compactor", namespace=~".*"}) by (cluster, namespace) > 250
"for": "24h"
"labels":
"severity": "critical"