Skip to content

Commit

Permalink
[DOC] Update metrics generator doc and add best practices (#2563)
Browse files Browse the repository at this point in the history
* Update metrics generator doc

* Updates from doc validator

* Fix typos and links

* Fix typo in admonition

* Fix validator issues, part 2

* Apply suggestions from code review

Co-authored-by: Heds Simons <[email protected]>

* Move content and fix admonitions

* Added best practices

* Update docs/sources/tempo/metrics-generator/span_metrics.md

* Apply suggestions from code review

* Apply suggestions from code review

Co-authored-by: Eve Meelan <[email protected]>

* Fix links, admonitions

---------

Co-authored-by: Heds Simons <[email protected]>
Co-authored-by: Eve Meelan <[email protected]>
  • Loading branch information
3 people authored Jun 16, 2023
1 parent 0b60c74 commit 9499cb9
Show file tree
Hide file tree
Showing 10 changed files with 334 additions and 52 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -61,4 +61,5 @@ metrics:

The same service graph metrics can also be generated by Tempo.
This is more efficient and recommended for larger installations.
For additional information about viewing service graph metrics in Grafana and calculating cardinality, check out the [server side documentation]({{< relref "../../metrics-generator/service_graphs#grafana" >}}).

For additional information about viewing service graph metrics in Grafana and calculating cardinality, refer to the [server side documentation]({{< relref "../../metrics-generator/service_graphs#enable-service-graphs-in-Grafana" >}}).
14 changes: 8 additions & 6 deletions docs/sources/tempo/metrics-generator/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,22 +13,24 @@ Metrics-generator is an optional Tempo component that derives metrics from inges
If present, the distributor will write received spans to both the ingester and the metrics-generator.
The metrics-generator processes spans and writes metrics to a Prometheus data source using the Prometheus remote write protocol.

>**Note**: Enabling metrics generation and remote writing them to Grafana Cloud Metrics produces extra active series that could impact your billing. For more information on billing, refer to [Billing and usage](/docs/grafana-cloud/billing-and-usage/).
{{% admonition type="note" %}}
Enabling metrics generation and remote writing them to Grafana Cloud Metrics produces extra active series that could impact your billing. For more information on billing, refer to [Billing and usage](/docs/grafana-cloud/billing-and-usage/).
{{% /admonition %}}

## Overview

Metrics-generator leverages the data available in Tempo's ingest path to provide additional value by generating metrics from traces.

The metrics-generator internally runs a set of **processors**.
Each processor ingests spans and produces metrics.
Every processor derives different metrics. Currently the following processors are available:
Every processor derives different metrics. Currently, the following processors are available:

- Service graphs
- Span metrics

<p align="center"><img src="server-side-metrics-arch-overview.png" alt="Service metrics architecture"></p>

### Service graphs
## Service graphs

Service graphs are the representations of the relationships between services within a distributed system.

Expand All @@ -38,7 +40,7 @@ The amount of request and their duration are recorded as metrics, which are used

To learn more about this processor, read the [documentation]({{< relref "./service_graphs" >}}).

### Span metrics
## Span metrics

The span metrics processor derives RED (Request, Error and Duration) metrics from spans.

Expand All @@ -48,11 +50,11 @@ The more dimensions are enabled, the higher the cardinality of the generated met

To learn more about this processor, read the [documentation]({{< relref "./span_metrics" >}}).

### Remote writing metrics
## Remote writing metrics

The metrics-generator runs a Prometheus Agent that periodically sends metrics to a `remote_write` endpoint.
The `remote_write` endpoint is configurable and can be any [Prometheus-compatible endpoint](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write).
To learn more about the endpoint configuration, refer to the [Metrics-generator]({{< relref "../configuration#metrics-generator" >}}) section of the Tempo Configuration documentation.
Writing interval can be controlled via `metrics_generator.registry.collection_interval`.

When multi-tenancy is enabled, the metrics-generator forwards the `X-Scope-OrgID` header of the original request to the remote_write endpoint.
When multi-tenancy is enabled, the metrics-generator forwards the `X-Scope-OrgID` header of the original request to the `remote_write` endpoint.
124 changes: 124 additions & 0 deletions docs/sources/tempo/metrics-generator/active-series.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
---
aliases:
- /docs/tempo/latest/metrics-generator/active-series
title: Active series
menuTitle: Active series
description: Learn about active series and how they are calculated.
weight: 100
---

# Active series

An active series is a time series that receives new data points or samples. When you stop writing new datapoints to a time series, shortly afterwards it is no longer considered active.

Metrics generated by Tempo's metrics generator can provide both RED (Rate/Error/Duration) metrics and interdependency graphs between services in a trace (the Service Graph functionality in Grafana).
These capabilities rely on a set of generated span metrics and service metrics.

Any spans that are ingested by Tempo could potentially create up to 13 metrics. However, this doesn't mean that every time a span is ingested that a new active series is created.

The number of active series generated depends on the label pairs generated from span data that are associated with the metrics, similar to other Prometheus-formated data.

For additional information, refer to the [Active series and DPM documentation](/docs/grafana-cloud/billing-and-usage/active-series-and-dpm/#active-series).

## Active series calculation

Active series for a metric increase when a new value for a label key is introduced. For example, the `span_kind` label has a total of five possible values, and the `status_code` label has a total of three possible values.

At first glance, you might make an assumption that this means that at least 15 (5*3) active series will be generated for each span. But this isn't the case.

Let's consider a span that's emitted from some piece of code in a service:

![Single span visualization](/static/img/docs/tempo/SingleSpan.jpeg)

Here's a single service with a single span.
If the code inside the span never leaves the service, then the `span_kind` label generated by the metrics generator will be `SPAN_KIND_INTERNAL` and never deviate. It'll never be one of the other four possible values.

Similarly, if the code inside the span never errors, it'll only have the `STATUS_CODE_OK` state for the `span_status` label.
This means that the metrics generator will only generate a single active series, where the service name will be _Service 1_ and the span name will be _span1_.
If we looked at the Prometheus data for the `traces_spanmetrics_call_total` metric, we'd see a single active series:

| service | span_name | span_kind | status_code | Metric value |
| --------- | --------- | ------------------ | -------------- | ------------ |
| Service 1 | span1 | SPAN_KIND_INTERNAL | STATUS_CODE_OK | 1 |

It doesn't matter how many times that span occurs in a trace either, for example maybe a span is generated within a loop.
In code run once, 10 times, 100 times, 1000 times, only a single active series will be produced, where a counter might be increased 1, 10, 100, or 1000 times:

![Single span with loop](/static/img/docs/tempo/SingleSpanLoop.jpeg)

If you looked at the Prometheus data, you'd see an instant value for `traces_spanmetrics_call_total` similar to the table. Again, one active series for the metric:

| service | span_name | span_kind | status_code | Metric value |
| --------- | --------- | ------------------ | -------------- | ------------ |
| Service 1 | span1 | SPAN_KIND_INTERNAL | STATUS_CODE_OK | 120 |


However, let's now assume that it does loop and there are occasionally errors.

![Single span with loop and errors](/static/img/docs/tempo/SinglespanLoopError.jpeg)

There are now two potential outcomes for a span when the code loops: one where everything successfully completes and one where there is an error.
This means that when the span completes `status_code` is now either `STATUS_CODE_OK` or `STATUS_CODE_ERROR`.
Because of that, the label values can be one of two values on a metric, and we now have two active series being generated based on the `status_code`, one for the `OK` status and one for the error.

Again, we could loop once, 10 times, 100, or more times, but there will only ever be two active series.

If we now looked at Prometheus instant values for `traces_spanmetrics_call_total`, we'd now see the following table:

| service | span_name | span_kind | status_code | Metric value |
| --------- | --------- | ------------------ | ----------------- | ------------ |
| Service 1 | span1 | SPAN_KIND_INTERNAL | STATUS_CODE_OK | 96 |
| Service 1 | span1 | SPAN_KIND_INTERNAL | STATUS_CODE_ERROR | 24 |

What happens if you call out to another service though? Let's add an option where, based on some arbitrary data, we sometimes make a downstream call to another service, but otherwise continue to runs loops in our own service:

![Multiple spans with loops and errors](/static/img/docs/tempo/SingleSpanLoopErrorAnotherService.jpeg)

In this scenario, `span1`'s `span_kind` label would now be one of either `SPAN_KIND_INTERNAL` or `SPAN_KIND_CLIENT` (as it has acted as a client calling a downstream server).
If a call to the downstream service could also potentially fail, then for `SPAN_KIND_CLIENT`, the `status_code` could be either `STATUS_CODE_ERROR` or `STATUS_CODE_OK`.

At this point, `traces_spanmetrics_call_total` would have four different variations in labels:

| service | span_name | span_kind | status_code | Metric value |
| --------- | --------- | ------------------ | ----------------- | ------------ |
| Service 1 | span1 | SPAN_KIND_INTERNAL | STATUS_CODE_OK | 34 |
| Service 1 | span1 | SPAN_KIND_INTERNAL | STATUS_CODE_ERROR | 6 |
| Service 1 | span1 | SPAN_KIND_CLIENT | STATUS_CODE_OK | 23 |
| Service 1 | span1 | SPAN_KIND_CLIENT | STATUS_CODE_ERROR | 3 |

Because of the variation in values, we now have four active series for our metric instead of one. But, as far as Service 1 is concerned, there's still only four active series, because there isn't any other variation of the values for labels. You can run 1 trace, 10 traces, 100 traces (each with however many loops of spans there are) and only four active series will ever be produced.

We've actually only told half the story in our last diagram. _Service 1_ called a second service, _Service 2_, which continues the trace by adding a new span, `span2`.
If there was a loop inside Service 2 with a single span that was generated from an upstream call from Service 1, and then a number of spans that were driven internally, which could also error, we'd end up with the possible values in the metric for `traces_spanmetrics_call_total` below:

| service | span_name | span_kind | status_code | Metric value |
| --------- | --------- | ------------------ | ----------------- | ------------ |
| Service 1 | span1 | SPAN_KIND_INTERNAL | STATUS_CODE_OK | 89 |
| Service 1 | span1 | SPAN_KIND_INTERNAL | STATUS_CODE_ERROR | 13 |
| Service 1 | span1 | SPAN_KIND_CLIENT | STATUS_CODE_OK | 44 |
| Service 1 | span1 | SPAN_KIND_CLIENT | STATUS_CODE_ERROR | 9 |
| Service 2 | span2 | SPAN_KIND_SERVER | STATUS_CODE_OK | 30 |
| Service 2 | span2 | SPAN_KIND_SERVER | STATUS_CODE_ERROR | 14 |
| Service 2 | span2 | SPAN_KIND_INTERNAL | STATUS_CODE_OK | 99 |
| Service 2 | span2 | SPAN_KIND_INTERNAL | STATUS_CODE_ERROR | 23 |

At this point, all our traces will be composed of two potential span names, each of which produce two separate types of `span_kind` and two separate types of `status_code`. So we have eight active series for a metric.

The variability of values for each potential span condition determines the number of active series being produced by Tempo when ingesting spans for a trace, and not the number of traces of spans that are seen.

## Custom span attributes

There's another consideration for active series: extra label key/value pairs that can be added onto metrics from a span's attributes.
The Tempo metrics generator allows the user to use arbitrary span attributes to be created as label pairs for metrics.
When considering the number of active series generated, you also need to determine how many possible values there are for the span attribute being turned into a label.

For example, if you added an `http.method` span attribute into a metric label pair, there are five possible values (because there are five possible REST methods):

- `HEAD`
- `GET`
- `POST`
- `PUT`
- `DELETE`

If this label pair is added to every span metric, that's another 5 *potential* active series generated for each metric (in all likelihood this is a very worst case scenario, very few spans will call all five REST methods).
Instead of 8 active series in the last table above, we'd have 40 (8 * 5).
50 changes: 50 additions & 0 deletions docs/sources/tempo/metrics-generator/cardinality.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
aliases:
- /docs/tempo/latest/metrics-generator/cardinality
title: Cardinality
menuTitle: Cardinality
description: What is cardinality and how it is impacted by metrics generation?
weight: 100
---

# Cardinality

Cardinality refers to the total combination of key/value pairs, such as labels and label values for a given metric series or log stream, and how many unique combinations they generate.
For more information on cardinality, see the [What are cardinality spikes and why do they matter?](/blog/2022/02/15/what-are-cardinality-spikes-and-why-do-they-matter/) blog post.

Because writes to a time-series database (TSDB) database are in series, high cardinality does not make a big difference to performance at ingest.
However, cardinality can have a major impact on querying where, the higher the cardinality, the more items are required to be iterated over.

## Traces collection and metrics

Tempo’s server-side metrics generation adds functionality to the collection of traces by creating Prometheus-based metrics that track a variety of metrics such as:

- Total span call counts
- Span latency histograms
- Total span size count

The metrics-generator creates metrics which define the relationship between services via edges and nodes.
Each of these metrics are queryable using a set of Prometheus labels (key/value pairs).

Each new value for a label increases the number of active series associated with a metric. (To learn more about active series, read the [Trace active series]({{< relref "./active-series" >}}) documentation.)

This is also known as an increase in cardinality, and the number of active series generated for a metric is directly proportional to the number of labels that exist for that metrics alongside the number of values each label has added.

In a non-modified instance of the metrics generator, a small number of labels are added automatically.
Because labels like `span_kind` and `status_code` only have a few valid values, the largest variable for the number of active series produced for each metric depends on the number of service names and span names associated with trace spans.

The metrics-generator can also be configured to also add extra labels on metrics, using span attribute key/value pairs which are mapped directly to these labels see the [custom span attribute documentation]({{< relref "../configuration#metrics-generator" >}}).

Be careful when configuring custom attributes: the greater the number of values seen in a specific attribute, the greater the number of active series will be produced. For more information about active series, refer to the [active series documentation]({{< relref "./active-series" >}})

Let's say that you are adding a custom attribute that includes unique customer IDs as a metrics label. If you have 100 customers, this could potentially multiple the number of active series generated by up to 100 (for example, going from 25,000 active series to 2.5M).
Always consider which attributes will actually be useful as labels for querying metrics, as well as the cardinality that they will increase metrics by.

## Dry-running the metrics-generator

An often most reliable solution is by running the metrics-generator in a dry-run mode.
Using the dry-run mode generates metrics but does not collecting them, thus not writing them to a metrics storage.
The override `metrics_generator_disable_collection` is defined for this use-case.

To get an estimate, run the metrics-generator normally and set the override to `true`.
Then, check `tempo_metrics_generator_registry_active_series` to get an estimation of the active series for that set-up.
2 changes: 1 addition & 1 deletion docs/sources/tempo/metrics-generator/service-graph-view.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ menuTitle: Service graph view
description: Grafana's service graph view utilizes metrics generated by the metrics-generator (or Grafana Agent) to display span request rates, error rates, and durations, as well as service graphs.
aliases:
- /docs/tempo/latest/metrics-generator/app-performance-mgmt
weight: 200
weight: 400
---

# Service graph view
Expand Down
Loading

0 comments on commit 9499cb9

Please sign in to comment.