[DOC] Update metrics generator doc and add best practices (#2563)

* Update metrics generator doc * Updates from doc validator * Fix typos and links * Fix typo in admonition * Fix validator issues, part 2 * Apply suggestions from code review Co-authored-by: Heds Simons <[email protected]> * Move content and fix admonitions * Added best practices * Update docs/sources/tempo/metrics-generator/span_metrics.md * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Eve Meelan <[email protected]> * Fix links, admonitions --------- Co-authored-by: Heds Simons <[email protected]> Co-authored-by: Eve Meelan <[email protected]>
grafana · Jun 16, 2023 · 9499cb9 · 9499cb9
1 parent 0b60c74
commit 9499cb9
Show file tree

Hide file tree

Showing 10 changed files with 334 additions and 52 deletions.
diff --git a/docs/sources/tempo/configuration/grafana-agent/service-graphs.md b/docs/sources/tempo/configuration/grafana-agent/service-graphs.md
@@ -61,4 +61,5 @@ metrics:
 
 The same service graph metrics can also be generated by Tempo.
 This is more efficient and recommended for larger installations.
-For additional information about viewing service graph metrics in Grafana and calculating cardinality, check out the [server side documentation]({{< relref "../../metrics-generator/service_graphs#grafana" >}}).
+
+For additional information about viewing service graph metrics in Grafana and calculating cardinality, refer to the [server side documentation]({{< relref "../../metrics-generator/service_graphs#enable-service-graphs-in-Grafana" >}}).
diff --git a/docs/sources/tempo/metrics-generator/_index.md b/docs/sources/tempo/metrics-generator/_index.md
@@ -13,22 +13,24 @@ Metrics-generator is an optional Tempo component that derives metrics from inges
 If present, the distributor will write received spans to both the ingester and the metrics-generator.
 The metrics-generator processes spans and writes metrics to a Prometheus data source using the Prometheus remote write protocol.
 
->**Note**: Enabling metrics generation and remote writing them to Grafana Cloud Metrics produces extra active series that could impact your billing. For more information on billing, refer to [Billing and usage](/docs/grafana-cloud/billing-and-usage/).
+{{% admonition type="note" %}}
+Enabling metrics generation and remote writing them to Grafana Cloud Metrics produces extra active series that could impact your billing. For more information on billing, refer to [Billing and usage](/docs/grafana-cloud/billing-and-usage/).
+{{% /admonition %}}
 
 ## Overview
 
 Metrics-generator leverages the data available in Tempo's ingest path to provide additional value by generating metrics from traces.
 
 The metrics-generator internally runs a set of **processors**.
 Each processor ingests spans and produces metrics.
-Every processor derives different metrics. Currently the following processors are available:
+Every processor derives different metrics. Currently, the following processors are available:
 
 - Service graphs
 - Span metrics
 
 <p align="center"><img src="server-side-metrics-arch-overview.png" alt="Service metrics architecture"></p>
 
-### Service graphs
+## Service graphs
 
 Service graphs are the representations of the relationships between services within a distributed system.
 
@@ -38,7 +40,7 @@ The amount of request and their duration are recorded as metrics, which are used
 
 To learn more about this processor, read the [documentation]({{< relref "./service_graphs" >}}).
 
-### Span metrics
+## Span metrics
 
 The span metrics processor derives RED (Request, Error and Duration) metrics from spans.
 
@@ -48,11 +50,11 @@ The more dimensions are enabled, the higher the cardinality of the generated met
 
 To learn more about this processor, read the [documentation]({{< relref "./span_metrics" >}}).
 
-### Remote writing metrics
+## Remote writing metrics
 
 The metrics-generator runs a Prometheus Agent that periodically sends metrics to a `remote_write` endpoint.
 The `remote_write` endpoint is configurable and can be any [Prometheus-compatible endpoint](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write).
 To learn more about the endpoint configuration, refer to the [Metrics-generator]({{< relref "../configuration#metrics-generator" >}}) section of the Tempo Configuration documentation.
 Writing interval can be controlled via `metrics_generator.registry.collection_interval`.
 
-When multi-tenancy is enabled, the metrics-generator forwards the `X-Scope-OrgID` header of the original request to the remote_write endpoint.
+When multi-tenancy is enabled, the metrics-generator forwards the `X-Scope-OrgID` header of the original request to the `remote_write` endpoint.
diff --git a/docs/sources/tempo/metrics-generator/active-series.md b/docs/sources/tempo/metrics-generator/active-series.md
@@ -0,0 +1,124 @@
+---
+aliases:
+- /docs/tempo/latest/metrics-generator/active-series
+title: Active series
+menuTitle: Active series
+description: Learn about active series and how they are calculated.
+weight: 100
+---
+
+# Active series
+
+An active series is a time series that receives new data points or samples. When you stop writing new datapoints to a time series, shortly afterwards it is no longer considered active.
+
+Metrics generated by Tempo's metrics generator can provide both RED (Rate/Error/Duration) metrics and interdependency graphs between services in a trace (the Service Graph functionality in Grafana).
+These capabilities rely on a set of generated span metrics and service metrics.
+
+Any spans that are ingested by Tempo could potentially create up to 13 metrics. However, this doesn't mean that every time a span is ingested that a new active series is created.
+
+The number of active series generated depends on the label pairs generated from span data that are associated with the metrics, similar to other Prometheus-formated data.
+
+For additional information, refer to the [Active series and DPM documentation](/docs/grafana-cloud/billing-and-usage/active-series-and-dpm/#active-series).
+
+## Active series calculation
+
+Active series for a metric increase when a new value for a label key is introduced. For example, the `span_kind` label has a total of five possible values, and the `status_code` label has a total of three possible values.
+
+At first glance, you might make an assumption that this means that at least 15 (5*3) active series will be generated for each span. But this isn't the case.
+
+Let's consider a span that's emitted from some piece of code in a service:
+
+![Single span visualization](/static/img/docs/tempo/SingleSpan.jpeg)
+
+Here's a single service with a single span.
+If the code inside the span never leaves the service, then the `span_kind` label generated by the metrics generator will be `SPAN_KIND_INTERNAL` and never deviate. It'll never be one of the other four possible values.
+
+Similarly, if the code inside the span never errors, it'll only have the `STATUS_CODE_OK` state for the `span_status` label.
+This means that the metrics generator will only generate a single active series, where the service name will be _Service 1_ and the span name will be _span1_.
+If we looked at the Prometheus data for the `traces_spanmetrics_call_total` metric, we'd see a single active series:
+
+| service   | span_name | span_kind          | status_code    | Metric value |
+| --------- | --------- | ------------------ | -------------- | ------------ |
+| Service 1 | span1     | SPAN_KIND_INTERNAL | STATUS_CODE_OK | 1            |
+
+It doesn't matter how many times that span occurs in a trace either, for example maybe a span is generated within a loop.
+In code run once, 10 times, 100 times, 1000 times, only a single active series will be produced, where a counter might be increased 1, 10, 100, or 1000 times:
+
+![Single span with loop](/static/img/docs/tempo/SingleSpanLoop.jpeg)
+
+If you looked at the Prometheus data, you'd see an instant value for `traces_spanmetrics_call_total` similar to the table. Again, one active series for the metric:
+
+| service   | span_name | span_kind          | status_code    | Metric value |
+| --------- | --------- | ------------------ | -------------- | ------------ |
+| Service 1 | span1     | SPAN_KIND_INTERNAL | STATUS_CODE_OK | 120          |
+
+
+However, let's now assume that it does loop and there are occasionally errors.
+
+![Single span with loop and errors](/static/img/docs/tempo/SinglespanLoopError.jpeg)
+
+There are now two potential outcomes for a span when the code loops: one where everything successfully completes and one where there is an error.
+This means that when the span completes `status_code` is now either `STATUS_CODE_OK` or `STATUS_CODE_ERROR`.
+Because of that, the label values can be one of two values on a metric, and we now have two active series being generated based on the `status_code`, one for the `OK` status and one for the error.
+
+Again, we could loop once, 10 times, 100, or more times, but there will only ever be two active series.
+
+If we now looked at Prometheus instant values for `traces_spanmetrics_call_total`, we'd now see the following table:
+
+| service   | span_name | span_kind          | status_code       | Metric value |
+| --------- | --------- | ------------------ | ----------------- | ------------ |
+| Service 1 | span1     | SPAN_KIND_INTERNAL | STATUS_CODE_OK    | 96           |
+| Service 1 | span1     | SPAN_KIND_INTERNAL | STATUS_CODE_ERROR | 24           |
+
+What happens if you call out to another service though? Let's add an option where, based on some arbitrary data, we sometimes make a downstream call to another service, but otherwise continue to runs loops in our own service:
+
+![Multiple spans with loops and errors](/static/img/docs/tempo/SingleSpanLoopErrorAnotherService.jpeg)
+
+In this scenario, `span1`'s `span_kind` label would now be one of either `SPAN_KIND_INTERNAL` or `SPAN_KIND_CLIENT` (as it has acted as a client calling a downstream server).
+If a call to the downstream service could also potentially fail, then for `SPAN_KIND_CLIENT`, the `status_code` could be either `STATUS_CODE_ERROR` or `STATUS_CODE_OK`.
+
+At this point, `traces_spanmetrics_call_total` would have four different variations in labels:
+
+| service   | span_name | span_kind          | status_code       | Metric value |
+| --------- | --------- | ------------------ | ----------------- | ------------ |
+| Service 1 | span1     | SPAN_KIND_INTERNAL | STATUS_CODE_OK    | 34           |
+| Service 1 | span1     | SPAN_KIND_INTERNAL | STATUS_CODE_ERROR | 6            |
+| Service 1 | span1     | SPAN_KIND_CLIENT   | STATUS_CODE_OK    | 23           |
+| Service 1 | span1     | SPAN_KIND_CLIENT   | STATUS_CODE_ERROR | 3            |
+
+Because of the variation in values, we now have four active series for our metric instead of one. But, as far as Service 1 is concerned, there's still only four active series, because there isn't any other variation of the values for labels. You can run 1 trace, 10 traces, 100 traces (each with however many loops of spans there are) and only four active series will ever be produced.
+
+We've actually only told half the story in our last diagram. _Service 1_ called a second service, _Service 2_, which continues the trace by adding a new span, `span2`.
+If there was a loop inside Service 2 with a single span that was generated from an upstream call from Service 1, and then a number of spans that were driven internally, which could also error, we'd end up with the possible values in the metric for `traces_spanmetrics_call_total` below:
+
+| service   | span_name | span_kind          | status_code       | Metric value |
+| --------- | --------- | ------------------ | ----------------- | ------------ |
+| Service 1 | span1     | SPAN_KIND_INTERNAL | STATUS_CODE_OK    | 89           |
+| Service 1 | span1     | SPAN_KIND_INTERNAL | STATUS_CODE_ERROR | 13           |
+| Service 1 | span1     | SPAN_KIND_CLIENT   | STATUS_CODE_OK    | 44           |
+| Service 1 | span1     | SPAN_KIND_CLIENT   | STATUS_CODE_ERROR | 9            |
+| Service 2 | span2     | SPAN_KIND_SERVER   | STATUS_CODE_OK    | 30           |
+| Service 2 | span2     | SPAN_KIND_SERVER   | STATUS_CODE_ERROR | 14           |
+| Service 2 | span2     | SPAN_KIND_INTERNAL | STATUS_CODE_OK    | 99           |
+| Service 2 | span2     | SPAN_KIND_INTERNAL | STATUS_CODE_ERROR | 23           |
+
+At this point, all our traces will be composed of two potential span names, each of which produce two separate types of `span_kind` and two separate types of `status_code`. So we have eight active series for a metric.
+
+The variability of values for each potential span condition determines the number of active series being produced by Tempo when ingesting spans for a trace, and not the number of traces of spans that are seen.
+
+## Custom span attributes
+
+There's another consideration for active series: extra label key/value pairs that can be added onto metrics from a span's attributes.
+The Tempo metrics generator allows the user to use arbitrary span attributes to be created as label pairs for metrics.
+When considering the number of active series generated, you also need to determine how many possible values there are for the span attribute being turned into a label.
+
+For example, if you added an `http.method` span attribute into a metric label pair, there are five possible values (because there are five possible REST methods):
+
+- `HEAD`
+- `GET`
+- `POST`
+- `PUT`
+- `DELETE`
+
+If this label pair is added to every span metric, that's another 5 *potential* active series generated for each metric (in all likelihood this is a very worst case scenario, very few spans will call all five REST methods).
+Instead of 8 active series in the last table above, we'd have 40 (8 * 5).
diff --git a/docs/sources/tempo/metrics-generator/cardinality.md b/docs/sources/tempo/metrics-generator/cardinality.md
@@ -0,0 +1,50 @@
+---
+aliases:
+- /docs/tempo/latest/metrics-generator/cardinality
+title: Cardinality
+menuTitle: Cardinality
+description: What is cardinality and how it is impacted by metrics generation?
+weight: 100
+---
+
+# Cardinality
+
+Cardinality refers to the total combination of key/value pairs, such as labels and label values for a given metric series or log stream, and how many unique combinations they generate.
+For more information on cardinality, see the [What are cardinality spikes and why do they matter?](/blog/2022/02/15/what-are-cardinality-spikes-and-why-do-they-matter/) blog post.
+
+Because writes to a time-series database (TSDB) database are in series, high cardinality does not make a big difference to performance at ingest.
+However, cardinality can have a major impact on querying where, the higher the cardinality, the more items are required to be iterated over.
+
+## Traces collection and metrics
+
+Tempo’s server-side metrics generation adds functionality to the collection of traces by creating Prometheus-based metrics that track a variety of metrics such as:
+
+- Total span call counts
+- Span latency histograms
+- Total span size count
+
+The metrics-generator creates metrics which define the relationship between services via edges and nodes.
+Each of these metrics are queryable using a set of Prometheus labels (key/value pairs).
+
+Each new value for a label increases the number of active series associated with a metric. (To learn more about active series, read the [Trace active series]({{< relref "./active-series" >}}) documentation.)
+
+This is also known as an increase in cardinality, and the number of active series generated for a metric is directly proportional to the number of labels that exist for that metrics alongside the number of values each label has added.
+
+In a non-modified instance of the metrics generator, a small number of labels are added automatically.
+Because labels like `span_kind` and `status_code` only have a few valid values, the largest variable for the number of active series produced for each metric depends on the number of service names and span names associated with trace spans.
+
+The metrics-generator can also be configured to also add extra labels on metrics, using span attribute key/value pairs which are mapped directly to these labels see the [custom span attribute documentation]({{< relref "../configuration#metrics-generator" >}}).
+
+Be careful when configuring custom attributes: the greater the number of values seen in a specific attribute, the greater the number of active series will be produced. For more information about active series, refer to the [active series documentation]({{< relref "./active-series" >}})
+
+Let's say that you are adding a custom attribute that includes unique customer IDs as a metrics label. If you have 100 customers, this could potentially multiple the number of active series generated by up to 100 (for example, going from 25,000 active series to 2.5M).
+Always consider which attributes will actually be useful as labels for querying metrics, as well as the cardinality that they will increase metrics by.
+
+## Dry-running the metrics-generator
+
+An often most reliable solution is by running the metrics-generator in a dry-run mode.
+Using the dry-run mode generates metrics but does not collecting them, thus not writing them to a metrics storage.
+The override `metrics_generator_disable_collection` is defined for this use-case.
+
+To get an estimate, run the metrics-generator normally and set the override to `true`.
+Then, check `tempo_metrics_generator_registry_active_series` to get an estimation of the active series for that set-up.
diff --git a/docs/sources/tempo/metrics-generator/service-graph-view.md b/docs/sources/tempo/metrics-generator/service-graph-view.md
@@ -4,7 +4,7 @@ menuTitle: Service graph view
 description: Grafana's service graph view utilizes metrics generated by the metrics-generator (or Grafana Agent) to display span request rates, error rates, and durations, as well as service graphs.
 aliases:
 - /docs/tempo/latest/metrics-generator/app-performance-mgmt
-weight: 200
+weight: 400
 ---
 
 # Service graph view