diff --git a/teps/000X-tekton-metrics.md b/teps/000X-tekton-metrics.md new file mode 100644 index 000000000..acfa0a43a --- /dev/null +++ b/teps/000X-tekton-metrics.md @@ -0,0 +1,176 @@ +--- +title: tekton-metrics +authors: + - "@NavidZ" +creation-date: 2020-07-13 +last-updated: 2020-07-13 +status: proposed +--- + + +# TEP-NNNN: More granular metrics + + + + + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Requirements](#requirements) +- [Test Plan](#test-plan) + + + +## Summary + + + +Add a set of metrics and tracing for monitoring and measuring +performance of the Tekton pipeline runs. These metrics are targeting +time spent on different parts of the pipeline including overall +execution, reconciling logic, fetching resources, pulling images, and +running containers. + +## Motivation + + + +Currently there is only one metric for capturing end to end time of the +pipeline runs. To be able to investigate possible regressions caused by +Tekton changes or possible causes of the slow Tekton pipelines in the +production more granular metrics are needed. This would help narrow down +regressions and help Tekton developers and users to find the root +cause faster. + +### Goals + + + +- Allow currently supported third-party metric backends to get more +granular view of different parts of a pipeline run. + +- Add a handful of (sub-)metrics that are believed useful to the current +implementation while leaving the door open to add more in the future if +needed. + +### Non-Goals + +- Add support for more metric backends. + +- Migrate the current way of reporting metrics (which is +[OpenCensus](https://opencensus.io/) via Knative libraries) to the new +[OpenTelemetry](https://opentelemetry.io/). + + + +## Requirements + +- Implement and document the new (sub-)metrics. + +- Add telemetry tests based on the current value of the metrics. + +## Test Plan + + +The new metrics will have unit-tests verifying the recording of the +metrics similar to the existing end to end metric. + +To be able to prevent regressions on the metrics due to the changes in +Tekton there will be some e2e tests that measure the metrics and expect +some values for that. One of the challenges with that is the inherent +flakiness of the metric values when running the tests. To overcome that +we would need to run the telemetry tests multiple times and compare the +median or 95th-percentile with a tolerance range.