Skip to content

Commit

Permalink
Add TEP for new (sub-)metrics
Browse files Browse the repository at this point in the history
This is a barebone TEP for adding new metrics as
suggested by the TEP template. At this point it
has only the summary, motivation, and a testing plan
to get aggreement on the general path.

I will iterate more to add the proposed submetrics
and what they measure and where should they be
measured in the code in subsequent PRs.
  • Loading branch information
NavidZ committed Jul 13, 2020
1 parent 309a61c commit 2199a87
Showing 1 changed file with 176 additions and 0 deletions.
176 changes: 176 additions & 0 deletions teps/000X-tekton-metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
---
title: tekton-metrics
authors:
- "@NavidZ"
creation-date: 2020-07-13
last-updated: 2020-07-13
status: proposed
---
<!--
**Note:** When your TEP is complete, all of these comment blocks should be removed.
To get started with this template:
- [ ] **Fill out this file as best you can.**
At minimum, you should fill in the "Summary", and "Motivation" sections.
These should be easy if you've preflighted the idea of the TEP with the
appropriate Working Group.
- [ ] **Create a PR for this TEP.**
Assign it to people in the SIG that are sponsoring this process.
- [ ] **Merge early and iterate.**
Avoid getting hung up on specific details and instead aim to get the goals of
the TEP clarified and merged quickly. The best way to do this is to just
start with the high-level sections and fill out details incrementally in
subsequent PRs.
Just because a TEP is merged does not mean it is complete or approved. Any TEP
marked as a `proposed` is a working document and subject to change. You can
denote sections that are under active debate as follows:
```
<<[UNRESOLVED optional short context or usernames ]>>
Stuff that is being argued.
<<[/UNRESOLVED]>>
```
When editing TEPS, aim for tightly-scoped, single-topic PRs to keep discussions
focused. If you disagree with what is already in a document, open a new PR
with suggested changes.
If there are new details that belong in the TEP, edit the TEP. Once a
feature has become "implemented", major changes should get new TEPs.
The canonical place for the latest set of instructions (and the likely source
of this file) is [here](/teps/NNNN-TEP-template/README.md).
-->

# TEP-NNNN: More granular metrics

<!--
This is the title of your TEP. Keep it short, simple, and descriptive. A good
title can help communicate what the TEP is and should be considered as part of
any review.
-->

<!--
A table of contents is helpful for quickly jumping to sections of a TEP and for
highlighting any additional information provided beyond the standard TEP
template.
Ensure the TOC is wrapped with
<code>&lt;!-- toc --&rt;&lt;!-- /toc --&rt;</code>
tags, and then generate with `hack/update-toc.sh`.
-->

<!-- toc -->
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Requirements](#requirements)
- [Test Plan](#test-plan)

<!-- /toc -->

## Summary

<!--
This section is incredibly important for producing high quality user-focused
documentation such as release notes or a development roadmap. It should be
possible to collect this information before implementation begins in order to
avoid requiring implementors to split their attention between writing release
notes and implementing the feature itself.
A good summary is probably at least a paragraph in length.
Both in this section and below, follow the guidelines of the [documentation
style guide]. In particular, wrap lines to a reasonable length, to make it
easier for reviewers to cite specific portions, and to minimize diff churn on
updates.
[documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md
-->

Add a set of metrics and tracing for monitoring and measuring
performance of the Tekton pipeline runs. These metrics are targeting
time spent on different parts of the pipeline including overall
execution, reconciling logic, fetching resources, pulling images, and
running containers.

## Motivation

<!--
This section is for explicitly listing the motivation, goals and non-goals of
this TEP. Describe why the change is important and the benefits to users. The
motivation section can optionally provide links to [experience reports][] to
demonstrate the interest in a TEP within the wider Tekton community.
[experience reports]: https://github.com/golang/go/wiki/ExperienceReports
-->

Currently there is only one metric for capturing end to end time of the
pipeline runs. To be able to investigate possible regressions caused by
Tekton changes or possible causes of the slow Tekton pipelines in the
production more granular metrics are needed. This would help narrow down
regressions and help Tekton developers and users to find the root
cause faster.

### Goals

<!--
List the specific goals of the TEP. What is it trying to achieve? How will we
know that this has succeeded?
-->

- Allow currently supported third-party metric backends to get more
granular view of different parts of a pipeline run.

- Add a handful of (sub-)metrics that are believed useful to the current
implementation while leaving the door open to add more in the future if
needed.

### Non-Goals

- Add support for more metric backends.

- Migrate the current way of reporting metrics (which is
[OpenCensus](https://opencensus.io/) via Knative libraries) to the new
[OpenTelemetry](https://opentelemetry.io/).

<!--
What is out of scope for this TEP? Listing non-goals helps to focus discussion
and make progress.
-->

## Requirements

- Implement and document the new (sub-)metrics.

- Add telemetry tests based on the current value of the metrics.

## Test Plan

<!--
**Note:** *Not required until targeted at a release.*
Consider the following in developing a test plan for this enhancement:
- Will there be e2e and integration tests, in addition to unit tests?
- How will it be tested in isolation vs with other components?
No need to outline all of the test cases, just the general strategy. Anything
that would count as tricky in the implementation and anything particularly
challenging to test should be called out.
All code is expected to have adequate tests (eventually with coverage
expectations).
-->
The new metrics will have unit-tests verifying the recording of the
metrics similar to the existing end to end metric.

To be able to prevent regressions on the metrics due to the changes in
Tekton there will be some e2e tests that measure the metrics and expect
some values for that. One of the challenges with that is the inherent
flakiness of the metric values when running the tests. To overcome that
we would need to run the telemetry tests multiple times and compare the
median or 95th-percentile with a tolerance range.

0 comments on commit 2199a87

Please sign in to comment.