Skip to content
This repository has been archived by the owner on Apr 25, 2023. It is now read-only.

feat: add custom kubefed metrics #1196

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions cmd/controller-manager/app/controller-manager.go
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ import (
"sigs.k8s.io/kubefed/pkg/controller/servicedns"
"sigs.k8s.io/kubefed/pkg/controller/util"
"sigs.k8s.io/kubefed/pkg/features"
kubefedmetrics "sigs.k8s.io/kubefed/pkg/metrics"
"sigs.k8s.io/kubefed/pkg/version"
)

Expand Down Expand Up @@ -114,6 +115,8 @@ func Run(opts *options.Options, stopChan <-chan struct{}) error {

go serveHealthz(healthzAddr)
go serveMetrics(metricsAddr, stopChan)
// Register kubefed custom metrics
kubefedmetrics.RegisterAll()

var err error
opts.Config.KubeConfig, err = clientcmd.BuildConfigFromFlags(masterURL, kubeconfig)
Expand Down
148 changes: 148 additions & 0 deletions docs/keps/20200302-kubefed-metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
---
kep-number: 0
short-desc: Kubefed Custom Metrics
title: Kubefed Custom Metrics
authors:
- "@hectorj2f"
reviewers:
- "@jimmidyson"
- "@pmorie"
- "@xunpan"
approvers:
- "@jimmidyson"
- "@pmorie"
- "@xunpan"
editor: TBD
creation-date: 2020-03-02
last-updated: 2020-03-02
status: provisional
---

# Kubefed Custom Metrics

## Table of Contents

* [Kubefed Custom Metrics](#kubefed-custom-metrics)
* [Table of Contents](#table-of-contents)
* [Summary](#summary)
* [Motivation](#motivation)
* [Goals](#goals)
* [Non\-Goals](#non-goals)
* [Proposals](#proposals)
* [Metrics](#metrics)
* [Risks and Mitigations](#risks-and-mitigations)
* [Graduation Criteria](#graduation-criteria)
* [Implementation History](#implementation-history)
* [Drawbacks](#drawbacks)
* [Infrastructure Needed](#infrastructure-needed)

## Summary

This document describes the different metrics and valuable data that could be exposed
and consumed from Kubefed to create dashboards and better understand this engine.

## Motivation

We aim to define a generic strategy on how to identify, consume and expose
custom Kubefed metrics.


### Goals

* Identify which metrics should be exposed from Kubefed if possible.
* Define a set of Kubefed metrics that could be consumed by Prometheus tools.
* Specify the type of each metric (e.g histogram, gauge, counter, summary).
* Use these metrics to create Grafana dashboards.

### Non-Goals

* Technical details about the Grafana Dashbards.

## Proposals

Kubefed already exposes a small set of metrics. These are some of the default metrics provided by
the [controller-runtime](https://github.com/kubernetes-sigs/controller-runtime/tree/master/pkg/metrics), in particular, Kubefed only exposes the client-only metrics. The rest of metrics are not available because Kubefed was not implemented
using the `controller-runtime` utils.

The metrics
are exposed by a `/metrics` route on a [Prometheus friendly format](https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exposition_formats.md).
A service monitor should be created to instruct Prometheus tools to scrape
the metrics from the Kubefed `metrics` service endpoint.

However the client-only metrics are not enough, and Kubefed custom metrics have to
be identified and exposed to better understand this engine and scalability challenges.


### Metrics

In the following we share a table with the relevant metrics:

Kubefed clusters states reflect the status of the cluster and is periodically checked.


The following metric aims to register the total number of Kubefed clusters on `ready`, `notready` and `offline` state:

* `kubefedcluster_total`: a gauge metric that holds the number Kubefed clusters in any of the three possible states.
To identify the type of state, we add a label `state` to this metric with the value of the state.

In addition to these metrics, we should also store the time this whole operation takes:

* `cluster_health_status_duration_seconds`: this `histogram` metric holds the duration in seconds of the action that checks
the health status of a Kubefed cluster.

Kubefed needs to connect to the remote clusters to validate/create/delete all the federated resources
in the target clusters. When having many clusters, the time invested on connecting
to remote clusters might be relevant:

* `cluster_client_connection_duration_seconds`: this `histogram` metric holds the duration in seconds of the creation
of a Kubernetes client to a remote cluster. This operation normally implies to connect to
the remote server to get certain metadata.

Kubefed federates resources on target clusters, and one of its controllers triggers
a periodic reconciliation of all target federated resources.

* `reconcile_federated_resources_duration_seconds`: this `histogram` metric holds the duration in seconds of the action that
reconcile federated resources in the target clusters.

Another operation that is relevant to record is the creation/update/deletion of
the propagated resources. This action is handled by the called dispatchers in Kubefed.

For this metric, we could choose a single metric that will include additional labels
to distinguish the different operations:

* `dispatch_operation_duration_seconds`: this `histogram` metric holds the duration in seconds of the creation/update/deletion
of the different propagated resources. The label `action` will hold the `create`, `update` and `delete` operations.

Regarding cluster join/unjoin operations, these metrics are also convenient to register:

* `joined_cluster_total`: a gauge metric that holds the number joined clusters.

* `join_cluster_duration_seconds`: this `histogram` metric holds the duration in seconds of the join cluster action.

* `unjoin_cluster_duration_seconds`: this `histogram` metric holds the duration in seconds of the unjoin cluster action.

To keep track of the rest of controllers and its reconciliation time, we will use a generic metric:

* `controller_runtime_reconcile_duration_seconds`: is a `histogram` which keeps track of the duration
of reconciliations for other Kubefed controllers. A label `controller` will allow to distinguish
the different controllers.

In addition to these metrics, we could add counters to register common error types.
This approach would make easy to track their rate on a dashboard.


#### Alternatives

### Implementation Details/Notes/Constraints

All the identified metrics in this document might be added to Kubefed in an incremental manner.

### Risks and Mitigations

## Graduation Criteria

## Implementation History

## Drawbacks

## Infrastructure Needed
3 changes: 3 additions & 0 deletions pkg/controller/federatedtypeconfig/controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ package federatedtypeconfig
import (
"context"
"sync"
"time"

"github.com/pkg/errors"

Expand All @@ -36,6 +37,7 @@ import (
statuscontroller "sigs.k8s.io/kubefed/pkg/controller/status"
synccontroller "sigs.k8s.io/kubefed/pkg/controller/sync"
"sigs.k8s.io/kubefed/pkg/controller/util"
"sigs.k8s.io/kubefed/pkg/metrics"
)

const finalizer string = "core.kubefed.io/federated-type-config"
Expand Down Expand Up @@ -128,6 +130,7 @@ func (c *Controller) Run(stopChan <-chan struct{}) {

func (c *Controller) reconcile(qualifiedName util.QualifiedName) util.ReconciliationStatus {
key := qualifiedName.String()
defer metrics.UpdateControllerReconcileDurationFromStart("federatedtypeconfigcontroller", time.Now())

klog.V(3).Infof("Running reconcile FederatedTypeConfig for %q", key)

Expand Down
3 changes: 3 additions & 0 deletions pkg/controller/ingressdns/controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ import (
dnsv1a1 "sigs.k8s.io/kubefed/pkg/apis/multiclusterdns/v1alpha1"
genericclient "sigs.k8s.io/kubefed/pkg/client/generic"
"sigs.k8s.io/kubefed/pkg/controller/util"
"sigs.k8s.io/kubefed/pkg/metrics"
)

const (
Expand Down Expand Up @@ -201,6 +202,8 @@ func (c *Controller) reconcileOnClusterChange() {
}

func (c *Controller) reconcile(qualifiedName util.QualifiedName) util.ReconciliationStatus {
defer metrics.UpdateControllerReconcileDurationFromStart("ingressdnscontroller", time.Now())

if !c.isSynced() {
return util.StatusNotSynced
}
Expand Down
4 changes: 4 additions & 0 deletions pkg/controller/kubefedcluster/clusterclient.go
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ import (
fedv1b1 "sigs.k8s.io/kubefed/pkg/apis/core/v1beta1"
"sigs.k8s.io/kubefed/pkg/client/generic"
"sigs.k8s.io/kubefed/pkg/controller/util"
"sigs.k8s.io/kubefed/pkg/metrics"
)

const (
Expand Down Expand Up @@ -128,10 +129,13 @@ func (self *ClusterClient) GetClusterHealthStatus() (*fedv1b1.KubeFedClusterStat
if err != nil {
runtime.HandleError(errors.Wrapf(err, "Failed to do cluster health check for cluster %q", self.clusterName))
clusterStatus.Conditions = append(clusterStatus.Conditions, newClusterOfflineCondition)
metrics.RegisterKubefedClusterTotal(metrics.ClusterOffline, self.clusterName)
} else {
if !strings.EqualFold(string(body), "ok") {
metrics.RegisterKubefedClusterTotal(metrics.ClusterNotReady, self.clusterName)
clusterStatus.Conditions = append(clusterStatus.Conditions, newClusterNotReadyCondition, newClusterNotOfflineCondition)
} else {
metrics.RegisterKubefedClusterTotal(metrics.ClusterReady, self.clusterName)
clusterStatus.Conditions = append(clusterStatus.Conditions, newClusterReadyCondition)
}
}
Expand Down
5 changes: 5 additions & 0 deletions pkg/controller/kubefedcluster/controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ import (
"context"
"fmt"
"sync"
"time"

"github.com/pkg/errors"
corev1 "k8s.io/api/core/v1"
Expand All @@ -41,6 +42,7 @@ import (
genscheme "sigs.k8s.io/kubefed/pkg/client/generic/scheme"
"sigs.k8s.io/kubefed/pkg/controller/util"
"sigs.k8s.io/kubefed/pkg/features"
"sigs.k8s.io/kubefed/pkg/metrics"
)

// ClusterData stores cluster client and previous health check probe results of individual cluster.
Expand Down Expand Up @@ -239,6 +241,8 @@ func (cc *ClusterController) updateClusterStatus() error {

func (cc *ClusterController) updateIndividualClusterStatus(cluster *fedv1b1.KubeFedCluster,
storedData *ClusterData, wg *sync.WaitGroup) {
defer metrics.ClusterHealthStatusDurationFromStart(time.Now())

clusterClient := storedData.clusterKubeClient

currentClusterStatus, err := clusterClient.GetClusterHealthStatus()
Expand All @@ -257,6 +261,7 @@ func (cc *ClusterController) updateIndividualClusterStatus(cluster *fedv1b1.Kube
if err := cc.client.UpdateStatus(context.TODO(), cluster); err != nil {
klog.Warningf("Failed to update the status of cluster %q: %v", cluster.Name, err)
}

wg.Done()
}

Expand Down
5 changes: 5 additions & 0 deletions pkg/controller/schedulingmanager/controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ limitations under the License.
package schedulingmanager

import (
"time"

"github.com/pkg/errors"

"k8s.io/apimachinery/pkg/util/runtime"
Expand All @@ -27,6 +29,7 @@ import (
corev1b1 "sigs.k8s.io/kubefed/pkg/apis/core/v1beta1"
"sigs.k8s.io/kubefed/pkg/controller/schedulingpreference"
"sigs.k8s.io/kubefed/pkg/controller/util"
"sigs.k8s.io/kubefed/pkg/metrics"
"sigs.k8s.io/kubefed/pkg/schedulingtypes"
)

Expand Down Expand Up @@ -141,6 +144,8 @@ func (c *SchedulingManager) shutdown() {
}

func (c *SchedulingManager) reconcile(qualifiedName util.QualifiedName) util.ReconciliationStatus {
defer metrics.UpdateControllerReconcileDurationFromStart("schedulingmanagercontroller", time.Now())

key := qualifiedName.String()

klog.V(3).Infof("Running reconcile FederatedTypeConfig %q in scheduling manager", key)
Expand Down
3 changes: 3 additions & 0 deletions pkg/controller/schedulingpreference/controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ import (

fedv1b1 "sigs.k8s.io/kubefed/pkg/apis/core/v1beta1"
"sigs.k8s.io/kubefed/pkg/controller/util"
"sigs.k8s.io/kubefed/pkg/metrics"
"sigs.k8s.io/kubefed/pkg/schedulingtypes"
)

Expand Down Expand Up @@ -191,6 +192,8 @@ func (s *SchedulingPreferenceController) reconcileOnClusterChange() {
}

func (s *SchedulingPreferenceController) reconcile(qualifiedName util.QualifiedName) util.ReconciliationStatus {
defer metrics.UpdateControllerReconcileDurationFromStart("schedulingpreferencecontroller", time.Now())

if !s.isSynced() {
return util.StatusNotSynced
}
Expand Down
3 changes: 3 additions & 0 deletions pkg/controller/servicedns/controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ import (
dnsv1a1 "sigs.k8s.io/kubefed/pkg/apis/multiclusterdns/v1alpha1"
genericclient "sigs.k8s.io/kubefed/pkg/client/generic"
"sigs.k8s.io/kubefed/pkg/controller/util"
"sigs.k8s.io/kubefed/pkg/metrics"
)

const (
Expand Down Expand Up @@ -257,6 +258,8 @@ func (c *Controller) reconcileOnClusterChange() {
}

func (c *Controller) reconcile(qualifiedName util.QualifiedName) util.ReconciliationStatus {
defer metrics.UpdateControllerReconcileDurationFromStart("servicednscontroller", time.Now())

if !c.isSynced() {
return util.StatusNotSynced
}
Expand Down
3 changes: 3 additions & 0 deletions pkg/controller/status/controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ import (
fedv1b1 "sigs.k8s.io/kubefed/pkg/apis/core/v1beta1"
genericclient "sigs.k8s.io/kubefed/pkg/client/generic"
"sigs.k8s.io/kubefed/pkg/controller/util"
"sigs.k8s.io/kubefed/pkg/metrics"
)

const (
Expand Down Expand Up @@ -230,6 +231,8 @@ func (s *KubeFedStatusController) reconcileOnClusterChange() {
}

func (s *KubeFedStatusController) reconcile(qualifiedName util.QualifiedName) util.ReconciliationStatus {
defer metrics.UpdateControllerReconcileDurationFromStart("statuscontroller", time.Now())

if !s.isSynced() {
return util.StatusNotSynced
}
Expand Down
2 changes: 2 additions & 0 deletions pkg/controller/sync/controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ import (
"sigs.k8s.io/kubefed/pkg/controller/sync/status"
"sigs.k8s.io/kubefed/pkg/controller/util"
finalizersutil "sigs.k8s.io/kubefed/pkg/controller/util/finalizers"
"sigs.k8s.io/kubefed/pkg/metrics"
)

const (
Expand Down Expand Up @@ -266,6 +267,7 @@ func (s *KubeFedSyncController) reconcile(qualifiedName util.QualifiedName) util
startTime := time.Now()
defer func() {
klog.V(4).Infof("Finished reconciling %s %q (duration: %v)", kind, key, time.Since(startTime))
metrics.ReconcileFederatedResourcesDurationFromStart(startTime)
}()

if fedResource.Object().GetDeletionTimestamp() != nil {
Expand Down
6 changes: 5 additions & 1 deletion pkg/controller/sync/dispatch/managed.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ import (
"fmt"
"strings"
"sync"
"time"

"github.com/pkg/errors"

Expand All @@ -33,6 +34,7 @@ import (
"sigs.k8s.io/kubefed/pkg/client/generic"
"sigs.k8s.io/kubefed/pkg/controller/sync/status"
"sigs.k8s.io/kubefed/pkg/controller/util"
"sigs.k8s.io/kubefed/pkg/metrics"
)

// FederatedResourceForDispatch is the subset of the FederatedResource
Expand Down Expand Up @@ -130,7 +132,7 @@ func (d *managedDispatcherImpl) Create(clusterName string) {
// operation timed out. The timeout status will be cleared by
// Wait() if a timeout does not occur.
d.RecordStatus(clusterName, status.CreationTimedOut)

start := time.Now()
d.dispatcher.incrementOperationsInitiated()
const op = "create"
go d.dispatcher.clusterOperation(clusterName, op, func(client generic.Client) util.ReconciliationStatus {
Expand All @@ -150,6 +152,7 @@ func (d *managedDispatcherImpl) Create(clusterName string) {
if err == nil {
version := util.ObjectVersion(obj)
d.recordVersion(clusterName, version)
metrics.DispatchOperationDurationFromStart("create", start)
return util.StatusAllOK
}

Expand All @@ -175,6 +178,7 @@ func (d *managedDispatcherImpl) Create(clusterName string) {

d.recordError(clusterName, op, errors.Errorf("An update will be attempted instead of a creation due to an existing resource"))
d.Update(clusterName, obj)
metrics.DispatchOperationDurationFromStart("update", start)
return util.StatusAllOK
})
}
Expand Down
Loading