Duplicate targets can mangle HTTP probe results. #436

jwillker · 2020-08-04T23:56:02Z

There are some metrics that doest make sense, how is possible success be bigger than total?

Environment:

GKE: 1.14
Istio: 1.5.6

My deploy:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: cloudprober
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cloudprober
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: cloudprober
    spec:
      containers:
      - args:
        - --config_file
        - /cfg/cloudprober.cfg
        - --logtostderr
        - --cloud_metadata=none
        - --disable_cloud_logging
        - -debug_log
        command:
        - /cloudprober
        image: cloudprober/cloudprober:latest
        imagePullPolicy: Always
        name: cloudprober
        ports:
        - containerPort: 9313
          name: http
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /cfg
          name: cloudprober-config
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - configMap:
          defaultMode: 420
          name: cloudprober-config
        name: cloudprober-config

My configmap:

apiVersion: v1
data:
  cloudprober.cfg: |
      probe {
      name: "http-internal-probe"
      type: HTTP
      targets {
        rds_targets {
          resource_path: "k8s://services"
          filter {
            key: "labels.probetype"
            value: "http"
          }
        }
      }
      http_probe {
        resolve_first: true
        relative_url: "/healthz"
      }
      interval_msec: 10000
      timeout_msec: 5000
    }

    rds_server {
      provider {
        kubernetes_config {
          services {}
        }
      }
    }

Until here ok 178/178:

cloudprober 1596571800161288155 1596572816 labels=ptype=http,probe=http-internal-probe,dst=payout-http total=178 success=178 latency=758171.735 timeouts=0 resp-code=map:code,503:178 resp-body=map:resp
cloudprober 1596571800161288156 1596572816 labels=ptype=http,probe=http-internal-probe,dst=payout-http total=178 success=178 latency=758171.735 timeouts=0 resp-code=map:code,503:178 resp-body=map:resp

But here the problem begins 179/180:

cloudprober 1596571800161288163 1596572826 labels=ptype=http,probe=http-internal-probe,dst=payout-http total=179 success=180 latency=786269.728 timeouts=0 resp-code=map:code,503:180 resp-body=map:resp
cloudprober 1596571800161288164 1596572826 labels=ptype=http,probe=http-internal-probe,dst=payout-http total=179 success=180 latency=786269.728 timeouts=0 resp-code=map:code,503:180 resp-body=map:resp

The text was updated successfully, but these errors were encountered:

manugarg · 2020-08-05T00:21:17Z

Thanks for reporting this bug @JohnWillker. It is indeed a very strange bug. seems like a case of "success" and "resp-code" being over-counted, or total being under-counted. Given that interval is 10s, there should have been just one probe in that interval.

Also, I am curious why there are two data lines. Do you have only one kubernetes service and it is named payout-http?

I'll continue looking at the code to see what may cause this.

manugarg · 2020-08-05T00:35:12Z

It does look like something fishy is going on with targets here. There should not be two exactly same data lines coming from a probe.

Since kubernetes services discovery code is newer (added between v0.10.7 and v0.10.8), it's most likely the culprit here. Nevertheless, HTTP probes keep results by target names, which are same here. I'll continue looking, but it will be great if you could share how your services look like.

manugarg · 2020-08-05T01:36:25Z

@JohnWillker do you have the same service (payout-http) in two namespace by any chance?

I think it's possible to trigger a bug in HTTP results accounting if we somehow have two or more targets with the same name. In kubernetes we can get two targets of the same name if they are in different namespaces.

Relevant code:

cloudprober/probes/http/http.go

Line 341 in dd0bb3e

req, result := p.httpRequests[target.Name], p.results[target.Name]

I think we need to fix a few things:

In kubernetes RDS either automatically filter by namespaces (use "default" namespace if no namespace is given) or somehow attach namespace to target name.
In targets module, make sure that we never return two targets with the same name. If such a situation exists, log and return only 1 of the targets.
Build defense in depth in HTTP probe to avoid surprising behavior regardless of what is passed to it.

jwillker · 2020-08-05T02:15:25Z

@JohnWillker do you have the same service (payout-http) in two namespace by any chance?

I think it's possible to trigger a bug in HTTP results accounting if we somehow have two or more targets with the same name. In kubernetes we can get two targets of the same name if they are in different namespaces.

Relevant code:

cloudprober/probes/http/http.go

Line 341 in dd0bb3e

req, result := p.httpRequests[target.Name], p.results[target.Name]

I think we need to fix a few things:

In kubernetes RDS either automatically filter by namespaces (use "default" namespace if no namespace is given) or somehow attach namespace to target name.

In targets module, make sure that we never return two targets with the same name. If such a situation exists, log and return only 1 of the targets.

Build defense in depth in HTTP probe to avoid surprising behavior regardless of what is passed to it.

Hi, @manugarg

I have 2 services with the same name in 2 different namespaces, something like example-prod and example-develop.

This is one of the service:

apiVersion: v1
kind: Service
metadata: 
  labels:
    app: payout-http
    probetype: http
  name: payout-http
  namespace: example-prod
spec:
  ports:
  - name: http
    port: 8800
    protocol: TCP
    targetPort: 8800
  selector:
    app: payout-http
  sessionAffinity: None
  type: ClusterIP

This is one example, but we have a bunch of services in this model, using two namespaces, but the filter only matches in prod namespace(dev namespace don't have probetype: http).

There's some way to separate this in metrics returning the entire service name in FQDN format. e.g. dst=payout-http.example-prod.svc.cluster.local?

Or other solutions to work with this model? Maybe using kubernetes_config { endpoints {}?

manugarg · 2020-08-05T06:31:15Z

Thanks once again. So, I've found that there is a problem in how we cache kubernetes resources. We were using only names as keys, while names can be same across namespaces (#437). I'll soon (within a day) send a fix for this problem. After that fix filtering by label will work for you.

For now, if you want, you can workaround this problem by filtering by namespace in the rds_server stanza. You can say:

  rds_server {
    provider {
      namespace: "example-prod"
      kubernetes_config {
        services {}
      }
    }
  }

There's some way to separate this in metrics returning the entire service name in FQDN format. e.g. dst=payout-http.example-prod.svc.cluster.local?

You can use target's labels to identify specific target. In your probe config you can add the following stanza:

additional_label {
  key: "env"
  value: "@target.label.appenv@"
}

If target has a label appenv=prod, this will add a label: end=prod to probe results. I am planning to add more documentation on it, but in the meantime, here is an example for the additional_label here:
https://github.com/google/cloudprober/blob/master/examples/additional_label/cloudprober.cfg

jwillker · 2020-08-05T14:00:56Z

Thanks once again. So, I've found that there is a problem in how we cache kubernetes resources. We were using only names as keys, while names can be same across namespaces (#437). I'll soon (within a day) send a fix for this problem. After that fix filtering by label will work for you.

For now, if you want, you can workaround this problem by filtering by namespace in the rds_server stanza. You can say:
  rds_server {
    provider {
      namespace: "example-prod"
      kubernetes_config {
        services {}
      }
    }
  }
There's some way to separate this in metrics returning the entire service name in FQDN format. e.g. dst=payout-http.example-prod.svc.cluster.local?

You can use target's labels to identify specific target. In your probe config you can add the following stanza:
additional_label {
  key: "env"
  value: "@target.label.appenv@"
}
If target has a label appenv=prod, this will add a label: end=prod to probe results. I am planning to add more documentation on it, but in the meantime, here is an example for the additional_label here:
https://github.com/google/cloudprober/blob/master/examples/additional_label/cloudprober.cfg

@manugarg
Thank you for a fast response. I will wait for the fix and will use the labels as you said to separate environments.

We anyway keep the target information in a map by name, i.e. we lose the information for duplicate targets anyway. Also, log a warning if we get a duplicate target as it should ideally never happen. Duplicate targets can lead to spurious behavior from probes: #436 PiperOrigin-RevId: 325081938

…st names. In Kubernetes, resource may have the same name across namespaces. Currently, this can lead to a pretty bizarre behavior. See #436 (comment) for background. PiperOrigin-RevId: 325143549

manugarg · 2020-08-07T03:34:47Z

I am closing this now as it should not be possible to generate duplicate targets now (after the last couple of changes to the RDS module), except for deliberately specifying duplicate targets statically. Also, this problem is not specific to HTTP probes. Duplicate targets are going to cause a problem for all kinds of probe. I've filed a separate bug to consider changing targets module's API to always return unique targets for ListEndpoints(): #445

manugarg self-assigned this Aug 5, 2020

manugarg added the bug label Aug 5, 2020

manugarg mentioned this issue Aug 5, 2020

Track kubernetes resources by namespace and name #437

Closed

manugarg changed the title ~~Wrong HTTP metrics, more success than total~~ Duplicate targets can mangle HTTP probe results. Aug 5, 2020

manugarg mentioned this issue Aug 5, 2020

RDS: De-duplicate the targets list while refreshing state. #439

Merged

manugarg mentioned this issue Aug 6, 2020

Kubernetes RDS: Use name and namespace as resource keys instead of just names. #440

Merged

manugarg mentioned this issue Aug 7, 2020

Consider always returning unique targets through targets.ListEndpoints() #445

Closed

manugarg closed this as completed Aug 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate targets can mangle HTTP probe results. #436

Duplicate targets can mangle HTTP probe results. #436

jwillker commented Aug 4, 2020

manugarg commented Aug 5, 2020

manugarg commented Aug 5, 2020 •

edited

Loading

manugarg commented Aug 5, 2020

jwillker commented Aug 5, 2020

manugarg commented Aug 5, 2020 •

edited

Loading

jwillker commented Aug 5, 2020

manugarg commented Aug 7, 2020

Duplicate targets can mangle HTTP probe results. #436

Duplicate targets can mangle HTTP probe results. #436

Comments

jwillker commented Aug 4, 2020

manugarg commented Aug 5, 2020

manugarg commented Aug 5, 2020 • edited Loading

manugarg commented Aug 5, 2020

jwillker commented Aug 5, 2020

manugarg commented Aug 5, 2020 • edited Loading

jwillker commented Aug 5, 2020

manugarg commented Aug 7, 2020

manugarg commented Aug 5, 2020 •

edited

Loading

manugarg commented Aug 5, 2020 •

edited

Loading