Skip to content
This repository has been archived by the owner on Nov 5, 2021. It is now read-only.

Duplicate targets can mangle HTTP probe results. #436

Closed
jwillker opened this issue Aug 4, 2020 · 7 comments
Closed

Duplicate targets can mangle HTTP probe results. #436

jwillker opened this issue Aug 4, 2020 · 7 comments
Assignees
Labels

Comments

@jwillker
Copy link

jwillker commented Aug 4, 2020

There are some metrics that doest make sense, how is possible success be bigger than total?

Environment:

  • GKE: 1.14
  • Istio: 1.5.6

My deploy:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: cloudprober
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cloudprober
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: cloudprober
    spec:
      containers:
      - args:
        - --config_file
        - /cfg/cloudprober.cfg
        - --logtostderr
        - --cloud_metadata=none
        - --disable_cloud_logging
        - -debug_log
        command:
        - /cloudprober
        image: cloudprober/cloudprober:latest
        imagePullPolicy: Always
        name: cloudprober
        ports:
        - containerPort: 9313
          name: http
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /cfg
          name: cloudprober-config
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - configMap:
          defaultMode: 420
          name: cloudprober-config
        name: cloudprober-config

My configmap:

apiVersion: v1
data:
  cloudprober.cfg: |
      probe {
      name: "http-internal-probe"
      type: HTTP
      targets {
        rds_targets {
          resource_path: "k8s://services"
          filter {
            key: "labels.probetype"
            value: "http"
          }
        }
      }
      http_probe {
        resolve_first: true
        relative_url: "/healthz"
      }
      interval_msec: 10000
      timeout_msec: 5000
    }

    rds_server {
      provider {
        kubernetes_config {
          services {}
        }
      }
    }

Until here ok 178/178:

cloudprober 1596571800161288155 1596572816 labels=ptype=http,probe=http-internal-probe,dst=payout-http total=178 success=178 latency=758171.735 timeouts=0 resp-code=map:code,503:178 resp-body=map:resp
cloudprober 1596571800161288156 1596572816 labels=ptype=http,probe=http-internal-probe,dst=payout-http total=178 success=178 latency=758171.735 timeouts=0 resp-code=map:code,503:178 resp-body=map:resp

But here the problem begins 179/180:

cloudprober 1596571800161288163 1596572826 labels=ptype=http,probe=http-internal-probe,dst=payout-http total=179 success=180 latency=786269.728 timeouts=0 resp-code=map:code,503:180 resp-body=map:resp
cloudprober 1596571800161288164 1596572826 labels=ptype=http,probe=http-internal-probe,dst=payout-http total=179 success=180 latency=786269.728 timeouts=0 resp-code=map:code,503:180 resp-body=map:resp
@manugarg
Copy link
Contributor

manugarg commented Aug 5, 2020

Thanks for reporting this bug @JohnWillker. It is indeed a very strange bug. seems like a case of "success" and "resp-code" being over-counted, or total being under-counted. Given that interval is 10s, there should have been just one probe in that interval.

Also, I am curious why there are two data lines. Do you have only one kubernetes service and it is named payout-http?

I'll continue looking at the code to see what may cause this.

@manugarg
Copy link
Contributor

manugarg commented Aug 5, 2020

It does look like something fishy is going on with targets here. There should not be two exactly same data lines coming from a probe.

Since kubernetes services discovery code is newer (added between v0.10.7 and v0.10.8), it's most likely the culprit here. Nevertheless, HTTP probes keep results by target names, which are same here. I'll continue looking, but it will be great if you could share how your services look like.

@manugarg
Copy link
Contributor

manugarg commented Aug 5, 2020

@JohnWillker do you have the same service (payout-http) in two namespace by any chance?

I think it's possible to trigger a bug in HTTP results accounting if we somehow have two or more targets with the same name. In kubernetes we can get two targets of the same name if they are in different namespaces.

Relevant code:

req, result := p.httpRequests[target.Name], p.results[target.Name]

I think we need to fix a few things:

  • In kubernetes RDS either automatically filter by namespaces (use "default" namespace if no namespace is given) or somehow attach namespace to target name.
  • In targets module, make sure that we never return two targets with the same name. If such a situation exists, log and return only 1 of the targets.
  • Build defense in depth in HTTP probe to avoid surprising behavior regardless of what is passed to it.

@manugarg manugarg self-assigned this Aug 5, 2020
@jwillker
Copy link
Author

jwillker commented Aug 5, 2020

@JohnWillker do you have the same service (payout-http) in two namespace by any chance?

I think it's possible to trigger a bug in HTTP results accounting if we somehow have two or more targets with the same name. In kubernetes we can get two targets of the same name if they are in different namespaces.

Relevant code:

req, result := p.httpRequests[target.Name], p.results[target.Name]

I think we need to fix a few things:

  • In kubernetes RDS either automatically filter by namespaces (use "default" namespace if no namespace is given) or somehow attach namespace to target name.
  • In targets module, make sure that we never return two targets with the same name. If such a situation exists, log and return only 1 of the targets.
  • Build defense in depth in HTTP probe to avoid surprising behavior regardless of what is passed to it.

Hi, @manugarg

I have 2 services with the same name in 2 different namespaces, something like example-prod and example-develop.

This is one of the service:

apiVersion: v1
kind: Service
metadata: 
  labels:
    app: payout-http
    probetype: http
  name: payout-http
  namespace: example-prod
spec:
  ports:
  - name: http
    port: 8800
    protocol: TCP
    targetPort: 8800
  selector:
    app: payout-http
  sessionAffinity: None
  type: ClusterIP

This is one example, but we have a bunch of services in this model, using two namespaces, but the filter only matches in prod namespace(dev namespace don't have probetype: http).

There's some way to separate this in metrics returning the entire service name in FQDN format. e.g. dst=payout-http.example-prod.svc.cluster.local?

Or other solutions to work with this model? Maybe using kubernetes_config { endpoints {}?

@manugarg
Copy link
Contributor

manugarg commented Aug 5, 2020

Thanks once again. So, I've found that there is a problem in how we cache kubernetes resources. We were using only names as keys, while names can be same across namespaces (#437). I'll soon (within a day) send a fix for this problem. After that fix filtering by label will work for you.

For now, if you want, you can workaround this problem by filtering by namespace in the rds_server stanza. You can say:

  rds_server {
    provider {
      namespace: "example-prod"
      kubernetes_config {
        services {}
      }
    }
  }

There's some way to separate this in metrics returning the entire service name in FQDN format. e.g. dst=payout-http.example-prod.svc.cluster.local?

You can use target's labels to identify specific target. In your probe config you can add the following stanza:

additional_label {
  key: "env"
  value: "@target.label.appenv@"
}

If target has a label appenv=prod, this will add a label: end=prod to probe results. I am planning to add more documentation on it, but in the meantime, here is an example for the additional_label here:
https://github.com/google/cloudprober/blob/master/examples/additional_label/cloudprober.cfg

@manugarg manugarg added the bug label Aug 5, 2020
@manugarg manugarg changed the title Wrong HTTP metrics, more success than total Duplicate targets can mangle HTTP probe results. Aug 5, 2020
@jwillker
Copy link
Author

jwillker commented Aug 5, 2020

Thanks once again. So, I've found that there is a problem in how we cache kubernetes resources. We were using only names as keys, while names can be same across namespaces (#437). I'll soon (within a day) send a fix for this problem. After that fix filtering by label will work for you.

For now, if you want, you can workaround this problem by filtering by namespace in the rds_server stanza. You can say:

  rds_server {
    provider {
      namespace: "example-prod"
      kubernetes_config {
        services {}
      }
    }
  }

There's some way to separate this in metrics returning the entire service name in FQDN format. e.g. dst=payout-http.example-prod.svc.cluster.local?

You can use target's labels to identify specific target. In your probe config you can add the following stanza:

additional_label {
  key: "env"
  value: "@target.label.appenv@"
}

If target has a label appenv=prod, this will add a label: end=prod to probe results. I am planning to add more documentation on it, but in the meantime, here is an example for the additional_label here:
https://github.com/google/cloudprober/blob/master/examples/additional_label/cloudprober.cfg

@manugarg
Thank you for a fast response. I will wait for the fix and will use the labels as you said to separate environments.

manugarg added a commit that referenced this issue Aug 5, 2020
We anyway keep the target information in a map by name, i.e. we lose the information for duplicate targets anyway.

Also, log a warning if we get a duplicate target as it should ideally never happen.

Duplicate targets can lead to spurious behavior from probes:
#436

PiperOrigin-RevId: 325081938
manugarg added a commit that referenced this issue Aug 5, 2020
We anyway keep the target information in a map by name, i.e. we lose the information for duplicate targets anyway.

Also, log a warning if we get a duplicate target as it should ideally never happen.

Duplicate targets can lead to spurious behavior from probes:
#436

PiperOrigin-RevId: 325081938
manugarg added a commit that referenced this issue Aug 6, 2020
…st names.

In Kubernetes, resource may have the same name across namespaces. Currently, this can lead to a pretty bizarre behavior. See #436 (comment) for background.

PiperOrigin-RevId: 325143549
manugarg added a commit that referenced this issue Aug 6, 2020
…st names.

In Kubernetes, resource may have the same name across namespaces. Currently, this can lead to a pretty bizarre behavior. See #436 (comment) for background.

PiperOrigin-RevId: 325143549
@manugarg
Copy link
Contributor

manugarg commented Aug 7, 2020

I am closing this now as it should not be possible to generate duplicate targets now (after the last couple of changes to the RDS module), except for deliberately specifying duplicate targets statically. Also, this problem is not specific to HTTP probes. Duplicate targets are going to cause a problem for all kinds of probe. I've filed a separate bug to consider changing targets module's API to always return unique targets for ListEndpoints(): #445

@manugarg manugarg closed this as completed Aug 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants