-
Notifications
You must be signed in to change notification settings - Fork 150
Duplicate targets can mangle HTTP probe results. #436
Comments
Thanks for reporting this bug @JohnWillker. It is indeed a very strange bug. seems like a case of "success" and "resp-code" being over-counted, or total being under-counted. Given that interval is 10s, there should have been just one probe in that interval. Also, I am curious why there are two data lines. Do you have only one kubernetes service and it is named payout-http? I'll continue looking at the code to see what may cause this. |
It does look like something fishy is going on with targets here. There should not be two exactly same data lines coming from a probe. Since kubernetes services discovery code is newer (added between v0.10.7 and v0.10.8), it's most likely the culprit here. Nevertheless, HTTP probes keep results by target names, which are same here. I'll continue looking, but it will be great if you could share how your services look like. |
@JohnWillker do you have the same service (payout-http) in two namespace by any chance? I think it's possible to trigger a bug in HTTP results accounting if we somehow have two or more targets with the same name. In kubernetes we can get two targets of the same name if they are in different namespaces. Relevant code: cloudprober/probes/http/http.go Line 341 in dd0bb3e
I think we need to fix a few things:
|
Hi, @manugarg I have 2 services with the same name in 2 different namespaces, something like This is one of the service: apiVersion: v1
kind: Service
metadata:
labels:
app: payout-http
probetype: http
name: payout-http
namespace: example-prod
spec:
ports:
- name: http
port: 8800
protocol: TCP
targetPort: 8800
selector:
app: payout-http
sessionAffinity: None
type: ClusterIP This is one example, but we have a bunch of services in this model, using two namespaces, but the filter only matches in prod namespace(dev namespace don't have There's some way to separate this in metrics returning the entire service name in FQDN format. e.g. Or other solutions to work with this model? Maybe using |
Thanks once again. So, I've found that there is a problem in how we cache kubernetes resources. We were using only names as keys, while names can be same across namespaces (#437). I'll soon (within a day) send a fix for this problem. After that fix filtering by label will work for you. For now, if you want, you can workaround this problem by filtering by namespace in the rds_server stanza. You can say:
You can use target's labels to identify specific target. In your probe config you can add the following stanza:
If target has a label |
@manugarg |
We anyway keep the target information in a map by name, i.e. we lose the information for duplicate targets anyway. Also, log a warning if we get a duplicate target as it should ideally never happen. Duplicate targets can lead to spurious behavior from probes: #436 PiperOrigin-RevId: 325081938
We anyway keep the target information in a map by name, i.e. we lose the information for duplicate targets anyway. Also, log a warning if we get a duplicate target as it should ideally never happen. Duplicate targets can lead to spurious behavior from probes: #436 PiperOrigin-RevId: 325081938
…st names. In Kubernetes, resource may have the same name across namespaces. Currently, this can lead to a pretty bizarre behavior. See #436 (comment) for background. PiperOrigin-RevId: 325143549
…st names. In Kubernetes, resource may have the same name across namespaces. Currently, this can lead to a pretty bizarre behavior. See #436 (comment) for background. PiperOrigin-RevId: 325143549
I am closing this now as it should not be possible to generate duplicate targets now (after the last couple of changes to the RDS module), except for deliberately specifying duplicate targets statically. Also, this problem is not specific to HTTP probes. Duplicate targets are going to cause a problem for all kinds of probe. I've filed a separate bug to consider changing targets module's API to always return unique targets for ListEndpoints(): #445 |
There are some metrics that doest make sense, how is possible
success
be bigger thantotal
?Environment:
My deploy:
My configmap:
Until here ok 178/178:
But here the problem begins 179/180:
The text was updated successfully, but these errors were encountered: