Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ServiceMonitor is dropped by prometheus #1114

Closed
sergeyshaykhullin opened this issue Jul 3, 2020 · 15 comments
Closed

ServiceMonitor is dropped by prometheus #1114

sergeyshaykhullin opened this issue Jul 3, 2020 · 15 comments
Assignees
Labels
needs-triage New issues, in need of classification

Comments

@sergeyshaykhullin
Copy link

sergeyshaykhullin commented Jul 3, 2020

I installed jaeger using helm-chart operator+jaeger using crds

ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  creationTimestamp: "2020-07-05T16:35:14Z"
  generation: 1
  labels:
    name: jaeger-jaeger-operator
  name: jaeger-jaeger-operator-metrics
  namespace: jaeger
  ownerReferences:
  - apiVersion: v1
    blockOwnerDeletion: true
    controller: true
    kind: Service
    name: jaeger-jaeger-operator-metrics
    uid: 8249f0e3-8553-4f8d-91c2-c6b1e406bd3a
  resourceVersion: "2769"
  selfLink: /apis/monitoring.coreos.com/v1/namespaces/jaeger/servicemonitors/jaeger-jaeger-operator-metrics
  uid: 170bbefb-e578-4a4b-a4c5-3c59e890bc2e
spec:
  endpoints:
  - bearerTokenSecret:
      key: ""
    port: http-metrics
  - bearerTokenSecret:
      key: ""
    port: cr-metrics
  namespaceSelector: {}
  selector:
    matchLabels:
      name: jaeger-jaeger-operator

Jaeger metrics service:

apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2020-07-05T16:34:23Z"
  labels:
    name: jaeger-jaeger-operator
  name: jaeger-jaeger-operator-metrics
  namespace: jaeger
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: Deployment
    name: jaeger-jaeger-operator
    uid: 2476e7c9-a430-4e9b-aea8-2c5f7615542a
  resourceVersion: "2755"
  selfLink: /api/v1/namespaces/jaeger/services/jaeger-jaeger-operator-metrics
  uid: 8249f0e3-8553-4f8d-91c2-c6b1e406bd3a
spec:
  clusterIP: 10.110.81.52
  ports:
  - name: http-metrics
    port: 8383
    protocol: TCP
    targetPort: 8383
  - name: cr-metrics
    port: 8686
    protocol: TCP
    targetPort: 8686
  selector:
    name: jaeger-jaeger-operator
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

Jaeger instance created by operator

apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/podIP: 10.244.135.13/32
    cni.projectcalico.org/podIPs: 10.244.135.13/32
    linkerd.io/inject: disabled
    prometheus.io/port: "14269"
    prometheus.io/scrape: "true"
    sidecar.istio.io/inject: "false"
  creationTimestamp: "2020-07-05T16:35:14Z"
  generateName: jaeger-jaeger-operator-jaeger-59d748f87f-
  labels:
    app: jaeger
    app.kubernetes.io/component: all-in-one
    app.kubernetes.io/instance: jaeger-jaeger-operator-jaeger
    app.kubernetes.io/managed-by: jaeger-operator
    app.kubernetes.io/name: jaeger-jaeger-operator-jaeger
    app.kubernetes.io/part-of: jaeger
    pod-template-hash: 59d748f87f
  name: jaeger-jaeger-operator-jaeger-59d748f87f-cwrcr
  namespace: jaeger
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: jaeger-jaeger-operator-jaeger-59d748f87f
    uid: 6393927c-a01b-4cec-bf68-eb1e93eae413
  resourceVersion: "2950"
  selfLink: /api/v1/namespaces/jaeger/pods/jaeger-jaeger-operator-jaeger-59d748f87f-cwrcr
  uid: a4249b59-f623-40ce-b49c-b7a908fd9d02
spec:
  containers:
  - args:
    - --badger.directory-key=/badger/key
    - --badger.directory-value=/badger/data
    - --badger.ephemeral=false
    - --query.ui-config=/etc/config/ui.json
    - --sampling.strategies-file=/etc/jaeger/sampling/sampling.json
    env:
    - name: SPAN_STORAGE_TYPE
      value: badger
    - name: COLLECTOR_ZIPKIN_HTTP_PORT
      value: "9411"
    image: jaegertracing/all-in-one:1.18.1
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 5
      httpGet:
        path: /
        port: 14269
        scheme: HTTP
      initialDelaySeconds: 5
      periodSeconds: 15
      successThreshold: 1
      timeoutSeconds: 1
    name: jaeger
    ports:
    - containerPort: 5775
      name: zk-compact-trft
      protocol: UDP
    - containerPort: 5778
      name: config-rest
      protocol: TCP
    - containerPort: 6831
      name: jg-compact-trft
      protocol: UDP
    - containerPort: 6832
      name: jg-binary-trft
      protocol: UDP
    - containerPort: 9411
      name: zipkin
      protocol: TCP
    - containerPort: 14267
      name: c-tchan-trft
      protocol: TCP
    - containerPort: 14268
      name: c-binary-trft
      protocol: TCP
    - containerPort: 16686
      name: query
      protocol: TCP
    - containerPort: 14269
      name: admin-http
      protocol: TCP
    - containerPort: 14250
      name: grpc
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /
        port: 14269
        scheme: HTTP
      initialDelaySeconds: 1
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /badger
      name: jaeger
    - mountPath: /etc/config
      name: jaeger-jaeger-operator-jaeger-ui-configuration-volume
      readOnly: true
    - mountPath: /etc/jaeger/sampling
      name: jaeger-jaeger-operator-jaeger-sampling-configuration-volume
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: jaeger-jaeger-operator-jaeger-token-bw54r
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: node3
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: jaeger-jaeger-operator-jaeger
  serviceAccountName: jaeger-jaeger-operator-jaeger
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: jaeger
    persistentVolumeClaim:
      claimName: jaeger-pvc
  - configMap:
      defaultMode: 420
      items:
      - key: ui
        path: ui.json
      name: jaeger-jaeger-operator-jaeger-ui-configuration
    name: jaeger-jaeger-operator-jaeger-ui-configuration-volume
  - configMap:
      defaultMode: 420
      items:
      - key: sampling
        path: sampling.json
      name: jaeger-jaeger-operator-jaeger-sampling-configuration
    name: jaeger-jaeger-operator-jaeger-sampling-configuration-volume
  - name: jaeger-jaeger-operator-jaeger-token-bw54r
    secret:
      defaultMode: 420
      secretName: jaeger-jaeger-operator-jaeger-token-bw54r

But prometheus dropped metrics:
image

image

image

@ghost ghost added the needs-triage New issues, in need of classification label Jul 3, 2020
@jpkrohling
Copy link
Contributor

Are you able to determine the root cause?

@sergeyshaykhullin
Copy link
Author

@jpkrohling I tried to curl metrics service, but it is not respond, i checked selector and endpoints, its fine. Also in pod definition ports are defined, but metrics is not collecting

@jpkrohling
Copy link
Contributor

@sergeyshaykhullin are you able to get the YAMLs again, properly formatted? It's very hard to understand them with the current formatting. I'm wondering why it shows the target port as 16686 for the Operator Metrics. From what I remember, we create a service monitor only for the ports from the operator itself (not for the operands), and the ports to get the metrics from should be 8383/8686.

@sergeyshaykhullin
Copy link
Author

@jpkrohling Sorry, i fixed yaml formatting

@jpkrohling
Copy link
Contributor

Far better, thanks! I did some further cleaning, to remove the managed fields. I'll add this to my queue, but I need a few days to try it out. If you do have an idea on what's going on and what the fix might be, let me know, as it would help expedite a solution ;-)

@sergeyshaykhullin
Copy link
Author

sergeyshaykhullin commented Jul 9, 2020

I've found, that ServiceMonitor is pointing not to Jaeger, but to Jaeger-operator, because in Jaeger no metrics ports, but in jaeger operator is:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/podIP: 10.244.3.71/32
    cni.projectcalico.org/podIPs: 10.244.3.71/32
  creationTimestamp: "2020-07-05T16:34:23Z"
  generateName: jaeger-jaeger-operator-6d797c86f-
  labels:
    app.kubernetes.io/name: jaeger-operator
    pod-template-hash: 6d797c86f
    manager: kubelet
    operation: Update
    time: "2020-07-05T16:35:10Z"
  name: jaeger-jaeger-operator-6d797c86f-tn5hs
  namespace: jaeger
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: jaeger-jaeger-operator-6d797c86f
    uid: 8fcaceb9-c4e1-43f3-abaf-405150143522
  resourceVersion: "2715"
  selfLink: /api/v1/namespaces/jaeger/pods/jaeger-jaeger-operator-6d797c86f-tn5hs
  uid: 22d60c2e-df1d-47bc-86dc-2da0bfc965b5
spec:
  containers:
  - args:
    - start
    env:
    - name: WATCH_NAMESPACE
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: OPERATOR_NAME
      value: jaeger-jaeger-operator
    image: jaegertracing/jaeger-operator:master
    imagePullPolicy: Always
    name: jaeger-jaeger-operator
    ports:
    - containerPort: 8383
      name: metrics
      protocol: TCP
    resources:
      limits:
        cpu: 200m
        memory: 200M
      requests:
        cpu: 100m
        memory: 100M
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: jaeger-jaeger-operator-token-54kqc
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: node4
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: jaeger-jaeger-operator
  serviceAccountName: jaeger-jaeger-operator
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: jaeger-jaeger-operator-token-54kqc
    secret:
      defaultMode: 420
      secretName: jaeger-jaeger-operator-token-54kqc
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-07-05T16:34:23Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2020-07-05T16:35:10Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2020-07-05T16:35:10Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2020-07-05T16:34:23Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://fd468ec8cf2d2dbf54c8255360433a64173df2d58d33e4544766a5f9f8bd4e5a
    image: jaegertracing/jaeger-operator:master
    imageID: docker-pullable://jaegertracing/jaeger-operator@sha256:10c5ec958adba5013b63fdc0b954a78d4dafc5d9c2fe007daa73811d6f4ba75d
    lastState: {}
    name: jaeger-jaeger-operator
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2020-07-05T16:35:10Z"
  hostIP: 37.46.128.123
  phase: Running
  podIP: 10.244.3.71
  podIPs:
  - ip: 10.244.3.71
  qosClass: Burstable
  startTime: "2020-07-05T16:34:23Z"

and inside jaeger-operator port 8383 exposed, but 8686 is not

There is generated labels mismatch:
Service selector:

selector:
  name: jaeger-jaeger-operator

But jaeger operator labels is:

labels:
  app: jaeger
  app.kubernetes.io/component: all-in-one
  app.kubernetes.io/instance: jaeger-jaeger-operator-jaeger
  app.kubernetes.io/managed-by: jaeger-operator
  app.kubernetes.io/name: jaeger-jaeger-operator-jaeger
  app.kubernetes.io/part-of: jaeger
  pod-template-hash: 59d748f87f

I used this helm chart: https://github.com/jaegertracing/helm-charts/tree/master/charts/jaeger-operator, but this is service monitor template and labels are ok! https://github.com/jaegertracing/helm-charts/blob/master/charts/jaeger-operator/templates/service.yaml
Does jaeger operator overrides service monitor, created by helm?

@jpkrohling
Copy link
Contributor

Could you please check what are the labels in the service monitor right after Helm provisions it? If it contains the label app.kubernetes.io/managed-by: jaeger-operator, then the Jaeger Operator will attempt to manage it. Otherwise, the Jaeger Operator should keep its hand off of this service.

@sergeyshaykhullin
Copy link
Author

@jpkrohling This is strange, no required labels exists. But i found manager field
image

@jpkrohling
Copy link
Contributor

That's interesting, I would expect this code to only create a service monitor if none exists, not to update an existing one:

func createServiceMonitor(ctx context.Context, cfg *rest.Config, namespace string, service *corev1.Service) {
tracer := global.TraceProvider().GetTracer(v1.BootstrapTracer)
ctx, span := tracer.Start(ctx, "createServiceMonitor")
defer span.End()
// CreateServiceMonitors will automatically create the prometheus-operator ServiceMonitor resources
// necessary to configure Prometheus to scrape metrics from this operator.
services := []*corev1.Service{service}
_, err := metrics.CreateServiceMonitors(cfg, namespace, services)
if err != nil {
if err == metrics.ErrServiceMonitorNotPresent {
log.WithError(err).Info("Install prometheus-operator in your cluster to create ServiceMonitor objects")
} else {
span.SetStatus(codes.Internal)
span.SetAttribute(key.String("error", err.Error()))
log.WithError(err).Warn("could not create ServiceMonitor object")
}
}
}

@sergeyshaykhullin
Copy link
Author

@jpkrohling Any updates?

@jpkrohling
Copy link
Contributor

Not yet, sorry. I'll try to get a couple of hours this week to try to reproduce/fix this one.

@jpkrohling jpkrohling self-assigned this Jul 20, 2020
@sergeyshaykhullin
Copy link
Author

;c

@jpkrohling
Copy link
Contributor

It's currently on my queue, I should be able to look into it during the next couple weeks.

@jpkrohling
Copy link
Contributor

I couldn't reproduce your situation, but I did find a couple of road bumps and a small bug, but doesn't seem related to your report. Your report seem to be a duplicate of #1067, which also has the Helm chart in the mix.

Basically, let's differentiate between the three possible targets:

  1. Operator metrics -- metrics about the Jaeger Operator runtime (port 8383)
  2. CR metrics -- currently almost empty, containing custom stats related to the CRs
  3. Jaeger instance (operand) metrics -- the Jaeger Operator does not currently handle those

It looks like that the Helm charts are able to create the service monitor objects for the Jaeger instances, but that's not relevant to instances created via the operator. The Helm charts don't currently have a service monitor for the Jaeger Operator, neither should they, as the service monitor is provisioned automatically by the Jaeger Operator.

That said, here's how to test it:

  1. Deploy the Prometheus Operator (like via make deploy-prometheus-operator in the linked PR)
  2. Deploy the Jaeger Operator (this will result in the service monitor being created)
  3. Create the RBAC objects for the Prometheus instance (see snippets below)
  4. Create a Prometheus instance (also in the snippets below)
  5. Deploy the Jaeger Operator
  6. Two targets should be discovered (see the screenshot below)

In the linked PR, you should see the two targets as active. In the latest release, you should see only one active (8383, which is the one that actually has metrics).

I realize that you probably care more about the Jaeger instance metrics. You can refer to an article I wrote some time ago for a more complete scenario, but here's a list of steps to achieve a simple scenario (snippets below):

  1. Create a new service exposing the admin-http port (14269) for the Jaeger instance (operand)
  2. Create a service monitor that selects this new service
  3. Create a Prometheus instance that discovers based on this service monitor
  4. One target should be discovered (see second screenshot)

Snippets 1 - Jaeger Operator:

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/metrics
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups: [""]
  resources:
  - configmaps
  verbs: ["get"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: default
---
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus-for-jaeger-operator
  namespace: default
spec: 
  serviceAccountName: prometheus
  serviceMonitorSelector:
    matchLabels:
      name: jaeger-operator

Results in:
image

Snippets 2 - Jaeger instance (operand):

---
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: simplest
spec:
  labels:
    name: jaeger
---
apiVersion: v1
kind: Service
metadata:
  labels:
    name: jaeger
  name: simplest-admin
  namespace: default
spec:
  ports:
  - name: admin-port
    port: 14269
    protocol: TCP
  selector:
    name: jaeger
  type: ClusterIP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    name: jaeger
  name: jaeger-metrics
  namespace: default
spec:
  endpoints:
  - port: admin-port
  selector:
    matchLabels:
      name: jaeger
---
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus-for-jaeger
  namespace: default
spec: 
  serviceAccountName: prometheus
  serviceMonitorSelector:
    matchLabels:
      name: jaeger

Results in:
image

In the end, I think the lesson is that the Jaeger Operator should be creating service monitors for the operands by default (created #1156).

@jpkrohling
Copy link
Contributor

@sergeyshaykhullin I'm closing this as I don't think it's a bug in the operator, but let me know if there's any clarification needed. I opened an issue with the Helm Charts repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-triage New issues, in need of classification
Projects
None yet
Development

No branches or pull requests

2 participants