-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VMAuth getting "cannot authorize request with auth tokens" after vmuser and vmuser-credential secret are all ready #1220
Comments
Hello, Could you please check the following:
|
Hi @f41gh7,
For the past 7 hours, I could just see the not found error
In VMAuth CR, I can see now the VMAuth CR is in failed state for we use config-reloader:0.51.3 as one container in the vmauth deployment.
it's 60. and empty most of the time. |
and all these regression happened after we did the upgrade for vm-operator from 0.49.1 to 0.51.3, and VM components vmauth, vmstorage, vmalert, vminsert, ect from 1.106.1 to 1.108.1. |
I think it's probably the reason of this issue. During configuration build, operator at first updates VMUser statuses and after that updates configuration secret with generated config. This logic was introduced at v0.50.0 operator version. I expect this logic to be fixed today at upcoming operator release. At first new configuration should be update and after operator must update statuses of vmusers. Also, it looks strange, that operator cannot find |
Thanks a lot, by taking a quick glance at the It really seems that with the operator upgrade from 0.49.1 to 0.51.3, many status subresources are getting always updated, which means a lot more K8s API server operations (and network traffic), and also client-side throttling: FYI we are using
Indeed, the |
Thanks for reporting. Yes, operator updates |
Many thanks @f41gh7 for following up on this! ❤ |
Previously, for `VMAuth`, `VMAlert` and `VMAlertmanager` configuration secret was updated only after `status` fields for all matched `child` objects were updated. It may lead to delays, since Kubernetes API server may throttle requests. This commit performs `Secret` with configuration update first and then updates related child objects. It greatly decreases changes delivery time. It also adds fast path for a single resource update. It updates the status field only for corresponding object. Related issue: #1220
Previously, for `VMAuth`, `VMAlert` and `VMAlertmanager` configuration secret was updated only after `status` fields for all matched `child` objects were updated. It may lead to delays, since Kubernetes API server may throttle requests. This commit performs `Secret` with configuration update first and then updates related child objects. It greatly decreases changes delivery time. It also adds fast path for a single resource update. It updates the status field only for corresponding object. Related issue: #1220
Client default rate limiter of 5 requests per second is too restrictive. And it prevents operator from scaling. Since operator needs to update objects `status.condition` fields. And it may require 10-20 requests per second on large scale. This commit raises default limit to 50 and exports metric with configured limit. Related issue: #1220 Signed-off-by: f41gh7 <[email protected]>
It controlls expiration time of status.condition lastUpdateTime. Which is needed to track stale parent objects. Increasing value of this flag reduces load on Kubernetes cluster, but it also increases time of stale object detection. For instance, if there are 2 VMAlert objects and it matches some VMRule. Both vmalerts will be registered at VMRule.status.conditions[].type with it's name. In case when 1 of VMAlert objects were deleted, it will be removed from VMRule.status.condition only after 3*controller.statusLastUpdateTimeTTL. Which take up to 3 hours with default values. Related issue: #1220 Signed-off-by: f41gh7 <[email protected]>
Previously, for `VMAuth`, `VMAlert` and `VMAlertmanager` configuration secret was updated only after `status` fields for all matched `child` objects were updated. It may lead to delays, since Kubernetes API server may throttle requests. This commit performs `Secret` with configuration update first and then updates related child objects. It greatly decreases changes delivery time. It also adds fast path for a single resource update. It updates the status field only for corresponding object. Related issue: #1220
Client default rate limiter of 5 requests per second is too restrictive. And it prevents operator from scaling. Since operator needs to update objects `status.condition` fields. And it may require 10-20 requests per second on large scale. This commit raises default limit to 50 and exports metric with configured limit. Related issue: #1220 Signed-off-by: f41gh7 <[email protected]>
It controlls expiration time of status.condition lastUpdateTime. Which is needed to track stale parent objects. Increasing value of this flag reduces load on Kubernetes cluster, but it also increases time of stale object detection. For instance, if there are 2 VMAlert objects and it matches some VMRule. Both vmalerts will be registered at VMRule.status.conditions[].type with it's name. In case when 1 of VMAlert objects were deleted, it will be removed from VMRule.status.condition only after 3*controller.statusLastUpdateTimeTTL. Which take up to 3 hours with default values. Related issue: #1220 Signed-off-by: f41gh7 <[email protected]>
Upcoming release of operator changes logic of configuration updates. Now operator will perform secret with configuration at first and after that performs status condition updates for objects. It also changes default kubernetes client configuration. Previously operator had 5 QPS limit. It's not enough to handle more than 200 objects. New default limit is 50 QPS. It's also possible to increase it with flag Alerting query could be used to check if it's needed:
It'll be also added to the grafana dashboard and alerting rules for the operator. Mostly, operator performs TTL update on objects status:
conditions:
- lastTransitionTime: "2025-01-18T15:13:25Z"
lastUpdateTime: "2025-01-20T13:54:59Z"
observedGeneration: 1
reason: ConfigParsedAndApplied
status: "True"
type: stack-victoria-metrics-k8s-stack.default.vmalert.victoriametrics.com/Applied
- lastTransitionTime: "2025-01-19T15:03:44Z"
lastUpdateTime: "2025-01-20T14:27:52Z"
observedGeneration: 1
reason: ConfigParsedAndApplied
status: "True"
type: stack-victoria-metrics-k8s-stack-v2.default.vmalert.victoriametrics.com/Applied
- lastTransitionTime: "2025-01-20T10:22:49Z"
lastUpdateTime: "2025-01-20T14:52:28Z"
observedGeneration: 1
reason: ConfigParsedAndApplied
status: "True"
type: stack-victoria-metrics-k8s-stack-v3.bench-1.vmalert.victoriametrics.com/Applied
observedGeneration: 1
updateStatus: operational It indicates, that Default TTL is also changed from |
Issue was fixed at v0.52.0 release |
We are now facing the regression issue that the scraped metrics can't be pushed due to
cannot authorize request with auth tokens
reported in vmauth and vmagent even after the VMUser and the Secretvmuser-credential
both are in place and ready.In vmagent pod logs:
In vmauth pod logs:
The
username
andpassowrd
are all the same in VMUser object, the Secretvmuser-credential
and the env in the vmagent pod.The creationTimestamp for VMUser object, the Secret
vmuser-credential
and vmagent pod are all the same.There is a big delay for the vmauth getting and syncing the credentials.
It's almost 30 minutes for reporting the auth error in vmauth and vmagent pods since the creations of the vmuser annd secret
vmuser-credential
, and metrics can't be received after everything is ready. Then after about 30m, the authorize errors will be gone..and the metrics will be visualized on the VMUI. The metrics were cached until the auth errors were gone and then pushed to the VM.This is happening on many of our clusters.
take one for example:
2025-01-17T01:57:06Z
vmuser-credential
creation time:2025-01-17T01:57:07Z
2025-01-17T02:32:27.273Z
.last auth error timestamp:
2025-01-17T02:32:27.180Z
.last auth error timestamp:
2025-01-17T02:32
, there were no metrics received and visualized on VMUI. after2025-01-17T02:32
, the metrics were visualized. metrics on the UI started from2025-01-17 01:59:15
. The metrics were cached and pushed to the vmstorage all together after the authorize errors were gone.2025-01-17T01:57
and2025-01-17T02:32
, about 35 minutes.victoria-metrics-operator version: 0.51.3
vmauth version: 1.108.1
vmagent version: 1.108.1
The text was updated successfully, but these errors were encountered: