VMAuth getting "cannot authorize request with auth tokens" after vmuser and vmuser-credential secret are all ready #1220

vvxxvvxx · 2025-01-17T02:56:30Z

We are now facing the regression issue that the scraped metrics can't be pushed due to cannot authorize request with auth tokens reported in vmauth and vmagent even after the VMUser and the Secret vmuser-credential both are in place and ready.

In vmagent pod logs:


{"ts":"2025-01-17T02:32:14.381Z","level":"error","caller":"VictoriaMetrics/app/vmagent/remotewrite/client.go:475","msg":"unexpected status code received after sending a block with size 174 bytes to \"1:secret-url\" during retry #38: 401; response body=\"remoteAddr: \\\"127.0.0.6:41313, X-Forwarded-For: *.*.*.*\\\"; requestURI: /api/v1/write; cannot authorize request with auth tokens [\\\"http_auth:Basic ***\\"]\\n\"; re-sending the block in 60.888 seconds"}
{"ts":"2025-01-17T02:32:21.186Z","level":"error","caller":"VictoriaMetrics/app/vmagent/remotewrite/client.go:475","msg":"unexpected status code received after sending a block with size 399 bytes to \"1:secret-url\" during retry #37: 401; response body=\"remoteAddr: \\\"127.0.0.6:32867, X-Forwarded-For: *.*.*.*\\\"; requestURI: /api/v1/write; cannot authorize request with auth tokens [\\\"http_auth:Basic ***\\"]\\n\"; re-sending the block in 62.120 seconds"}
{"ts":"2025-01-17T02:32:27.273Z","level":"error","caller":"VictoriaMetrics/app/vmagent/remotewrite/client.go:475","msg":"unexpected status code received after sending a block with size 456 bytes to \"1:secret-url\" during retry #37: 401; response body=\"remoteAddr: \\\"127.0.0.6:51665, X-Forwarded-For: *.*.*.*\\\"; requestURI: /api/v1/write; cannot authorize request with auth tokens [\\\"http_auth:Basic ***\\\"]\\n\"; re-sending the block in 63.354 seconds"}

In vmauth pod logs:

...
2025-01-17T02:25:05.599Z        warn    VictoriaMetrics/app/vmauth/main.go:152  remoteAddr: "127.0.0.6:53633, X-Forwarded-For: *.*.*.*"; requestURI: /api/v1/write; cannot authorize request with auth tokens ["http_auth:Basic ***"]
2025-01-17T02:25:56.621Z        warn    VictoriaMetrics/app/vmauth/main.go:152  remoteAddr: "127.0.0.6:44933, X-Forwarded-For: *.*.*.*"; requestURI: /api/v1/write; cannot authorize request with auth tokens ["http_auth:Basic ***"]
2025-01-17T02:26:08.466Z        warn    VictoriaMetrics/app/vmauth/main.go:152  remoteAddr: "127.0.0.6:44933, X-Forwarded-For: *.*.*.*"; requestURI: /api/v1/write; cannot authorize request with auth tokens ["http_auth:Basic ***"]

The username and passowrd are all the same in VMUser object, the Secret vmuser-credential and the env in the vmagent pod.
The creationTimestamp for VMUser object, the Secret vmuser-credential and vmagent pod are all the same.
There is a big delay for the vmauth getting and syncing the credentials.
It's almost 30 minutes for reporting the auth error in vmauth and vmagent pods since the creations of the vmuser annd secret vmuser-credential , and metrics can't be received after everything is ready. Then after about 30m, the authorize errors will be gone..and the metrics will be visualized on the VMUI. The metrics were cached until the auth errors were gone and then pushed to the VM.

This is happening on many of our clusters.
take one for example:

vmuser creation time: 2025-01-17T01:57:06Z

~# k get vmuser runtime1 -ojson |jq '.metadata.creationTimestamp'
"2025-01-17T01:57:06Z"

secret vmuser-credential creation time: 2025-01-17T01:57:07Z

~# k get secret vmuser-credential -ojson |jq '.metadata.creationTimestamp'
"2025-01-17T01:57:07Z"

auth error keeps being reported in vmagent pod logs until 2025-01-17T02:32:27.273Z.
last auth error timestamp:

{"ts":"2025-01-17T02:32:27.273Z","level":"error","caller":"VictoriaMetrics/app/vmagent/remotewrite/client.go:475","msg":"unexpected status code received after sending a block with size 456 bytes to \"1:secret-url\" during retry #37: 401; response body=\"remoteAddr: \\\"127.0.0.6:51665, X-Forwarded-For: *.*.*.*\\\"; requestURI: /api/v1/write; cannot authorize request with auth tokens [\\\"http_auth:Basic ***\\\"]\\n\"; re-sending the block in 63.354 seconds"}

auth error keeps being reported in vmauth pod logs until 2025-01-17T02:32:27.180Z.
last auth error timestamp:

log:2025-01-17T02:32:27.180Z warn VictoriaMetrics/app/vmauth/main.go:152 remoteAddr: "127.0.0.6:51665, X-Forwarded-For: *.*.*.*"; requestURI: /api/v1/write; cannot authorize request with auth tokens ["http_auth:Basic ***"]

before 2025-01-17T02:32, there were no metrics received and visualized on VMUI. after 2025-01-17T02:32, the metrics were visualized. metrics on the UI started from 2025-01-17 01:59:15. The metrics were cached and pushed to the vmstorage all together after the authorize errors were gone.

The gap is between 2025-01-17T01:57 and 2025-01-17T02:32, about 35 minutes.

victoria-metrics-operator version: 0.51.3
vmauth version: 1.108.1
vmagent version: 1.108.1

The text was updated successfully, but these errors were encountered:

f41gh7 · 2025-01-17T04:11:11Z

Hello,

Could you please check the following:

Were there any errors at vm-operator logs?
Did VMAuth configured with spec.useVMConfigReloader: true option? It greatly increases the speed of config syncronization between kubernetes secret and kubernetes pod secret mount.
What is reconcile latency for vmauth object? It could be checked at operator metrics via query expression: histogram_quantile(0.99,sum(rate(controller_runtime_reconcile_time_seconds_bucket{controller="vmauth"})) by(le) )

vvxxvvxx · 2025-01-17T04:37:24Z

Hello,

Could you please check the following:

Were there any errors at vm-operator logs?

Did VMAuth configured with spec.useVMConfigReloader: true option? It greatly increases the speed of config syncronization between kubernetes secret and kubernetes pod secret mount.

What is reconcile latency for vmauth object? It could be checked at operator metrics via query expression: histogram_quantile(0.99,sum(rate(controller_runtime_reconcile_time_seconds_bucket{controller="vmauth"})) by(le) )

Hi @f41gh7,

Were there any errors at vm-operator logs?

For the past 7 hours, I could just see the not found error "error":"cannot create or update vmauth deploy: cannot update statuses for vmusers: VMUser.operator.victoriametrics.com \"vmuser-name1\" not found" for several times in the vm-operator logs. they happened for some deleted clusters. so I think it's not relevant to this issue.

Did VMAuth configured with spec.useVMConfigReloader: true option? It greatly increases the speed of config syncronization between kubernetes secret and kubernetes pod secret mount.

In VMAuth CR, I can see "useVMConfigReloader":true is shown in the metadata.annotations.operator.victoriametrics/last-applied-spec. but not sure why this parameter is not in VMAuth spec.

now the VMAuth CR is in failed state for reason: 'cannot create or update vmauth deploy: cannot update statuses for vmusers: VMUser.operator.victoriametrics.com "vmuser-name3" not found'. this "vmuser-name3" was for a deleted cluster. so I think this error reason is not relevant to this issue.. not sure if the failed state has any impact to the vmauth peformance..

we use config-reloader:0.51.3 as one container in the vmauth deployment.

What is reconcile latency for vmauth object? It could be checked at operator metrics via query expression: histogram_quantile(0.99,sum(rate(controller_runtime_reconcile_time_seconds_bucket{controller="vmauth"})) by(le) )

it's 60. and empty most of the time.
In the past 1 hour. we could see all the requests are dropped in le="+Inf".

vvxxvvxx · 2025-01-17T05:54:15Z

and all these regression happened after we did the upgrade for vm-operator from 0.49.1 to 0.51.3, and VM components vmauth, vmstorage, vmalert, vminsert, ect from 1.106.1 to 1.108.1.

f41gh7 · 2025-01-17T06:30:51Z

For the past 7 hours, I could just see the not found error "error":"cannot create or update vmauth deploy: cannot update statuses for vmusers: VMUser.operator.victoriametrics.com "vmuser-name1" not found" for several times in the vm-operator logs. they happened for some deleted clusters. so I think it's not relevant to this issue.

I think it's probably the reason of this issue. During configuration build, operator at first updates VMUser statuses and after that updates configuration secret with generated config. This logic was introduced at v0.50.0 operator version.

I expect this logic to be fixed today at upcoming operator release. At first new configuration should be update and after operator must update statuses of vmusers.

Also, it looks strange, that operator cannot find vmuser-name1. It should skip deleted objects.

ebensom · 2025-01-17T12:25:40Z

I think it's probably the reason of this issue. During configuration build, operator at first updates VMUser statuses and after that updates configuration secret with generated config. This logic was introduced at v0.50.0 operator version.

I expect this logic to be fixed today at upcoming operator release. At first new configuration should be update and after operator must update statuses of vmusers.

Thanks a lot, by taking a quick glance at the buildVMAuthConfig() code, this really seems to be the issue. Please note, we have VMUsers and VMRules both in the scale of 2K (~1500), and we use the controller.maxConcurrentReconciles=10 so that all vm-operator controllers use 10 workers. VMUsers and VMRules are getting created, changed or deleted quite frequently, (multiple mutations per minute).

It really seems that with the operator upgrade from 0.49.1 to 0.51.3, many status subresources are getting always updated, which means a lot more K8s API server operations (and network traffic), and also client-side throttling:

FYI we are using victoriametrics/operator:config-reloader-0.51.3 for the reloader sidecar of VMAuth.

Also, it looks strange, that operator cannot find vmuser-name1. It should skip deleted objects.

Indeed, the buildVMAuthConfig() code here should skip and continue processing the users instead of bailing out with error.

f41gh7 · 2025-01-17T13:01:43Z

It really seems that with the operator upgrade from 0.49.1 to 0.51.3, many status subresources are getting always updated, which means a lot more K8s API server operations (and network traffic), and also client-side throttling:

Thanks for reporting. Yes, operator updates status subresource with constant interval 7 minutes + jitter. And it may produce a lot of updates in setup with a lot of objects. It should be also addressed at upcoming release. We're going to make a performance test and improve status update logic.

ebensom · 2025-01-17T14:08:23Z

Many thanks @f41gh7 for following up on this! ❤

Previously, for `VMAuth`, `VMAlert` and `VMAlertmanager` configuration secret was updated only after `status` fields for all matched `child` objects were updated. It may lead to delays, since Kubernetes API server may throttle requests. This commit performs `Secret` with configuration update first and then updates related child objects. It greatly decreases changes delivery time. It also adds fast path for a single resource update. It updates the status field only for corresponding object. Related issue: #1220

Client default rate limiter of 5 requests per second is too restrictive. And it prevents operator from scaling. Since operator needs to update objects `status.condition` fields. And it may require 10-20 requests per second on large scale. This commit raises default limit to 50 and exports metric with configured limit. Related issue: #1220 Signed-off-by: f41gh7 <[email protected]>

It controlls expiration time of status.condition lastUpdateTime. Which is needed to track stale parent objects. Increasing value of this flag reduces load on Kubernetes cluster, but it also increases time of stale object detection. For instance, if there are 2 VMAlert objects and it matches some VMRule. Both vmalerts will be registered at VMRule.status.conditions[].type with it's name. In case when 1 of VMAlert objects were deleted, it will be removed from VMRule.status.condition only after 3*controller.statusLastUpdateTimeTTL. Which take up to 3 hours with default values. Related issue: #1220 Signed-off-by: f41gh7 <[email protected]>

Previously, for `VMAuth`, `VMAlert` and `VMAlertmanager` configuration secret was updated only after `status` fields for all matched `child` objects were updated. It may lead to delays, since Kubernetes API server may throttle requests. This commit performs `Secret` with configuration update first and then updates related child objects. It greatly decreases changes delivery time. It also adds fast path for a single resource update. It updates the status field only for corresponding object. Related issue: #1220

Client default rate limiter of 5 requests per second is too restrictive. And it prevents operator from scaling. Since operator needs to update objects `status.condition` fields. And it may require 10-20 requests per second on large scale. This commit raises default limit to 50 and exports metric with configured limit. Related issue: #1220 Signed-off-by: f41gh7 <[email protected]>

It controlls expiration time of status.condition lastUpdateTime. Which is needed to track stale parent objects. Increasing value of this flag reduces load on Kubernetes cluster, but it also increases time of stale object detection. For instance, if there are 2 VMAlert objects and it matches some VMRule. Both vmalerts will be registered at VMRule.status.conditions[].type with it's name. In case when 1 of VMAlert objects were deleted, it will be removed from VMRule.status.condition only after 3*controller.statusLastUpdateTimeTTL. Which take up to 3 hours with default values. Related issue: #1220 Signed-off-by: f41gh7 <[email protected]>

f41gh7 · 2025-01-20T14:58:52Z

Upcoming release of operator changes logic of configuration updates. Now operator will perform secret with configuration at first and after that performs status condition updates for objects.

It also changes default kubernetes client configuration. Previously operator had 5 QPS limit. It's not enough to handle more than 200 objects. New default limit is 50 QPS. It's also possible to increase it with flag -client.qps.

Alerting query could be used to check if it's needed:

sum(rate(rest_client_requests_total{job="victoria-metrics-operator"}[10m]))/operator_rest_client_qps_limit{job="victoria-metrics-operator"} > 0.9

It'll be also added to the grafana dashboard and alerting rules for the operator.

Mostly, operator performs TTL update on objects status.condition.lastUpdateTime field. For example VMRule could have the following status conditions:

status:
  conditions:
  - lastTransitionTime: "2025-01-18T15:13:25Z"
    lastUpdateTime: "2025-01-20T13:54:59Z"
    observedGeneration: 1
    reason: ConfigParsedAndApplied
    status: "True"
    type: stack-victoria-metrics-k8s-stack.default.vmalert.victoriametrics.com/Applied
  - lastTransitionTime: "2025-01-19T15:03:44Z"
    lastUpdateTime: "2025-01-20T14:27:52Z"
    observedGeneration: 1
    reason: ConfigParsedAndApplied
    status: "True"
    type: stack-victoria-metrics-k8s-stack-v2.default.vmalert.victoriametrics.com/Applied
  - lastTransitionTime: "2025-01-20T10:22:49Z"
    lastUpdateTime: "2025-01-20T14:52:28Z"
    observedGeneration: 1
    reason: ConfigParsedAndApplied
    status: "True"
    type: stack-victoria-metrics-k8s-stack-v3.bench-1.vmalert.victoriametrics.com/Applied
  observedGeneration: 1
  updateStatus: operational

It indicates, that VMRule object was processed by 3 VMAlert objects: - stack-victoria-metrics-k8s-stack, stack-victoria-metrics-k8s-stack-v2 and stack-victoria-metrics-k8s-stack-v3. lastUpdateTime is used at TTL, if it expires, operator deletes corresponding record from VMRule status.

Default TTL is also changed from 5 minutes to 1 hour. And it's also configurable with flag -controller.statusLastUpdateTimeTTL=1hour.

f41gh7 · 2025-01-22T12:06:48Z

Issue was fixed at v0.52.0 release

f41gh7 added the question Further information is requested label Jan 17, 2025

f41gh7 added bug Something isn't working and removed question Further information is requested labels Jan 17, 2025

f41gh7 mentioned this issue Jan 20, 2025

controllers: decrease latency for generated configuration updates #1225

Merged

f41gh7 added the waiting for release The change was merged to upstream, but wasn't released yet. label Jan 20, 2025

f41gh7 closed this as completed Jan 22, 2025

f41gh7 removed the waiting for release The change was merged to upstream, but wasn't released yet. label Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VMAuth getting "cannot authorize request with auth tokens" after vmuser and vmuser-credential secret are all ready #1220

VMAuth getting "cannot authorize request with auth tokens" after vmuser and vmuser-credential secret are all ready #1220

vvxxvvxx commented Jan 17, 2025 •

edited

Loading

f41gh7 commented Jan 17, 2025

vvxxvvxx commented Jan 17, 2025 •

edited

Loading

vvxxvvxx commented Jan 17, 2025 •

edited

Loading

f41gh7 commented Jan 17, 2025

ebensom commented Jan 17, 2025 •

edited

Loading

f41gh7 commented Jan 17, 2025

ebensom commented Jan 17, 2025

f41gh7 commented Jan 20, 2025 •

edited

Loading

f41gh7 commented Jan 22, 2025

VMAuth getting "cannot authorize request with auth tokens" after vmuser and vmuser-credential secret are all ready #1220

VMAuth getting "cannot authorize request with auth tokens" after vmuser and vmuser-credential secret are all ready #1220

Comments

vvxxvvxx commented Jan 17, 2025 • edited Loading

f41gh7 commented Jan 17, 2025

vvxxvvxx commented Jan 17, 2025 • edited Loading

vvxxvvxx commented Jan 17, 2025 • edited Loading

f41gh7 commented Jan 17, 2025

ebensom commented Jan 17, 2025 • edited Loading

f41gh7 commented Jan 17, 2025

ebensom commented Jan 17, 2025

f41gh7 commented Jan 20, 2025 • edited Loading

f41gh7 commented Jan 22, 2025

vvxxvvxx commented Jan 17, 2025 •

edited

Loading

vvxxvvxx commented Jan 17, 2025 •

edited

Loading

vvxxvvxx commented Jan 17, 2025 •

edited

Loading

ebensom commented Jan 17, 2025 •

edited

Loading

f41gh7 commented Jan 20, 2025 •

edited

Loading