Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VMAuth getting "cannot authorize request with auth tokens" after vmuser and vmuser-credential secret are all ready #1220

Closed
vvxxvvxx opened this issue Jan 17, 2025 · 9 comments
Labels
bug Something isn't working

Comments

@vvxxvvxx
Copy link

vvxxvvxx commented Jan 17, 2025

We are now facing the regression issue that the scraped metrics can't be pushed due to cannot authorize request with auth tokens reported in vmauth and vmagent even after the VMUser and the Secret vmuser-credential both are in place and ready.

In vmagent pod logs:


{"ts":"2025-01-17T02:32:14.381Z","level":"error","caller":"VictoriaMetrics/app/vmagent/remotewrite/client.go:475","msg":"unexpected status code received after sending a block with size 174 bytes to \"1:secret-url\" during retry #38: 401; response body=\"remoteAddr: \\\"127.0.0.6:41313, X-Forwarded-For: *.*.*.*\\\"; requestURI: /api/v1/write; cannot authorize request with auth tokens [\\\"http_auth:Basic ***\\"]\\n\"; re-sending the block in 60.888 seconds"}
{"ts":"2025-01-17T02:32:21.186Z","level":"error","caller":"VictoriaMetrics/app/vmagent/remotewrite/client.go:475","msg":"unexpected status code received after sending a block with size 399 bytes to \"1:secret-url\" during retry #37: 401; response body=\"remoteAddr: \\\"127.0.0.6:32867, X-Forwarded-For: *.*.*.*\\\"; requestURI: /api/v1/write; cannot authorize request with auth tokens [\\\"http_auth:Basic ***\\"]\\n\"; re-sending the block in 62.120 seconds"}
{"ts":"2025-01-17T02:32:27.273Z","level":"error","caller":"VictoriaMetrics/app/vmagent/remotewrite/client.go:475","msg":"unexpected status code received after sending a block with size 456 bytes to \"1:secret-url\" during retry #37: 401; response body=\"remoteAddr: \\\"127.0.0.6:51665, X-Forwarded-For: *.*.*.*\\\"; requestURI: /api/v1/write; cannot authorize request with auth tokens [\\\"http_auth:Basic ***\\\"]\\n\"; re-sending the block in 63.354 seconds"}

In vmauth pod logs:

...
2025-01-17T02:25:05.599Z        warn    VictoriaMetrics/app/vmauth/main.go:152  remoteAddr: "127.0.0.6:53633, X-Forwarded-For: *.*.*.*"; requestURI: /api/v1/write; cannot authorize request with auth tokens ["http_auth:Basic ***"]
2025-01-17T02:25:56.621Z        warn    VictoriaMetrics/app/vmauth/main.go:152  remoteAddr: "127.0.0.6:44933, X-Forwarded-For: *.*.*.*"; requestURI: /api/v1/write; cannot authorize request with auth tokens ["http_auth:Basic ***"]
2025-01-17T02:26:08.466Z        warn    VictoriaMetrics/app/vmauth/main.go:152  remoteAddr: "127.0.0.6:44933, X-Forwarded-For: *.*.*.*"; requestURI: /api/v1/write; cannot authorize request with auth tokens ["http_auth:Basic ***"]

The username and passowrd are all the same in VMUser object, the Secret vmuser-credential and the env in the vmagent pod.
The creationTimestamp for VMUser object, the Secret vmuser-credential and vmagent pod are all the same.
There is a big delay for the vmauth getting and syncing the credentials.
It's almost 30 minutes for reporting the auth error in vmauth and vmagent pods since the creations of the vmuser annd secret vmuser-credential , and metrics can't be received after everything is ready. Then after about 30m, the authorize errors will be gone..and the metrics will be visualized on the VMUI. The metrics were cached until the auth errors were gone and then pushed to the VM.

This is happening on many of our clusters.
take one for example:

  1. vmuser creation time: 2025-01-17T01:57:06Z
~# k get vmuser runtime1 -ojson |jq '.metadata.creationTimestamp'
"2025-01-17T01:57:06Z"
  1. secret vmuser-credential creation time: 2025-01-17T01:57:07Z
~# k get secret vmuser-credential -ojson |jq '.metadata.creationTimestamp'
"2025-01-17T01:57:07Z"
  1. auth error keeps being reported in vmagent pod logs until 2025-01-17T02:32:27.273Z.
    last auth error timestamp:
{"ts":"2025-01-17T02:32:27.273Z","level":"error","caller":"VictoriaMetrics/app/vmagent/remotewrite/client.go:475","msg":"unexpected status code received after sending a block with size 456 bytes to \"1:secret-url\" during retry #37: 401; response body=\"remoteAddr: \\\"127.0.0.6:51665, X-Forwarded-For: *.*.*.*\\\"; requestURI: /api/v1/write; cannot authorize request with auth tokens [\\\"http_auth:Basic ***\\\"]\\n\"; re-sending the block in 63.354 seconds"}
  1. auth error keeps being reported in vmauth pod logs until 2025-01-17T02:32:27.180Z.
    last auth error timestamp:
log:2025-01-17T02:32:27.180Z warn VictoriaMetrics/app/vmauth/main.go:152 remoteAddr: "127.0.0.6:51665, X-Forwarded-For: *.*.*.*"; requestURI: /api/v1/write; cannot authorize request with auth tokens ["http_auth:Basic ***"]
  1. before 2025-01-17T02:32, there were no metrics received and visualized on VMUI. after 2025-01-17T02:32, the metrics were visualized. metrics on the UI started from 2025-01-17 01:59:15. The metrics were cached and pushed to the vmstorage all together after the authorize errors were gone.

Image

  1. The gap is between 2025-01-17T01:57 and 2025-01-17T02:32, about 35 minutes.

victoria-metrics-operator version: 0.51.3
vmauth version: 1.108.1
vmagent version: 1.108.1

@f41gh7
Copy link
Collaborator

f41gh7 commented Jan 17, 2025

Hello,

Could you please check the following:

  1. Were there any errors at vm-operator logs?
  2. Did VMAuth configured with spec.useVMConfigReloader: true option? It greatly increases the speed of config syncronization between kubernetes secret and kubernetes pod secret mount.
  3. What is reconcile latency for vmauth object? It could be checked at operator metrics via query expression: histogram_quantile(0.99,sum(rate(controller_runtime_reconcile_time_seconds_bucket{controller="vmauth"})) by(le) )

@f41gh7 f41gh7 added the question Further information is requested label Jan 17, 2025
@vvxxvvxx
Copy link
Author

vvxxvvxx commented Jan 17, 2025

Hello,

Could you please check the following:

  1. Were there any errors at vm-operator logs?
  2. Did VMAuth configured with spec.useVMConfigReloader: true option? It greatly increases the speed of config syncronization between kubernetes secret and kubernetes pod secret mount.
  3. What is reconcile latency for vmauth object? It could be checked at operator metrics via query expression: histogram_quantile(0.99,sum(rate(controller_runtime_reconcile_time_seconds_bucket{controller="vmauth"})) by(le) )

Hi @f41gh7,

  1. Were there any errors at vm-operator logs?

For the past 7 hours, I could just see the not found error "error":"cannot create or update vmauth deploy: cannot update statuses for vmusers: VMUser.operator.victoriametrics.com \"vmuser-name1\" not found" for several times in the vm-operator logs. they happened for some deleted clusters. so I think it's not relevant to this issue.

  1. Did VMAuth configured with spec.useVMConfigReloader: true option? It greatly increases the speed of config syncronization between kubernetes secret and kubernetes pod secret mount.

In VMAuth CR, I can see "useVMConfigReloader":true is shown in the metadata.annotations.operator.victoriametrics/last-applied-spec. but not sure why this parameter is not in VMAuth spec.

now the VMAuth CR is in failed state for reason: 'cannot create or update vmauth deploy: cannot update statuses for vmusers: VMUser.operator.victoriametrics.com "vmuser-name3" not found'. this "vmuser-name3" was for a deleted cluster. so I think this error reason is not relevant to this issue.. not sure if the failed state has any impact to the vmauth peformance..

we use config-reloader:0.51.3 as one container in the vmauth deployment.

  1. What is reconcile latency for vmauth object? It could be checked at operator metrics via query expression: histogram_quantile(0.99,sum(rate(controller_runtime_reconcile_time_seconds_bucket{controller="vmauth"})) by(le) )

it's 60. and empty most of the time.
In the past 1 hour. we could see all the requests are dropped in le="+Inf".

Image

Image

@vvxxvvxx
Copy link
Author

vvxxvvxx commented Jan 17, 2025

and all these regression happened after we did the upgrade for vm-operator from 0.49.1 to 0.51.3, and VM components vmauth, vmstorage, vmalert, vminsert, ect from 1.106.1 to 1.108.1.

@f41gh7
Copy link
Collaborator

f41gh7 commented Jan 17, 2025

For the past 7 hours, I could just see the not found error "error":"cannot create or update vmauth deploy: cannot update statuses for vmusers: VMUser.operator.victoriametrics.com "vmuser-name1" not found" for several times in the vm-operator logs. they happened for some deleted clusters. so I think it's not relevant to this issue.

I think it's probably the reason of this issue. During configuration build, operator at first updates VMUser statuses and after that updates configuration secret with generated config. This logic was introduced at v0.50.0 operator version.

I expect this logic to be fixed today at upcoming operator release. At first new configuration should be update and after operator must update statuses of vmusers.

Also, it looks strange, that operator cannot find vmuser-name1. It should skip deleted objects.

@f41gh7 f41gh7 added bug Something isn't working and removed question Further information is requested labels Jan 17, 2025
@ebensom
Copy link
Contributor

ebensom commented Jan 17, 2025

I think it's probably the reason of this issue. During configuration build, operator at first updates VMUser statuses and after that updates configuration secret with generated config. This logic was introduced at v0.50.0 operator version.

I expect this logic to be fixed today at upcoming operator release. At first new configuration should be update and after operator must update statuses of vmusers.

Thanks a lot, by taking a quick glance at the buildVMAuthConfig() code, this really seems to be the issue. Please note, we have VMUsers and VMRules both in the scale of 2K (~1500), and we use the controller.maxConcurrentReconciles=10 so that all vm-operator controllers use 10 workers. VMUsers and VMRules are getting created, changed or deleted quite frequently, (multiple mutations per minute).

It really seems that with the operator upgrade from 0.49.1 to 0.51.3, many status subresources are getting always updated, which means a lot more K8s API server operations (and network traffic), and also client-side throttling:

Image

Image

FYI we are using victoriametrics/operator:config-reloader-0.51.3 for the reloader sidecar of VMAuth.

Also, it looks strange, that operator cannot find vmuser-name1. It should skip deleted objects.

Indeed, the buildVMAuthConfig() code here should skip and continue processing the users instead of bailing out with error.

@f41gh7
Copy link
Collaborator

f41gh7 commented Jan 17, 2025

It really seems that with the operator upgrade from 0.49.1 to 0.51.3, many status subresources are getting always updated, which means a lot more K8s API server operations (and network traffic), and also client-side throttling:

Thanks for reporting. Yes, operator updates status subresource with constant interval 7 minutes + jitter. And it may produce a lot of updates in setup with a lot of objects. It should be also addressed at upcoming release. We're going to make a performance test and improve status update logic.

@ebensom
Copy link
Contributor

ebensom commented Jan 17, 2025

Many thanks @f41gh7 for following up on this! ❤

f41gh7 added a commit that referenced this issue Jan 20, 2025
   Previously, for `VMAuth`, `VMAlert` and `VMAlertmanager` configuration
secret was updated only after `status` fields for all matched `child`
objects were updated.
  It may lead to delays, since Kubernetes API server may throttle
requests.

  This commit performs `Secret` with configuration update first and then
updates related child objects. It greatly decreases changes delivery
time.
 It also adds fast path for a single resource update. It updates the
 status field only for corresponding object.

Related issue:
#1220
f41gh7 added a commit that referenced this issue Jan 20, 2025
   Previously, for `VMAuth`, `VMAlert` and `VMAlertmanager` configuration
secret was updated only after `status` fields for all matched `child`
objects were updated.
  It may lead to delays, since Kubernetes API server may throttle
requests.

  This commit performs `Secret` with configuration update first and then
updates related child objects. It greatly decreases changes delivery
time.
 It also adds fast path for a single resource update. It updates the
 status field only for corresponding object.

Related issue:
#1220
f41gh7 added a commit that referenced this issue Jan 20, 2025
Client default rate limiter of 5 requests per second is too restrictive. And it prevents operator from scaling.
Since operator needs to update objects `status.condition` fields. And it may require 10-20 requests per second on large scale.

 This commit raises default limit to 50 and exports metric with configured limit.

 Related issue:
#1220

Signed-off-by: f41gh7 <[email protected]>
f41gh7 added a commit that referenced this issue Jan 20, 2025
It controlls expiration time of status.condition lastUpdateTime. Which is needed to track stale parent objects.
Increasing value of this flag reduces load on Kubernetes cluster, but it also increases time of stale object detection.

 For instance, if there are 2 VMAlert objects and it matches some VMRule. Both vmalerts will be registered at VMRule.status.conditions[].type with it's name.
In case when 1 of VMAlert objects were deleted, it will be removed from VMRule.status.condition only after 3*controller.statusLastUpdateTimeTTL. Which take up to 3 hours
with default values.

 Related issue:
#1220

Signed-off-by: f41gh7 <[email protected]>
f41gh7 added a commit that referenced this issue Jan 20, 2025
   Previously, for `VMAuth`, `VMAlert` and `VMAlertmanager` configuration
secret was updated only after `status` fields for all matched `child`
objects were updated.
  It may lead to delays, since Kubernetes API server may throttle
requests.

  This commit performs `Secret` with configuration update first and then
updates related child objects. It greatly decreases changes delivery
time.
 It also adds fast path for a single resource update. It updates the
 status field only for corresponding object.

Related issue:
#1220
f41gh7 added a commit that referenced this issue Jan 20, 2025
Client default rate limiter of 5 requests per second is too restrictive. And it prevents operator from scaling.
Since operator needs to update objects `status.condition` fields. And it may require 10-20 requests per second on large scale.

 This commit raises default limit to 50 and exports metric with configured limit.

 Related issue:
#1220

Signed-off-by: f41gh7 <[email protected]>
f41gh7 added a commit that referenced this issue Jan 20, 2025
It controlls expiration time of status.condition lastUpdateTime. Which is needed to track stale parent objects.
Increasing value of this flag reduces load on Kubernetes cluster, but it also increases time of stale object detection.

 For instance, if there are 2 VMAlert objects and it matches some VMRule. Both vmalerts will be registered at VMRule.status.conditions[].type with it's name.
In case when 1 of VMAlert objects were deleted, it will be removed from VMRule.status.condition only after 3*controller.statusLastUpdateTimeTTL. Which take up to 3 hours
with default values.

 Related issue:
#1220

Signed-off-by: f41gh7 <[email protected]>
@f41gh7 f41gh7 added the waiting for release The change was merged to upstream, but wasn't released yet. label Jan 20, 2025
@f41gh7
Copy link
Collaborator

f41gh7 commented Jan 20, 2025

Upcoming release of operator changes logic of configuration updates. Now operator will perform secret with configuration at first and after that performs status condition updates for objects.

It also changes default kubernetes client configuration. Previously operator had 5 QPS limit. It's not enough to handle more than 200 objects. New default limit is 50 QPS. It's also possible to increase it with flag -client.qps.

Alerting query could be used to check if it's needed:

sum(rate(rest_client_requests_total{job="victoria-metrics-operator"}[10m]))/operator_rest_client_qps_limit{job="victoria-metrics-operator"} > 0.9

It'll be also added to the grafana dashboard and alerting rules for the operator.

Mostly, operator performs TTL update on objects status.condition.lastUpdateTime field. For example VMRule could have the following status conditions:

status:
  conditions:
  - lastTransitionTime: "2025-01-18T15:13:25Z"
    lastUpdateTime: "2025-01-20T13:54:59Z"
    observedGeneration: 1
    reason: ConfigParsedAndApplied
    status: "True"
    type: stack-victoria-metrics-k8s-stack.default.vmalert.victoriametrics.com/Applied
  - lastTransitionTime: "2025-01-19T15:03:44Z"
    lastUpdateTime: "2025-01-20T14:27:52Z"
    observedGeneration: 1
    reason: ConfigParsedAndApplied
    status: "True"
    type: stack-victoria-metrics-k8s-stack-v2.default.vmalert.victoriametrics.com/Applied
  - lastTransitionTime: "2025-01-20T10:22:49Z"
    lastUpdateTime: "2025-01-20T14:52:28Z"
    observedGeneration: 1
    reason: ConfigParsedAndApplied
    status: "True"
    type: stack-victoria-metrics-k8s-stack-v3.bench-1.vmalert.victoriametrics.com/Applied
  observedGeneration: 1
  updateStatus: operational

It indicates, that VMRule object was processed by 3 VMAlert objects: - stack-victoria-metrics-k8s-stack, stack-victoria-metrics-k8s-stack-v2 and stack-victoria-metrics-k8s-stack-v3. lastUpdateTime is used at TTL, if it expires, operator deletes corresponding record from VMRule status.

Default TTL is also changed from 5 minutes to 1 hour. And it's also configurable with flag -controller.statusLastUpdateTimeTTL=1hour.

@f41gh7
Copy link
Collaborator

f41gh7 commented Jan 22, 2025

Issue was fixed at v0.52.0 release

@f41gh7 f41gh7 closed this as completed Jan 22, 2025
@f41gh7 f41gh7 removed the waiting for release The change was merged to upstream, but wasn't released yet. label Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants