Scrape targets can be reassigned even when collector count doesn't change #2280

swiatekm · 2023-10-26T11:24:08Z

Component(s)

target allocator

Description

Running target allocator in reasonably sized clusters, I noticed that targets were being reassigned to different collector Pods even if the Pod count didn't change. We don't actually have metrics to measure this, but I noticed a second-order effect of receiving Prometheus staleness markers, which are emitted when a scrape target legitimately disappears.

This shouldn't really happen, and looking at the allocation strategy code, there doesn't appear to be a way for it to happen, assuming the hash doesn't change for an existing target. However, we do calculate the hash based on all the target labels, some of which are mutable. Even putting aside whether this is the cause of the problem I've experienced, there's no reason for us to include all the labels in the hash - the target address should suffice, and this is what prometheus-operator does for sharding as well.

Steps to Reproduce

Don't have a clean reproduction yet, but showing the behaviour of targets moving if one of the labels changes is easy in unit tests.

Expected Result

Scrape targets should never be moved between collectors if the collector count doesn't change.

Kubernetes Version

1.24, latest EKS

Operator version

0.82.0

Collector version

0.82.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

Log output

No response

Additional context

No response

jaronoff97 · 2023-10-26T15:00:33Z

Yeah I think this would happen if the labels on a collector are patched, the update could cause the assignment to move based on the hash ring.

jaronoff97 · 2023-10-26T15:01:55Z

The reason we included all of the labels is because sometimes a target will appear more than once and if we didn't use the entire label it would cause a single target to be assigned to multiple collectors IIRC. Let me find the PR where I changed that.

jaronoff97 · 2023-10-26T15:07:53Z

Yeah here was that PR

swiatekm · 2023-10-26T15:38:15Z

Wouldn't it be easier to just deduplicate the targets in that case? Plus, that sounds like a bug in Prometheus Service Discovery, so maybe it was fixed in the meantime?

Or is the problem that you can legitimately have multiple targets with the same address in a job? I don't think that's a problem. I don't want to change how target identity is determined, just how the allocation decisions are made.

jaronoff97 · 2023-10-26T15:51:14Z

it's a bit different actually, the scrape config's service discovery may generate multiple entries for the same target url but have different labels (think an application that serves metrics on two paths, not two ports). If you don't use the entire label set you would end up having a condition where you sometimes scrape the first path and sometimes scrape the second because of a race in discovery. That PR resolved it by treating it the way prometheus does (as two targets)

Prometheus treats a target group as a list of targets where target is identified by its entire label set. We could probably do the same, though I would think that a hash may be more efficient than a labelset as a key.

Regardless, I think it's totally fine to mess with how the decision is made!

swiatekm added bug Something isn't working needs triage labels Oct 26, 2023

swiatekm self-assigned this Oct 26, 2023

swiatekm mentioned this issue Oct 30, 2023

Use only target address for allocation in consistent-hashing strategy #2290

Merged

swiatekm removed the needs triage label Nov 14, 2023

jaronoff97 linked a pull request Nov 28, 2023 that will close this issue

Use only target address for allocation in consistent-hashing strategy #2290

Merged

jaronoff97 closed this as completed in #2290 Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrape targets can be reassigned even when collector count doesn't change #2280

Scrape targets can be reassigned even when collector count doesn't change #2280

swiatekm commented Oct 26, 2023 •

edited

Loading

jaronoff97 commented Oct 26, 2023

jaronoff97 commented Oct 26, 2023

jaronoff97 commented Oct 26, 2023

swiatekm commented Oct 26, 2023 •

edited

Loading

jaronoff97 commented Oct 26, 2023 •

edited

Loading

Scrape targets can be reassigned even when collector count doesn't change #2280

Scrape targets can be reassigned even when collector count doesn't change #2280

Comments

swiatekm commented Oct 26, 2023 • edited Loading

Component(s)

Description

Steps to Reproduce

Expected Result

Kubernetes Version

Operator version

Collector version

Environment information

Environment

Log output

Additional context

jaronoff97 commented Oct 26, 2023

jaronoff97 commented Oct 26, 2023

jaronoff97 commented Oct 26, 2023

swiatekm commented Oct 26, 2023 • edited Loading

jaronoff97 commented Oct 26, 2023 • edited Loading

swiatekm commented Oct 26, 2023 •

edited

Loading

swiatekm commented Oct 26, 2023 •

edited

Loading

jaronoff97 commented Oct 26, 2023 •

edited

Loading