Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrape targets can be reassigned even when collector count doesn't change #2280

Closed
swiatekm opened this issue Oct 26, 2023 · 5 comments · Fixed by #2290
Closed

Scrape targets can be reassigned even when collector count doesn't change #2280

swiatekm opened this issue Oct 26, 2023 · 5 comments · Fixed by #2290
Assignees
Labels
bug Something isn't working

Comments

@swiatekm
Copy link
Contributor

swiatekm commented Oct 26, 2023

Component(s)

target allocator

Description

Running target allocator in reasonably sized clusters, I noticed that targets were being reassigned to different collector Pods even if the Pod count didn't change. We don't actually have metrics to measure this, but I noticed a second-order effect of receiving Prometheus staleness markers, which are emitted when a scrape target legitimately disappears.

This shouldn't really happen, and looking at the allocation strategy code, there doesn't appear to be a way for it to happen, assuming the hash doesn't change for an existing target. However, we do calculate the hash based on all the target labels, some of which are mutable. Even putting aside whether this is the cause of the problem I've experienced, there's no reason for us to include all the labels in the hash - the target address should suffice, and this is what prometheus-operator does for sharding as well.

Steps to Reproduce

Don't have a clean reproduction yet, but showing the behaviour of targets moving if one of the labels changes is easy in unit tests.

Expected Result

Scrape targets should never be moved between collectors if the collector count doesn't change.

Kubernetes Version

1.24, latest EKS

Operator version

0.82.0

Collector version

0.82.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

Log output

No response

Additional context

No response

@swiatekm swiatekm added bug Something isn't working needs triage labels Oct 26, 2023
@swiatekm swiatekm self-assigned this Oct 26, 2023
@jaronoff97
Copy link
Contributor

Yeah I think this would happen if the labels on a collector are patched, the update could cause the assignment to move based on the hash ring.

@jaronoff97
Copy link
Contributor

The reason we included all of the labels is because sometimes a target will appear more than once and if we didn't use the entire label it would cause a single target to be assigned to multiple collectors IIRC. Let me find the PR where I changed that.

@jaronoff97
Copy link
Contributor

Yeah here was that PR

@swiatekm
Copy link
Contributor Author

swiatekm commented Oct 26, 2023

Wouldn't it be easier to just deduplicate the targets in that case? Plus, that sounds like a bug in Prometheus Service Discovery, so maybe it was fixed in the meantime?

Or is the problem that you can legitimately have multiple targets with the same address in a job? I don't think that's a problem. I don't want to change how target identity is determined, just how the allocation decisions are made.

@jaronoff97
Copy link
Contributor

jaronoff97 commented Oct 26, 2023

it's a bit different actually, the scrape config's service discovery may generate multiple entries for the same target url but have different labels (think an application that serves metrics on two paths, not two ports). If you don't use the entire label set you would end up having a condition where you sometimes scrape the first path and sometimes scrape the second because of a race in discovery. That PR resolved it by treating it the way prometheus does (as two targets)

Prometheus treats a target group as a list of targets where target is identified by its entire label set. We could probably do the same, though I would think that a hash may be more efficient than a labelset as a key.

Regardless, I think it's totally fine to mess with how the decision is made!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants