Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DFS: proctitle naming scheme for Puma and others #235

Closed
wants to merge 1 commit into from

Conversation

lzap
Copy link

@lzap lzap commented Nov 30, 2021

I solved the long-term problem that is bugging us for years: when app server (Puma in our case) performs any kind of rolling restart or recycles dead worker, new data file is created causing the temporary directory for DirectFileStore to grow causing the /metrics endpoint to be slower and slower.

I figured out how to solve this for Puma, and for other web servers which set process titles in a consistent way. Initially, I had the similar implementation for Puma.stats (https://github.com/puma/puma/blob/master/docs/stats.md), however, this did not work because this call returns relevant data only when called from the master process. Therefore, the implementation was done via process titles, which is also more generic and pretty fast.

The patch includes a README explanation, start with that to understand what is going on.

Edit: Fixed tests.

@dmagliola
Copy link
Collaborator

Just want to leave a quick comment that I saw your PR, I've just been pretty sick lately and haven't been able to actually look at it in detail, but based on the description alone, i'm super interested in this.
Will try to look at it soon!

@lzap
Copy link
Author

lzap commented Dec 14, 2021

Oh get better! Take your time. Looking forward your opinion about this. Cheers and stay safe.

@@ -205,7 +244,7 @@ def aggregate_values(values)
if @values_aggregation_mode == SUM
values.inject { |sum, element| sum + element }
elsif @values_aggregation_mode == MAX
values.max
values.max
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was probably an accident, will resolve once we have an agreement this is a valid solution.

Copy link
Collaborator

@dmagliola dmagliola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the pretty late response! 😅

I really like this idea!

My thoughts in terms of the "too many files" problem has always been to not have "one file per metric", as that's really problematic when you have too many metrics (and you'll probably have more metrics than refreshed processes), and just have "one per process".

This probably won't solve the issues with too many metrics, but it should certainly help, and both approaches can be combined.

The main question I have is why do we need the regex?
Could we always just take process_title? (which already returns pid if empty)

My other question is whether you know if Unicorn also names its processes in this way, or if this would only work with Puma.

def process_title
cmdline = "/proc/#{process_id}/cmdline"
if File.exist?(cmdline)
@process_title ||= File.read(cmdline).unpack("Z*").first
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this memoization be higher up, so it caches the whole thing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, absolutely, to get rid of file existence check.

end

def filename_suffix
if Thread.current[:prometheus_suffix]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often do we call this method? Do we need this Thread.current cache?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can confirm it is safe to only use memorization, then yeah, that will be cleaner. I am worried about this method being called on every measurement causing the regexp to recompile all the time.

@dmagliola
Copy link
Collaborator

Oh, and one more... Have you run this code in production for a while? :D

@lzap
Copy link
Author

lzap commented Jan 11, 2022

This probably won't solve the issues with too many metrics, but it should certainly help, and both approaches can be combined.

Yeah, we have way too many and we have limited them over the past years to just about two hundreds. But due to dynamic nature of our Prometheus implementation (metrics appear as more and more users hit various actions, controllers or model classes) reloading of processes is not the culprit.

Oh, and one more... Have you run this code in production for a while? :D

Nah, labs only :) Luckily this is opt-in, will have no effect until you set the regexp attribute.

Could we always just take process_title? (which already returns pid if empty)

Well, I don't think this would be safe to have as the default - what if a webserver (or anything that utilizes the library) simply names multiple processes with the same name? I have seen this in the past I think, something like "XYZ worker" without an unique numbering scheme. Then we would have a conflict and a nasty bug, for this reason I think this needs to be opt-in and users must think when implementing this.

My other question is whether you know if Unicorn also names its processes in this way, or if this would only work with Puma.

I haven't found setproctitle call in the Unicorn codebase, so my assumption is no.

@lzap lzap force-pushed the dfs-proctitle-naming branch from f69454b to 4ea5b2a Compare January 11, 2022 10:24
@lzap
Copy link
Author

lzap commented Jan 11, 2022

I have fixed both memorizations.

@Sinjo
Copy link
Member

Sinjo commented Apr 24, 2022

Hi @lzap,

First off, sorry we left you hanging for so long on this one. @dmagliola and I had a couple of chats about this PR in the meantime, and we think there's real value in giving consistent identity to multi-process workers that periodically get recycled/respawned. The reduction in how many metric files we generate could be pretty huge for some users.

What we want to make sure of is getting the API right. In particular, we're keen to generalise it beyond worker systems that set process titles. While Puma exposes process identity that way, other multi-process worker systems may not, and we don't want to tie our API here to what Puma has chosen to do.

We were thinking that it may be better to pass a lambda in rather than a regex. That way, different implementations for retrieving the current process's identity could be supplied. The Puma one could look at the process title, but others could source that information elsewhere.

We would aim to ship lambdas for any popular framework as constants that users could pass in when configuring DirectFileStore.

Before shipping this feature, we'd like to try and make it work with at least Puma and Unicorn (assuming the latter has some way to get a consistent identity for a worker process across respawns) as they're a couple of the most popular frameworks using multi-process workers.

Outside of web servers, there's also multi-process support in Sidekiq Enterprise, which would be nice to target, but would likely involve a conversation with Mike Perham as it's only available in the most expensive paid edition of Sidekiq (we'd need to figure something out for testing).

@lzap
Copy link
Author

lzap commented May 4, 2022

Hey, I want to say that we haven't deployed this in production and I am not very confident that is actually solves our problem which is that we have too many metrics (multiple metrics per each Rails controller/action and we have many). One big drawback is that Puma does not recycle processes, there is the https://github.com/zombocom/puma_worker_killer 3rd party gem but without it I think this would not be useful.

Therefore I do not propose this into the mainline branch. Also I don't have time to extensively work on this. Feel free to pick it up if you find this useful. Thanks!

@lzap lzap closed this May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants