DFS: proctitle naming scheme for Puma and others #235

lzap · 2021-11-30T13:01:51Z

I solved the long-term problem that is bugging us for years: when app server (Puma in our case) performs any kind of rolling restart or recycles dead worker, new data file is created causing the temporary directory for DirectFileStore to grow causing the /metrics endpoint to be slower and slower.

I figured out how to solve this for Puma, and for other web servers which set process titles in a consistent way. Initially, I had the similar implementation for Puma.stats (https://github.com/puma/puma/blob/master/docs/stats.md), however, this did not work because this call returns relevant data only when called from the master process. Therefore, the implementation was done via process titles, which is also more generic and pretty fast.

The patch includes a README explanation, start with that to understand what is going on.

Edit: Fixed tests.

dmagliola · 2021-12-03T15:40:22Z

Just want to leave a quick comment that I saw your PR, I've just been pretty sick lately and haven't been able to actually look at it in detail, but based on the description alone, i'm super interested in this.
Will try to look at it soon!

lzap · 2021-12-14T15:00:51Z

Oh get better! Take your time. Looking forward your opinion about this. Cheers and stay safe.

lzap · 2021-12-14T15:03:28Z

lib/prometheus/client/data_stores/direct_file_store.rb

@@ -205,7 +244,7 @@ def aggregate_values(values)
              if @values_aggregation_mode == SUM
                values.inject { |sum, element| sum + element }
              elsif @values_aggregation_mode == MAX
-                values.max
+                  values.max


This was probably an accident, will resolve once we have an agreement this is a valid solution.

dmagliola

Sorry for the pretty late response! 😅

I really like this idea!

My thoughts in terms of the "too many files" problem has always been to not have "one file per metric", as that's really problematic when you have too many metrics (and you'll probably have more metrics than refreshed processes), and just have "one per process".

This probably won't solve the issues with too many metrics, but it should certainly help, and both approaches can be combined.

The main question I have is why do we need the regex?
Could we always just take process_title? (which already returns pid if empty)

My other question is whether you know if Unicorn also names its processes in this way, or if this would only work with Puma.

dmagliola · 2022-01-06T21:09:54Z

lib/prometheus/client/data_stores/direct_file_store.rb

+          def process_title
+            cmdline = "/proc/#{process_id}/cmdline"
+            if File.exist?(cmdline)
+              @process_title ||= File.read(cmdline).unpack("Z*").first


Should this memoization be higher up, so it caches the whole thing?

Oh right, absolutely, to get rid of file existence check.

dmagliola · 2022-01-06T21:11:34Z

lib/prometheus/client/data_stores/direct_file_store.rb

+          end
+
+          def filename_suffix
+            if Thread.current[:prometheus_suffix]


How often do we call this method? Do we need this Thread.current cache?

If you can confirm it is safe to only use memorization, then yeah, that will be cleaner. I am worried about this method being called on every measurement causing the regexp to recompile all the time.

dmagliola · 2022-01-06T23:55:55Z

Oh, and one more... Have you run this code in production for a while? :D

lzap · 2022-01-11T10:18:07Z

This probably won't solve the issues with too many metrics, but it should certainly help, and both approaches can be combined.

Yeah, we have way too many and we have limited them over the past years to just about two hundreds. But due to dynamic nature of our Prometheus implementation (metrics appear as more and more users hit various actions, controllers or model classes) reloading of processes is not the culprit.

Oh, and one more... Have you run this code in production for a while? :D

Nah, labs only :) Luckily this is opt-in, will have no effect until you set the regexp attribute.

Could we always just take process_title? (which already returns pid if empty)

Well, I don't think this would be safe to have as the default - what if a webserver (or anything that utilizes the library) simply names multiple processes with the same name? I have seen this in the past I think, something like "XYZ worker" without an unique numbering scheme. Then we would have a conflict and a nasty bug, for this reason I think this needs to be opt-in and users must think when implementing this.

My other question is whether you know if Unicorn also names its processes in this way, or if this would only work with Puma.

I haven't found setproctitle call in the Unicorn codebase, so my assumption is no.

Signed-off-by: Lukas Zapletal <[email protected]>

lzap · 2022-01-11T10:25:04Z

I have fixed both memorizations.

Sinjo · 2022-04-24T15:14:19Z

Hi @lzap,

First off, sorry we left you hanging for so long on this one. @dmagliola and I had a couple of chats about this PR in the meantime, and we think there's real value in giving consistent identity to multi-process workers that periodically get recycled/respawned. The reduction in how many metric files we generate could be pretty huge for some users.

What we want to make sure of is getting the API right. In particular, we're keen to generalise it beyond worker systems that set process titles. While Puma exposes process identity that way, other multi-process worker systems may not, and we don't want to tie our API here to what Puma has chosen to do.

We were thinking that it may be better to pass a lambda in rather than a regex. That way, different implementations for retrieving the current process's identity could be supplied. The Puma one could look at the process title, but others could source that information elsewhere.

We would aim to ship lambdas for any popular framework as constants that users could pass in when configuring DirectFileStore.

Before shipping this feature, we'd like to try and make it work with at least Puma and Unicorn (assuming the latter has some way to get a consistent identity for a worker process across respawns) as they're a couple of the most popular frameworks using multi-process workers.

Outside of web servers, there's also multi-process support in Sidekiq Enterprise, which would be nice to target, but would likely involve a conversation with Mike Perham as it's only available in the most expensive paid edition of Sidekiq (we'd need to figure something out for testing).

lzap · 2022-05-04T05:37:30Z

Hey, I want to say that we haven't deployed this in production and I am not very confident that is actually solves our problem which is that we have too many metrics (multiple metrics per each Rails controller/action and we have many). One big drawback is that Puma does not recycle processes, there is the https://github.com/zombocom/puma_worker_killer 3rd party gem but without it I think this would not be useful.

Therefore I do not propose this into the mainline branch. Also I don't have time to extensively work on this. Feel free to pick it up if you find this useful. Thanks!

lzap force-pushed the dfs-proctitle-naming branch from 08d1222 to e64a3c4 Compare November 30, 2021 16:19

lzap mentioned this pull request Nov 30, 2021

Fixes #34042 - prometheus worker grouping enabled theforeman/foreman#8959

Closed

lzap force-pushed the dfs-proctitle-naming branch from e64a3c4 to f69454b Compare December 3, 2021 07:25

lzap commented Dec 14, 2021

View reviewed changes

dmagliola reviewed Jan 6, 2022

View reviewed changes

DFS: proctitle naming scheme for Puma and others

4ea5b2a

Signed-off-by: Lukas Zapletal <[email protected]>

lzap force-pushed the dfs-proctitle-naming branch from f69454b to 4ea5b2a Compare January 11, 2022 10:24

lzap closed this May 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DFS: proctitle naming scheme for Puma and others #235

DFS: proctitle naming scheme for Puma and others #235

lzap commented Nov 30, 2021 •

edited

Loading

dmagliola commented Dec 3, 2021

lzap commented Dec 14, 2021

lzap Dec 14, 2021

dmagliola left a comment

dmagliola Jan 6, 2022

lzap Jan 11, 2022

dmagliola Jan 6, 2022

lzap Jan 11, 2022

dmagliola commented Jan 6, 2022

lzap commented Jan 11, 2022

lzap commented Jan 11, 2022

Sinjo commented Apr 24, 2022

lzap commented May 4, 2022

DFS: proctitle naming scheme for Puma and others #235

DFS: proctitle naming scheme for Puma and others #235

Conversation

lzap commented Nov 30, 2021 • edited Loading

dmagliola commented Dec 3, 2021

lzap commented Dec 14, 2021

lzap Dec 14, 2021

Choose a reason for hiding this comment

dmagliola left a comment

Choose a reason for hiding this comment

dmagliola Jan 6, 2022

Choose a reason for hiding this comment

lzap Jan 11, 2022

Choose a reason for hiding this comment

dmagliola Jan 6, 2022

Choose a reason for hiding this comment

lzap Jan 11, 2022

Choose a reason for hiding this comment

dmagliola commented Jan 6, 2022

lzap commented Jan 11, 2022

lzap commented Jan 11, 2022

Sinjo commented Apr 24, 2022

lzap commented May 4, 2022

lzap commented Nov 30, 2021 •

edited

Loading