-
Notifications
You must be signed in to change notification settings - Fork 800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gunicorn/multiprocess with restarting workers, .db files are not cleaned up #275
Comments
This is the expected behaviour, you can only delete these safely when the whole application restarts. I'd suggest increasing max_requests. |
We are using max_requests to deal with some memory leaks introduced with Gunicorn with gevent workers. Not ideal at all. But even if we didn't have this problem, workers can restart for other reasons. Is it also expected that memory usage would increase like this? Because if so, it seems like eventually over the lifetime of the service, at some point all services using this pattern would run out of memory. |
Memory usage for workers should be constant. |
We have the same issue in our uWSGI deployment. We also restart workers every N requests, and they could restart by themselves because of Is it at all possible that the prometheus client would support this use case? Or should we just look for another solution? |
There's not much we can do here if you have high churn I'm afraid. |
We disabled worker restarts and we found that using Gunicorn and Flask, once in a while, we do have unplanned worker restarts infrequently. So for us, we were able to reduce the large memory leak and now have a very slow one that grows over the course of months. |
Couldn't |
That'd require locking across processes, which we want to avoid, and cause issues if that pid was reused. |
I think locking across processes in I guess reusing pids is another issue if it mixes samples in the same pid file when it should actually be a true unique file per process. That would cause potential loss of data like a min value for a Gauge. |
Hey folks. We have a workaround for this currently which uses gunicorn's child_exit hook to merge the dead worker's counter/histogram metrics into archive files and deletes the pid file. This has kept our file count and thus scrape response times stable. The attached image shows our scrape times (yellow line) when we deployed our workaround. Our thinking was that using child_exit avoids the need to lock, as it's run only in the master. We still call the current mark_process_dead() function too, so gauge files are removed (although we are not heavy users of gauges). Regards pid recycling, I think it is only an issue for gauges, which are still deleted anyway in our experiment. Also, given filenames with the pid in should be deleted very soon after the worker death, the chance of a recycling pid clash is very remote? I may have missed something though... It seems to be working for us, but I would appreciate any feedback on anything we may have missed. Our experiment involves a slightly altered version of the MultiProcessCollector.collect() function, which you can see here: https://github.com/canonical-ols/talisker/blob/master/talisker/metrics.py#L193 The changes are documented there, but for convenience are copied here:
We'd like to create a PR that upstreams some of this work, if you think the approach is sensible? Specifically, we'd like to slightly refactor the collect() method to support preserving label order, allow for not accumulating histograms and not hardcoding '*.db'. This would allow us to reuse it, rather than reimplement it. Also, if this approach is deemed worthwhile, then we'd be happy to work on a PR to expand to include this aggregation as appropriate. Thanks for reading! EDIT: typo |
That's not a good assumption. For example if another worker is started up and gets the old pid, that'll break things. We also need this to work on things that aren't Gunicorn. For this to work, you'd need a lock of some form.
Does it not already? That's usually something we have.
That violates the Prometheus text format, histograms must be cumulative. |
Isn't the current approach already broken if the new worker gets the same PID? How does it currently keep the values for the old and new processes if the PID is the same? Won't the new process start overwriting the old process's values since it uses the same file. Just trying to understand this better to figure out potential workarounds. |
No, it's designed to handle it. It reads in the previous values at startup. |
Thanks for the prompt response! :)
Right, I think I see the issue. Because the aggregation is not atomic, then there is a small window where a new worker using an old pid can lose data it may have written to pre-exisiting pid file, but after it was read by collect(). Strawman naive solution: add process creation timestamp to the filename? I think that would ensure that they are never reused, regardless of pid recycling? While clever, I suspect the current pid-recylcing solution of file reuse is likely to complicate this problem.
OK. But a general purpose multi-process serialization is hard, as you have stated. Using SIGCHLD to the master process is one way - and that is what child_exit boils down to. Would it perhaps be beneficial to offer an opt-in documented solution for gunicorn, than no solution at all?
So, if you're talking about a lock in the master, we could add that. But for gunicorn/child_exit (and possibly python in general) I think that it's not strictly necessary, as child_exit is fired as a SIGCHILD signal handler, and the python execution of signal handlers is effectively serialized by the interpreter, AIUI.
Sadly not - the wrapping them in dict() is what loses the order: They are ordered all the way up to that final step. My implementation simply used OrderedDict here, but I suspect a cleaner approach would be to avoid the dict() at all, and just return a list of tuples? If that alone could be changed, then our solution would be improved, as we can use collect() directly, and just copy files into a different dir to get round the '*.db' beforehand, and unaccumulate histograms manually afterwards (this is the approach we took at first, until we hit the ordering issue).
Right, sorry, was a bit vague there in hindsight. The collect() function would retain its current output, with accumulated histograms. But we'd refactor to allow it not to be accumlated, probably would propose a refactoring into a new function what would not accumulate, and then collect() would call that, and then accumulate. We could then just use the new function rather than collect(). The changes are not big, and I think it would help discussion, so I might take some time today to knock a PoC PR. |
That could make the number of files worse. Not everyone can mark things dead.
We need to work on Windows, and we shouldn't interfere with the signal handling of the process we're instrumenting.
Ah yes. That can't really be changed, as that's part of our API. Why do you want to change this? |
It should only increase the number of files if recycled pids happen to be another worker. And our aggregation would delete them...
Re windows, I suspect we there is no single solution that would work for both, anyway? Regards interfering, it would be up to the application to ensure the cleanup is called at the right time, just like it is currently?
The json key used for the mmap file format has the in label order. To write the collect()ed data back to the aggregated archive files, I need the same order, which sadly the dict() loses. |
I don't know offhand.
Yes. But there's no concurrency restrictions around that, so for example two different processes could be performing it at once.
The file format can be changed. |
I'm also facing rising scrape times with a gunicorn application that has restarting workers. I have a PoC that I worked up before finding this discussion. I can share that if it would be helpful, but I don't want to step on any toes. I've spent some time thinking about this, and I'd like to see it solved, so I hope you don't mind if I add my thoughts to the discussion. Regarding timestamped filenames making the problem worse, this would only be a problem when pid reuse actually occurs. I think the users who are likely to be affected will probably have thousands of db files before this happens and scrape times of ten minutes or longer. I'm not saying it's impossible, but it must be vanishingly rare. If the multiprocess db file can be locked by the writer (when it is opened), then mark_process_dead can test to see if the file is still active by trying to obtain the lock. This could partially work around the issue, but it leaves open the possibility of the livesum/liveall gauge files inheriting values from a dead process. |
Looking around, multiprocessing.Lock() might provide a cross-platform solution. I've added it to solve an issue in our solution, seems to work, but it's not yet in production, so unproven.
Right, I wonder if documenting what the synchronization requirements are might be a good start? Then applications can use the right approach given their concurrency model?
Right, fair enough. But I don't think it needs to, we just need to avoid losing the order of labels, which seems fairly simple? |
Hmm, how does that work behind the scenes?
I think this is something that should Just Work in any environment, I rather not have a system that requires extensive documentation and reasoning to work - as users will get it wrong. |
From the source, it looks like it's using WaitForSingleObjectEx syscall in windows, and a POSIX semaphore on Unix.
Fair enough, it's just a lot more work to get there :). If multiprocessing.Lock is suitable, then I think mark_process_dead could maybe be expanded to aggregate into a single file on worker death by default, using that lock? Rough stab at general solution: register an atexit function that calls mark_process_dead. That should work for any multiprocess scenario, with the caveat that processes forcibly killed will not run the atexit handle, and thus leak mmapped files. But that's a common enough issue with killing processes that way, and the leak would be limited to rare error scenarios, rather than every process recycle. |
How is the handle/semaphone id shared between the processes?
That only works if you know that no new processes will be created with that pid while you're doing that. |
The lock has to be inherited, so it has to be created in the parent process, and thus only works for a fork()ing model.
Right, that could probably be addressed by locking the mmap file opening with the same lock, or as discussed earlier, adding a timestamp or similar to the mmap filename (if we have full worker death cleanup, then extra files are no longer a problem, anyway). |
More particularly it only works if the parent process loads the Prometheus client library, which certain configurations of Gunicorn will not. |
Right. You would probably need to document adding a gunicorn on_starting hook or something that did that. |
Gunicorn isn't the only thing in use though, what if it's say mod_python in Apache (which I admit to having never used, and haven't even read the docs) in which a C process is what spawns the children. |
Made a PR with the small changes to MultiProcessCollector.collect() I mentioned above. This doesn't solve the problem, but it makes it easier for others to roll their own solution to the problem, as we have done (we had to duplicate the whole of collect() in our code - this would allow us to just call the new merge() function directly) |
Factors out a merge() method from the previous collect() method, which is parameterized, and thus can be used for arbitrary merging of samples. For motivation, see discussion in issue prometheus#275 around merging dead workers data into a single mmaped file. This basically allows us to parameterize the files to be merged, and also whether to accumulate histograms or not. Accumulation is on by default, as that is what the prometheus format expects. But it can now be disabled, which allows merged values to be correctly written back to an mmaped file. For the same reason, the order of labels is preserved via OrderedDict. Signed-off-by: Simon Davy <[email protected]>
…ng. (#302) * Refactor MultiprocessCollector.collect to allow for arbitrary merging. Factors out a merge() method from the previous collect() method, which is parameterized, and thus can be used for arbitrary merging of samples. For motivation, see discussion in issue #275 around merging dead workers data into a single mmaped file. This basically allows us to parameterize the files to be merged, and also whether to accumulate histograms or not. Accumulation is on by default, as that is what the prometheus format expects. But it can now be disabled, which allows merged values to be correctly written back to an mmaped file. Signed-off-by: Simon Davy <[email protected]>
…ng. (#302) * Refactor MultiprocessCollector.collect to allow for arbitrary merging. Factors out a merge() method from the previous collect() method, which is parameterized, and thus can be used for arbitrary merging of samples. For motivation, see discussion in issue #275 around merging dead workers data into a single mmaped file. This basically allows us to parameterize the files to be merged, and also whether to accumulate histograms or not. Accumulation is on by default, as that is what the prometheus format expects. But it can now be disabled, which allows merged values to be correctly written back to an mmaped file. Signed-off-by: Simon Davy <[email protected]>
Did anybody manage to solve this? I'm using uwsgi and it's not obvious how I can call mark_process_dead |
We managed to solve this for Gunicorn, the bulk of our (rather complex) solution can be seen here: https://github.com/canonical-ols/talisker/blob/master/talisker/prometheus.py It's not pretty :( The core of the solution is this: The master process runs a cleanup function for each worker that dies. This function aggregates the dead worker's files into some central files, and then deletes the old files. This write and delete is a critical section that needs protecting with a multiprocess lock of some kind, or else workers reading the files could double read. The Gunicorn specifics are using a SIGCHLD handler on the master process, and other messy details, but it's been working well for us, after a few initial bumps. However, that solution is very specific to gunicorn's multiprocess model. uWSGI is a whole other kettle of fish. AFAICS, it doesn't support hooking in to SIGCHLD handling on the master, which is in C, so not so easily extensible. One option might be to define a custom worker as documented here: https://uwsgi-docs.readthedocs.io/en/latest/WorkerOverride.html You can maybe do the work in the destroy() function, which I assume happens in the worker process, but you'd really need multiprocess locking here. We used the stdlib's multiprocess.Lock(), but as discussed in this thread, it requires a python master process to initialise it. But under the hood, it's just a semaphore based on a shared memory file path, some form of that might be usable, if you can figured out how to ensure each worker is using the same path. Are you aware of https://github.com/Lispython/pyprometheus ? That is an alternative prometheus client library that has specific support for using uwsgi's shared memory support, as opposed to using mmap files in a shared directory. I have not used it, and it's not the official prometheus client, but it might help here - you may be able to use it's approach with prometheus_client. HTH Edit: did some more research, and sharedarea in uswsgi is just a locking wrapper around a shared mmap, but it does to the multiprocess locking for you, which is nice. |
It should be noted that you only need client_python/prometheus_client/multiprocess.py Lines 120 to 127 in 4aa1936
|
Have you found any solution for this issue ? |
hello @bloodearnest , I have attempted to use your solution to tackle db file number problem. I used the solution on three metrics types: Counter (
Thank you for your time. Edit: Below is the image describing the 1st question. My gunicorn is configured to run with 3 workers: |
I'm using
which makes it so that it doesn't use the pid but the worker id, which is reused on worker restart. This looks like it works in dev. @brian-brazil would there be problems with this setup? The only thing that looks suspicious to me is not having a |
This issue has gotten a bit long covering many related issues. The initial report was not a bug, so let's open new issues for any new discussion around improvements here. |
Hi, sorry, I know this issue is now closed, but I wanted to reply to @xuansontrinh, I'd missed the question addressed to me.
No - the reason is that we don't use summaries! I expect you'll need to extend the code to do so. If the summaries are a different format from counters/histograms, then you may need to write code to merge them.
I don't think this should be an issue. I think the file size should depend on the number of metrics/labels your app has, not the length of time it's been around, as the merge should collate all of them into one entry per metric/label combination. Also, we avoid this by using a new tmpdir for prometheus_multiproc_dir when the gunicorn service restarts, to meet the requirement of clearing it between service starts as in the documentation. And as each deploy is a new service, it will only grow for the lifetime of a deployment, which is typically only a few weeks at most.
The gunicorn master does do more work. But it spends most of its time idle anyway, as it handles no traffic. I guess if you have many workers, there might be an issue, especially if you SIGHUP gunicorn, as all the current workers will die. FTR, we use 64 workers, and regularly SIGHUP, with out any issue that we've detected.
|
Hi everyone! Sorry for updating it after two years but I will don't understand what should I do with flooding prometheus multiproc dir with .db files? It slows the /metrics collection. Now we just clean up the dir for each deploy but if where are not deploys for some days it leads to timeouts for /metrics request. What is the best practice here? Cron cleaning job? |
Hi All. I am also facing the same issue. Memory will reach the limit and application will restart each time. I tried to use all types of workers but still no success. I tried to run a cron job to delete files but after each deletion I will get a pod error for prometheus. Also if a file was being used by some worker metrics endpoint would give an exception and again error will be registered via prometheus. For the suggestion above to restart the server is not suitable for me as I need it to run without any restart and would not want application depend of metrics. Any help is appreciated. Thanks! |
We are using Gunicorn with multiple workers with the Gunicorn
max_requests
option. We found that memory usage in our containers increases with every worker restart. By reducing themax_requests
value so workers restart very frequently, the problem became apparent.Using the method described in the readme:
While running a performance test, I see the following in the container:
Watching that directory, new files are created with every new worker and old ones are not being cleaned up. Deleting the files in that directory reduce down the memory usage on that container, which then starts building again.
What is the proper way of dealing with these database files? Is this
mark_process_dead
's responsibility?And a general question about Prometheus: do we need to keep metrics around after they are collected? Could we wipe our metrics after the collector hits our metrics endpoint? If so, can this be done via the prometheus_client?
The text was updated successfully, but these errors were encountered: