You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The entire project runs in a Celery instance (which might live for days before getting restarted). Occasionally, it stops sending metrics to ES.
When this happens, the logs show that all metrics are discarded (log entry like: [2019-01-10 12:24:18,816: WARNING/ForkPoolWorker-16] Discard metric inspirehep.modules.records.receivers.push_to_orcid) because the queue is full and it does not get consumed: https://github.com/kpn-digital/py-timeexecution/blob/d1b005a337fb6705aa804da6c26b7d9d477b62fc/time_execution/backends/threaded.py#L43-L44
My guess is that the cause is that the consumer thread started by the method start_worker has died (or hanged).
Note that the same project also runs in a Gunicorn instance where the issue never happens.
The problem is solved when Celery is restarted.
INVESTIGATION
Celery has a MainProcess (main) and a number (2 in our case) of ForkPoolWorker-N (worker or child) processes. A worker process might be killed by the main when using too much memory (log entry like: [2019-01-10 11:47:20,010: ERROR/ForkPoolWorker-1] child process exiting after exceeding memory limit (2760384KiB / 1370112KiB)).
When a worker process dies it is replaced by a new one. I noticed that every time the issue happens, it's right after a worker process has been killed because using too much memory. I was able to reproduce this issue only a few times and not in a deterministic way (even setting a very low threshold for memory, thus triggering the kill very frequently).
To complicate things: the consumer thread is owned by the MainProcess and the write method is executed in a ForkPoolWorker-N process (I added explicit logs to prove this).
POSSIBLE SOLUTION
exceptFull:
ifnotself.thread: # and maybe: or not self.thread.is_alive()self.start_worker()
logger.warning("Discard metric %s", name)
I am not going to make such PR yet as I was not able to deterministically reproduce the issue, but I want to keep track of it in here.
UPDATE 15/01/2019
The possible solution mentioned above did not work, but I am now able to reproduce the issue systematically.
Celery has a main process and a number (depending on the configuration) of worker processes.
This is our case:
I added more logging statement and acknowledged that:
the consumer thread started here lives in the main Celery process. That is because the ThreadedBackend class is instantiated on app's boot up (thread with pid=32450 in the htop screenshot).
the thread safe queue is also started in the __init__ (I guess it is the thread with pid=32447 in the htop screenshot).
the producer code lives in the Celery worker processes (processes with pids=1017,966 in the htop screenshot).
When a Celery worker process exceeds the memory limit set in the configuration, what happens is:
the Celery worker process is killed and replaced by a new process.
the producer code in the new Celery worker process keeps on queueing metrics until it gets full (a few seconds later).
the consumer thread in the main Celery process sees an empty queue and thus no metrics is sent.
I worked out a solution where the consumer thread, the queue and the producer code live in the Celery worker processes. I have been testing it manually and in one canary production machine for a couple of days and it works well (while all the other non-canary machines actually are affected by the issue). The solution is backward compatible. PR coming soon.
The text was updated successfully, but these errors were encountered:
puntonim
changed the title
Queue full and no metrics sent to backend occasionally w/ ThreadedBackend
Queue full and stop sending metrics to backend occasionally w/ ThreadedBackend
Jan 11, 2019
puntonim
changed the title
Queue full and stop sending metrics to backend occasionally w/ ThreadedBackend
Queue full and stop sending metrics to backend w/ ThreadedBackend
Jan 15, 2019
puntonim
changed the title
Queue full and stop sending metrics to backend w/ ThreadedBackend
Queue full and stop sending metrics to backend w/ ThreadedBackend in a Celery project
Jan 15, 2019
puntonim
added a commit
to puntonim/py-timeexecution
that referenced
this issue
Jan 15, 2019
The lazy_init kwarg can be used to avoid a bug that leads to no metrics
being sent by the ThreadedBackend when used in a Celery project with
--max-memory-per-child param.
See: kpn#37
The lazy_init kwarg can be used to avoid a bug that leads to no metrics
being sent by the ThreadedBackend when used in a Celery project with
--max-memory-per-child param.
See: kpn#37
Sorry for late feedback. Thanks for filling the issue and for the PR. We are evaluating the solution and will update/merge soon, since there's some conflicts with the original PR.
I am using timeexecution with a
ThreadedBackend
onElasticsearchBackend
in a project at CERN. It is great software!For reference, this is where timeexecution gets configured at app's boot up:
https://github.com/inspirehep/inspire-next/blob/899dabc588159dd9d45e7202692c675f073f0fe0/inspirehep/utils/ext.py#L76
The entire project runs in a Celery instance (which might live for days before getting restarted). Occasionally, it stops sending metrics to ES.
When this happens, the logs show that all metrics are discarded (log entry like:
[2019-01-10 12:24:18,816: WARNING/ForkPoolWorker-16] Discard metric inspirehep.modules.records.receivers.push_to_orcid
) because the queue is full and it does not get consumed:https://github.com/kpn-digital/py-timeexecution/blob/d1b005a337fb6705aa804da6c26b7d9d477b62fc/time_execution/backends/threaded.py#L43-L44
My guess is that the cause is that the consumer thread started by the method
start_worker
has died (or hanged).Note that the same project also runs in a Gunicorn instance where the issue never happens.
The problem is solved when Celery is restarted.
INVESTIGATION
Celery has a
MainProcess
(main) and a number (2 in our case) ofForkPoolWorker-N
(worker or child) processes. A worker process might be killed by the main when using too much memory (log entry like:[2019-01-10 11:47:20,010: ERROR/ForkPoolWorker-1] child process exiting after exceeding memory limit (2760384KiB / 1370112KiB)
).When a worker process dies it is replaced by a new one. I noticed that every time the issue happens, it's right after a worker process has been killed because using too much memory. I was able to reproduce this issue only a few times and not in a deterministic way (even setting a very low threshold for memory, thus triggering the kill very frequently).
To complicate things: the consumer thread is owned by the
MainProcess
and thewrite
method is executed in aForkPoolWorker-N
process (I added explicit logs to prove this).POSSIBLE SOLUTION
I am not going to make such PR yet as I was not able to deterministically reproduce the issue, but I want to keep track of it in here.
UPDATE 15/01/2019
The possible solution mentioned above did not work, but I am now able to reproduce the issue systematically.
Celery has a main process and a number (depending on the configuration) of worker processes.
This is our case:
I added more logging statement and acknowledged that:
ThreadedBackend
class is instantiated on app's boot up (thread with pid=32450 in the htop screenshot).__init__
(I guess it is the thread with pid=32447 in the htop screenshot).When a Celery worker process exceeds the memory limit set in the configuration, what happens is:
I worked out a solution where the consumer thread, the queue and the producer code live in the Celery worker processes. I have been testing it manually and in one canary production machine for a couple of days and it works well (while all the other non-canary machines actually are affected by the issue). The solution is backward compatible. PR coming soon.
The text was updated successfully, but these errors were encountered: