Queue full and stop sending metrics to backend w/ ThreadedBackend in a Celery project #37

puntonim · 2019-01-11T12:24:42Z

I am using timeexecution with a ThreadedBackend on ElasticsearchBackend in a project at CERN. It is great software!
For reference, this is where timeexecution gets configured at app's boot up:
https://github.com/inspirehep/inspire-next/blob/899dabc588159dd9d45e7202692c675f073f0fe0/inspirehep/utils/ext.py#L76

The entire project runs in a Celery instance (which might live for days before getting restarted). Occasionally, it stops sending metrics to ES.
When this happens, the logs show that all metrics are discarded (log entry like: [2019-01-10 12:24:18,816: WARNING/ForkPoolWorker-16] Discard metric inspirehep.modules.records.receivers.push_to_orcid) because the queue is full and it does not get consumed:
https://github.com/kpn-digital/py-timeexecution/blob/d1b005a337fb6705aa804da6c26b7d9d477b62fc/time_execution/backends/threaded.py#L43-L44
My guess is that the cause is that the consumer thread started by the method start_worker has died (or hanged).

Note that the same project also runs in a Gunicorn instance where the issue never happens.
The problem is solved when Celery is restarted.

INVESTIGATION
Celery has a MainProcess (main) and a number (2 in our case) of ForkPoolWorker-N (worker or child) processes. A worker process might be killed by the main when using too much memory (log entry like: [2019-01-10 11:47:20,010: ERROR/ForkPoolWorker-1] child process exiting after exceeding memory limit (2760384KiB / 1370112KiB)).

When a worker process dies it is replaced by a new one. I noticed that every time the issue happens, it's right after a worker process has been killed because using too much memory. I was able to reproduce this issue only a few times and not in a deterministic way (even setting a very low threshold for memory, thus triggering the kill very frequently).

To complicate things: the consumer thread is owned by the MainProcess and the write method is executed in a ForkPoolWorker-N process (I added explicit logs to prove this).

POSSIBLE SOLUTION

        except Full:
            if not self.thread:  # and maybe: or not self.thread.is_alive()
                self.start_worker()
            logger.warning("Discard metric %s", name)

I am not going to make such PR yet as I was not able to deterministically reproduce the issue, but I want to keep track of it in here.

UPDATE 15/01/2019

The possible solution mentioned above did not work, but I am now able to reproduce the issue systematically.
Celery has a main process and a number (depending on the configuration) of worker processes.
This is our case:

I added more logging statement and acknowledged that:

the consumer thread started here lives in the main Celery process. That is because the ThreadedBackend class is instantiated on app's boot up (thread with pid=32450 in the htop screenshot).
the thread safe queue is also started in the __init__ (I guess it is the thread with pid=32447 in the htop screenshot).
the producer code lives in the Celery worker processes (processes with pids=1017,966 in the htop screenshot).

When a Celery worker process exceeds the memory limit set in the configuration, what happens is:

the Celery worker process is killed and replaced by a new process.
the producer code in the new Celery worker process keeps on queueing metrics until it gets full (a few seconds later).
the consumer thread in the main Celery process sees an empty queue and thus no metrics is sent.

I worked out a solution where the consumer thread, the queue and the producer code live in the Celery worker processes. I have been testing it manually and in one canary production machine for a couple of days and it works well (while all the other non-canary machines actually are affected by the issue). The solution is backward compatible. PR coming soon.

The text was updated successfully, but these errors were encountered:

The lazy_init kwarg can be used to avoid a bug that leads to no metrics being sent by the ThreadedBackend when used in a Celery project with --max-memory-per-child param. See: kpn#37

ricardosantosalves · 2019-04-08T15:32:05Z

Hi Paolo,

Sorry for late feedback. Thanks for filling the issue and for the PR. We are evaluating the solution and will update/merge soon, since there's some conflicts with the original PR.

Cheers

puntonim changed the title ~~Queue full and no metrics sent to backend occasionally w/ ThreadedBackend~~ Queue full and stop sending metrics to backend occasionally w/ ThreadedBackend Jan 11, 2019

puntonim changed the title ~~Queue full and stop sending metrics to backend occasionally w/ ThreadedBackend~~ Queue full and stop sending metrics to backend w/ ThreadedBackend Jan 15, 2019

puntonim changed the title ~~Queue full and stop sending metrics to backend w/ ThreadedBackend~~ Queue full and stop sending metrics to backend w/ ThreadedBackend in a Celery project Jan 15, 2019

puntonim mentioned this issue Jan 15, 2019

NEW Add lazy_init optional kwarg #38

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queue full and stop sending metrics to backend w/ ThreadedBackend in a Celery project #37

Queue full and stop sending metrics to backend w/ ThreadedBackend in a Celery project #37

puntonim commented Jan 11, 2019 •

edited

Loading

ricardosantosalves commented Apr 8, 2019

Queue full and stop sending metrics to backend w/ ThreadedBackend in a Celery project #37

Queue full and stop sending metrics to backend w/ ThreadedBackend in a Celery project #37

Comments

puntonim commented Jan 11, 2019 • edited Loading

ricardosantosalves commented Apr 8, 2019

puntonim commented Jan 11, 2019 •

edited

Loading