Scalability Issue: outages / timeouts / slow responses in the recrawler service may lead to message queue buildups #43

jayaddison · 2021-01-27T18:44:14Z

Describe the bug
The recrawler service has been switched off since early January, due to a lack of query results which will be opened and tracked as a separate issue for that service.

If no recrawler pods are available, requests to that service fail with connection errors -- after a considerable timeout -- as visible here in the backend-worker deployment logs:

[2021-01-27 18:28:19,290: WARNING/ForkPoolWorker-2] Recrawling failed due to "ConnectionError" exception
[2021-01-27 18:28:19,291: WARNING/ForkPoolWorker-3] Recrawling failed due to "ConnectionError" exception
[2021-01-27 18:30:30,362: WARNING/ForkPoolWorker-1] Recrawling failed due to "ConnectionError" exception
[2021-01-27 18:30:30,366: WARNING/ForkPoolWorker-3] Recrawling failed due to "ConnectionError" exception

This causes the throughput of the backend-worker instances to drop dramatically since most of the task worker time is spent attempting to make a connection.

It may be useful to consider both a short-term and longer-term fix here. Since we are not currently receiving results from the recrawler service, a patch would involve re-deploying that service to respond with empty results (effectively a no-op). Longer-term we likely want to isolate the queue workers that handle event logs, and perhaps add circuit breakers and/or adjust the connection timeouts they use.

Expected behavior
Throughput for the majority of the RecipeRadar message queues should not be adversely affected by outages in a minor service.

The text was updated successfully, but these errors were encountered:

jayaddison · 2021-01-27T19:07:41Z

Temporary mitigation deployed: openculinary/recrawler@2229817

jayaddison · 2022-07-29T19:51:52Z

Isn't the solution for this to add worker processes to the recrawler service? We shouldn't permit one service to get backlogged as a result of taking on the work requested for another service to perform.

jayaddison · 2023-09-19T17:46:50Z

Isn't the solution for this to add worker processes to the recrawler service? We shouldn't permit one service to get backlogged as a result of taking on the work requested for another service to perform.

I think that separating the worker queues is likely a better idea here. Recrawling shouldn't be in the capacity path of crawling/reindexing, for example.

jayaddison added the bug Something isn't working label Jan 27, 2021

jayaddison mentioned this issue Jan 27, 2021

DuckDuckGo / duckduckpy library is not returning results openculinary/recrawler#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalability Issue: outages / timeouts / slow responses in the recrawler service may lead to message queue buildups #43

Scalability Issue: outages / timeouts / slow responses in the recrawler service may lead to message queue buildups #43

jayaddison commented Jan 27, 2021

jayaddison commented Jan 27, 2021

jayaddison commented Jul 29, 2022

jayaddison commented Sep 19, 2023

Scalability Issue: outages / timeouts / slow responses in the recrawler service may lead to message queue buildups #43

Scalability Issue: outages / timeouts / slow responses in the recrawler service may lead to message queue buildups #43

Comments

jayaddison commented Jan 27, 2021

jayaddison commented Jan 27, 2021

jayaddison commented Jul 29, 2022

jayaddison commented Sep 19, 2023