Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability Issue: outages / timeouts / slow responses in the recrawler service may lead to message queue buildups #43

Open
jayaddison opened this issue Jan 27, 2021 · 3 comments
Labels
bug Something isn't working

Comments

@jayaddison
Copy link
Member

Describe the bug
The recrawler service has been switched off since early January, due to a lack of query results which will be opened and tracked as a separate issue for that service.

If no recrawler pods are available, requests to that service fail with connection errors -- after a considerable timeout -- as visible here in the backend-worker deployment logs:

[2021-01-27 18:28:19,290: WARNING/ForkPoolWorker-2] Recrawling failed due to "ConnectionError" exception
[2021-01-27 18:28:19,291: WARNING/ForkPoolWorker-3] Recrawling failed due to "ConnectionError" exception
[2021-01-27 18:30:30,362: WARNING/ForkPoolWorker-1] Recrawling failed due to "ConnectionError" exception
[2021-01-27 18:30:30,366: WARNING/ForkPoolWorker-3] Recrawling failed due to "ConnectionError" exception

This causes the throughput of the backend-worker instances to drop dramatically since most of the task worker time is spent attempting to make a connection.

It may be useful to consider both a short-term and longer-term fix here. Since we are not currently receiving results from the recrawler service, a patch would involve re-deploying that service to respond with empty results (effectively a no-op). Longer-term we likely want to isolate the queue workers that handle event logs, and perhaps add circuit breakers and/or adjust the connection timeouts they use.

Expected behavior
Throughput for the majority of the RecipeRadar message queues should not be adversely affected by outages in a minor service.

@jayaddison
Copy link
Member Author

Temporary mitigation deployed: openculinary/recrawler@2229817

@jayaddison
Copy link
Member Author

Isn't the solution for this to add worker processes to the recrawler service? We shouldn't permit one service to get backlogged as a result of taking on the work requested for another service to perform.

@jayaddison
Copy link
Member Author

Isn't the solution for this to add worker processes to the recrawler service? We shouldn't permit one service to get backlogged as a result of taking on the work requested for another service to perform.

I think that separating the worker queues is likely a better idea here. Recrawling shouldn't be in the capacity path of crawling/reindexing, for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

1 participant