Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[9.x] Prevent redis crash when large number of jobs are scheduled for a specific time #43310

Merged

Conversation

AbiriAmir
Copy link
Contributor

Currently, scheduling a large number of jobs for a specific time causes Redis to halt since migrate script is a heavy script.
In this pull request, support for limiting the number of migrated jobs from the delayed queue to the main queue is added.

Steps to reproduce the problem:

  • Schedule >10000 delayed jobs for a specific time.
  • Then run php artisan queue:work

When the time reaches, Redis will be busy moving these jobs for minutes.

image

By limiting the number of jobs to migrate, the problem would be easily fixed.
It is also fully backward compatible since the default value passed as the limit is -1 which means no limit.

… in redis queue in order to prevent redis crash when large number of jobs are scheduled for a specific time
… in redis queue in order to prevent redis crash when large number of jobs are scheduled for a specific time
@rezakeramati
Copy link

Really needed this feature.
My Redis stucks when lots of jobs are scheduled for a specific time, and results downtime on the whole website.

@taylorotwell
Copy link
Member

Need much more explanation about how this works, what every change does, etc.

@taylorotwell
Copy link
Member

Please mark as ready for review when that is provided.

@taylorotwell taylorotwell marked this pull request as draft July 20, 2022 13:45
@AbiriAmir
Copy link
Contributor Author

AbiriAmir commented Jul 20, 2022

In Laravel Redis Queue, delayed jobs and ready-to-do jobs are stored in different data structures. Let's assume that we have a queue named default. In this case, ready-to-do jobs are stored in a Redis list named default, and delayed jobs are stored in a Redis zset named default:delayed with the key equal to the timestamp of expected running time and value equal to the job body itself.

Then, before popping every job from the main queue, it is checked if there is some overdue delayed job or not and if it exists, it is migrated to the main queue:
image

Now, imagine you are scheduling thousands of jobs for a specific time. For example, in our use case, we calculate user balances every night and schedule to notify them at 8:00 in the morning.
By the time of 8:00, suddenly this script runs:
image

image

This heavy script, blocks all other redis operations and makes your entire application down. According to the documentation:

Redis guarantees the script's atomic execution. While executing the script, all server activities are blocked during its entire runtime. These semantics mean that all of the script's effects either have yet to happen or had already happened.

Hence, doing a bunch of works in a single eval causes Redis to block all other operations including all of the queues, cache (if it is used in the same redis server), etc and can bring the whole application down.

Now, the solution is quite simple. Chunking the expired (and ready-to-migrate) jobs can easily fix this problem.
Migrate function is using zrangebyscore command of Redis. It gets an optional limit option just like SQL. Passing the limit option is enough to solve this problem.

Hence:

-- Get all of the jobs with an expired "score"...
local val = redis.call('zrangebyscore', KEYS[1], '-inf', ARGV[1])

is changed to:

-- Get all of the jobs with an expired "score"...
local val = redis.call('zrangebyscore', KEYS[1], '-inf', ARGV[1], 'limit', 0, ARGV[2])

(Limit option is passed to zrangebyscore)

And a migrationBatchSize option from redis config is passed to the function.
image

The default value is set to -1 to protect backward compatibility when no config is set.

The optional LIMIT argument can be used to only get a range of the matching elements (similar to SELECT LIMIT offset, count in SQL). A negative count returns all elements from the offset. Keep in mind that if offset is large, the sorted set needs to be traversed for offset elements before getting to the elements to return, which can add up to O(N) time complexity.

image

And the config can be set just like any other Redis queue configuration:
image

@AbiriAmir AbiriAmir marked this pull request as ready for review July 20, 2022 14:42
@nunomaduro nunomaduro changed the title Prevent redis crash when large number of jobs are scheduled for a specific time [9.x] Prevent redis crash when large number of jobs are scheduled for a specific time Jul 26, 2022
@taylorotwell taylorotwell merged commit 5582614 into laravel:9.x Jul 29, 2022
Ken-vdE pushed a commit to Ken-vdE/framework that referenced this pull request Aug 9, 2022
… a specific time (laravel#43310)

* Added support for limitting the number of jobs to migrate in each job in redis queue in order to prevent redis crash when large number of jobs are scheduled for a specific time

* Added support for limitting the number of jobs to migrate in each job in redis queue in order to prevent redis crash when large number of jobs are scheduled for a specific time

* formatting

Co-authored-by: amir <[email protected]>
Co-authored-by: Taylor Otwell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants