-
Notifications
You must be signed in to change notification settings - Fork 91
Home
sched_ext is a Linux kernel feature that implements kernel thread schedulers in BPF and dynamically loads them. This wiki contains helpful information on this technology and schedulers implemented with it. Feel free to come to the #docs
channel in slack to share your own.
Be aware that sched-ext is a rapidly evolving ecosystem, so the resources listed here might be slightly outdated.
The ground truth is in the code and the kernel.org documentation.
In the following a short tutorial to get you started:
In the following a short tutorial for creating a minimal scheduler written with sched_ext in C. This scheduler uses a global scheduling queue from which every CPU gets its tasks to run for a time slice. The scheduler order is First-In-First-Out. So it essentially implements a round-robin scheduler:
This short tutorial covers the basics; to learn more, visit the resources from the scx wiki.
We need a 6.12 kernel or a patched 6.11 kernel to build a custom scheduler. You can get a kernel patched with the scheduler extensions on Ubuntu 24.10 from here, or you can use CachyOS and install a patched kernel from there.
Furthermore, you also need
- a recent
clang
for compilation -
bpftool
for attaching the scheduler
On Ubuntu, for example, you can run: apt install clang linux-tools-common linux-tools-$(uname -r)
.
Nothing more is needed to run it, and you can find the code of this tutorial in the minimal-scheduler repository. I would advise to just cloning it:
git clone https://github.com/parttimenerd/minimal-scheduler
cd minimal-scheduler
The scheduler lives in the sched_ext.bpf.c
file, but before we take a look at it,
I want to show you how you can use this scheduler:
In short, we only need two steps:
# build the scheduler binary
./build.sh
# start the scheduler
sudo ./start.sh
# do something ...
# stop the scheduler
sudo ./stop.sh
I'll show you later what's in these scripts, but first, let's get to the scheduler code:
We assume that you have some experience writing eBPF programs. If not, Liz Rice's book Learning eBPF is a good starting point.
The scheduler code only depends on the Linux bpf kernel headers and sched-ext. So here is the code from sched_ext.bpf.c
:
// This header is autogenerated, as explained later
#include <vmlinux.h>
// The following two headers come from the Linux headers library
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
// Define a shared Dispatch Queue (DSQ) ID
// We use this as our global scheduling queue
#define SHARED_DSQ_ID 0
// Two macros that make the later code more readable
// and place the functions in the correct sections
// of the binary file
#define BPF_STRUCT_OPS(name, args...) \
SEC("struct_ops/"#name) BPF_PROG(name, ##args)
#define BPF_STRUCT_OPS_SLEEPABLE(name, args...) \
SEC("struct_ops.s/"#name) \
BPF_PROG(name, ##args)
// Initialize the scheduler by creating a shared dispatch queue (DSQ)
s32 BPF_STRUCT_OPS_SLEEPABLE(sched_init) {
// All scx_ functions come from vmlinux.h
return scx_bpf_create_dsq(SHARED_DSQ_ID, -1);
}
// Enqueue a task to the shared DSQ that wants to run,
// dispatching it with a time slice
int BPF_STRUCT_OPS(sched_enqueue, struct task_struct *p, u64 enq_flags) {
// Calculate the time slice for the task based on the number of tasks in the queue
// This makes the system slightly more responsive than a basic round-robin
// scheduler, which assigns every task the same time slice all the time
// The base time slice is 5_000_000ns or 5ms
u64 slice = 5000000u / scx_bpf_dsq_nr_queued(SHARED_DSQ_ID);
scx_bpf_dispatch(p, SHARED_DSQ_ID, slice, enq_flags);
return 0;
}
// Dispatch a task from the shared DSQ to a CPU,
// whenever a CPU needs something to run, usually after it is finished
// running the previous task for the allotted time slice
int BPF_STRUCT_OPS(sched_dispatch, s32 cpu, struct task_struct *prev) {
scx_bpf_consume(SHARED_DSQ_ID);
return 0;
}
// Define the main scheduler operations structure (sched_ops)
SEC(".struct_ops.link")
struct sched_ext_ops sched_ops = {
.enqueue = (void *)sched_enqueue,
.dispatch = (void *)sched_dispatch,
.init = (void *)sched_init,
// There are more functions available, but we'll focus
// on the important ones for a minimal scheduler
.flags = SCX_OPS_ENQ_LAST | SCX_OPS_KEEP_BUILTIN_IDLE,
// A name that will appear in
// /sys/kernel/sched_ext/root/ops
// after we attached the scheduler
// The name has to be a valid C identifier
.name = "minimal_scheduler"
};
// All schedulers have to be GPLv2 licensed
char _license[] SEC("license") = "GPL";
We can visualize the interaction of all functions in the scheduler with the following diagram:
Now, after you've seen the code,
run the build.sh
script to generate the vmlinux.h
BPF header
and then compile the scheduler code to BPF bytecode:
bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h
clang -target bpf -g -O2 -c sched_ext.bpf.c -o sched_ext.bpf.o -I.
Then run the start.sh
script as a root user to attach the scheduler using the bpftool
:
bpftool struct_ops register sched_ext.bpf.o /sys/fs/bpf/sched_ext
The custom scheduler is now the scheduler of this system. You can check this
by accessing the /sys/kernel/sched_ext/root/ops
file:
> cat /sys/kernel/sched_ext/root/ops
minimal_scheduler
And by checking dmesg | tail
:
> sudo dmesg | tail
# ...
[32490.366637] sched_ext: BPF scheduler "minimal_scheduler" enabled
Play around with your system and see how it behaves.
If you're done, you can detach the scheduler by running the stop.sh
script
using root privileges. This removes the /sys/fs/bpf/sched_ext/sched_ops
file.
Now that you know what a basic scheduler looks like, you can start modifying it. Here are a few suggestions:
How does your system behave when you increase or decrease the time slice?
For example, try a time slice of 1s. Do you see any difference in how your cursor moves? Or try a small time slice of 100us and run a program like that that does some computation, do you see a difference in its performance?
How does your system behave when you change the scheduler to use the same time slice, ignoring the number of enqueued processes?
Try, for example, to create load on your system and see how it behaves.
How does your system behave if the scheduler only schedules to specific CPUs?
Try, for example, to make your system effectively single-core by only consuming tasks
on CPU 0 in sched_dispatch
(Hint: the cpu
parameter is the CPU id).
How does your system behave with multiple scheduling queues for different CPUs and processes?
Try, for example, to create two scheduling queues, with one scheduling queue only
for a process with a specific id (Hint: task_struct#tgid
gives you the process id)
which is scheduled on half of your CPUs.
Look into the linux/sched.h header to learn more about task_struct
.
If you already know how to write basic eBPF programs,
use bpf_trace_printk
and the running
and stopping
hooks.
The running
hook is called whenever a task starts running on a CPU; get the current CPU id via smp_processor_id()
:
int BPF_STRUCT_OPS(sched_running, struct task_struct *p) {
// ...
return 0; // there are no void functions in eBPF
}
The stopping
hook is called whenever a task stops running:
int BPF_STRUCT_OPS(sched_stopping, struct task_struct *p, bool runnable) {
// ...
return 0;
}
You can use this to create visualizations of the scheduling order.
To do even more, you can look at the collected resources further down, especially the well-documented sched-ext code in the kernel.
If you're interested in how to use it in Rust, take a look at scx and scx_rust_scheduler, and for Java at hello-ebpf.
-
Andrea Righi's blog is full of interesting texts on sched-ext, including
- Getting started with sched-ext development (2024) that covers the basic dev setup for scx development
- Writing a scheduler for Linux in Rust that runs in user-space (2024) and part 2 (2024)
-
Changwoo Min's two-part intro into sched-ext, schedulers, and how to write them in C
-
Jonathan Corbet's short introduction and What's scheduled for sched_ext by Daroc Alden, both at LWN
-
Johannes Bechberger's blog post Hello eBPF: Writing a Linux scheduler in Java with eBPF (15)
There are a few talk recordings (and one podcast episode) on sched-ext. You can find them all in the Sched-Ext YouTube playlist. The following is a selection of the most notable videos:
David Vernet gives a lot of talks on sched-ext, this is a collection:
- Scheduling with superpowers: Using sched_ext to get big perf gains (2024) at Kernel Recipes 2024
- sched ext - David Vernet (2023) at LSFMM+BPF Summit 2023
- sched_ext: pluggable scheduling in the Linux kernel (2023) at Kernel Recipes 2023
- Tech Over Tea podcast episode Linux Kernel Scheduler Developer | David Vernet
The recent Linux Storage, Filesystem, MM & BPF Summit also had two talks on sched-ext, giving an update to the previous talk by David Vernet:
- More features and use-cases for sched_ext
- BPF struct_ops & sched_ext by Kui-Feng Lee
The Linux Plumbers Conference 2024 had a micro-conference related to sched-ext, Sched-Ext: The BPF extensible scheduler class MC which covered major topics of the sched-ext community:
- The current status and future potential of sched_ext by David Vernet
- Design a user-space framework to implement sched_ext schedulers by Andrea Righi
- Using sched_ext to improve frame rates on the SteamDeck by Changwoo Min
- Optimizing Google Search and beyond with pluggable scheduling by Barret Rhoden and Josh Don
- A case for using para-virtualized scheduling information with sched_ext schedulers by Himadri Chhaya-Shailesh
- "Hey, psst, try this." The underground culture around custom CPU schedulers. by Giovanni Gherdovich et al.
- Deploying and managing sched_ext schedulers in CachyOS by Peter Jung and Piotr Górski
- Shipping sched-ext: Linux distributions roundtable by Giovanni Gherdovich
Additionally, there was Andrea Righi's talk Crafting a Linux kernel scheduler that runs in user-space using Rust in the LPC Refereed Track.
The whole sched-ext micro-conference has been summed up quite nicely by Jonathan Corbet in his LWN article.