Slurm Docker Cluster is a multi-container Slurm cluster designed for rapid deployment using Docker Compose. This repository simplifies the process of setting up a robust Slurm environment for development, testing, or lightweight usage.
To get up and running with Slurm in Docker, make sure you have the following tools installed:
Clone the repository:
git clone https://github.com/giovtorres/slurm-docker-cluster.git
cd slurm-docker-cluster
This setup consists of the following containers:
- mysql: Stores job and cluster data.
- slurmdbd: Manages the Slurm database.
- slurmctld: The Slurm controller responsible for job and resource management.
- c1, c2: Compute nodes (running
slurmd
).
etc_munge
: Mounted to/etc/munge
etc_slurm
: Mounted to/etc/slurm
slurm_jobdir
: Mounted to/data
var_lib_mysql
: Mounted to/var/lib/mysql
var_log_slurm
: Mounted to/var/log/slurm
The version of the Slurm project and the Docker build process can be simplified
by using a .env
file, which will be automatically picked up by Docker Compose.
Update the SLURM_TAG
and IMAGE_TAG
found in the .env
file and build
the image:
docker compose build
Alternatively, you can build the Slurm Docker image locally by specifying the SLURM_TAG as a build argument and tagging the container with a version (IMAGE_TAG):
docker build --build-arg SLURM_TAG="slurm-21-08-6-1" -t slurm-docker-cluster:21.08.6 .
Once the image is built, deploy the cluster with the default version of slurm using Docker Compose:
docker compose up -d
To specify a specific version and override what is configured in .env
, specify
the IMAGE_TAG
:
IMAGE_TAG=21.08.6 docker compose up -d
This will start up all containers in detached mode. You can monitor their status using:
docker compose ps
After the containers are up and running, register the cluster with SlurmDBD:
./register_cluster.sh
Tip: Wait a few seconds for the daemons to initialize before running the registration script to avoid connection errors like:
sacctmgr: error: Problem talking to the database: Connection refused
.
For real-time cluster logs, use:
docker compose logs -f
To interact with the Slurm controller, open a shell inside the slurmctld
container:
docker exec -it slurmctld bash
Now you can run any Slurm command from inside the container:
[root@slurmctld /]# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up 5-00:00:00 2 idle c[1-2]
The cluster mounts the slurm_jobdir
volume across all nodes, making job files accessible from the /data
directory. To submit a job:
[root@slurmctld /]# cd /data/
[root@slurmctld data]# sbatch --wrap="hostname"
Submitted batch job 2
Check the output of the job:
[root@slurmctld data]# cat slurm-2.out
c1
Stop the cluster without removing the containers:
docker compose stop
Restart it later:
docker compose start
To completely remove the containers and associated volumes:
docker compose down -v
You can modify Slurm configurations (slurm.conf
, slurmdbd.conf
) on the fly without rebuilding the containers. Just run:
./update_slurmfiles.sh slurm.conf slurmdbd.conf
docker compose restart
This makes it easy to add/remove nodes or test new configuration settings dynamically.
Contributions are welcomed from the community! If you want to add features, fix bugs, or improve documentation:
- Fork this repo.
- Create a new branch:
git checkout -b feature/your-feature
. - Submit a pull request.
This project is licensed under the MIT License.