SLURM survival guide

SLURM survival guide

This post is written from the perspective of someone learning to run SLURM jobs. There might be some inaccuracies but the idea is to get you up and running fast.

Unfortunately, the only material on SLURM I have been able to find was written by MLOps folks for MLOps folks and was very hard to understand.

Turns out SLURM is very simple from a user perspective once you get through the shroud of the language being used. I hope this blog post can be of help to you!

What is SLURM?

Imagine you have 100 researchers working at your company and they all need to run a job on 8 A100 GPUs on average for 2 hours once a week.

Do you give each a node with 8 GPUs or do you pool the resources together and allow your users to access hardware when they need it and in the configuration they need?

Maybe one research needs 32 GPUs but for 10 minutes only. Another would like to access a single GPU but for a week straight.

This is where SLURM comes in.

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained.

Essentially, an OS for supercomputers.

👉 Two very interesting technologies that SLURM uses to make things run extra fast are RDMA and NVLINK.

How do you use SLURM?

You ssh into a login node that is essentially just a Linux box shared by many users.

You get your home directory and can likely install things that can be useful to you (vim, your dot files, etc).

But your user account comes with special powers. Chances are you have been assigned a role that allows you to use some portion of the available resources and your requests are treated with a predefined priority.

For instance, you can do something as follows:

srun --pty --ntasks-per-node=1 --job-name=test_job --partition=<partition_name> --time=2 --gres=gpu:0 --mem=8G --account=<account_name> --cpus-per-task=2 /bin/bash

This command will launch a VM for you within a specified partition (the SLURM cluster can be divided into blocks referred to as partitions).

It will be launched with 0 GPUs, 8GB of RAM, and the command that will be executed is /bin/bash.

If you look at your terminal prompt, you will see that it has changed! You have been magically transported into the new VM.

Of note is the time argument. It specifies the maximum time your job is allowed to run for.

In this particular instance, the VM will be shut down automatically after 2 minutes.

For additional ways of specifying the lifetime of your instance, check out man srun and search for the --time argument (type /time and press enter inside the man page).

That is all there is to it! Given how basic the srun command is, the people maintaining the cluster might provide scripts you can execute to achieve more complex outcomes.

Maybe you'd like to run a specific docker image or have something run (say, jupyter notebook) when the container lunches instead of the /bin/bash above?

Either the functionality will be provided to you or you can code it up yourself!

My understanding is that for these more complex scenarios, it is best to use sbatch. It comes with a bunch of features that make it simpler to execute scripts (see man sbatch for more).

You ran your first job... what's next?

Maybe you'd like to shut down the VM before the --time elapses and unlock the resources for others to use?

No problem!

scancel <job_id> will do this for you!

Don't know the job id you'd like to kill?

squeue -u <user_name> is your friend!

But maybe you'd like to see a brief summary of the jobs you have executed?

sacct -b has got you covered!

A particular job piqued your interest? You can get its details using the following command:

scontrol show job 3788

Other things useful to know

You probably use your ssh key to authenticate with github and the likes (and if you don't, you should! getting this configured is an unbelievable time saver).

And if you do, you probably want to take your ssh credentials with you but don't necessarily want to copy your ssh key onto the remote server.

You can achieve all this using ssh-agent forwarding which you trigger as follows:  

eval `ssh-agent`
ssh-add
ssh -A <username>@<server_ip>

Okay, so this is cool!

But how do you access a port on your VM inside your SLURM cluster? (for instance, maybe you'd like to connect to the jupyter notebook running inside)

And the awesome thing is that ssh allows you to chain your tunnels!

So if you have to ssh into a jump box to gain access to your cluster and inside the cluster you have a jupyter server running at the IP address of 10.50.0.1 on port 8888, this is the command you'd need to run to be able to connect to it from the browser on your local computer:

ssh -A -L 2222:10.49.132.252:8888 <jump_box_ip_or_hostname>

All I have to do then is to type localhost:2222 in my browser and I am there!

And that is really all there is to SLURM from the perspective of the user!

Sure, you can run arbitrarily complex srun or sbatch commands, but the above should give you everything to get you up and running quickly.

Thanks of reading and see you around!