Andrew Chang [Postdoctoral Fellow, Department of Psychology, New York University]
Please email me ([email protected]) if you have any suggestions and advice, or spot any errors!
This is a tutorial for computer muggles who want to use NYU's HPC, "Greene," to analyze their data or fit a machine learning model. It also serves as a note for myself and my colleagues. Feel free to distribute it.
Even though we all know that HPC can speed up our research, I found that most people don't want to use it because they worry that they will spend more time learning HPC than waiting for the computation to be completed on their laptops. This tutorial is designed to give you a quick start. It will help you set up a reliable and replicable HPC environment within one afternoon, even if you are not a computer geek.
This tutorial is a lean and lay version of NYU HPC's official website, integrated with my own recommendations. This tutorial covers topics related to data science or data analysis workflows. If you are a computer wizard or witch 🧙♀️🧙, this is not for you. If you are interested in knowing all the commands and details like Hermione Granger 📚, this is not for you. Some functions and approaches may be outdated when you read it, so make sure that you check out the NYU HPC's official website if you encounter any issues. Also, as different HPC systems may have different OS, this tutorial may not apply to other HPCs. Despite my approaches possibly not being the most efficient or correct way to do things, I hope this tutorial can give you a good start in using HPC and boost your productivity!
- You have to request a NYU HPC account. See instructions here.
- Familiar with basic command line commands, such as
cd
,ls
,pwd
,mv
,rm
,cat
,mkdir
. Command line will be the primary way you interact with HPC (see here for a tutorial). - Familiar with
Conda
commands. You will uses them to manage your environment (check out here for more instructions). - You don't have to be familiar with
Git/GitHub
orPython
, but it will be helpful if you have some experience with them.
HPC is a system where you can request computational resources (CPUs, GPUs, RAM, nodes) for each computational job, and you can request many jobs in parallel.
While you can cook all kinds of cuisines (scripts) in your kitchen at home (laptop), you only have access to a few stoves (CPUs), and you can only cook a small portion (RAM) at a time. You don't have access to some specialized equipment (GPU), and you are the only cook (compute node). HPC is a restaurant with many cooks and equipment which can cook all kinds of cuisines. You can assemble multiple cooks (compute nodes) and reserve multiple stoves (CPUs) and specialized equipment (GPUs) to cook multiple dishes at the same time. However, a downside of a restaurant is that it can be too busy to serve you if the cooks are already busy with other customers.
Source: overcooked (Perhaps this is the real appearance of HPC at work?)
- You need more CPUs or RAM to do your job.
- You need to run a time-consuming loop that can be executed independently in parallel.
- You need to get access to a GPU.
- Speeding up a time-consuming script that cannot be parallelized. It will be equally slow on HPC.
I once needed to analyze 200k audio files, and the process on each file would take approximately 15 seconds. In total, it would take 34.7 days to run through all the files if my poor laptop doesn’t burn 🔥! So, I ended up requesting 2000 jobs, each job processing 100 files. As a result, it took less than 2 hours to complete all the jobs, including the queue time ⚡️!
Open your terminal application Mac default, PC default or iTerm for Mac (recommended), connect to the NYU VPN, and log into Greene by executing this line (key in and then press enter):
ssh <NetID>@gw.hpc.nyu.edu ## you can skip this step if you are on the NYU network or using the NYU VPN
ssh <NetID>@greene.hpc.nyu.edu
Replace
<NetID>
of this tutorial with your own, such asab1234
Then enter your password. It should be the same as your NetID password.
If everything is correct, the login node will look something like this:
[<NetID>@log-3 ~]$
log-3
is the login node. Please do NOT run CPU-heavy jobs on login nodes, as it serves other users as well.
Your first step is to request a compute node dedicated to serving you. Execute this line:
srun --cpus-per-task=1 --mem=10GB --time=04:00:00 --pty /bin/bash
It means that you request the node to have 1 CPU and 10 GB of RAM and serve you for 4 hours. You can change these parameters to whatever you want. However, the more resources you request, the longer the queue time (depending on how many other jobs and resources were requested by other users).
After a short queue, you will see the terminal displaying this:
[<NetID>@log-3 ~]$ srun --cpus-per-task=1 --mem=10GB --time=04:00:00 --pty /bin/bash
srun: job 48347520 queued and waiting for resources
srun: job 48347520 has been allocated resources
[<NetID>@cm015 ~]$
Now you have a compute node cm015
ready to serve you.
You can start Python by executing this line:
python
# Output:
# Python 3.9.16 (main, Jan 4 2024, 00:00:00)
# [GCC 11.3.1 20221121 (Red Hat 11.3.1-4)] on linux
# Type "help", "copyright", "credits" or "license" for more information.
# >>>
Now you can run any Python commands following >>>
on HPC the same way as you do on your local computer.
Congratulations! Now you know the core workflow of using HPC. The following sections cover how to streamline the process to make it easier, automatic, replicable, and parallel.
Note that since Greene is Linux-based, the commands you use will need to be Linux-based as well.
There are several root directories on Greene, each with different specifications tailored for various purposes.
Storage | Disk Space / Number of Files | Backed Up / Flushed | Recommendation |
---|---|---|---|
/home | 50 GB / 30 K | YES / NO | Store your Singularity and other frequently used keys or important results. |
/archive | 2 TB / 20 K | YES / NO | Long-term storage. Archive your projects as .tar or .zip or unused Singularity files. Only for infrequent access. |
/scratch | 5 TB / 1 M | NO / Files not accessed for 60 days | Put working projects here, especially those with a small number of large files (e.g., neuroimages). |
/vast | 2 TB / 5 M | NO / Files not accessed for 60 days | Put working projects here, especially those involving many small files with high I/O workflows (e.g., audio or image files). |
You can execute the myquota
command to see the current usage of your storage.
myquota
# Output:
# Hostname: log-3 at Thu Jul 11 07:39:37 AM EDT 2024
#
# Filesystem Environment Backed up? Allocation Current Usage
# Space Variable /Flushed? Space / Files Space(%) / Files(%)
#
# /home $HOME Yes/No 50.0GB/30.0K 46.75GB(93.50%)/13008(43.36%)
# /scratch $SCRATCH No/Yes 5.0TB/1.0M 0.97GB(0.02%)/11764(1.18%)
# /archive $ARCHIVE Yes/No 2.0TB/20.0K 56.33GB(2.75%)/233(1.16%)
# /vast $VAST NO/YES 2TB/5.0M 1.38TB(69.0%)/3454787(69%)
Typically, you want to put your project folder under /scratch
or /vast
, and your Conda environment, Singularity (will explain later), and other personal login files under /home
. Since data not accessed for 60 days under /scratch
and /vast
will be wiped out, I recommend compressing the project folder and saving it under /archive
periodically.
Run this line to compress the folder and save it under /archive
:
tar -czvf /archive/<NetID>/myProject_20240711.tgz /scratch/<NetID>/myProjectFolder
Run this line to uncompress the .tgz file and put it back to /scratch
:
tar -xvf /archive/<NetID>/myProject_20240711.tgz /scratch/<NetID>/
You will see the /scratch/<NetID>/myProjectFolder
folder is back.
Here is an instruction on how to tar archieve a folder on Mac/Linux.
According to NYU HPC's website, there are a number of ways to transfer data to HPC. Here, I will introduce two methods that I recommend.
Globus has a browser-based user interface for file transfers, which is very intuitive and features automatic error monitoring. See here for instructions.
However, Globus can be slow when transferring a large quantity of files. In such cases, you can simply compress the entire folder on your local machine into a single tar file, upload that tar file using Globus, and then uncompressed it on HPC.
Manually synchronizing script files using Globus can be error-prone, as it's easy to lose track, especially when dealing with a large number of small files. Therefore, I recommend synchronizing your scripts (and perhaps a small number of small-sized data files) using GitHub, similar to working on the same project across multiple computers.
Cloning a GitHub repository to HPC is the same as on your local machine (instructions). Just remember to .gitignore
your data folder and anything else you don't want to share.
Tip: I strongly recommend organizing your project files following the cookiecutter-data-science template. You can do this manually if you prefer not to run the script for it. Future you will thank you!
One of the more complex steps in using HPC is setting up the environment and installing all necessary packages. Singularity and Miniconda can simplify this process significantly. While it may take a couple of hours to complete all the steps for the first time, it will become faster and easier.
Conda is a powerful tool for package and environment management. It allows you to create multiple environments with different versions of packages and dependencies that won't interfere with each other. Whenever you install a new package within an environment, Conda checks and updates its dependencies with other existing packages. This is particularly useful when working on multiple projects using different sets of tools, or when you want to install new packages without risking conflicts in the existing environment.
To keep this tutorial focused on HPC, I assume that you are already familiar with Conda and actively using it on your local computer. If not, start using Conda today! See here for reasons why you should use it, and check out here for more instructions.
Singularity is a container platform specifically designed for HPC and has become widely adopted across many HPC systems as a standard.
Without Singularity, finding a suitable place to store your Conda environment file on HPC can be challenging. The /home
space typically has limits on the number of files it can contain, which may not be sufficient to host all required packages. While /scratch
and /vast
can accommodate many files, these locations are regularly wiped clean. You wouldn't want to repeatedly recreate your Conda environment!
Singularity acts like a container. It appears as a single file to the HPC file system but can contain numerous files within. This allows you to store your Singularity container under /home
without concerns about hitting file limits.
You can even share a copy of your Singularity .ext3
file with others to ensure reproducibility! Your colleagues won't need to reinstall the packages you used, which may no longer be available in the future. However, note that the .ext3
file can be quite large, depending on the initial size you requested.
This section is primarily adapted from the NYU HPC website.
mkdir /scratch/<NetID>/pytorch-example # make directory
cd /scratch/<NetID>/pytorch-example # change the current directory to this place
You can browse available images using ls
to see the options available:
ls /scratch/work/public/overlay-fs-ext3
In this example, we'll use overlay-15GB-500K.ext3.gz
as it provides sufficient storage for most Conda environments. It offers 15GB of free space and can hold up to 500K files. You can choose a different size if needed, but remember that the overlay image cannot be easily modified later. It's recommended to select one slightly larger than your current needs.
cp -rp /scratch/work/public/overlay-fs-ext3/overlay-15GB-500K.ext3.gz .
gunzip overlay-15GB-500K.ext3.gz
Choosing a Singularity image is akin to selecting an operating system to run your code. You can change this each time you access the overlay file.
For this example, we will use the following image:
/scratch/work/public/singularity/cuda11.6.124-cudnn8.4.0.27-devel-ubuntu20.04.4.sif
To view available Singularity images on NYU HPC Greene, you can check the singularity images folder:
ls /scratch/work/public/singularity/
For the most recent supported versions, refer to the TensorFlow website.
singularity exec --overlay overlay-15GB-500K.ext3:rw /scratch/work/public/singularity/cuda11.6.124-cudnn8.4.0.27-devel-ubuntu20.04.4.sif /bin/bash
The above command starts a bash shell inside the specified Singularity container, overlaid with the 15GB, 500K file system you set up earlier. This setup provides the illusion of having a writable filesystem inside what is typically a read-only Singularity container.
Run these commands:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p /ext3/miniconda3
You can then remove the installation file to save space:
rm Miniconda3-latest-Linux-x86_64.sh
Use nano to create and open the file:
nano /ext3/env.sh
Inside nano, enter the following content for the wrapper script. This script will activate your Conda environment where you can install your packages and dependencies:
#!/bin/bash
unset -f which
source /ext3/miniconda3/etc/profile.d/conda.sh
export PATH=/ext3/miniconda3/bin:$PATH
export PYTHONPATH=/ext3/miniconda3/bin:$PATH
To save the file, press Ctrl+O
, then Enter
to save the file, and Ctrl+X
to exit.
Run this to activate the environment:
source /ext3/env.sh
Now that your environment is activated, you can update and install packages
conda update -n base conda -y
conda clean --all --yes
conda install pip -y
conda install ipykernel -y # Note: ipykernel is required to run as a kernel in the Open OnDemand Jupyter Notebooks
To confirm that your environment is appropriately referencing your Miniconda installation, try out the following:
unset -f which
which conda
# output: /ext3/miniconda3/bin/conda
which python
# output: /ext3/miniconda3/bin/python
python --version
# output: Python 3.8.5
which pip
# output: /ext3/miniconda3/bin/pip
exit
# exit Singularity
You may now install packages into the environment with either the pip install or conda install commands.
First, start an interactive job with adequate compute and memory resources to install packages. The login nodes restrict memory to 2GB per user, which may cause some large packages to crash.
srun --cpus-per-task=2 --mem=10GB --time=04:00:00 --pty /bin/bash # request a compute node
# wait to be assigned a node
singularity exec --overlay overlay-15GB-500K.ext3:rw /scratch/work/public/singularity/cuda11.6.124-cudnn8.4.0.27-devel-ubuntu20.04.4.sif /bin/bash
source /ext3/env.sh # activate the environment
Option 1: pip install (it works, but not recommended)
You can install PyTorch using pip
as an example:
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
pip3 install jupyter jupyterhub pandas matplotlib scipy scikit-learn scikit-image Pillow
Option 2: create a new conda env with specified packages (recommended)
The following sections will be based on this.
I recommend using conda to ensure compatibility among packages. Execute this command to create a Conda environment named pytorch-ac8888
with several essential packages installed:
conda create -n pytorch-ac8888 pytorch jupyter jupyterhub pandas matplotlib scipy scikit-learn scikit-image Pillow
After the environment is created, activate it using:
conda activate pytorch-ac8888
You might wonder why we create a Conda environment within a .ext3
file, which is already dedicated to this project. The reason is that you may need multiple environments for different analyses due to package incompatibilities. Also, having separate environments allows you to duplicate them as backups in case upgrading some packages disrupts the entire environment. This approach ensures flexibility and reproducibility in your HPC workflows.
Refer to the Conda documentation for more details on managing Conda environments.
Option 3: recreate a Conda env from .yaml
file (may not alway work)
You can export a .yaml
file listing all the packages and versions you use on your local computer, upload it to HPC, and use it to recreate a Conda environment. This method replicates the exact same environment on HPC.
-
Use Globus to upload the
.yaml
file to/scratch/<NetID>/pytorch-example/
. -
Run the following command on HPC:
conda env create -f /scratch/<NetID>/pytorch-example/pytorch-ac8888_20240711.yaml
However, note that this approach may not always work perfectly. Some packages installed on your local machine might be incompatible on HPC. In such cases, you may need to relax constraints on package versions or exclude unnecessary packages from the .yaml
file.
You can check the available space left on your image using the following command:
find /ext3 | wc -l
# output: should be something like 222076
du -sh /ext3
# output should be something like 6.0G /ext3
Now, exit the Singularity container and then rename the overlay image. Typing exit
and hitting enter will exit the Singularity container if you are currently inside it. You can tell if you're in a Singularity container because your prompt will be different, such as showing the prompt Singularity>
exit
mv overlay-15GB-500K.ext3 my_pytorch.ext3
Test your PyTorch Singularity Image
singularity exec --overlay /scratch/<NetID>/pytorch-example/my_pytorch.ext3:ro /scratch/work/public/singularity/cuda11.6.124-cudnn8.4.0.27-devel-ubuntu20.04.4.sif /bin/bash -c 'source /ext3/env.sh; conda activate pytorch-ac8888; python -c "import torch; print(torch.__file__); print(torch.__version__)"'
#output: /ext3/miniconda3/envs/pytorch-ac8888/lib/python3.12/site-packages/torch/__init__.py
#output: 2.3.1.post100
Note that by default, you are accessing the image with the :ro
(read-only) flag, which prevents accidental modifications to the packages in your environment. This is recommended when executing your scripts to maintain consistency.
If you need to make further modifications to the environment, you will need to change the flag to :rw
(read-write). This allows you to write changes to the Singularity container, such as installing new packages or updating existing ones.
To simplify the steps required to access the Conda environment in the image, you can create a .bash
file on your computer named run-pytorch-ac8888.bash
and upload it to the directory /scratch/<NetID>/pytorch-example/
. The file should resemble this example.
#!/bin/bash
args=''
for i in "$@"; do
i="${i//\\/\\\\}"
args="${args} \"${i//\"/\\\"}\""
done
if [ "${args}" == "" ]; then args="/bin/bash"; fi
if [[ -e /dev/nvidia0 ]]; then nv="--nv"; fi
singularity \
exec \
--nv --overlay /scratch/<NetID>/pytorch-example/my_pytorch.ext3:ro \
/scratch/work/public/singularity/cuda11.6.124-cudnn8.4.0.27-devel-ubuntu20.04.4.sif \
/bin/bash -c "
unset -f which
source /opt/apps/lmod/lmod/init/sh
source /ext3/env.sh
conda activate pytorch-ac8888
${args}
"
To make this script executable, run this command:
chmod 755 /scratch/<NetID>/pytorch-example/run-pytorch-ac8888.bash
Now you can access to the Conda environment simply by running:
/scratch/<NetID>/pytorch-example/run-pytorch-ac8888.bash
Check which environment is currently activated:
conda env list
## Output:
# # conda environments:
# #
# base /ext3/miniconda3
# pytorch-ac8888 * /ext3/miniconda3/envs/pytorch-ac8888
The *
sign indicates the currently activated environment.
Here is a simple Python script print_odd_even.py
for demonstration purposes, which can be downloaded here:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import sys
def print_odd_number(n):
print(str(n)+" is an odd number")
def print_even_number(n):
print(str(n)+" is an even number")
if __name__ == "__main__":
# Check if the correct number of arguments are provided
if len(sys.argv) != 2:
print("Usage: python script.py arg1")
sys.exit(1)
# Extract command-line arguments
n = int(sys.argv[1])
if n%2==0:
print_even_number(n)
else:
print_odd_number(n)
sys.exit(0)
Now you can upload it to HPC /scratch/<NetID>/pytorch-example/
and then run the following command:
/scratch/<NetID>/pytorch-example/run-pytorch-ac8888.bash python /scratch/<NetID>/pytorch-example/print_odd_even.py 23
# Output: 23 is an odd number
There are 4 chunks in this line. To break it down:
/scratch/<NetID>/pytorch-example/run-pytorch-ac8888.bash
calls the.bash
script file.python
calls the Python interpreter./scratch/<NetID>/pytorch-example/print_odd_even.py
is the Python script you are executing. It corresponds tosys.argv[0]
in the Python script.23
is the numerical input (n
) for the function. In this example, you can replace23
with any other integer. It corresponds tosys.argv[1]
in the Python script.
Note that all sys.argv[_]
inputs are automatically read as strings, so numerical inputs need to be converted to integers using int()
in your .py
script (or float using float()
) .
One of the biggest advantages of using HPC is that you can run the same script on multiple nodes in parallel. Using job array you may submit many similar jobs with almost identical job requirement. This can be easily achieved using SLURM batch jobs:
#!/bin/bash
#SBATCH --job-name=testrun # The name of the job
#SBATCH --nodes=1 # Request 1 compute node per job instance
#SBATCH --cpus-per-task=1 # Request 1 CPU per job instance
#SBATCH --mem=2GB # Request 2GB of RAM per job instance
#SBATCH --time=00:10:00 # Request 10 mins per job instance
#SBATCH --output=/scratch/<NetID>/pytorch-example/slurm_output/out_%A_%a.out # The output will be saved here. %A will be replaced by the slurm job ID, and %a will be replaced by the SLURM_ARRAY_TASK_ID
#SBATCH --mail-user=<NetID>@nyu.edu # Email address
#SBATCH --mail-type=END # Send an email when all the instances of this job are completed
module purge # unload all currently loaded modules in the environment
/scratch/<NetID>/pytorch-example/run-pytorch-ac8888.bash python /scratch/<NetID>/pytorch-example/print_odd_even.py $SLURM_ARRAY_TASK_ID
Save it as sbatch_pytorch-ac8888.s
and upload it to /scratch/<NetID>/pytorch-example/
.
You can modify the requested resources as needed. However, requesting more resources will increase the queue time. There are also limits on the resources you can request; please refer to this link.
Since the script will save output files under /scratch/<NetID>/pytorch-example/slurm_output/
, ensure you create that folder beforehand by running mkdir /scratch/<NetID>/pytorch-example/slurm_output
.
Now you can execute the script by running this command:
sbatch --array=0-99 /scratch/<NetID>/pytorch-example/sbatch_pytorch-ac8888.s
# Output: Submitted batch job 48368654
The --array=0-99
option means that there will be 100 instances of the script executed, with inputs ranging from 0 to 99. Each instance will be assigned a number stored in $SLURM_ARRAY_TASK_ID
, and then passed onto the print_odd_even.py
script as input variable sys.argv[1]
or n
.
The job ID is 48368654
.
You can check the status of all your jobs using this command:
squeue -u <NetID>
# Output:
# JOBID PARTITION NAME USER ST TIME NODES
# 48368654_[0-99] short,cs, testrun <NetID> PD 0:00 1
Here are some other useful SLURM commands that can help you get information about running and pending jobs, which can guide you in requesting just enough resources next time.
# detailed information for a job:
scontrol show jobid -dd <JobID>
# show status of a currently running job
# (see 'man sstat' for other available JOB STATUS FIELDS)
sstat --format=TresUsageInMax%80,TresUsageInMaxNode%80 -j <JobID> --allsteps
# get stats for completed jobs
# (see 'man sacct' for other JOB ACCOUNTING FIELDS)
sacct --units=G --format=User,JobID%24,Jobname,state,time,start,end,elapsed,TotalCPU,ReqMem,MaxRss,MaxVMSize,nnodes,ncpus,nodelist -j <JobID>
# the same information for all jobs of a user:
sacct --units=G --format=User,JobID%24,Jobname,state,time,start,end,elapsed,TotalCPU,ReqMem,MaxRss,MaxVMSize,nnodes,ncpus,nodelist -u <NetID>
When the all the jobs are done, you will receive an email titled something like this Slurm Array Summary Job_id=48368654_* (48368654) Name=testrun Ended, COMPLETED, ExitCode [0-0]
, where ExitCode 0
means there's no error! ExitCode 1
means an error. ExitCode [0-1]
means some instances have no error while some have.
You can list all the output files in the /scratch/<NetID>/pytorch-example/slurm_output/
directory using the ls
command:
ls /scratch/<NetID>/pytorch-example/slurm_output/
# out_48368654_0.out out_48368654_25.out out_48368654_40.out out_48368654_56.out out_48368654_71.out out_48368654_87.out
# out_48368654_10.out out_48368654_26.out out_48368654_41.out out_48368654_57.out out_48368654_72.out out_48368654_88.out
# out_48368654_11.out out_48368654_27.out out_48368654_42.out out_48368654_58.out out_48368654_73.out out_48368654_89.out
# out_48368654_12.out out_48368654_28.out out_48368654_43.out out_48368654_59.out out_48368654_74.out out_48368654_8.out
# ...
# (There should be a total of 100 files, each following the naming scheme out_[slurm job ID]_[SLURM_ARRAY_TASK_ID].out.)
You can view the content of a file using the cat
command to check the output of our Python script:
cat /scratch/<NetID>/pytorch-example/slurm_output/out_48368654_0.out
# Output: 0 is an even number
cat /scratch/<NetID>/pytorch-example/slurm_output/out_48368654_23.out
# Output: 23 is an odd number
Before submitting a GPU job, make sure that your package is GPU-compatible, as many Python packages have different versions for GPU and CPU-only usage. For instance, in PyTorch, you need to install:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
See more here.
As of Jan 25, 2025, NYU HPC currently only support up to CUDA 12.1
You can add this line into your .s
sbatch file:
#SBATCH --gres=gpu:2 # requesting 2 GPUs
However, unless your job specifically requires GPU usage and you are highly experienced in utilizing it, I do not recommend requesting GPUs for the following reasons:
-
Long Queue Times: Queue times for GPU jobs are typically very long because GPUs are in high demand. Even with the faster computational speed of GPUs, the extended queue times often negate their benefits.
-
Termination for Low GPU Usage: NYU HPC may terminate jobs that do not effectively utilize GPUs. This could require trial-and-error to optimize GPU usage, resulting in prolonged queue times for your jobs.
When the GPU job is running, use this command to access the job's node:
srun --jobid=<JOB_ID> --pty bash
Then use this command to monitor GPU performance in real time.
watch -n 1 nvidia-smi
You will see something like this, where the memory usage is 235 MiB / 46080 MiB, and the GPU utilization rate is 20%:
Every 1.0s: nvidia-smi
Sat Jan 25 08:51:54 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Quadro RTX 8000 On | 00000000:D8:00.0 Off | 0 |
| N/A 39C P0 61W / 250W | 235MiB / 46080MiB | 20% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1020293 C python 232MiB |
+---------------------------------------------------------------------------------------+
Your GPU job will be cancelled if the utility rate remains too low for an extended period (e.g., 2 hours). In pytorch (and most of the neural network models), you can often improve the utility rate by:
- Increasing the batch size,
- Increasing model complexity, and
- Ensuring that data loading is not a bottleneck (see this guide).
However, optimizing the utility rate can be challenging, requiring extensive trial-and-error, engineering effort, and waiting in the queue. For this reason, I sometimes recommend running jobs on multi-core CPUs as an alternative.
Another hacking strategy is to design your job as a "relay race" so that a cancelled job can be resumed. Most ML toolkits support resuming training from a previous checkpoint. After submitting your first job, use the --dependency=afternotok:<JobID>
command (see this guide) to schedule a follow-up job that runs only if the previous one gets cancelled.
Additionally, requesting shorter job durations (e.g., 2 hours) can significantly reduce GPU queue times. With this relay strategy, the shorter duration becomes less critical since subsequent jobs will pick up where the cancelled one left off.
Let me know if you'd like further refinements!
HPC systems offer a wide range of capabilities. Here, I'll cover the ones I've worked with before.
HPC systems typically use a module system to load most software into a user’s environment. This approach is particularly useful for tasks involving audio and/or video file processing.
You can use module avail
to check the available modules on an HPC. On Greene, for example, there are hundreds of modules available.
module avail
# Output:
#-------------------------------------------------------- /share/apps/modulefiles ---------------------------------------------------------
# abyss/intel/2.3.0 google-chrome/87.0.4280.88 nwchem/openmpi/intel/7.2.0
# admixtools/intel/7.0.2 google-cloud-sdk/357.0.0 octopus/openmpi/intel/20240311
# admixture/1.3.0 google-cloud-sdk/379.0.0 octopus/openmpi/intel/20240323
# advanpix-mct/4.9.3.15018 googletest/1.10.0 onetbb/intel/2020.3
#...
For example, if your Python packages related to audio processing require libsndfile
, you can load it using module load libsndfile/intel/1.0.31
. Insert this command after module purge in your .sbatch
file like this:
module purge
module load libsndfile/intel/1.0.31
Yes, you can. Setting up MATLAB on HPC is usually easier than Python, as it typically does not involve Singularity and Conda. You only need to use the module command to load MATLAB.
Make a few modifications in your .sbatch
file right below the #SBATCH
section:
module purge
module load matlab/2023b # There are many other versions available on Greene you can choose from
matlab -nodisplay -r "your_matlab_function($SLURM_ARRAY_TASK_ID); exit;"
Note that executing a MATLAB script will always return "COMPLETED, ExitCode [0]", regardless of whether it crashed or not. Therefore, make sure to check the SLURM output files for accurate status.
Yes, you can use it via Open OnDemand! Here are the instructions.
I recommend using this GUI to experiment directly with your scripts while utilizing HPC's resources. It can be more effective than testing on your local computer with limited resources or running debugging cycles each time you execute a .py
file. Once you have a working pipeline, you can reorganize it into a .py
file for scaling up.
If you have a job that needs to be executed in 100k instances of the print_odd_even.py
script, ranging from 0 to 99,999, and each instance only takes 0.1 seconds to run, it may not be a good idea to request --array=0-99999
as there will be 100k output files! You want to find a balance among quantity, speed, and complexity.
You may want to tweak the script as follows and request sbatch
job with --array=0-99
:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import sys
def print_odd_number(n):
print(str(n)+" is an odd number")
def print_even_number(n):
print(str(n)+" is an even number")
if __name__ == "__main__":
# Check if the correct number of arguments are provided
if len(sys.argv) != 2:
print("Usage: python script.py arg1")
sys.exit(1)
# Extract command-line arguments
n = int(sys.argv[1])
# This section has been modified
for i in range(n*1000, (n+1)*1000):
if i%2==0:
print_even_number(i)
else:
print_odd_number(i)
sys.exit(0)
This new script makes each job instance loop through 1000 iterations per n
. When n=0
, this instance will loop i
from 0 to 999. When n=1
, this instance will loop i
from 1000 to 1999, and so on. Therefore, it is equivalent to running 100k brief instances in parallel, but it is much more manageable with only 100 job instances.
Although the execution time per instance is 1000 times longer, the difference is just 0.1 s vs. 100 s. Also, the total queue time and required resources of 100k jobs is almost certainly much longer and larger than 100 jobs!
To start a new job after the completion of a previous job, you can use the job dependencies
feature in SLURM. This allows you to set up a dependency between jobs, ensuring the subsequent job only starts after the previous one completes successfully.
General Syntax:
sbatch --dependency=<type:job_id[:job_id][,type:job_id[:job_id]]> ...
Example worksflow
[<NetID>@log-3 ~]$ sbatch job1.sh
Submitted job 3345678
[<NetID>@log-3 ~]$ sbatch --dependency=afterok:3345678 job2.sh
In this example:
- The first job (
job1.sh
) is submitted and assigned job ID3345678
. - The second job (
job2.sh
) is submitted with a dependency on the successful completion (afterok
) of job3345678
.
For more details on job dependency types, see the official documentation.
If you need to submit a job to begin at a specific time (note: this specifies the submission time, not the guaranteed execution time), you can use the --begin
option.
[<NetID>@log-3 ~]$ sbatch --begin=9:42:00 job1.sh
In this example, the job job1.sh
is scheduled to start at 9:42 AM.
Keep in mind that the execution depends on resource availability in the queue system, so actual job start times may vary.
Yes, you can. You can use the subprocess
function to execute your Git commands.
import subprocess
subprocess.run(['git', 'add', 'your/path/to/be/committed'], check=True)
subprocess.run(['git', 'commit', '-m', 'your commit message'], check=True)
subprocess.run(['git', 'push'], check=True)
The check
parameter is a boolean value that indicates whether to check the return code of the subprocess. If check
is set to True
and the return code is non-zero, a CalledProcessError
will be raised.
Note that when running an array job with Git commands targeting the same directory, it may cause conflicts, as multiple jobs may execute the Git commands simultaneously. Make sure to add different paths to avoid this.
As NYU HPC regularly removes files that have not been accessed for a while, it is important to correctly preserve your data. For projects that have been completed, you should tar the files and deposit them in /archive
or on your local computer.
You can use this command to check which files have not been accessed, modified, or created in over 60 days.
find $SCRATCH -atime +60 -mtime +60 -ctime +60 -type f
(You can replace $SCRATCH
with other roots, and/or replace +60
with other days.)
However, for projects you are still working on, tarring all the files and then untarring them may not be the best approach. Instead, I wrote a python
script that automatically accesses all the files in a path, which will reset the access time for each file. Note that this approach will only change the files' access time, not the modification time. Additionally, here is a sbatch
script to run the python
script.
submitit
is a Python package designed for submitting your HPC job from your local computer via Python. Note that as this package was developed by Facebook (Meta) primarily for their internal use, it may not be applicable to other HPC systems. I have never tried it myself, but let me know if it works on the NYU HPC. Thank Jean-Rémi King for suggesting this package!singuconda
is a package that makes setting upSingularity
overlays withminiconda
much easier and faster. Thank Bea Steers for making this package and Iran R. Roman for the recommendation!
I recommend this paper: Alnasir, J. J. (2021). Fifteen quick tips for success with HPC, ie, responsibly BASHing that Linux cluster. PLOS Computational Biology, 17(8), e1009207..
Also check out the NYU HPC's website on this topic here.
Shout out to NYU HPC's technical staff. They are knowledgeable, friendly, patient, and very willing to teach me. A huge thanks to them!!