Submit Jobs to HPC Slurm

NYU provides Greene high-performance computing (HPC) cluster, a powerful system that bolsters research. The Greene HPC cluster is a collection of computers and nodes, with an overall performance (CPUs and GPUs) that exceeds four petaflops. (That means a lot when the computing is heavy!)

Several types of jobs can be run on the Greene cluster. The traditional type of workload for HPC. We write an SBATCH script specifying how a particular job should run within an HPC cluster. With a command like sbatch script.sh, an SBATCH job is sent to the scheduler system (Slurm provided by Greene).

The job scheduler then takes care of queuing the job, allocating resources, and executing the job on a suitable node in the cluster without further user input interaction. (So we could chill and focus on other tasks! 😎)

This article summarizes my knowledge of moving the high-computing-demand job from our local server to HPC to save time and energy. The main language involved here will be some SBATCH job script (.sh) calling Matlab (.m), Python (.py) or R (.R) scripts.

You could find examples in /scratch/work/public/examples after SSH to Greene.

Contents Table:

Accessing HPC Cluster

SSH using terminal

ssh <NetID>@gw.hpc.nyu.edu ## you can skip if on the NYU Network or NYU VPN
ssh <NetID>@greene.hpc.nyu.edu

Elements of SBATCH Job Script

Features - Allocating Resources

Nodes with distinct resource conditions have feature tags on GPU capability, GPU name with GPU memory amount, Hybrid memory, Processor name, Processor generation, etc.

The node features are usually specified using —constraint flags in Slurm sbatch files such as script.sh below:

## This tells the shell how to execute the script
#!/bin/bash

#SBATCH --job-name=MatlabJobe

## we ask for a single node, one task for that node, and 20 cpu for each task
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=20

## Time is the estimated time to complete. Sets a limit on the total run time of the job allocation. If the requested time limit exceeds the partition’s time limit, the job will be left in a PENDING state (possibly indefinitely).
#SBATCH --time=48:00:00
#SBATCH --mem=200GB

## These lines manage mail alerts for when the job ends and who the email should be sent to. 
#SBATCH --mail-type=END
#SBATCH [email protected] ##change it to your own email

## To request one GPU card
#SBATCH --gres=gpu:1

## This places the standard output and error into the same slurm_<job_id>.out
## #SBATCH --output=slurm%j.out

## Or we could save the output and error logs in a specified folder
#SBATCH --output=output/output_jobname_%A.out # change the jobname
#SBATCH --error=error/error_jobname_%A.err

# make folders Log, error and output if you want to save them
error=/scratch/qy775/matlab_greene/error
output=/scratch/qy775/matlab_greene/output

mkdir "$error"
mkdir "$output"

Note that the feature - -constraints might vary in HPC Clusters, but the ideas are universal.