#acl +All:read Default = Basics = The Grid HTC cluster is designed as a multipurpose computing resource for scientific research. However it is optimised for high throughput computing (hence HTC). What this means is that we have a large data storage capacity (>5PB) and a large number of job slots (>4000) and a large network bandwidth to connect the two. If you work requires large data storage or lots of single processor slots then we provide a good solution. If your work requires inter job communication (e.g. MPI) or has high memory requirements (>4GB/HT core) then our system is not well suited to you. There maybe other solutions available (such as the HPC cluster). Our batch system is SLURM (Simple Linux Utility for Resource Management). Details about SLURM can be found [[http://slurm.schedmd.com/slurm.html | here]]. A summary of useful SLURM commands can be found [[ http://slurm.schedmd.com/pdfs/summary.pdf | here ]]. They differ with respect to the Gridengine commands , the difference are summarised [[ http://slurm.schedmd.com/rosetta.pdf | here]]. == SLUM Setup == === SLURM Queues === The default queues (partitions in SLURM language) to submit jobs to are: Partition Name = debug, Default Memory Per CPU =1024, Maximum Memory Per CPU =2048, Max Wall Clock Time = 1 hour Partition Name = prod, Default Memory Per CPU =1024, Maximum Memory Per CPU =4096, Max Wall Clock Time = 4 hours === Submitting a job === to submit a job to the "production" partition (aka queue), using 1 task per node (one job slot) sbatch -p prod test_slurm {{{ [$] cat test_slurm #!/bin/sh hostname uptime }}} Several additional options are available when submitting a job. Clearly stating your job limits will make the job more lily to succeed and may improve turnaround time. {{{ #Time limit for the job --time=01:00:00 # memory in MB; default limit is 1024MB per core --mem-per-cpu=1024 # Number of cores per job per node. --ntasks-per-node=1 # Number of compute nodes for the job. --nodes=1 #Name of job. Default is the JobID --job-name="hello_test" # Name of file for stdout. Default is the JobID --output=test.out }}} Alternatively the job options can be put in the job script e.g. {{{ #!/bin/sh #SBATCH --partition=prod --qos=general-compute #SBATCH --time=00:15:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem-per-cpu=1024 #SBATCH --job-name="hello_test" #SBATCH --output=test-srun.out hostname uptime # some slurm variables echo "SLURM_JOBID="$SLURM_JOBID echo "SLURM_JOB_NODELIST"=$SLURM_JOB_NODELIST echo "SLURM_NNODES"=$SLURM_NNODES echo "SLURMTMPDIR="$SLURMTMPDIR echo "working directory = "$SLURM_SUBMIT_DIR }}} Note: * Clearly stating your job limits will make the job more lily to succeed and may improve turnaround time. * Requesting more nodes or more cores than 1 will lead to much longer turnaround times. the grid cluster is optimized for single core, single nodes high through put computing (htc). #=== MPI === #to run mpi job see #https://www.open-mpi.org/faq/?category=slurm == Get information from SLUM == A queued or running job can be cancelled {{{ scancel }}} scontrol - Used view Slurm configuration and state. {{{ scontrol show job }}} squeue - view information about jobs located in the Slurm scheduling queue. {{{ squeue -j squeue -u }}} sacct - displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database {{{ sacct -j }}} == More complex examples == Submit job that needs a gpu(gpu:1=1 gpu, max 4) {{{sbatch -p centos7_gpu --gres=gpu:1 test_gpu.sh}}} {{{ echo $CUDA_VISIBLE_DEVICES nvidia-smi }}} Example Script This is a more complex example show how various option scan be used when submitting a job. {{{ #!/bin/sh SBATCH --job-name="hello_test" # Create Working Directory WDIR=/data/scratch/tmp/$USER/$SLURM_JOBID mkdir -p $WDIR if [ ! -d $WDIR ] then echo $WDIR not created exit fi cd $WDIR # Copy Data and Config Files. cp $HOME/Data/FrogProject/FrogFile . # Put your Science related commands here /share/apps/runsforever FrogFile # Copy Results Back to Home Directory RDIR=$HOME/FrogProject/Results/$SLURM_JOBID mkdir -p $RDIR cp NobelPrizeWinningResults $RDIR # Cleanup rm -rf $WDIR }}} A more complicated example. This will create multiple directories, with the name LJ.$t where $t is a number, with a control file containing different temperature, $t, in each directory. The commented out section submits jobs from each directory in turn, which should use the unique control file. {{{ MINTEMP=0 MAXTEMP=3000 TEMPSTEP=30 SAMPLE="LJ" for ((t=$MINTEMP;t<=$MAXTEMP;t=t+$TEMPSTEP)) do echo "Creating Directory for temperature: " $t DIR=${SAMPLE}.${t} mkdir $DIR CONTROLFILE=${DIR}/CONTROL # If more complicated, could copy a default control file or something echo "Control stuff" >$CONTROLFILE echo "more control stuff" >>$CONTROLFILE echo "temperature " $t >>$CONTROLFILE done #for ((t=$MINTEMP;t<=$MAXTEMP;t=t+$MAXTEMP)); do # cd ${SAMPLE}.${t} # sbatch -p production myjob.sh # cd .. #done }}} Cluster basics (last edited 2015-09-22 08:24:51 by apw043) EditInfoAdd LinkAttachments  MoinMoin PoweredPython PoweredGPL licensedValid HTML 4.01