HPC

On selected high-performance computing (HPC) systems, WarpX has documented or even pre-build installation routines. Follow the guide here instead of the generic installation routines for optimal stability and best performance.

warpx.profile

Use a warpx.profile file to set up your software environment without colliding with other software. Ideally, store that file directly in your $HOME/ and source it after connecting to the machine:

source $HOME/warpx.profile

We list example warpx.profile files below, which can be used to set up WarpX on various HPC systems.

HPC Machines

This section documents quick-start guides for a selection of supercomputers that WarpX users are active on.

Tip

Your HPC system is not in the list? Open an issue and together we can document it!

Batch Systems

HPC systems use a scheduling (“batch”) system for time sharing of computing resources. The batch system is used to request, queue, schedule and execute compute jobs asynchronously. The individual HPC machines above document job submission example scripts, as templates for your modifications.

In this section, we document a quick reference guide (or cheat sheet) to interact in more detail with the various batch systems that you might encounter on different systems.

Slurm

Slurm is a modern and very popular batch system. Slurm is used at NERSC, OLCF Frontier, among others.

Job Submission

  • sbatch your_job_script.sbatch

Job Control

  • interactive job:

    • salloc --time=1:00:00 --nodes=1 --ntasks-per-node=4 --cpus-per-task=8

      • e.g. srun "hostname"

    • GPU allocation on most machines require additional flags, e.g. --gpus-per-task=1 or --gres=...

  • details for my jobs:

    • scontrol -d show job 12345 all details for job with <job id> 12345

    • squeue -u $(whoami) -l all jobs under my user name

  • details for queues:

    • squeue -p queueName -l list full queue

    • squeue -p queueName --start (show start times for pending jobs)

    • squeue -p queueName -l -t R (only show running jobs in queue)

    • sinfo -p queueName (show online/offline nodes in queue)

    • sview (alternative on taurus: module load llview and llview)

    • scontrol show partition queueName

  • communicate with job:

    • scancel <job id> abort job

    • scancel -s <signal number> <job id> send signal or signal name to job

    • scontrol update timelimit=4:00:00 jobid=12345 change the walltime of a job

    • scontrol update jobid=12345 dependency=afterany:54321 only start job 12345 after job with id 54321 has finished

    • scontrol hold <job id> prevent the job from starting

    • scontrol release <job id> release the job to be eligible for run (after it was set on hold)

References

LSF

LSF (for Load Sharing Facility) is an IBM batch system. It is used at OLCF Summit, LLNL Lassen, and other IBM systems.

Job Submission

  • bsub your_job_script.bsub

Job Control

  • interactive job:

    • bsub -P $proj -W 2:00 -nnodes 1 -Is /bin/bash

  • details for my jobs:

    • bjobs 12345 all details for job with <job id> 12345

    • bjobs [-l] all jobs under my user name

    • jobstat -u $(whoami) job eligibility

    • bjdepinfo 12345 job dependencies on other jobs

  • details for queues:

    • bqueues list queues

  • communicate with job:

    • bkill <job id> abort job

    • bpeek [-f] <job id> peek into stdout/stderr of a job

    • bkill -s <signal number> <job id> send signal or signal name to job

    • bchkpnt and brestart checkpoint and restart job (untested/unimplemented)

    • bmod -W 1:30 12345 change the walltime of a job (currently not allowed)

    • bstop <job id> prevent the job from starting

    • bresume <job id> release the job to be eligible for run (after it was set on hold)

References

PBS

PBS (for Portable Batch System) is a popular HPC batch system. The OpenPBS project is related to PBS, PBS Pro and TORQUE.

Job Submission

  • qsub your_job_script.qsub

Job Control

  • interactive job:

    • qsub -I

  • details for my jobs:

    • qstat -f 12345 all details for job with <job id> 12345

    • qstat -u $(whoami) all jobs under my user name

  • details for queues:

    • qstat -a queueName show all jobs in a queue

    • pbs_free -l compact view on free and busy nodes

    • pbsnodes list all nodes and their detailed state (free, busy/job-exclusive, offline)

  • communicate with job:

    • qdel <job id> abort job

    • qsig -s <signal number> <job id> send signal or signal name to job

    • qalter -lwalltime=12:00:00 <job id> change the walltime of a job

    • qalter -Wdepend=afterany:54321 12345 only start job 12345 after job with id 54321 has finished

    • qhold <job id> prevent the job from starting

    • qrls <job id> release the job to be eligible for run (after it was set on hold)

References

PJM

PJM (probably for Parallel Job Manager?) is a Fujitsu batch system It is used at RIKEN Fugaku and on other Fujitsu systems.

Note

This section is a stub and improvements to complete the (TODO) sections are welcome.

Job Submission

  • pjsub your_job_script.pjsub

Job Control

  • interactive job:

    • pjsub --interact

  • details for my jobs:

    • pjstat status of all jobs

    • (TODO) all details for job with <job id> 12345

    • (TODO) all jobs under my user name

  • details for queues:

    • (TODO) show all jobs in a queue

    • (TODO) compact view on free and busy nodes

    • (TODO) list all nodes and their detailed state (free, busy/job-exclusive, offline)

  • communicate with job:

    • pjdel <job id> abort job

    • (TODO) send signal or signal name to job

    • (TODO) change the walltime of a job

    • (TODO) only start job 12345 after job with id 54321 has finished

    • pjhold <job id> prevent the job from starting

    • pjrls <job id> release the job to be eligible for run (after it was set on hold)

References