HPC
On selected high-performance computing (HPC) systems, WarpX has documented or even pre-build installation routines. Follow the guide here instead of the generic installation routines for optimal stability and best performance.
warpx.profile
Use a warpx.profile
file to set up your software environment without colliding with other software.
Ideally, store that file directly in your $HOME/
and source it after connecting to the machine:
source $HOME/warpx.profile
We list example warpx.profile
files below, which can be used to set up WarpX on various HPC systems.
HPC Machines
This section documents quick-start guides for a selection of supercomputers that WarpX users are active on.
Tip
Your HPC system is not in the list? Open an issue and together we can document it!
Batch Systems
HPC systems use a scheduling (“batch”) system for time sharing of computing resources. The batch system is used to request, queue, schedule and execute compute jobs asynchronously. The individual HPC machines above document job submission example scripts, as templates for your modifications.
In this section, we document a quick reference guide (or cheat sheet) to interact in more detail with the various batch systems that you might encounter on different systems.
Slurm
Slurm is a modern and very popular batch system. Slurm is used at NERSC, OLCF Frontier, among others.
Job Submission
sbatch your_job_script.sbatch
Job Control
interactive job:
salloc --time=1:00:00 --nodes=1 --ntasks-per-node=4 --cpus-per-task=8
e.g.
srun "hostname"
GPU allocation on most machines require additional flags, e.g.
--gpus-per-task=1
or--gres=...
details for my jobs:
scontrol -d show job 12345
all details for job with <job id>12345
squeue -u $(whoami) -l
all jobs under my user name
details for queues:
squeue -p queueName -l
list full queuesqueue -p queueName --start
(show start times for pending jobs)squeue -p queueName -l -t R
(only show running jobs in queue)sinfo -p queueName
(show online/offline nodes in queue)sview
(alternative on taurus:module load llview
andllview
)scontrol show partition queueName
communicate with job:
scancel <job id>
abort jobscancel -s <signal number> <job id>
send signal or signal name to jobscontrol update timelimit=4:00:00 jobid=12345
change the walltime of a jobscontrol update jobid=12345 dependency=afterany:54321
only start job12345
after job with id54321
has finishedscontrol hold <job id>
prevent the job from startingscontrol release <job id>
release the job to be eligible for run (after it was set on hold)
References
LSF
LSF (for Load Sharing Facility) is an IBM batch system. It is used at OLCF Summit, LLNL Lassen, and other IBM systems.
Job Submission
bsub your_job_script.bsub
Job Control
interactive job:
bsub -P $proj -W 2:00 -nnodes 1 -Is /bin/bash
-
bjobs 12345
all details for job with <job id>12345
bjobs [-l]
all jobs under my user namejobstat -u $(whoami)
job eligibilitybjdepinfo 12345
job dependencies on other jobs
details for queues:
bqueues
list queues
communicate with job:
bkill <job id>
abort jobbpeek [-f] <job id>
peek intostdout
/stderr
of a jobbkill -s <signal number> <job id>
send signal or signal name to jobbchkpnt
andbrestart
checkpoint and restart job (untested/unimplemented)bmod -W 1:30 12345
change the walltime of a job (currently not allowed)bstop <job id>
prevent the job from startingbresume <job id>
release the job to be eligible for run (after it was set on hold)
References
PBS
PBS (for Portable Batch System) is a popular HPC batch system. The OpenPBS project is related to PBS, PBS Pro and TORQUE.
Job Submission
qsub your_job_script.qsub
Job Control
interactive job:
qsub -I
details for my jobs:
qstat -f 12345
all details for job with <job id>12345
qstat -u $(whoami)
all jobs under my user name
details for queues:
qstat -a queueName
show all jobs in a queuepbs_free -l
compact view on free and busy nodespbsnodes
list all nodes and their detailed state (free, busy/job-exclusive, offline)
communicate with job:
qdel <job id>
abort jobqsig -s <signal number> <job id>
send signal or signal name to jobqalter -lwalltime=12:00:00 <job id>
change the walltime of a jobqalter -Wdepend=afterany:54321 12345
only start job12345
after job with id54321
has finishedqhold <job id>
prevent the job from startingqrls <job id>
release the job to be eligible for run (after it was set on hold)
References
PJM
PJM (probably for Parallel Job Manager?) is a Fujitsu batch system It is used at RIKEN Fugaku and on other Fujitsu systems.
Note
This section is a stub and improvements to complete the (TODO)
sections are welcome.
Job Submission
pjsub your_job_script.pjsub
Job Control
interactive job:
pjsub --interact
details for my jobs:
pjstat
status of all jobs(TODO) all details for job with <job id>
12345
(TODO) all jobs under my user name
details for queues:
(TODO) show all jobs in a queue
(TODO) compact view on free and busy nodes
(TODO) list all nodes and their detailed state (free, busy/job-exclusive, offline)
communicate with job:
pjdel <job id>
abort job(TODO) send signal or signal name to job
(TODO) change the walltime of a job
(TODO) only start job
12345
after job with id54321
has finishedpjhold <job id>
prevent the job from startingpjrls <job id>
release the job to be eligible for run (after it was set on hold)