Running on specific platforms

Running on Cori KNL at NERSC

The batch script below can be used to run a WarpX simulation on 2 KNL nodes on the supercomputer Cori at NERSC. Replace descriptions between chevrons <> by relevant values, for instance <job name> could be laserWakefield.

#!/bin/bash -l

# Copyright 2019 Maxence Thevenet
#
# This file is part of WarpX.
#
# License: BSD-3-Clause-LBNL


#SBATCH -N 2
#SBATCH -t 01:00:00
#SBATCH -q regular
#SBATCH -C knl
#SBATCH -S 4
#SBATCH -J <job name>
#SBATCH -A <allocation ID>
#SBATCH -e error.txt
#SBATCH -o output.txt

export OMP_PLACES=threads
export OMP_PROC_BIND=spread

# KNLs have 4 hyperthreads max
export CORI_MAX_HYPETHREAD_LEVEL=4
# We use 64 cores out of the 68 available on Cori KNL,
# and leave 4 to the system (see "#SBATCH -S 4" above).
export CORI_NCORES_PER_NODE=64

# Typically use 8 MPI ranks per node without hyperthreading,
# i.e., OMP_NUM_THREADS=8
export WARPX_NMPI_PER_NODE=8
export WARPX_HYPERTHREAD_LEVEL=1

# Compute OMP_NUM_THREADS and the thread count (-c option)
export CORI_NHYPERTHREADS_MAX=$(( ${CORI_MAX_HYPETHREAD_LEVEL} * ${CORI_NCORES_PER_NODE} ))
export WARPX_NTHREADS_PER_NODE=$(( ${WARPX_HYPERTHREAD_LEVEL} * ${CORI_NCORES_PER_NODE} ))
export OMP_NUM_THREADS=$(( ${WARPX_NTHREADS_PER_NODE} / ${WARPX_NMPI_PER_NODE} ))
export WARPX_THREAD_COUNT=$(( ${CORI_NHYPERTHREADS_MAX} / ${WARPX_NMPI_PER_NODE} ))

# for async_io support: (optional)
export MPICH_MAX_THREAD_SAFETY=multiple

srun --cpu_bind=cores -n $(( ${SLURM_JOB_NUM_NODES} * ${WARPX_NMPI_PER_NODE} )) -c ${WARPX_THREAD_COUNT} <path/to/executable> <input file>

To run a simulation, copy the lines above to a file batch_cori.sh and run

sbatch batch_cori.sh

to submit the job.

For a 3D simulation with a few (1-4) particles per cell using FDTD Maxwell solver on Cori KNL for a well load-balanced problem (in our case laser wakefield acceleration simulation in a boosted frame in the quasi-linear regime), the following set of parameters provided good performance:

  • amr.max_grid_size=64 and amr.blocking_factor=64 so that the size of each grid is fixed to 64**3 (we are not using load-balancing here).

  • 8 MPI ranks per KNL node, with OMP_NUM_THREADS=8 (that is 64 threads per KNL node, i.e. 1 thread per physical core, and 4 cores left to the system).

  • 2 grids per MPI, i.e., 16 grids per KNL node.

Running on Cori Haswell at NERSC

The batch script below can be used to run a WarpX simulation on 1 Haswell node on the supercomputer Cori at NERSC.

#!/bin/bash -l

# Just increase this number of you need more nodes.
#SBATCH -N 1
#SBATCH -t 03:00:00
#SBATCH -q regular
#SBATCH -C haswell
#SBATCH -J <job name>
#SBATCH -A <allocation ID>
#SBATCH -e error.txt
#SBATCH -o output.txt
# one MPI rank per half-socket (see below)
#SBATCH --tasks-per-node=4
# request all logical (virtual) cores per half-socket
#SBATCH --cpus-per-task=16


# each Cori Haswell node has 2 sockets of Intel Xeon E5-2698 v3
# each Xeon CPU is divided into 2 bus rings that each have direct L3 access
export WARPX_NMPI_PER_NODE=4

# each MPI rank per half-socket has 8 physical cores
#   or 16 logical (virtual) cores
# over-subscribing each physical core with 2x
#   hyperthreading leads to a slight (3.5%) speedup
# the settings below make sure threads are close to the
#   controlling MPI rank (process) per half socket and
#   distribute equally over close-by physical cores and,
#   for N>8, also equally over close-by logical cores
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
export OMP_NUM_THREADS=16

# for async_io support: (optional)
export MPICH_MAX_THREAD_SAFETY=multiple

EXE="<path/to/executable>"

srun --cpu_bind=cores -n $(( ${SLURM_JOB_NUM_NODES} * ${WARPX_NMPI_PER_NODE} )) ${EXE} <input file>

To run a simulation, copy the lines above to a file batch_cori_haswell.sh and run

sbatch batch_cori_haswell.sh

to submit the job.

For a 3D simulation with a few (1-4) particles per cell using FDTD Maxwell solver on Cori Haswell for a well load-balanced problem (in our case laser wakefield acceleration simulation in a boosted frame in the quasi-linear regime), the following set of parameters provided good performance:

Running on Summit at OLCF

V100 GPUs

The batch script below can be used to run a WarpX simulation on 2 nodes on the supercomputer Summit at OLCF. Replace descriptions between chevrons <> by relevant values, for instance <input file> could be plasma_mirror_inputs. Note that the only option so far is to run with one MPI rank per GPU.

#!/bin/bash

# Copyright 2019-2020 Maxence Thevenet, Axel Huebl
#
# This file is part of WarpX.
#
# License: BSD-3-Clause-LBNL
#
# Refs.:
#   https://jsrunvisualizer.olcf.ornl.gov/?s4f0o11n6c7g1r11d1b1l0=
#   https://docs.olcf.ornl.gov/systems/summit_user_guide.html#cuda-aware-mpi

#BSUB -P <allocation ID>
#BSUB -W 00:10
#BSUB -nnodes 2
#BSUB -alloc_flags smt4
#BSUB -J WarpX
#BSUB -o WarpXo.%J
#BSUB -e WarpXe.%J

module load gcc
module load cuda

export OMP_NUM_THREADS=1
jsrun -r 6 -a 1 -g 1 -c 7 -l GPU-CPU -d packed -b rs --smpiargs="-gpu" <path/to/executable> <input file> > output.txt

To run a simulation, copy the lines above to a file batch_summit.sh and run

bsub batch_summit.sh

to submit the job.

For a 3D simulation with a few (1-4) particles per cell using FDTD Maxwell solver on Summit for a well load-balanced problem (in our case laser wakefield acceleration simulation in a boosted frame in the quasi-linear regime), the following set of parameters provided good performance:

  • amr.max_grid_size=256 and amr.blocking_factor=128.

  • One MPI rank per GPU (e.g., 6 MPI ranks for the 6 GPUs on each Summit node)

  • Two `128x128x128` grids per GPU, or one `128x128x256` grid per GPU.

A batch script with more options regarding profiling on Summit can be found at Summit batch script

Power9 CPUs

Similar to above, the batch script below can be used to run a WarpX simulation on 1 node on the supercomputer Summit at OLCF, on Power9 CPUs (i.e., the GPUs are ignored).

#!/bin/bash

# Copyright 2019-2020 Maxence Thevenet, Axel Huebl, Michael Rowan
#
# This file is part of WarpX.
#
# License: BSD-3-Clause-LBNL
#
# Refs.:
#   https://jsrunvisualizer.olcf.ornl.gov/?s1f0o121n2c21g0r11d1b1l0=

#BSUB -P <allocation ID>
#BSUB -W 00:10
#BSUB -nnodes 1
#BSUB -alloc_flags "smt1"
#BSUB -J WarpX
#BSUB -o WarpXo.%J
#BSUB -e WarpXe.%J


export OMP_NUM_THREADS=21
jsrun -n 2 -a 1 -c 21 -r 2 -l CPU-CPU -d packed -b rs <path/to/executable> <input file> > output.txt

For a 3D simulation with a few (1-4) particles per cell using FDTD Maxwell solver on Summit for a well load-balanced problem, the following set of parameters provided good performance:

  • amr.max_grid_size=64 and amr.blocking_factor=64

  • Two MPI ranks per node (i.e. 2 resource sets per node; equivalently, 1 resource set per socket)

  • 21 physical CPU cores per MPI rank

  • 21 OpenMP threads per MPI rank (i.e. 1 OpenMP thread per physical core)

  • SMT 1 (Simultaneous Multithreading level 1)

  • Sixteen `64x64x64` grids per MPI rank (with default tiling in WarpX, this results in ~49 tiles per OpenMP thread)

Running on Lassen at LLNL

V100 GPUs

The batch script below can be used to run a WarpX simulation on 2 nodes on the supercomputer Lassen at LLNL. Replace descriptions between chevrons <> by relevant values, for instance <input file> could be plasma_mirror_inputs. Note that the only option so far is to run with one MPI rank per GPU.

#!/bin/bash

# Copyright 2020 Axel Huebl
#
# This file is part of WarpX.
#
# License: BSD-3-Clause-LBNL
#
# Refs.:
#   https://jsrunvisualizer.olcf.ornl.gov/?s4f0o11n6c7g1r11d1b1l0=
#   https://hpc.llnl.gov/training/tutorials/using-lcs-sierra-system#quick16

#BSUB -G <allocation ID>
#BSUB -W 00:10
#BSUB -nnodes 2
#BSUB -alloc_flags smt4
#BSUB -J WarpX
#BSUB -o WarpXo.%J
#BSUB -e WarpXe.%J

export OMP_NUM_THREADS=1
jsrun -r 4 -a 1 -g 1 -c 7 -l GPU-CPU -d packed -b rs -M "-gpu" <path/to/executable> <input file> > output.txt

To run a simulation, copy the lines above to a file batch_lassen.sh and run

bsub batch_lassen.sh

to submit the job.

For a 3D simulation with a few (1-4) particles per cell using FDTD Maxwell solver on V100 GPUs for a well load-balanced problem (in our case laser wakefield acceleration simulation in a boosted frame in the quasi-linear regime), the following set of parameters provided good performance:

  • amr.max_grid_size=256 and amr.blocking_factor=128.

  • One MPI rank per GPU (e.g., 4 MPI ranks for the 4 GPUs on each Lassen node)

  • Two `128x128x128` grids per GPU, or one `128x128x256` grid per GPU.

Running on Quartz at LLNL

Intel Xeon E5-2695 v4 CPUs

The batch script below can be used to run a WarpX simulation on 2 nodes on the supercomputer Quartz at LLNL. Replace descriptions between chevrons <> by relevant values, for instance <input file> could be plasma_mirror_inputs.

#!/bin/bash -l

# Just increase this number of you need more nodes.
#SBATCH -N 2
#SBATCH -t 24:00:00
#SBATCH -A <allocation ID>

#SBATCH -J WarpX
#SBATCH -q pbatch
#SBATCH --qos=normal
#SBATCH --license=lustre1,lustre2
#SBATCH --export=ALL
#SBATCH -e error.txt
#SBATCH -o output.txt
# one MPI rank per half-socket (see below)
#SBATCH --tasks-per-node=2
# request all logical (virtual) cores per half-socket
#SBATCH --cpus-per-task=18


# each Quartz node has 1 socket of Intel Xeon E5-2695 v4
# each Xeon CPU is divided into 2 bus rings that each have direct L3 access
export WARPX_NMPI_PER_NODE=2

# each MPI rank per half-socket has 9 physical cores
#   or 18 logical (virtual) cores
# over-subscribing each physical core with 2x
#   hyperthreading led to a slight (3.5%) speedup on Cori's Intel Xeon E5-2698 v3,
#   so we do the same here
# the settings below make sure threads are close to the
#   controlling MPI rank (process) per half socket and
#   distribute equally over close-by physical cores and,
#   for N>9, also equally over close-by logical cores
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
export OMP_NUM_THREADS=18

EXE="<path/to/executable>"  # e.g. ./warpx

srun --cpu_bind=cores -n $(( ${SLURM_JOB_NUM_NODES} * ${WARPX_NMPI_PER_NODE} )) ${EXE} <input file>

To run a simulation, copy the lines above to a file batch_quartz.sh and run

sbatch batch_quartz.sh

to submit the job.