HPC Systems
On selected high-performance computing (HPC) systems, WarpX has documented or even pre-build installation routines. Follow the guide here instead of the generic installation routines for optimal stability and best performance.
warpx.profile
Use a warpx.profile file to set up your software environment without colliding with other software.
Ideally, store that file directly in your $HOME/ and source it after connecting to the machine:
source $HOME/warpx.profile
We list example warpx.profile files below, which can be used to set up WarpX on various HPC systems.
HPC Machines
This section documents quick-start guides for a selection of supercomputers that WarpX users are active on.
- Adastra (CINES)
- Aurora (ALCF)
- Crusher (OLCF)
- Frontier (OLCF)
- Fugaku (Riken)
- Great Lakes (UMich)
- HPC3 (UCI)
- Juwels (JSC)
- Karolina (IT4I)
- Lassen (LLNL)
- Lawrencium (LBNL)
- Leonardo (CINECA)
- Lonestar6 (TACC)
- LUMI (CSC)
- LXPLUS (CERN)
- Ookami (Stony Brook)
- Perlmutter (NERSC)
- Pitzer (OSC)
- Polaris (ALCF)
- Dane (LLNL)
- Summit (OLCF)
- Taurus (ZIH)
- Tuolumne (LLNL)
Tip
Your HPC system is not in the list? Open an issue and together we can document it!
Batch Systems
HPC systems use a scheduling (“batch”) system for time sharing of computing resources. The batch system is used to request, queue, schedule and execute compute jobs asynchronously. The individual HPC machines above document job submission example scripts, as templates for your modifications.
In this section, we document a quick reference guide (or cheat sheet) to interact in more detail with the various batch systems that you might encounter on different systems.
Slurm
Slurm is a modern and very popular batch system. Slurm is used at NERSC, OLCF Frontier, among others.
Job Submission
sbatch your_job_script.sbatch
Job Control
interactive job:
salloc --time=1:00:00 --nodes=1 --ntasks-per-node=4 --cpus-per-task=8e.g.
srun "hostname"
GPU allocation on most machines require additional flags, e.g.
--gpus-per-task=1or--gres=...
details for my jobs:
scontrol -d show job 12345all details for job with <job id>12345squeue -u $(whoami) -lall jobs under my user name
details for queues:
squeue -p queueName -llist full queuesqueue -p queueName --start(show start times for pending jobs)squeue -p queueName -l -t R(only show running jobs in queue)sinfo -p queueName(show online/offline nodes in queue)sview(alternative on taurus:module load llviewandllview)scontrol show partition queueName
communicate with job:
scancel <job id>abort jobscancel -s <signal number> <job id>send signal or signal name to jobscontrol update timelimit=4:00:00 jobid=12345change the walltime of a jobscontrol update jobid=12345 dependency=afterany:54321only start job12345after job with id54321has finishedscontrol hold <job id>prevent the job from startingscontrol release <job id>release the job to be eligible for run (after it was set on hold)
References
Flux
Flux is a modern batch system and resource manager framework. Flux is used at LLNL LC, among others.
Job Submission
flux batch your_job_script.flux
Job Control
-
flux submit --time-limit=1:00:00 --nodes=1 --tasks-per-node=4 --cores-per-task=8e.g.
flux submit "hostname"
GPU allocation requires additional flags, e.g.
--gpus-per-task=1
details for my jobs:
flux jobsall jobs under my user nameflux job info abc123 jobspecall details for job with <job id>abc123flux job info 12345 eventloghistory of events for job with <job id>12345
details for queues:
flux queue listlist all queuesflux queue statusshow status of queuesunclear/TODO show start times for pending jobs
sinfo -p queueNameshow online/offline nodes in queue
communicate with job:
flux cancel <job id>abort jobflux job kill --signal=<signal number> <job id>send signal or signal name to jobunclear/TODO change the walltime of a job
unclear/TODO only start job
12345after job with id54321has finishedflux job urgency <job id> holdprevent the job from startingflux job urgency <job id> defaultrelease the job to be eligible for run (after it was set on hold)
References
LSF
LSF (for Load Sharing Facility) is an IBM batch system. It is used at OLCF Summit, LLNL Lassen, and other IBM systems.
Job Submission
bsub your_job_script.bsub
Job Control
interactive job:
bsub -P $proj -W 2:00 -nnodes 1 -Is /bin/bash
-
bjobs 12345all details for job with <job id>12345bjobs [-l]all jobs under my user namejobstat -u $(whoami)job eligibilitybjdepinfo 12345job dependencies on other jobs
details for queues:
bqueueslist queues
communicate with job:
bkill <job id>abort jobbpeek [-f] <job id>peek intostdout/stderrof a jobbkill -s <signal number> <job id>send signal or signal name to jobbchkpntandbrestartcheckpoint and restart job (untested/unimplemented)bmod -W 1:30 12345change the walltime of a job (currently not allowed)bstop <job id>prevent the job from startingbresume <job id>release the job to be eligible for run (after it was set on hold)
References
PBS
PBS (for Portable Batch System) is a popular HPC batch system. The OpenPBS project is related to PBS, PBS Pro and TORQUE.
Job Submission
qsub your_job_script.qsub
Job Control
interactive job:
qsub -I
details for my jobs:
qstat -f 12345all details for job with <job id>12345qstat -u $(whoami)all jobs under my user name
details for queues:
qstat -a queueNameshow all jobs in a queuepbs_free -lcompact view on free and busy nodespbsnodeslist all nodes and their detailed state (free, busy/job-exclusive, offline)
communicate with job:
qdel <job id>abort jobqsig -s <signal number> <job id>send signal or signal name to jobqalter -lwalltime=12:00:00 <job id>change the walltime of a jobqalter -Wdepend=afterany:54321 12345only start job12345after job with id54321has finishedqhold <job id>prevent the job from startingqrls <job id>release the job to be eligible for run (after it was set on hold)
References
PJM
PJM (probably for Parallel Job Manager?) is a Fujitsu batch system It is used at RIKEN Fugaku and on other Fujitsu systems.
Note
This section is a stub and improvements to complete the (TODO) sections are welcome.
Job Submission
pjsub your_job_script.pjsub
Job Control
interactive job:
pjsub --interact
details for my jobs:
pjstatstatus of all jobs(TODO) all details for job with <job id>
12345(TODO) all jobs under my user name
details for queues:
(TODO) show all jobs in a queue
(TODO) compact view on free and busy nodes
(TODO) list all nodes and their detailed state (free, busy/job-exclusive, offline)
communicate with job:
pjdel <job id>abort job(TODO) send signal or signal name to job
(TODO) change the walltime of a job
(TODO) only start job
12345after job with id54321has finishedpjhold <job id>prevent the job from startingpjrls <job id>release the job to be eligible for run (after it was set on hold)