The High Performance Computing (HPC) cluster is a central resource available at MAX IV for users and staff. It is a small cluster compared to what one would find at dedicated supercomputing centers.
There are currently two sub-clusters nick-named “online” and “offline” cluster. The names are referring to their intended usage:
- The online cluster, dedicated to data analysis during the beamtime. There are two front end machines, clu0-fe-2 and clu0-fe-3. The online cluster is only accessible within MAX IV, including access from the maxiv_guest wifi network and from beamline workstations.
- The offline cluster is a small cluster that can be used outside beam time by both staff and users. The front end is offline-fe1. The offline cluster is not accessible from the beamline computers, but it is possible to log in to it remotely using the VPN.
The HPC is maintained in cooperation with LUNARC (Lund University Computing Center) and so has a very similar architecture.
Anyone with a MAX IV account (including DUO accounts) can use the clusters. It is not necessary to apply for any special account, though access for non-staff may be limited to active proposal periods. For access problems or need for additional access, contact Thomas Eriksson.
Starting information for “Dummies”
If you are happy with a Linux prompt, just need to do simple things, do not want to read the long info below and do not want to bother other users by taking limited resources at the frontends then you just need to read this. See also access via Remote Desktop.
The online cluster frontends are clu0-fe-2 and clu0-fe-3. The offline frontend is offline-fe1 (actually an alias to clu1-fe-1)
# login using ssh (use MAX IV login-name) ssh -X usrnam@clu0-fe-2 # step 1 # You are now at the computing cluster frontend. This machine has around # 20 cores and ~ 60 GB of RAM and so it can comfortably serve several users # simultaneously. You can do here whatever you are used to do at your # laptop. But if you are planning to do something larger, e.g. use # a software that can occupy all CPUs or take large amount of memory (> 20 GB) # (watch out! it is quite easy with Matlab) it is strongly advised to skip on # one of the computing nodes. This will give you more resource without # affecting other users. And so start an "interactive" session. interactive -c 2 -t 06:00:00 # step 2a (6 hours, single core, i.e. 2 hyperthreads) # you can work now (!), you may find your data in cd /data/visitors/(beamline)/(proposal)/(visit) # where you use your beamline name, proposal and visit number. # if you want more CPUs use -n option (useful e.g. for Matlab) # if you want more RAM use --mem option (you are getting around 1.5 GB per logical CPU, i.e. hyperthread) interactive -n 4 -c 2 --mem 20GB -t 06:00:00 # step 2b (4 cores, i.e. 8 hyperthreads, 20GB RAM)
You may be wondering there is not much software available, software and libraries are old. In such a case you need to understand the basics of modular software installation. See
- LUNARC User Documentation is the best reference
- Some basic module system commands:
- module list # list loaded modules
- module avail # show available modules that can be loaded directly
- module spider modulename # look for an installed module
- module add modulename # load module
- module load modulename # load module (same as add)
- module spider exact-modulename # get info about module
- module remove modulename # unload modul
- module purge # unload all loaded modules
- Note: There are also frontends with Linux virtual desktop that may fit better your needs: clu0-fe-2, clu0-fe-3, offline-fe1
Home directories
Each user has its own cluster-dedicated home directory (~) common at frontends and nodes (this is a permanent storage with backup). Accessory users mxn-home/visitors directories are mounted at frontends and nodes to be available for user convenience.
# compare ls ~ ls /mxn/visitors/username
Node local storage
$TMPDIR=/local/slurmtmp.$SLURM_JOB_ID
Note: variable set only in sbatch scripts, not in interactive mode. However the directory is there.
Storage
Beamline/scientific data storage is mounted in /data/visitors/(beamline)/(proposal)/(visit)
ls /data/visitors/biomax/prn0001/20160622
Using software at MAX IV cluster
Software installation at MAX IV HPC cluster is identical to LUNARC Aurora. Hierarchical environment modules scheme is used in order to provide rich and unified software environment for scientific applications. We refer to LUNARC User Documentation for useful and precise information.
Getting information
MAX IV cluster is using SLURM (Simple Linux Utilityfor Resource Management).
# view information about nodes and partitions sinfo all* up 7-00:00:00 1 drain cn17 all* up 7-00:00:00 7 idle cn[20-26] gpu up 7-00:00:00 1 idle gn0 # view information about jobs located in the scheduling queue squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) ...
Useful commands
See How to use a job submission system at LUNARC
# start a job from the batch file 'j_CDImap.sh' - see Lunarc documentation or an example here sbatch j_CDImap.sh # or to run interactive bash in node cn26 interactive --nodelist=cn26 # add the "-p v100" option to indicate if you request a V100 gpu node interactive -p v100 # According to LUNARC documentation it is strongly recommended to "purge" # all modules after entering the interactive module purge Deprecated method: srun --nodelist=cn26 --pty bash
# cancel/stop job scancel JOBID
# reserve a CPU node salloc -N 1 # Note: after this cmd you are logged into the first allocated node # reserve a whole GPU node salloc -p v100 --exclusive
Preparing a batch script
See a detailed tutorial within LUNARC documentation.
Below is just a quick and dirty example asking exclusively for nodes cn28 and cn29. We have maximum 48 tasks per node.
j_CDImap.sh
#!/bin/bash # # job time, change for what your job requires #SBATCH -t 00:10:00 # # job name #SBATCH -J j_CDImap # #SBATCH --exclusive #SBATCH -N 2 #SBATCH --tasks-per-node=48 #SBATCH --nodelist=cn28,cn29 # filenames stdout and stderr - customise, include %j #SBATCH -o process_%j.out #SBATCH -e process_%j.err # write this script to stdout-file - useful for scripting errors cat $0 # load the modules required for you program - customise for your program module purge module add foss/2018a h5py/2.7.1-Python-2.7.14 # run the program # customise for your program name and add arguments if required mpirun -n 96 python /mxn/nanomax/sw/CDIsuite/XRFCDImapping.py --path=/data/nanomax/prn20161125/ --file=GIA_sxw.h5 --scan=12 --scratch=$TMPDIR
Get statistics on completed jobs
Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc. See below for two examples,
To get statistics on completed jobs by jobID
sacct -j jobid --format=JobID,JobName,MaxRSS,Elapsed # To view the same information for all jobs of a user sacct -u usrnam --format=JobID,JobName,MaxRSS,Elapsed