----+ HOTCAT USER MANUAL +----

(version 1.0)

Authors: giuliano.taffoni@inaf.it
         gianmarco.maggio@inaf.it

Support Email: for any support requests please contact help.hotcat@inaf.it

*******************************************************************************
IMPORTANT: In case of use of the HOTCAT computing infrastructure, on your paper
you have to cite the following papers:

 [1] Taffoni, Giuliano, Ugo Becciani, Bianca Garilli, Gianmarco Maggio, Fabio
    Pasian, Grazia Umana, Riccardo Smareglia and Fabio Vitello.
    ''CHIPP: INAF pilot project for HTC, HPC and HPDA.''
    ArXiv abs/2002.01283 (2020)
 [2] Bertocco, Sara, David Goz, Laura Tornatore, Antonio Ragagnin, G. Maggio,
     F. Gasparo, Claudio Vuerli, Gaia Taffoni and Mateus Molinaro.
     INAF Trieste Astronomical Observatory Information Technology Framework.
     arXiv: Instrumentation and Methods for Astrophysics (2019)
*******************************************************************************

0. Index
    0.1 Conventions
1. Cluster overview and characteristics (cpu,ram etc.)
    1.1 The SKADC partition
    1.2 Storage and Data
2. Access the cluster
3. The queue system: SLURM
    3.1 List of Useful Commands
    3.2 Submit jobs on the cluster with SLURM by examples
4. Software and modules
    4.1 Environmental Modules
    4.2 Containers
      4.2.1 Executing the SKADC container
      4.2.2 Executing your own container
      4.2.3 Submit a container job
    4.3 Compiling Software
    4.4 Python packages and virtual environment

0.1 Conventions

Hardware:

 - login node: node used to access the cluster
 - compute node: node where the code is actually running
 - storage node: node where the data is stored
 - home storage: the file system directory containing files for a given user on
                the cluster
 - scratch storage: large capacity and high performance temporary storage to be
                   used while applications are running on the cluster.

The examples:

- an example is inside a ++++ section;
- lines starting with '$' are commands executed on the user(s);
- lines starting with '## ' are comments.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
$ cat examples.sh

#!/bin/bash
## This is an examples
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


1. Cluster overview and characteristics (cpu,ram etc.)
--------------------
HOTCAT cluster is composed by three partitions:
a) base
b) CHIPP;
c) skadc.

The  = partitions share the same login node "amonra.oats.inaf.it" .

1.1 The SKADC partition
It is composed by 20 computing nodes, 800 cores in total, and 6 GB Ram per core,
in details each compute node has:

 - 4x10 Core (40 core) Haswell E5-4627v3 @ 2.60GHz
 - 256 GB DDR3 1333 MHz
 - Network: Infiniband ConnectX 56GBs and 1GB Ethernet.

To reserve some resources to the operating system and to the cluster services,
the user will  be able to use 38 cores out of 40 and 240 GB of RAM per node.

Those nodes are referred as GEN9 in the rest of this document.

1.3 Storage and Quotas
All users have access to a high capacity primary storage.
This system currently provides 50 TB (terabytes) of storage.
The integrity of the data is protected by a RAID 6 system but no backups are
done so anyway copy out to a safer storage any data you must keep secure.

HOTCAT provides each user  a  home directory space  on the primary
storage that is accessible from all HOTCAT nodes:

/u/username

Your use of this space is limited by a storage quota that applies to your
account's usage as a whole.

HOTCAT provides large capacity (600TB) and high performance storage (2GBs IO)
to be used while  applications are running on the supercomputer.
This is the scratch parallel FS based on beegfs and 4 storage nodes.
The scratch space is  a set-aside area of primary storage, and you can find
a scratch space dedicated to SKA Data Challenge on:

/beegfs/skadc

The space is organised as follows:

Skadc -+
       |
       +- data
       +- doc
       +- singularity
       +- skadc04 
       +- skadc05
       +- skadc06
       +- skadc07
       +- software


The data directory is where you will find SKADC data. 
NOTE: Copy out the data into your work directory in /beegfs.

The 'doc' directory contains the documentations and some examples to execute 
Jobs using the cluster and to run singularity.

Singularity contains the shared singularity images.

skadc0X is the group reserved space to use for, computing, private data and 
software.

Software is the directory for the shared software to use through the containers.
Some software is too big to include in a docker/singularity container (the
image becomes to large) so we install il locally.

The /beegfs filesystem is parallel high performance filesystem, that must be
used during application runs but:
THERE IS NOT ANY BACKUP, IT IS INTRINSIC  SUBJECT TO FAILURES, IT IS HIGHLY
RISKY TO LEAVE DATA THERE FOR LONG TIME. EACH YEAR WE WILL CLEAN IT.

SUGGESTION:  always use the /beegfs partition for running programs, then
copy back to your /u/username disk only the data necessary to preserve.



2.0 Access the Cluster
----------------------
To access the cluster the user must ssh to the login node:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
$ ssh username@amonra.oats.inaf.it
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

The login is done using ssh key (NO PASSWORD IS REQUIRED) that the user must
provide to the administrators at registration.

The user is assigned to the skadc partition ad to a set of resources according to
her capabilities (project, fundings etc.)

The user must use the compute nodes to execute her program, for no reason it
will be allowed to run programs on the login node. Programs running on the
login node will be killed with no advice!!!!

Compute nodes can be used also to compile jobs or for interactive post
processing using interactive jobs (see below).

Software is distributed either using singularity containers or environmental 
modules (see below).

3.0 The queue system: SLURM
---------------------------
The Simple Linux Utility for Resource Management (SLURM) is an open source,
fault-tolerant, and highly scalable cluster management and job scheduling system
for large and small Linux clusters.

3.1 List of Useful Commands
Man pages exist for all SLURM daemons, commands, and API functions.
Here an on-line manual: https://slurm.schedmd.com/

*******************************************************************************
NOTE: For most of the users it is sufficient to learn the following commands:
      srun, sbatch, squeue and scancel.
******************************************************************************


The command option --help also provides a brief summary of options.
Note that the command options are all case insensitive.

+------------------------------------------------------------------------------+
| sbatch  | used to submit a job script for later execution. The script will   |
|         | typically contain one or more commands to launch parallel tasks.   |
+------------------------------------------------------------------------------+
| squeue  | reports the state of jobs or job steps. It has a wide variety of   |
|         | filtering sorting, and formatting options. By default, it reports  |
|         | the running jobs in priority order and then the pending jobs in    |
|         | priority order.                                                    |
+---------|--------------------------------------------------------------------+
| srun    | used to submit a job for execution or initiate job steps in real   |
|         | time. srun has a wide variety of options to specify resource       |
|         | requirements, including: minimum and maximum node count, processor |
|         | count, specific nodes to use or not use, and specific node         |
|         | characteristics (so much memory, disk space, certain required      |
|         | features, etc.). A job can contain multiple job steps executing    |
|         | sequentially or in parallel on independent or shared nodes within  |
|         | the job's node allocation.                                         |
+---------|--------------------------------------------------------------------+
| scancel | cancel a running job.            |
+------------------------------------------------------------------------------+


3.2 Submit jobs on the cluster with SLURM by examples

Users can submit jobs using the sbatch command.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
$ sbatch job_script.sh
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

In the job script, in order to define the sbatch parameters you have to use
the #SBATCH directives.
Users can also start jobs using directly the srun command.
But the best way to submit a job is to use sbatch in order to allocate the
required resource with the desired walltime and then call mpirun or srun inside
the script.

Here is a simple example where we execute 2 system commands inside the script,
sleep and hostname.

This job will have a name as TestJob, will run on the chipp partition,
we allocated 1 compute node and 128 GB RAM, we defined the output  files 
and we requested 8 hours of walltime.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
#!/bin/bash
#SBATCH -J TestJob
#SBATCH -p skadc
#SBATCH -N 1
#SBATCH --mem=128G
#SBATCH -o TestJob-%j.out
#SBATCH -e TestJob-%j.err
#SBATCH --time=00-80:00:00
sleep 5
hostname
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

We could do the same using directly the srun command (accepts only one
executable as argument):

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
$ srun -N1 --time=8:00:00 --mem=128G  -p skadc hostname
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


To run interactive jobs, users can call srun with some specific arguments.

For example:
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
$ srun -N2 --time=01:20:00 -p skadc --pty -u bash -i -l 
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

This command will return a console of the compute nodes.

Every command that will called there it will be executed on all allocated
compute nodes.


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
## On the login node (amonra.oats.inaf.it)

[amonra]$ hostname
 amonra
[amonra]$ srun --mem=4096 --nodes=1 --ntasks-per-node=4 --time=01:00:00 -p skadc \
          --pty /bin/bash
[gen10-09]$ hostname
gen10-09
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

In this example we request an interactive console on 1 node, using 4 CPUs (core)
and a total of 4GB of RAM for 1 hour.
Another way to start an interactive job is to call salloc.
Please choose the way you like more.


4.0 Software and modules
------------------------

To access software on Linux systems, use the module command to load the software
into your environment.

We recommend  to use default_gnu module unless you know exacltly
what you are doing. At login and in any job script type:
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
% module load default_gnu
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

NOTE: do not put this command in your bashrc or bash_profile, is could conflict
with your slurm jobs.

4.1 Environmental Modules
You can find other modules available on  HOTCAT using the module command:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
$ module available
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

For example, if you want to use the compiler  9.3.0 version of gcc, you would
type in your job script or at the command line:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
% module load gnu/9.3.0
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
To find out what module versions are available,  use  the module avail command
to search by name.

For example to find out what versions of gnu compiler  are available:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
​% module avail gnu

-------------------- /opt/cluster/Modules/3.2.10/compilers ---------------------
gnu/4.8.5 gnu/9.3.0

% module avail fftw

-------------------- /opt/cluster/Modules/3.2.10/libraries ---------------------
fftw/2.1.5/openmpi/3.1.6/gnu/4.8.5 fftw/2.1.5/openmpi/4.0.3/gnu/9.3.0
fftw/2.1.5/openmpi/3.1.6/gnu/9.3.0 fftw/3.3.8/openmpi/3.1.6/gnu/4.8.5
fftw/2.1.5/openmpi/3.1.6/pgi/19.10 fftw/3.3.8/openmpi/3.1.6/gnu/9.3.0
fftw/2.1.5/openmpi/4.0.3/gnu/4.8.5
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

To clean your module enviroment use purge command:
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
% module purge
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

To check the module loaded use the list command:
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
% module list
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

4.2 Containers

We support the use of singularity containers that can be either created by the 
user or available in the cluster.

The "official" SKADC container is

skadc_software_0.0.5.sif

It could be updated/integrated with other containers during the challenge if 
Requested/needed. It is an ubuntu 18.04 with kern suite software, python3, 
astropy, CASA, CARTA, Sofia2 etc. 
The list of available  software is /beegfs/skadc/doc/SOFTWARE_LIST

4.2.1 Executing the SKADC container

The container is fully isolated and to run it properly you can use the 
Example script available at /beegfs/skadc/doc/:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
% cat /beegfs/skadc/doc/run_singularity_local.sh

#!/bin/bash
BASE_SINGULARITY_DIR=/beegfs/skadc/singularity/
CONTAINER_NAME=skadc_software
CONTAINER_VERSION=0.0.5
if [ 'XXX'$1 = 'XXX' ]; then
    COMMAND=bash
else
    COMMAND=$1
fi
HOMEDIR=`mktemp -d -t singularity_XXXXXXX`
singularity run  --pid --no-home --home=/home/skauser --workdir ${HOMEDIR}/tmp -B${HOMEDIR}:/home/ --containall --cleanenv -B /beegfs:/beegfs ${BASE_SINGULARITY_DIR}${CONTAINER_NAME}_${CONTAINER_VERSION}.sif ${COMMAND}
rm -fr ${HOMEDIR}

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

You can customise this script as you like. Remember that the only persistent
directory is the /beegfs, for security reason you cannone share your home 
directory.

The best way to execute your code through a container is to prepare a script
(But also a python code), for example:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
% sh run_singularity_local.sh /beegfs/skarc/doc/example.sh
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Where example.sh is:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
% cat /beegfs/skadc/doc/run_singularity_local.sh

#!/bin/bash
cd /beegfs/skadc/doc
echo "Just an example"
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

or 

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
% sh run_singularity_local.sh /beegfs/skarc/doc/example.py
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Sofia2 (https://github.com/SoFiA-Admin/SoFiA-2) and CARTA can be executed
inside the container:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
%sh run_singularity_local.sh
singularity> /beegfs/skadc/software/SoFiA-2-2.2.0/sofia
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Or

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
sh run_singularity_local.sh /beegfs/skadc/doc/Sofia.sh
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


4.2.2 Executing your own container
You can execute your own container in the cluster if it is available in docker hub
or any other docker registry.

You can use the /beegfs/skadc/doc/run_singularity.sh as example.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
% cat run_singularity.sh

#!/bin/bash
export CONTAINER_NAME=morgan1971/skadc_software
export CONTAINER_VERSION=0.0.5
export BASE_PORT=
if [ 'XXX'$1 = 'XXX' ]; then
    COMMAND=bash
else
    COMMAND=$1
fi
HOMEDIR=`mktemp -d -t singularity_XXXXXXX`
mkdir $HOMEDIR/tmp
mkdir $HOMEDIR/home
singularity run  --pid --no-home --home=/home/skauser --workdir ${HOMEDIR}/tmp -B${HOMEDIR}:/home/ -B/beegfs:/beegfs --containall --cleanenv docker://${CONTAINER_NAME}:${CONTAINER_VERSION} $COMMAND
rm -fr ${HOMEDIR}
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


4.2.3 Submit a container job

To execute your code in the cluster you must submit a job to the queue system.

An example of submission script is available at /beegfs/skadc/doc/submit_example.slurm
To run on the cluster nodes you must submit the  job using the sbatch command:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
% sbatch submit_example.slurm
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Check the status of your job with squeue command:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
% squeue submit_example.slurm
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



4.3 Compiling Software
If you want to run a compile, then  look at interactive job submission.
Start an interactive session and run your compile there.
Do NOT run compiles on the login nodes.
We suggest to  request 4 slots with your interactive job, here an exmaple with
chip partition (change it according to your needs):

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
% srun --mem=4096 --nodes=1 --ntasks-per-node=4 -p skadc --pty /bin/bash
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

If you want to compile sw inside a container you need to execute the container 
and then work in your /beegfs directory. For example

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
% srun --mem=4096 --nodes=1 --ntasks-per-node=4 -p skadc --pty /bin/bash

gen09-10% cd /beegfs/skadc/skadc04/
gen09-10% sh /beegfs/skadc/doc/run_singularity_local.sh
singularity> cd /beegfs/skadc/skadc04/your_software_dir
singularity> configure --prefix=/beegfs/skadc/skadc04/wherever_you_like
singularity> make; make install
singularity> exit
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Then you will execute the sw from inside the container as

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
% /beegfs/skadc/doc/run_singularity_local.sh   \
                  /beegfs/skadc/skadc04/wherever_you_like/mysoftware
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Or using a script.



4.4 Python packages and virtual environment
Hotcat cluster provides two python versions:
 - python 2.7 (deprecated)
 - python 3.6

We install some of the most common python packages but not all.
A user can customise her python environment using the virtual env package that
allows any user to install any python modules.
A python virtual environment is a self-contained directory tree that contains a
Python installation for a particular version of Python, plus a number of
additional packages.

Different applications can then use different virtual environments.

The module used to create and manage virtual environments is called venv.
If you have multiple versions of Python on your system, you can select a
specific Python version by running python3 or whichever version you want.

To create a virtual environment, decide upon a directory where you want to place
it, and run the venv module as a script with the directory path:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
$ python3 -m venv tutorial-env
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

This will create the tutorial-env directory if it doesn’t exist, and also create
directories inside it containing a copy of the Python interpreter, the standard
library, and various supporting files.

Once you’ve created a virtual environment, you may activate it.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
$ source tutorial-env/bin/activate
(tutorial-env) $
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

You can install, upgrade, and remove packages using a program called pip.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
(tutorial-env) $ pip search astronomy
pip search astronomy
astronomy (0.0.1)             - Astronomy!
catastropy (0.0dev)           - (cat)astronomy
gastropy (0.0dev)             - (g)astronomy
pykepler (1.0.1)              - Algorithms for positional astronomy
...
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

You can install the latest version of a package by specifying a package’s name:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
(tutorial-env) $ pip install astronomy
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

If you do not need the environment anymore you can just remove the tutorial-env
directory.

To submit a batch job that uses the environment you can use:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
#!/bin/bash
#SBATCH -J TestJob
#SBATCH -p chipp
#SBATCH -N 1
#SBATCH --ntasks-per-node=36
#SBATCH --mem-per-cpu=10000
#SBATCH -o TestJob-%j.out
#SBATCH -e TestJob-%j.err
#SBATCH --time=30
source tutorial-env/bin/activate

python3 your_python_code.py
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


More informations at the official python page
https://docs.python.org/3/tutorial/venv.html