update FAQ,contact and cluster authored by Bouquieaux Marie-Catherine's avatar Bouquieaux Marie-Catherine
**Table of contents**
[[_TOC_]]
# Preliminary remark
The cluster part of this wiki is still under construction. However we have gathered here some
pieces of information to help you start using it.
Do not hesitate to contact the [Bioinformatics team](contacts) for any question or if you need help
to resolve any problem you may encounter.
# The GIGA high-performance computing system
The GIGA provides its members with a high-performance computing system (hereafter called cluster)
composed of (1) a **mass storage** where to store large datasets, (2) several **compute nodes** to perform analyses,
(3) a central server (**master**) which connects the different components and manages the analyses (jobs)
sent to the compute nodes and (4) a **scratch disk** where you can temporarily keep intermediate results.
This structure is used by almost 300 people. Therefore, there are several important rules to follow
to not endangered analyses of other users.
## 1. The mass storage
Here are a few links to specific pages of our wiki:
- [Full description](mass-storage/mass-storage-home)
- [Quickstart](mass-storage/quickstart-mass-storage)
- [Connection instructions](mass-storage/mass-storage-connection)
- [VPN instructions](vpn-connection) (if you want to connect from outside the university network)
- [Frequently asked questions](faq/FAQ)
- Video: https://youtu.be/VppEcHAvSoU
Once connected to the mass storage, you will have access to different spaces.
1. A first one, called **home** where you can store up to 100Gb of files/scripts that you do not want to share
with the members of your laboratory. It is your entry point on the storage and on the cluster,
whatever mode of connection you use.
2. The second space is a folder associated with your **project**
(accessible from your home, via `_SHARE_/Research/<UnitAbreviation>/<LabAbreviation>/<TeamName>/<ProjectName>`).
This folder is accessible to all members of the team who work on that project.
For you to have access to this folder, the PI of your team need to make a request to the [UDI GIGA-MED IT specialist](contacts)
specifying the name of the laboratory, the "Team" folder and the "project" folder(s), as well as your username.
If it is a new project, the UDI GIGA-MED IT specialist can create the folder.
3. There is also a **"Resources"** space on the storage accessible via the link of the same name
in the `_SHARE_` folder, and in which you'll find some reference genomes as well as some analysis tools.
[Here are the instructions to upload your data and scripts to the mass storage.](faq/file_transfer)
## 2. The cluster
### 2.1 Connection
If you haven't yet, please first connect to the mass storage using the SAMBA protocol
explained [here](mass-storage/mass-storage-connection).
Once it's done, you can now connect directly to the cluster by typing `ssh u123456@cluster.calc.priv`
in your terminal (replacing u123456 by your university ID).
You will then be in your [home](faq/home-info), from which you will have access to
the mass storage and to the compute nodes.
### 2.2 Very important points:
When you connect to the cluster, you are on the master.
**It is FORBIDDEN to make your calculations directly on the master.**
You must use `slurm` to send your analyzes (jobs) to the compute nodes.
Slurm is a resource management system.
It allows cluster users to allocate to each job the necessary resources and to launch them as soon
as these resources are available.
To use slurm, you can write a bash script that contains information about the resources needed
as well as the command(s) to run. Or you can use an interactive slurm session in which you can
run your command directly.
You can find more information about slurm on [this wiki page](cluster/slurm/slurm_home)
The CECI also has a really well explained [tutorial](https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html)
and [FAQ](https://support.ceci-hpc.be/doc/_contents/SubmittingJobs/SlurmFAQ.html).
In addition to this, there are some very important considerations related to the type of jobs you want to run.
- As the compute nodes are available to everyone, it is important not to launch several millions of jobs
at the same time so as not to use all the available resources. To limit the number of jobs running in parallel,
you can use arrays (see below).
- In addition, you should avoid launching a large number of very short jobs one after the other.
Indeed, if the overhead time used by slurm to managed each job is more important that the jobs itself and
there is a lots of these jobs sent at the same time, slurm is going to crash.
It is recommended that each individual job takes at least 20 minutes.
If you have lots of small jobs, please combine several ones in one job executing them one after the other
(for example with a for loop) so that the actual job managed by slurm last about 20 minutes or more
(and send several of these combined jobs in parallel using the array method explained below)
Also note that if you are launching a large number of jobs, you should avoid asking to receive an email
each time a job starts or ends. In the past, our server has sometimes been blacklisted due to the fact that
it wanted to send more than 10,000 emails in 1 hour, and when this happens, no one receives emails anymore.
In addition, the mass storage is not optimized to store a very large number of small files,
so if the outputs of your commands are small files, you should ideally generate them on the scratch disk (see below)
and then either concatenate them, or gather the information that interests you in a single file,
or [group them in an archive](mass-storage/mass-storage-compression) before transferring it to mass storage.
Same thing for slurm logs, if you have several thousands of jobs and want to keep all the logs,
we recommend to combine them into a single archive.
### 2.3. Operating system, programs and compute nodes:
The operating system of the cluster is CentOS which is a linux distribution.
The main programming languages are available on the cluster.
There is also a series of programs installed as `modules`.
These modules can be used by following the instructions on [this wiki page](cluster/software/cluster-module).
To use a module in an analysis, you must load the module in your script.
They are several compute nodes available on the cluster.
These nodes have different resources (number of CPUs and RAM available) and are grouped into "partitions".
By default, any GIGA member has access to the compute nodes that are in the partitions all_5hrs, all_24hrs and kosmos. There is no limit of time for the jobs sent to the kosmos partition, but jobs sent to the 2 other partitions will be killed by slurm if they don't complete in the indicated time (5h and 24h respectively).
You can see the nodes present in the partitions to which you have access by typing
```
module load slurm # to load the slurm module - no longer required once the module is loaded
sinfo
```
And see the resources available on each node with the 2 following commands:
```
sinfo -lN
cat /etc/slurm/slurm.conf | grep ^Node
```
If your lab bought some compute nodes, they are probably in a separated partition and the PI of your lab
need to make a request to the [UDIMED/UDIGIGA](contacts) to add you to the list of people having access to it.
### 2.4 Interactive sessions
For the interactive session you can use the command srun. An example of a this command is the following:
```
srun --partition=kosmos --ntasks=1 --cpus-per-task=1 --mem-per-cpu=1000 --pty bash
# change the name of the partition, number of tasks, CPUs and memory per CPU accordingly to your needs
```
When running this type of command, slurm will allocate you resources on one of the calculation node and
will log you to that node in your terminal. Your prompt will thus change to include the name of the node
(for example u123456@genetic.ptfgen005 ~ $ ). You are now on the node and you can test all your commands.
Of note, while you are connected to the node:
- You shouldn't use more resources than what you asked for
- The resources that have been allocated to you can not be used by other,
so when you don't need them anymore, please exit the node.
- If you loose your internet connection (or turn off your computer), your job will be interrupted,
so it's only to test or try out a few things or for debugging, but once you want to run your analysis on big files,
it's always better to run it with sbatch and a script.
You can find more information about this command here: https://slurm.schedmd.com/archive/slurm-14.11.11/srun.html
### 2.5 Job arrays
Arrays allow to launch, progressively and in a limited number, a certain number of jobs in parallel,
so as not to overload the computing cluster. To use arrays in a slurm job,
add a #SBATCH --array option to the headers of the script.
Here is an example of the use of arrays:
```
#SBATCH --array 1-16% 4 # In this example, we want to launch 16 jobs in total, with only 4 running in parallel.
i=$((SLURM_ARRAY_TASK_ID -1)) # allows to retrieve the array identifier (in this case, a number ranging from 1 to 16)
cd ${Directory_with_fastq_files}
AllData=(* .fastq.gz) # we get the list of 16 files that interests us for the rest of the script
Data=${AllData[$i]} # we retieve the first file in the list, this "Data" variable will then be used with the command launched later in the script
```
Slurm will then launch the job for the first 4 files (as soons as requested resources are available)
and will wait for one of the job to finish before to send the next one.
You can find more explanations on arrays [here](https://help.rc.ufl.edu/doc/SLURM_Job_Arrays)
and [here](https://rcc.uchicago.edu/docs/running-jobs/array/index.html)
### 2.6 Scratch disk and temporary folders/files:
Writing your temporary and intermediates files to the mass storage will considerably slow down your analyses
(and everybody else analysis). To store these files during the time of your job, you have 2 options:
- the `/local` folder on the node you are running your analysis.
There should be 2Tb of space available there (if other users have properly deleted temporary files they generated).
Everything written there is available only from that node.
As it's on the node itself, it's really fast to write and read files there.
- the `/home/gallia/scratch` folder which is on a separated disk accessible from all nodes and currently has
17Tb of free space. Writing there will be slower than writing on `/local` but still a lot faster than
writing directly on the mass storage.
In both cases, you should create a subfolder with your ULiege identifier as folder name.
Inside that folder, you can create a subfolder specific to each job.
More importantly, at the end of your job, you should transfer all the files you need to keep to
the mass storage as the tmp and scratch folders are not backed up.
Moreover, files written in the local folder should be deleted at the end of your job,
and files on /home/gallia/scratch should be deleted as soon as you no longer need them,
since these common spaces are limited.
### 2.7 final remark:
Our cluster is optimized for jobs that require a lot of memory.
If you need a lot of CPUs and relatively not much RAM,
you can use [the CECI clusters](http://www.ceci-hpc.be/clusters.html)
(accessible to all members of the University of Liège)
CECI has very good tutorials [here](https://support.ceci-hpc.be/doc/)