Changes

Bouquieaux Marie-Catherine · fd0c1039
--- a/cluster/cluster-home.md
+++ b/cluster/cluster-home.md
+**Table of contents**
+[[_TOC_]]
+# Preliminary remark
+The cluster part of this wiki is still under construction. However we have gathered here some 
+pieces of information to help you start using it.  
+Do not hesitate to contact the [Bioinformatics team](contacts) for any question or if you need help 
+to resolve any problem you may encounter.
+# The GIGA high-performance computing system
+The GIGA provides its members with a high-performance computing system (hereafter called cluster) 
+composed of (1) a **mass storage** where to store large datasets, (2) several **compute nodes** to perform analyses, 
+(3) a central server (**master**) which connects the different components and manages the analyses (jobs) 
+sent to the compute nodes and (4) a **scratch disk** where you can temporarily keep intermediate results.
+This structure is used by almost 300 people. Therefore, there are several important rules to follow 
+to not endangered analyses of other users. 
+## 1. The mass storage
+Here are a few links to specific pages of our wiki:  
+- [Full description](mass-storage/mass-storage-home)    
+- [Quickstart](mass-storage/quickstart-mass-storage)   
+- [Connection instructions](mass-storage/mass-storage-connection)  
+- [VPN instructions](vpn-connection) (if you want to connect from outside the university network)  
+- [Frequently asked questions](faq/FAQ)   
+- Video: https://youtu.be/VppEcHAvSoU  
+Once connected to the mass storage, you will have access to different spaces. 
+1. A first one, called **home** where you can store up to 100Gb of files/scripts that you do not want to share 
+with the members of your laboratory. It is your entry point on the storage and on the cluster, 
+whatever mode of connection you use. 
+2. The second space is a folder associated with your **project** 
+(accessible from your home, via `_SHARE_/Research/<UnitAbreviation>/<LabAbreviation>/<TeamName>/<ProjectName>`). 
+This folder is accessible to all members of the team who work on that project. 
+For you to have access to this folder, the PI of your team need to make a request to the [UDI GIGA-MED IT specialist](contacts)  
+specifying the name of the laboratory, the "Team" folder and the "project" folder(s), as well as your username. 
+If it is a new project, the UDI GIGA-MED IT specialist can create the folder. 
+3. There is also a **"Resources"** space on the storage accessible via the link of the same name 
+in the `_SHARE_` folder, and in which you'll find some reference genomes as well as some analysis tools.
+[Here are the instructions to upload your data and scripts to the mass storage.](faq/file_transfer)
+## 2. The cluster
+### 2.1 Connection
+If you haven't yet, please first connect to the mass storage using the SAMBA protocol 
+explained [here](mass-storage/mass-storage-connection).  
+Once it's done, you can now connect directly to the cluster by typing `ssh u123456@cluster.calc.priv` 
+in your terminal (replacing u123456 by your university ID). 
+You will then be in your [home](faq/home-info), from which you will have access to 
+the mass storage and to the compute nodes.
+### 2.2 Very important points:
+When you connect to the cluster, you are on the master.
+**It is FORBIDDEN to make your calculations directly on the master.**   
+You must use `slurm` to send your analyzes (jobs) to the compute nodes.
+Slurm is a resource management system. 
+It allows cluster users to allocate to each job the necessary resources and to launch them as soon 
+as these resources are available.
+To use slurm, you can write a bash script that contains information about the resources needed 
+as well as the command(s) to run. Or you can use an interactive slurm session in which you can 
+run your command directly. 
+You can find more information about slurm on [this wiki page](cluster/slurm/slurm_home)
+The CECI also has a really well explained [tutorial](https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html) 
+and [FAQ](https://support.ceci-hpc.be/doc/_contents/SubmittingJobs/SlurmFAQ.html).
+In addition to this, there are some very important considerations related to the type of jobs you want to run.
+- As the compute nodes are available to everyone, it is important not to launch several millions of jobs 
+at the same time so as not to use all the available resources. To limit the number of jobs running in parallel, 
+you can use arrays (see below).
+- In addition, you should avoid launching a large number of very short jobs one after the other. 
+Indeed, if the overhead time used by slurm to managed each job is more important that the jobs itself and 
+there is a lots of these jobs sent at the same time, slurm is going to crash. 
+It is recommended that each individual job takes at least 20 minutes. 
+If you have lots of small jobs, please combine several ones in one job executing them one after the other 
+(for example with a for loop) so that the actual job managed by slurm last about 20 minutes or more 
+(and send several of these combined jobs in parallel using the array method explained below)
+Also note that if you are launching a large number of jobs, you should avoid asking to receive an email 
+each time a job starts or ends. In the past, our server has sometimes been blacklisted due to the fact that 
+it wanted to send more than 10,000 emails in 1 hour, and when this happens, no one receives emails anymore.
+In addition, the mass storage is not optimized to store a very large number of small files, 
+so if the outputs of your commands are small files, you should ideally generate them on the scratch disk (see below) 
+and then either concatenate them, or gather the information that interests you in a single file, 
+or [group them in an archive](mass-storage/mass-storage-compression) before transferring it to mass storage. 
+Same thing for slurm logs, if you have several thousands of jobs and want to keep all the logs, 
+we recommend to combine them into a single archive.
+### 2.3. Operating system, programs and compute nodes:
+The operating system of the cluster is CentOS which is a linux distribution. 
+The main programming languages are available on the cluster. 
+There is also a series of programs installed as `modules`. 
+These modules can be used by following the instructions on [this wiki page](cluster/software/cluster-module). 
+To use a module in an analysis, you must load the module in your script.
+They are several compute nodes available on the cluster. 
+These nodes have different resources (number of CPUs and RAM available) and are grouped into "partitions".
+By default, any GIGA member has access to the compute nodes that are in the partitions all_5hrs, all_24hrs and kosmos. There is no limit of time for the jobs sent to the kosmos partition, but jobs sent to the 2 other partitions will be killed by slurm if they don't complete in the indicated time (5h and 24h respectively).
+You can see the nodes present in the partitions to which you have access by typing
+```
+module load slurm # to load the slurm module - no longer required once the module is loaded
+sinfo
+```
+And see the resources available on each node with the 2 following commands:
+``` 
+sinfo -lN
+cat /etc/slurm/slurm.conf | grep ^Node
+```
+If your lab bought some compute nodes, they are probably in a separated partition and the PI of your lab 
+need to make a request to the [UDIMED/UDIGIGA](contacts) to add you to the list of people having access to it. 
+### 2.4 Interactive sessions
+For the interactive session you can use the command srun. An example of a this command is the following:
+```
+srun --partition=kosmos --ntasks=1 --cpus-per-task=1 --mem-per-cpu=1000 --pty bash
+# change the name of the partition, number of tasks, CPUs and memory per CPU accordingly to your needs
+```
+When running this type of command, slurm will allocate you resources on one of the calculation node and 
+will log you to that node in your terminal. Your prompt will thus change to include the name of the node 
+(for example u123456@genetic.ptfgen005 ~ $  ). You are now on the node and you can test all your commands.
+Of note, while you are connected to the node:
+- You shouldn't use more resources than what you asked for
+- The resources that have been allocated to you can not be used by other, 
+so when you don't need them anymore, please exit the node.
+- If you loose your internet connection (or turn off your computer), your job will be interrupted, 
+so it's only to test or try out a few things or for debugging, but once you want to run your analysis on big files, 
+it's always better to run it with sbatch and a script.
+You can find more information about this command here: https://slurm.schedmd.com/archive/slurm-14.11.11/srun.html
+### 2.5 Job arrays
+Arrays allow to launch, progressively and in a limited number, a certain number of jobs in parallel, 
+so as not to overload the computing cluster. To use arrays in a slurm job, 
+add a #SBATCH --array option to the headers of the script. 
+Here is an example of the use of arrays:
+```
+#SBATCH --array 1-16% 4 # In this example, we want to launch 16 jobs in total, with only 4 running in parallel.
+i=$((SLURM_ARRAY_TASK_ID -1)) # allows to retrieve the array identifier (in this case, a number ranging from 1 to 16)
+cd ${Directory_with_fastq_files}
+AllData=(* .fastq.gz) # we get the list of 16 files that interests us for the rest of the script
+Data=${AllData[$i]} # we retieve the first file in the list, this "Data" variable will then be used with the command launched later in the script
+```
+Slurm will then launch the job for the first 4 files (as soons as requested resources are available) 
+and will wait for one of the job to finish before to send the next one.
+You can find more explanations on arrays [here](https://help.rc.ufl.edu/doc/SLURM_Job_Arrays)  
+and [here](https://rcc.uchicago.edu/docs/running-jobs/array/index.html)
+### 2.6 Scratch disk and temporary folders/files:
+Writing your temporary and intermediates files to the mass storage will considerably slow down your analyses 
+(and everybody else analysis). To store these files during the time of your job, you have 2 options: 
+- the `/local` folder on the node you are running your analysis. 
+There should be 2Tb of space available there (if other users have properly deleted temporary files they generated). 
+Everything written there is available only from that node. 
+As it's on the node itself, it's really fast to write and read files there.
+- the `/home/gallia/scratch` folder which is on a separated disk accessible from all nodes and currently has 
+17Tb of free space. Writing there will be slower than writing on `/local` but still a lot faster than 
+writing directly on the mass storage.
+In both cases, you should create a subfolder with your ULiege identifier as folder name.   
+Inside that folder, you can create a subfolder specific to each job.   
+More importantly, at the end of your job, you should transfer all the files you need to keep to 
+the mass storage as the tmp and scratch folders are not backed up.   
+Moreover, files written in the local folder should be deleted at the end of your job, 
+and files on /home/gallia/scratch should be deleted as soon as you no longer need them, 
+since these common spaces are limited.
+### 2.7 final remark:
+Our cluster is optimized for jobs that require a lot of memory.  
+If you need a lot of CPUs and relatively not much RAM, 
+you can use [the CECI clusters](http://www.ceci-hpc.be/clusters.html) 
+(accessible to all members of the University of Liège)   
+CECI has very good tutorials [here](https://support.ceci-hpc.be/doc/)