|
|
**Table of contents**
|
|
|
|
|
|
[[_TOC_]]
|
|
|
|
|
|
# Preliminary remark
|
|
|
|
|
|
The cluster part of this wiki is still under construction. However we have gathered here some
|
|
|
pieces of information to help you start using it.
|
|
|
|
|
|
Do not hesitate to contact the [Bioinformatics team](contacts) for any question or if you need help
|
|
|
to resolve any problem you may encounter.
|
|
|
|
|
|
# The GIGA high-performance computing system
|
|
|
|
|
|
The GIGA provides its members with a high-performance computing system (hereafter called cluster)
|
|
|
composed of (1) a **mass storage** where to store large datasets, (2) several **compute nodes** to perform analyses,
|
|
|
(3) a central server (**master**) which connects the different components and manages the analyses (jobs)
|
|
|
sent to the compute nodes and (4) a **scratch disk** where you can temporarily keep intermediate results.
|
|
|
|
|
|
This structure is used by almost 300 people. Therefore, there are several important rules to follow
|
|
|
to not endangered analyses of other users.
|
|
|
|
|
|
|
|
|
## 1. The mass storage
|
|
|
|
|
|
Here are a few links to specific pages of our wiki:
|
|
|
- [Full description](mass-storage/mass-storage-home)
|
|
|
- [Quickstart](mass-storage/quickstart-mass-storage)
|
|
|
- [Connection instructions](mass-storage/mass-storage-connection)
|
|
|
- [VPN instructions](vpn-connection) (if you want to connect from outside the university network)
|
|
|
- [Frequently asked questions](faq/FAQ)
|
|
|
- Video: https://youtu.be/VppEcHAvSoU
|
|
|
|
|
|
Once connected to the mass storage, you will have access to different spaces.
|
|
|
1. A first one, called **home** where you can store up to 100Gb of files/scripts that you do not want to share
|
|
|
with the members of your laboratory. It is your entry point on the storage and on the cluster,
|
|
|
whatever mode of connection you use.
|
|
|
2. The second space is a folder associated with your **project**
|
|
|
(accessible from your home, via `_SHARE_/Research/<UnitAbreviation>/<LabAbreviation>/<TeamName>/<ProjectName>`).
|
|
|
This folder is accessible to all members of the team who work on that project.
|
|
|
For you to have access to this folder, the PI of your team need to make a request to the [UDI GIGA-MED IT specialist](contacts)
|
|
|
specifying the name of the laboratory, the "Team" folder and the "project" folder(s), as well as your username.
|
|
|
If it is a new project, the UDI GIGA-MED IT specialist can create the folder.
|
|
|
3. There is also a **"Resources"** space on the storage accessible via the link of the same name
|
|
|
in the `_SHARE_` folder, and in which you'll find some reference genomes as well as some analysis tools.
|
|
|
|
|
|
[Here are the instructions to upload your data and scripts to the mass storage.](faq/file_transfer)
|
|
|
|
|
|
## 2. The cluster
|
|
|
|
|
|
### 2.1 Connection
|
|
|
|
|
|
If you haven't yet, please first connect to the mass storage using the SAMBA protocol
|
|
|
explained [here](mass-storage/mass-storage-connection).
|
|
|
Once it's done, you can now connect directly to the cluster by typing `ssh u123456@cluster.calc.priv`
|
|
|
in your terminal (replacing u123456 by your university ID).
|
|
|
You will then be in your [home](faq/home-info), from which you will have access to
|
|
|
the mass storage and to the compute nodes.
|
|
|
|
|
|
|
|
|
### 2.2 Very important points:
|
|
|
|
|
|
When you connect to the cluster, you are on the master.
|
|
|
**It is FORBIDDEN to make your calculations directly on the master.**
|
|
|
You must use `slurm` to send your analyzes (jobs) to the compute nodes.
|
|
|
Slurm is a resource management system.
|
|
|
It allows cluster users to allocate to each job the necessary resources and to launch them as soon
|
|
|
as these resources are available.
|
|
|
|
|
|
To use slurm, you can write a bash script that contains information about the resources needed
|
|
|
as well as the command(s) to run. Or you can use an interactive slurm session in which you can
|
|
|
run your command directly.
|
|
|
|
|
|
You can find more information about slurm on [this wiki page](cluster/slurm/slurm_home)
|
|
|
The CECI also has a really well explained [tutorial](https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html)
|
|
|
and [FAQ](https://support.ceci-hpc.be/doc/_contents/SubmittingJobs/SlurmFAQ.html).
|
|
|
|
|
|
In addition to this, there are some very important considerations related to the type of jobs you want to run.
|
|
|
- As the compute nodes are available to everyone, it is important not to launch several millions of jobs
|
|
|
at the same time so as not to use all the available resources. To limit the number of jobs running in parallel,
|
|
|
you can use arrays (see below).
|
|
|
- In addition, you should avoid launching a large number of very short jobs one after the other.
|
|
|
Indeed, if the overhead time used by slurm to managed each job is more important that the jobs itself and
|
|
|
there is a lots of these jobs sent at the same time, slurm is going to crash.
|
|
|
It is recommended that each individual job takes at least 20 minutes.
|
|
|
If you have lots of small jobs, please combine several ones in one job executing them one after the other
|
|
|
(for example with a for loop) so that the actual job managed by slurm last about 20 minutes or more
|
|
|
(and send several of these combined jobs in parallel using the array method explained below)
|
|
|
|
|
|
Also note that if you are launching a large number of jobs, you should avoid asking to receive an email
|
|
|
each time a job starts or ends. In the past, our server has sometimes been blacklisted due to the fact that
|
|
|
it wanted to send more than 10,000 emails in 1 hour, and when this happens, no one receives emails anymore.
|
|
|
|
|
|
In addition, the mass storage is not optimized to store a very large number of small files,
|
|
|
so if the outputs of your commands are small files, you should ideally generate them on the scratch disk (see below)
|
|
|
and then either concatenate them, or gather the information that interests you in a single file,
|
|
|
or [group them in an archive](mass-storage/mass-storage-compression) before transferring it to mass storage.
|
|
|
Same thing for slurm logs, if you have several thousands of jobs and want to keep all the logs,
|
|
|
we recommend to combine them into a single archive.
|
|
|
|
|
|
|
|
|
### 2.3. Operating system, programs and compute nodes:
|
|
|
|
|
|
The operating system of the cluster is CentOS which is a linux distribution.
|
|
|
The main programming languages are available on the cluster.
|
|
|
|
|
|
There is also a series of programs installed as `modules`.
|
|
|
These modules can be used by following the instructions on [this wiki page](cluster/software/cluster-module).
|
|
|
To use a module in an analysis, you must load the module in your script.
|
|
|
|
|
|
They are several compute nodes available on the cluster.
|
|
|
These nodes have different resources (number of CPUs and RAM available) and are grouped into "partitions".
|
|
|
By default, any GIGA member has access to the compute nodes that are in the partitions all_5hrs, all_24hrs and kosmos. There is no limit of time for the jobs sent to the kosmos partition, but jobs sent to the 2 other partitions will be killed by slurm if they don't complete in the indicated time (5h and 24h respectively).
|
|
|
|
|
|
You can see the nodes present in the partitions to which you have access by typing
|
|
|
```
|
|
|
module load slurm # to load the slurm module - no longer required once the module is loaded
|
|
|
sinfo
|
|
|
```
|
|
|
And see the resources available on each node with the 2 following commands:
|
|
|
```
|
|
|
sinfo -lN
|
|
|
cat /etc/slurm/slurm.conf | grep ^Node
|
|
|
```
|
|
|
|
|
|
If your lab bought some compute nodes, they are probably in a separated partition and the PI of your lab
|
|
|
need to make a request to the [UDIMED/UDIGIGA](contacts) to add you to the list of people having access to it.
|
|
|
|
|
|
### 2.4 Interactive sessions
|
|
|
|
|
|
For the interactive session you can use the command srun. An example of a this command is the following:
|
|
|
```
|
|
|
srun --partition=kosmos --ntasks=1 --cpus-per-task=1 --mem-per-cpu=1000 --pty bash
|
|
|
# change the name of the partition, number of tasks, CPUs and memory per CPU accordingly to your needs
|
|
|
```
|
|
|
|
|
|
When running this type of command, slurm will allocate you resources on one of the calculation node and
|
|
|
will log you to that node in your terminal. Your prompt will thus change to include the name of the node
|
|
|
(for example u123456@genetic.ptfgen005 ~ $ ). You are now on the node and you can test all your commands.
|
|
|
|
|
|
Of note, while you are connected to the node:
|
|
|
- You shouldn't use more resources than what you asked for
|
|
|
- The resources that have been allocated to you can not be used by other,
|
|
|
so when you don't need them anymore, please exit the node.
|
|
|
- If you loose your internet connection (or turn off your computer), your job will be interrupted,
|
|
|
so it's only to test or try out a few things or for debugging, but once you want to run your analysis on big files,
|
|
|
it's always better to run it with sbatch and a script.
|
|
|
|
|
|
You can find more information about this command here: https://slurm.schedmd.com/archive/slurm-14.11.11/srun.html
|
|
|
|
|
|
### 2.5 Job arrays
|
|
|
|
|
|
Arrays allow to launch, progressively and in a limited number, a certain number of jobs in parallel,
|
|
|
so as not to overload the computing cluster. To use arrays in a slurm job,
|
|
|
add a #SBATCH --array option to the headers of the script.
|
|
|
|
|
|
Here is an example of the use of arrays:
|
|
|
|
|
|
```
|
|
|
#SBATCH --array 1-16% 4 # In this example, we want to launch 16 jobs in total, with only 4 running in parallel.
|
|
|
|
|
|
i=$((SLURM_ARRAY_TASK_ID -1)) # allows to retrieve the array identifier (in this case, a number ranging from 1 to 16)
|
|
|
cd ${Directory_with_fastq_files}
|
|
|
AllData=(* .fastq.gz) # we get the list of 16 files that interests us for the rest of the script
|
|
|
Data=${AllData[$i]} # we retieve the first file in the list, this "Data" variable will then be used with the command launched later in the script
|
|
|
```
|
|
|
|
|
|
Slurm will then launch the job for the first 4 files (as soons as requested resources are available)
|
|
|
and will wait for one of the job to finish before to send the next one.
|
|
|
|
|
|
You can find more explanations on arrays [here](https://help.rc.ufl.edu/doc/SLURM_Job_Arrays)
|
|
|
and [here](https://rcc.uchicago.edu/docs/running-jobs/array/index.html)
|
|
|
|
|
|
### 2.6 Scratch disk and temporary folders/files:
|
|
|
|
|
|
Writing your temporary and intermediates files to the mass storage will considerably slow down your analyses
|
|
|
(and everybody else analysis). To store these files during the time of your job, you have 2 options:
|
|
|
|
|
|
- the `/local` folder on the node you are running your analysis.
|
|
|
There should be 2Tb of space available there (if other users have properly deleted temporary files they generated).
|
|
|
Everything written there is available only from that node.
|
|
|
As it's on the node itself, it's really fast to write and read files there.
|
|
|
- the `/home/gallia/scratch` folder which is on a separated disk accessible from all nodes and currently has
|
|
|
17Tb of free space. Writing there will be slower than writing on `/local` but still a lot faster than
|
|
|
writing directly on the mass storage.
|
|
|
|
|
|
In both cases, you should create a subfolder with your ULiege identifier as folder name.
|
|
|
Inside that folder, you can create a subfolder specific to each job.
|
|
|
More importantly, at the end of your job, you should transfer all the files you need to keep to
|
|
|
the mass storage as the tmp and scratch folders are not backed up.
|
|
|
Moreover, files written in the local folder should be deleted at the end of your job,
|
|
|
and files on /home/gallia/scratch should be deleted as soon as you no longer need them,
|
|
|
since these common spaces are limited.
|
|
|
|
|
|
### 2.7 final remark:
|
|
|
|
|
|
Our cluster is optimized for jobs that require a lot of memory.
|
|
|
If you need a lot of CPUs and relatively not much RAM,
|
|
|
you can use [the CECI clusters](http://www.ceci-hpc.be/clusters.html)
|
|
|
(accessible to all members of the University of Liège)
|
|
|
CECI has very good tutorials [here](https://support.ceci-hpc.be/doc/) |