update so that VPN and DoX links now points to SEGI english description +... authored by Mayer Alice's avatar Mayer Alice
update so that VPN and DoX links now points to SEGI english description + remove images associated to old explanations
...@@ -4,44 +4,44 @@ ...@@ -4,44 +4,44 @@
# Preliminary remark # Preliminary remark
The cluster part of this wiki is still under construction. However we have gathered here some The cluster part of this wiki is still under construction. However we have gathered here some
pieces of information to help you start using it. pieces of information to help you start using it.
Do not hesitate to contact the [Bioinformatics team](contacts) for any question or if you need help Do not hesitate to contact the [Bioinformatics team](contacts) for any question or if you need help
to resolve any problem you may encounter. to resolve any problem you may encounter.
# The GIGA high-performance computing system # The GIGA high-performance computing system
The GIGA provides its members with a high-performance computing system (hereafter called cluster) The GIGA provides its members with a high-performance computing system (hereafter called cluster)
composed of (1) a **mass storage** where to store large datasets, (2) several **compute nodes** to perform analyses, composed of (1) a **mass storage** where to store large datasets, (2) several **compute nodes** to perform analyses,
(3) a central server (**master**) which connects the different components and manages the analyses (jobs) (3) a central server (**master**) which connects the different components and manages the analyses (jobs)
sent to the compute nodes and (4) a **scratch disk** where you can temporarily keep intermediate results. sent to the compute nodes and (4) a **scratch disk** where you can temporarily keep intermediate results.
This structure is used by almost 300 people. Therefore, there are several important rules to follow This structure is used by almost 300 people. Therefore, there are several important rules to follow
to not endangered analyses of other users. to not endangered analyses of other users.
## 1. The mass storage ## 1. The mass storage
Here are a few links to specific pages of our wiki: Here are a few links to specific pages of our wiki:
- [Full description](mass-storage/mass-storage-home) - [Full description](mass-storage/mass-storage-home)
- [Quickstart](mass-storage/quickstart-mass-storage) - [Quickstart](mass-storage/quickstart-mass-storage)
- [Connection instructions](mass-storage/mass-storage-connection) - [Connection instructions](mass-storage/mass-storage-connection)
- [VPN instructions](vpn-connection) (if you want to connect from outside the university network) - [VPN instructions](https://my.segi.uliege.be/cms/c_11650735/en/mysegi-vpn-f5-big-ip) (if you want to connect from outside the university network)
- [Frequently asked questions](faq/FAQ) - [Frequently asked questions](faq/FAQ)
- Video: https://youtu.be/VppEcHAvSoU - Video: https://youtu.be/VppEcHAvSoU
Once connected to the mass storage, you will have access to different spaces. Once connected to the mass storage, you will have access to different spaces.
1. A first one, called **home** where you can store up to 100Gb of files/scripts that you do not want to share 1. A first one, called **home** where you can store up to 100Gb of files/scripts that you do not want to share
with the members of your laboratory. It is your entry point on the storage and on the cluster, with the members of your laboratory. It is your entry point on the storage and on the cluster,
whatever mode of connection you use. whatever mode of connection you use.
2. The second space is a folder associated with your **project** 2. The second space is a folder associated with your **project**
(accessible from your home, via `_SHARE_/Research/<UnitAbreviation>/<LabAbreviation>/<TeamName>/<ProjectName>`). (accessible from your home, via `_SHARE_/Research/<UnitAbreviation>/<LabAbreviation>/<TeamName>/<ProjectName>`).
This folder is accessible to all members of the team who work on that project. This folder is accessible to all members of the team who work on that project.
For you to have access to this folder, the PI of your team need to make a request to the [UDI GIGA-MED IT specialist](contacts) For you to have access to this folder, the PI of your team need to make a request to the [UDI GIGA-MED IT specialist](contacts)
specifying the name of the laboratory, the "Team" folder and the "project" folder(s), as well as your username. specifying the name of the laboratory, the "Team" folder and the "project" folder(s), as well as your username.
If it is a new project, the UDI GIGA-MED IT specialist can create the folder. If it is a new project, the UDI GIGA-MED IT specialist can create the folder.
3. There is also a **"Resources"** space on the storage accessible via the link of the same name 3. There is also a **"Resources"** space on the storage accessible via the link of the same name
in the `_SHARE_` folder, and in which you'll find some reference genomes as well as some analysis tools. in the `_SHARE_` folder, and in which you'll find some reference genomes as well as some analysis tools.
[Here are the instructions to upload your data and scripts to the mass storage.](faq/file_transfer) [Here are the instructions to upload your data and scripts to the mass storage.](faq/file_transfer)
...@@ -50,65 +50,65 @@ in the `_SHARE_` folder, and in which you'll find some reference genomes as well ...@@ -50,65 +50,65 @@ in the `_SHARE_` folder, and in which you'll find some reference genomes as well
### 2.1 Connection ### 2.1 Connection
If you haven't yet, please first connect to the mass storage using the SAMBA protocol If you haven't yet, please first connect to the mass storage using the SAMBA protocol
explained [here](mass-storage/mass-storage-connection). explained [here](mass-storage/mass-storage-connection).
Once it's done, you can now connect directly to the cluster by typing `ssh u123456@cluster.calc.priv` Once it's done, you can now connect directly to the cluster by typing `ssh u123456@cluster.calc.priv`
in your terminal (replacing u123456 by your university ID). in your terminal (replacing u123456 by your university ID).
You will then be in your [home](faq/home-info), from which you will have access to You will then be in your [home](faq/home-info), from which you will have access to
the mass storage and to the compute nodes. the mass storage and to the compute nodes.
### 2.2 Very important points: ### 2.2 Very important points:
When you connect to the cluster, you are on the master. When you connect to the cluster, you are on the master.
**It is FORBIDDEN to make your calculations directly on the master.** **It is FORBIDDEN to make your calculations directly on the master.**
You must use `slurm` to send your analyzes (jobs) to the compute nodes. You must use `slurm` to send your analyzes (jobs) to the compute nodes.
Slurm is a resource management system. Slurm is a resource management system.
It allows cluster users to allocate to each job the necessary resources and to launch them as soon It allows cluster users to allocate to each job the necessary resources and to launch them as soon
as these resources are available. as these resources are available.
To use slurm, you can write a bash script that contains information about the resources needed To use slurm, you can write a bash script that contains information about the resources needed
as well as the command(s) to run. Or you can use an interactive slurm session in which you can as well as the command(s) to run. Or you can use an interactive slurm session in which you can
run your command directly. run your command directly.
You can find more information about slurm on [this wiki page](cluster/slurm/slurm_home) You can find more information about slurm on [this wiki page](cluster/slurm/slurm_home)
The CECI also has a really well explained [tutorial](https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html) The CECI also has a really well explained [tutorial](https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html)
and [FAQ](https://support.ceci-hpc.be/doc/_contents/SubmittingJobs/SlurmFAQ.html). and [FAQ](https://support.ceci-hpc.be/doc/_contents/SubmittingJobs/SlurmFAQ.html).
In addition to this, there are some very important considerations related to the type of jobs you want to run. In addition to this, there are some very important considerations related to the type of jobs you want to run.
- As the compute nodes are available to everyone, it is important not to launch several millions of jobs - As the compute nodes are available to everyone, it is important not to launch several millions of jobs
at the same time so as not to use all the available resources. To limit the number of jobs running in parallel, at the same time so as not to use all the available resources. To limit the number of jobs running in parallel,
you can use arrays (see below). you can use arrays (see below).
- In addition, you should avoid launching a large number of very short jobs one after the other. - In addition, you should avoid launching a large number of very short jobs one after the other.
Indeed, if the overhead time used by slurm to managed each job is more important that the jobs itself and Indeed, if the overhead time used by slurm to managed each job is more important that the jobs itself and
there is a lots of these jobs sent at the same time, slurm is going to crash. there is a lots of these jobs sent at the same time, slurm is going to crash.
It is recommended that each individual job takes at least 20 minutes. It is recommended that each individual job takes at least 20 minutes.
If you have lots of small jobs, please combine several ones in one job executing them one after the other If you have lots of small jobs, please combine several ones in one job executing them one after the other
(for example with a for loop) so that the actual job managed by slurm last about 20 minutes or more (for example with a for loop) so that the actual job managed by slurm last about 20 minutes or more
(and send several of these combined jobs in parallel using the array method explained below) (and send several of these combined jobs in parallel using the array method explained below)
Also note that if you are launching a large number of jobs, you should avoid asking to receive an email Also note that if you are launching a large number of jobs, you should avoid asking to receive an email
each time a job starts or ends. In the past, our server has sometimes been blacklisted due to the fact that each time a job starts or ends. In the past, our server has sometimes been blacklisted due to the fact that
it wanted to send more than 10,000 emails in 1 hour, and when this happens, no one receives emails anymore. it wanted to send more than 10,000 emails in 1 hour, and when this happens, no one receives emails anymore.
In addition, the mass storage is not optimized to store a very large number of small files, In addition, the mass storage is not optimized to store a very large number of small files,
so if the outputs of your commands are small files, you should ideally generate them on the scratch disk (see below) so if the outputs of your commands are small files, you should ideally generate them on the scratch disk (see below)
and then either concatenate them, or gather the information that interests you in a single file, and then either concatenate them, or gather the information that interests you in a single file,
or [group them in an archive](mass-storage/mass-storage-compression) before transferring it to mass storage. or [group them in an archive](mass-storage/mass-storage-compression) before transferring it to mass storage.
Same thing for slurm logs, if you have several thousands of jobs and want to keep all the logs, Same thing for slurm logs, if you have several thousands of jobs and want to keep all the logs,
we recommend to combine them into a single archive. we recommend to combine them into a single archive.
### 2.3. Operating system, programs and compute nodes: ### 2.3. Operating system, programs and compute nodes:
The operating system of the cluster is CentOS which is a linux distribution. The operating system of the cluster is CentOS which is a linux distribution.
The main programming languages are available on the cluster. The main programming languages are available on the cluster.
There is also a series of programs installed as `modules`. There is also a series of programs installed as `modules`.
These modules can be used by following the instructions on [this wiki page](cluster/software/cluster-module). These modules can be used by following the instructions on [this wiki page](cluster/software/cluster-module).
To use a module in an analysis, you must load the module in your script. To use a module in an analysis, you must load the module in your script.
They are several compute nodes available on the cluster. They are several compute nodes available on the cluster.
These nodes have different resources (number of CPUs and RAM available) and are grouped into "partitions". These nodes have different resources (number of CPUs and RAM available) and are grouped into "partitions".
By default, any GIGA member has access to the compute nodes that are in the partitions all_5hrs, all_24hrs and kosmos. There is no limit of time for the jobs sent to the kosmos partition, but jobs sent to the 2 other partitions will be killed by slurm if they don't complete in the indicated time (5h and 24h respectively). By default, any GIGA member has access to the compute nodes that are in the partitions all_5hrs, all_24hrs and kosmos. There is no limit of time for the jobs sent to the kosmos partition, but jobs sent to the 2 other partitions will be killed by slurm if they don't complete in the indicated time (5h and 24h respectively).
...@@ -118,13 +118,13 @@ module load slurm # to load the slurm module - no longer required once the modul ...@@ -118,13 +118,13 @@ module load slurm # to load the slurm module - no longer required once the modul
sinfo sinfo
``` ```
And see the resources available on each node with the 2 following commands: And see the resources available on each node with the 2 following commands:
``` ```
sinfo -lN sinfo -lN
cat /etc/slurm/slurm.conf | grep ^Node cat /etc/slurm/slurm.conf | grep ^Node
``` ```
If your lab bought some compute nodes, they are probably in a separated partition and the PI of your lab If your lab bought some compute nodes, they are probably in a separated partition and the PI of your lab
need to make a request to the [UDIMED/UDIGIGA](contacts) to add you to the list of people having access to it. need to make a request to the [UDIMED/UDIGIGA](contacts) to add you to the list of people having access to it.
### 2.4 Interactive sessions ### 2.4 Interactive sessions
...@@ -134,25 +134,25 @@ srun --partition=kosmos --ntasks=1 --cpus-per-task=1 --mem-per-cpu=1000 --pty ba ...@@ -134,25 +134,25 @@ srun --partition=kosmos --ntasks=1 --cpus-per-task=1 --mem-per-cpu=1000 --pty ba
# change the name of the partition, number of tasks, CPUs and memory per CPU accordingly to your needs # change the name of the partition, number of tasks, CPUs and memory per CPU accordingly to your needs
``` ```
When running this type of command, slurm will allocate you resources on one of the calculation node and When running this type of command, slurm will allocate you resources on one of the calculation node and
will log you to that node in your terminal. Your prompt will thus change to include the name of the node will log you to that node in your terminal. Your prompt will thus change to include the name of the node
(for example u123456@genetic.ptfgen005 ~ $ ). You are now on the node and you can test all your commands. (for example u123456@genetic.ptfgen005 ~ $ ). You are now on the node and you can test all your commands.
Of note, while you are connected to the node: Of note, while you are connected to the node:
- You shouldn't use more resources than what you asked for - You shouldn't use more resources than what you asked for
- The resources that have been allocated to you can not be used by other, - The resources that have been allocated to you can not be used by other,
so when you don't need them anymore, please exit the node. so when you don't need them anymore, please exit the node.
- If you loose your internet connection (or turn off your computer), your job will be interrupted, - If you loose your internet connection (or turn off your computer), your job will be interrupted,
so it's only to test or try out a few things or for debugging, but once you want to run your analysis on big files, so it's only to test or try out a few things or for debugging, but once you want to run your analysis on big files,
it's always better to run it with sbatch and a script. it's always better to run it with sbatch and a script.
You can find more information about this command here: https://slurm.schedmd.com/archive/slurm-14.11.11/srun.html You can find more information about this command here: https://slurm.schedmd.com/archive/slurm-14.11.11/srun.html
### 2.5 Job arrays ### 2.5 Job arrays
Arrays allow to launch, progressively and in a limited number, a certain number of jobs in parallel, Arrays allow to launch, progressively and in a limited number, a certain number of jobs in parallel,
so as not to overload the computing cluster. To use arrays in a slurm job, so as not to overload the computing cluster. To use arrays in a slurm job,
add a #SBATCH --array option to the headers of the script. add a #SBATCH --array option to the headers of the script.
Here is an example of the use of arrays: Here is an example of the use of arrays:
...@@ -165,37 +165,37 @@ AllData=(* .fastq.gz) # we get the list of 16 files that interests us for the re ...@@ -165,37 +165,37 @@ AllData=(* .fastq.gz) # we get the list of 16 files that interests us for the re
Data=${AllData[$i]} # we retieve the first file in the list, this "Data" variable will then be used with the command launched later in the script Data=${AllData[$i]} # we retieve the first file in the list, this "Data" variable will then be used with the command launched later in the script
``` ```
Slurm will then launch the job for the first 4 files (as soons as requested resources are available) Slurm will then launch the job for the first 4 files (as soons as requested resources are available)
and will wait for one of the job to finish before to send the next one. and will wait for one of the job to finish before to send the next one.
You can find more explanations on arrays [here](https://help.rc.ufl.edu/doc/SLURM_Job_Arrays) You can find more explanations on arrays [here](https://help.rc.ufl.edu/doc/SLURM_Job_Arrays)
and [here](https://rcc.uchicago.edu/docs/running-jobs/array/index.html) and [here](https://rcc.uchicago.edu/docs/running-jobs/array/index.html)
### 2.6 Scratch disk and temporary folders/files: ### 2.6 Scratch disk and temporary folders/files:
Writing your temporary and intermediates files to the mass storage will considerably slow down your analyses Writing your temporary and intermediates files to the mass storage will considerably slow down your analyses
(and everybody else analysis). To store these files during the time of your job, you have 2 options: (and everybody else analysis). To store these files during the time of your job, you have 2 options:
- the `/local` folder on the node you are running your analysis. - the `/local` folder on the node you are running your analysis.
There should be 2Tb of space available there (if other users have properly deleted temporary files they generated). There should be 2Tb of space available there (if other users have properly deleted temporary files they generated).
Everything written there is available only from that node. Everything written there is available only from that node.
As it's on the node itself, it's really fast to write and read files there. As it's on the node itself, it's really fast to write and read files there.
- the `/home/gallia/scratch` folder which is on a separated disk accessible from all nodes and currently has - the `/home/gallia/scratch` folder which is on a separated disk accessible from all nodes and currently has
17Tb of free space. Writing there will be slower than writing on `/local` but still a lot faster than 17Tb of free space. Writing there will be slower than writing on `/local` but still a lot faster than
writing directly on the mass storage. writing directly on the mass storage.
In both cases, you should create a subfolder with your ULiege identifier as folder name. In both cases, you should create a subfolder with your ULiege identifier as folder name.
Inside that folder, you can create a subfolder specific to each job. Inside that folder, you can create a subfolder specific to each job.
More importantly, at the end of your job, you should transfer all the files you need to keep to More importantly, at the end of your job, you should transfer all the files you need to keep to
the mass storage as the tmp and scratch folders are not backed up. the mass storage as the tmp and scratch folders are not backed up.
Moreover, files written in the local folder should be deleted at the end of your job, Moreover, files written in the local folder should be deleted at the end of your job,
and files on /home/gallia/scratch should be deleted as soon as you no longer need them, and files on /home/gallia/scratch should be deleted as soon as you no longer need them,
since these common spaces are limited. since these common spaces are limited.
### 2.7 final remark: ### 2.7 final remark:
Our cluster is optimized for jobs that require a lot of memory. Our cluster is optimized for jobs that require a lot of memory.
If you need a lot of CPUs and relatively not much RAM, If you need a lot of CPUs and relatively not much RAM,
you can use [the CECI clusters](http://www.ceci-hpc.be/clusters.html) you can use [the CECI clusters](http://www.ceci-hpc.be/clusters.html)
(accessible to all members of the University of Liège) (accessible to all members of the University of Liège)
CECI has very good tutorials [here](https://support.ceci-hpc.be/doc/) CECI has very good tutorials [here](https://support.ceci-hpc.be/doc/)