Changes

Mayer Alice · 649028db
--- a/mass-storage/mass-storage-backup-archive.md
+++ b/mass-storage/mass-storage-backup-archive.md
@@ -4,23 +4,23 @@

 # Backup procedure

-Personal spaces (home) and team's folders are backed on a regular basis on a tape 
-library situated in a different location (mass storage disks are located in the 
-GIGA, B35, while the tape library is situated at the SEGI, B26, and in February 2021, 
-a second tape library will be added at the CHU datacenter, B35). 
-
-It’s an automatized procedure in which any file that has been modified then left 
-unchanged for at least 2h will enter a “backing-up queue” and will be backed up 
-as soon as technically possible. In most cases, this means that newly modified 
-files will be backed up after 2h of inactivity, but the delay could be longer if 
-large files are currently being backed up, if a large number of files have been 
-modified recently or if the system is momentarily down for maintenance. In all 
+Personal spaces (home) and team's folders are backed on a regular basis on a tape
+library situated in a different location (mass storage disks are located in the
+GIGA, B35, while the tape library is situated at the SEGI, B26, and in February 2021,
+a second tape library will be added at the CHU datacenter, B35).
+
+It’s an automatized procedure in which any file that has been modified then left
+unchanged for at least 2h will enter a “backing-up queue” and will be backed up
+as soon as technically possible. In most cases, this means that newly modified
+files will be backed up after 2h of inactivity, but the delay could be longer if
+large files are currently being backed up, if a large number of files have been
+modified recently or if the system is momentarily down for maintenance. In all
 these exceptional cases, it could take several hours before a file is actually backed up.

-The system will keep a maximum of 25 versions of each file for a maximum of 28 days. 
-These will be the 25 last versions, so if a file is backed up 12 times a day, 
-the oldest recoverable version will be only 2 days old. Previous versions of a 
-file that have been saved more than 28 days ago will be deleted from the system. 
+The system will keep a maximum of 25 versions of each file for a maximum of 28 days.
+These will be the 25 last versions, so if a file is backed up 12 times a day,
+the oldest recoverable version will be only 2 days old. Previous versions of a
+file that have been saved more than 28 days ago will be deleted from the system.

 In other words, it is possible to recover any previous version of a file if
 - that version has been backed up (i.e. stayed inactive for at least 2h after having been modified and saved)
@@ -31,155 +31,154 @@ If a file is deleted from the disk, the last backed-up version will be kept on t

 Users have to ask the UDI GIGA-MED IT specialist to recover their last backed-up file, or, when possible, one of the previous versions.

-**IMPORTANT**  
-As the tape system can only handle a limited amount of data per hour, any action 
-impacting 1 Tb of data or more, for example copying a large dataset or downloading 
-large files, must be reported **before** being performed to the UDI GIGA-MED IT 
-department so that these large changes do not affect the proper functioning of 
-the operations. This rule also applied when large number of small files are created 
+**IMPORTANT**
+As the tape system can only handle a limited amount of data per hour, any action
+impacting 1 Tb of data or more, for example copying a large dataset or downloading
+large files, must be reported **before** being performed to the UDI GIGA-MED IT
+department so that these large changes do not affect the proper functioning of
+the operations. This rule also applied when large number of small files are created
 or modified (e.g. several thousands of files of less than 1 Mb).

 ### nobackup folders

-For files that do not require to be backed up, such as temporary files, a specific 
-folder can be created. This folder must be called "nobackup". It must be written 
-exactly like this, without space and in lowercase. Otherwise it will still be backed up. 
+For files that do not require to be backed up, such as temporary files, a specific
+folder can be created. This folder must be called "nobackup". It must be written
+exactly like this, without space and in lowercase. Otherwise it will still be backed up.

 # Offline data archiving

-**Why is it useful to send data to the offline archiving system?**  
-Since the disk space on the mass storage is finite and expensive and the 
+**Why is it useful to send data to the offline archiving system?**
+Since the disk space on the mass storage is finite and expensive and the
 amount of data we are producing grows exponentially, users are encouraged to send files
-they need to keep but don't need to access anymore to our offline archiving 
-system in order to release some space on disk.   
+they need to keep but don't need to access anymore to our offline archiving
+system in order to release some space on disk.

-**How does it work?**  
+**How does it work?**
 As explained in the backup section of this page, each file present on disk is
-also copied on 2 tapes libraries. Once a file is sent to the offline archiving system, 
-the copy on disk is truncated, but there is still 2 copies on the 2 tape libraries. 
+also copied on 2 tapes libraries. Once a file is sent to the offline archiving system,
+the copy on disk is truncated, but there is still 2 copies on the 2 tape libraries.
 Archived files are not directly accessible to users, but can be restored on disk if needed.
 Of note, the restoration of an archive is obviously possible only if there is enough space on disk to store it.

-**Where will the archived data be stored?**  
-Currently, both tape libraries are located in the same robot at the SEGI. However, 
-in February 2021, we'll add a new robot at the CHU datacenter, and one of the tape 
-library will be moved there, so that the 2 copies will be in different locations. 
-
-**Which type of files can be sent to the offline archiving system?**  
-Tape libraries are designed for long term storage of data. They are more stable 
-and a lot cheaper than disk space. However, they are also slower in term of writing/reading 
-capacity, so retrieving archived data can take several days. 
-Therefore, files sent to the offline archiving system should be files that the user 
-needs to keep (for legal reasons for example, or data that have been fully analyzed but not 
-published yet, in case a reviewer is asking to redo part of the analysis with a 
+**Where will the archived data be stored?**
+Currently, both tape libraries are located in the same robot at the SEGI. However,
+in February 2021, we'll add a new robot at the CHU datacenter, and one of the tape
+library will be moved there, so that the 2 copies will be in different locations.
+
+**Which type of files can be sent to the offline archiving system?**
+Tape libraries are designed for long term storage of data. They are more stable
+and a lot cheaper than disk space. However, they are also slower in term of writing/reading
+capacity, so retrieving archived data can take several days.
+Therefore, files sent to the offline archiving system should be files that the user
+needs to keep (for legal reasons for example, or data that have been fully analyzed but not
+published yet, in case a reviewer is asking to redo part of the analysis with a
 new software or different options) but do not need to use/access quickly anymore.

 **Is there a size limit for sending data to archive?**
   - There is no upper limit to the size of an archive. But if you want to send several Terabytes of data,
-please organize the main folder into subfolders containing data that are likely to be retrieved 
+please organize the main folder into subfolders containing data that are likely to be retrieved
 together and keep a record of the tree structure, so that we don't need to retrieve
-the whole archive if you need only some of the files.  
-   - The minimum size you can send to archive is 500Gb. If your experiments typically generate less 
-than 500Gb of data, you can wait to have several experiments (eventually in separated subfolder) 
-before to archive them. 
+the whole archive if you need only some of the files.
+   - The minimum size you can send to archive is 500Gb. If your experiments typically generate less
+than 500Gb of data, you can wait to have several experiments (eventually in separated subfolder)
+before to archive them.

-**Warning about hardlinks**  
+**Warning about hardlinks**
 NB1: If you don't know what a hardlink is, you probably don't have any (it's actually quite rare to have some in data).
 NB2: If you made links using `ln -s` command, you made a softlink and not a hardlink.
-If you have some hardlinks in your archive folder and if other occurrences of the same file are in your project folder, 
-be aware that once the file will be truncated, it will be so in all locations 
-(everywhere where you have a hardlink pointing to that file). 
-The side effect of this is that if you open the copy in your project folder, 
-the file will be restored on disk, which means that 
+If you have some hardlinks in your archive folder and if other occurrences of the same file are in your project folder,
+be aware that once the file will be truncated, it will be so in all locations
+(everywhere where you have a hardlink pointing to that file).
+The side effect of this is that if you open the copy in your project folder,
+the file will be restored on disk, which means that
 1. you need to have enough space in the folder to store it (or the retrieve will fail)
 2. opening it the first time will be very slow

-Don't hesitate to ask the [Bioinformatic teams](contacts) if you have any question or want to 
+Don't hesitate to ask the [Bioinformatic teams](contacts) if you have any question or want to
 discuss your specific utilisation of hardlinks.

-**NB:** In some circumstances, some files may be sent offline even if the user didn't ask for it. 
+**NB:** In some circumstances, some files may be sent offline even if the user didn't ask for it.
 See the [automatic archiving](mass-storage/mass-storage-backup-archive#automatic-archiving) for more information.

 ## On demand archiving

 ### Procedure to send data for archiving

-Files/folders should be properly organized before being sent for archiving.  
-Don't hesitate to contact the [Bioinformatics team](contacts) 
-if you need help for any of these steps.  
+Files/folders should be properly organized before being sent for archiving.
+Don't hesitate to contact the [Bioinformatics team](contacts)
+if you need help for any of these steps.

 The procedure to send files for archiving is:
-1. If not already done, ask the UDIMED/UDIGIGA (https://sam.med.uliege.be/) to create an "ARCHIVES" folder in 
+1. If not already done, ask the UDIMED/UDIGIGA (https://sam.med.uliege.be/) to create an "ARCHIVES" folder in
 your team folder on the mass storage. This need to be done only the very first time.
 2. Determine which files/folders you want to send for archiving and organize them so that
    - All related files that are likely to be retrieved together (if a retrieval is ever needed) are in the same (sub)folder with a meaningful name
    - Big files (typically 200Mb and more) are compressed as much as possible as explained [here](mass-storage/mass-storage-compression)
    - Numerous small files (typically several thousands of files smaller than 4Mb) are grouped into archive files as explained [here](mass-storage/mass-storage-compression)
-3. Create a subfolder in the "ARCHIVES" folder with a meaningful name (for example the name of the project, the date and any specific information).   
+3. Create a subfolder in the "ARCHIVES" folder with a meaningful name (for example the name of the project, the date and any specific information).
 **WARNING: don't use any space or special character in the folder names !!!!**
-4. Move in that folder the data you want to send for archiving (organised as described above)  
-**WARNING:** if your data are on the mass storage it's very important to move them (using `mv` and not rsync or cp)!!!   
-This move should be done directly on the mass storage (see important considerations below) and not from the cluster.  
-If you want to archive data that are currently on another disk (for example gallia or CECI cluster), you need to 
+4. Move in that folder the data you want to send for archiving (organised as described above)
+**WARNING:** if your data are on the mass storage it's very important to move them (using `mv` and not rsync or cp)!!!
+This move should be done directly on the mass storage (see important considerations below) and not from the cluster.
+If you want to archive data that are currently on another disk (for example gallia or CECI cluster), you need to
 transfer them using rsync or cp/scp.
-5. Wait until you have at least 500Gb of data to send for archiving 
-(eventually grouping separated project in separated sub-folder). 
-6. Keep a record of what you have sent for archiving, for example in a text file explaining what's in each folder. 
+5. Wait until you have at least 500Gb of data to send for archiving
+(eventually grouping separated project in separated sub-folder).
+6. Keep a record of what you have sent for archiving, for example in a text file explaining what's in each folder.
 You can also make a list of the files using the linux `tree` command (with the help of the bioinformatic platforms if needed).
-Of note, we strongly recommend to have both a tree and file with a description of what's in the archive, 
+Of note, we strongly recommend to have both a tree and file with a description of what's in the archive,
 as a list of file names might not be enough for you to exactly know what is in each folder.
-7. Contact the UDIMED/UDIGIGA by filling a form at https://sam.med.uliege.be/  
+7. Contact the UDIMED/UDIGIGA by filling a form at https://sam.med.uliege.be/
    The form must contain the following pieces of information:
    - path to the folder to archive (or at least the name of the team and the name of the folder to archive)
    - the number of years that the data must be kept (default is 5 years)
    - name of the PI in charge
-8. The UDIMED/UDIGIGA IT specialist will send your data to the archiving system. 
+8. The UDIMED/UDIGIGA IT specialist will send your data to the archiving system.
 Once that's done, you won't be able to enter the archived folder and to see your data anymore.

 **IMPORTANT CONSIDERATIONS**
- To move your data to the "ARCHIVES" folder, it's recommended to log into the mass storage with 
-`ssh u123456@massstorage.giga.priv` (replace u123456 by your university userID). 
-Then you should run a `screen` session to prevent any interruption of the transfer 
+- To move your data to the "ARCHIVES" folder, it's recommended to log into the mass storage with
+`ssh u123456@massstorage.giga.priv` (replace u123456 by your university userID).
+Then you should run a `screen` session to prevent any interruption of the transfer
 if you lose your connection to the mass storage.
-If you don't know how to run a screen session or move file from a terminal or if you are not sure 
+If you don't know how to run a screen session or move file from a terminal or if you are not sure
 of the method you should use, please contact the [Bioinformatic team](contacts).
- Moving data to the "ARCHIVES" folder is not enough for them to be truncated. 
-You have to ask the UDIMED/UDIGIGA IT specialist to archive them. 
-Data in the "ARCHIVES" folder that haven't been truncated still takes up space on disk and are therefore 
-still taken into account for the billing of your disk usage.
- Each truncated file is occupying 4kb of space on disk. So, if you archive 1 million of individual files, 
-the remaining volume on disk will be 4Gb (ish). However, if you have hundreds of thousands of small files, 
+- **Moving data to the "ARCHIVES" folder is not enough for them to be truncated.**
+You have to ask the UDIMED/UDIGIGA IT specialist to archive them.
+Data in the "ARCHIVES" folder that haven't been truncated still takes up space on disk!!!
+- Each truncated file is occupying 4kb of space on disk. So, if you archive 1 million of individual files,
+the remaining volume on disk will be 4Gb (ish). However, if you have hundreds of thousands of small files,
 you'll save more space by grouping them into one file as explained [here](mass-storage/mass-storage-compression)

 ### Procedure to retrieve data from archiving

 The procedure to retrieve archived files/folders is:

-1. Contact the UDIMED/UDIGIGA by filling a form at  https://sam.med.uliege.be/  
+1. Contact the UDIMED/UDIGIGA by filling a form at  https://sam.med.uliege.be/
   The form must contain the following pieces of information:
   - the path to folder you want to retrieve (or at least the name of the team and the name of the folder)
   - if you don't want to retrieve the whole folder but only some part of it, don't forget to mention which subfolder
   - the name of the PI in charge
-2. The UDIMED/UDIGIGA will check there is enough space left on disk to store your data. 
-If this is the case, they will retrieve your data and give you access to them. 
+2. The UDIMED/UDIGIGA will check there is enough space left on disk to store your data.
+If this is the case, they will retrieve your data and give you access to them.

-**Note**: This operation may take several days. This means that you need to **anticipate** the need of those data. 
+**Note**: This operation may take several days. This means that you need to **anticipate** the need of those data.

 ## Automatic archiving

-Once the storage will reach 80% of its maximum storage capacity, oldest data will 
-automatically be truncated from the disk in order to save space. Once that happens, 
-there will still be 2 copies of the file on tape but only the beginning of the file 
-and its metadata will stay on disk. Therefore, the file name will still be visible 
-in the tree view but the file itself will be on tape. 
+Once the storage will reach 80% of its maximum storage capacity, oldest data will
+automatically be truncated from the disk in order to save space. Once that happens,
+there will still be 2 copies of the file on tape but only the beginning of the file
+and its metadata will stay on disk. Therefore, the file name will still be visible
+in the tree view but the file itself will be on tape.

-This procedure will affect only files of more than 4Mb and will start with files 
-that haven’t been modified or open for at least 270 days. If that’s not 
+This procedure will affect only files of more than 4Mb and will start with files
+that haven’t been modified or open for at least 270 days. If that’s not
 enough, “younger” files may be affected too.

-It means that opening data that haven’t been used for a long time will become a 
-slow process, as these data will first need to migrate from tape to disk before 
-to be accessible again. The time required will depend of the ongoing backing up 
+It means that opening data that haven’t been used for a long time will become a
+slow process, as these data will first need to migrate from tape to disk before
+to be accessible again. The time required will depend of the ongoing backing up
 of other files. In optimal conditions, it shouldn't take more than 1h to access a 1 TB file.

-# [Contacts](contacts)
\ No newline at end of file
+# [Contacts](contacts)