Changes

Bouquieaux Marie-Catherine · a8f852dc
--- a/mass-storage/mass-storage-backup-archive.md
+++ b/mass-storage/mass-storage-backup-archive.md
@@ -4,14 +4,23 @@

 # Backup procedure

-Personal spaces (home) and team's folders are backed on a regular basis on a tape library situated in a different location (at the SEGI, B36). It’s an automatised procedure in which 
-any file that has been modified then left unchanged for at least 2h will enter a “backing-up queue” and will be backed up as soon as technically possible. In most cases, this 
-means that newly 
-modified files will be backed up after 2h of inactivity, but the delay could be longer if large files are currently being backed up, if a large number of files have been modified 
-recently or if the system is momentarily down for maintenance. In all these cases, it could take several hours before a file is actually backed up.
-
-The system will keep a maximum of 25 versions of each file for a maximum of 28 days. These will be the 25 last versions, so if a file is backed up 12 times a day, the oldest 
-recoverable version will be only 2 days old. Previous versions of a file that have been saved more than 28 days ago will be deleted from the system. 
+Personal spaces (home) and team's folders are backed on a regular basis on a tape 
+library situated in a different location (mass storage disks are located in the 
+GIGA, B35, while the tape library is situated at the SEGI, B26, and in February 2021, 
+a second tape library will be added at the CHU datacenter, B35). 
+
+It’s an automatized procedure in which any file that has been modified then left 
+unchanged for at least 2h will enter a “backing-up queue” and will be backed up 
+as soon as technically possible. In most cases, this means that newly modified 
+files will be backed up after 2h of inactivity, but the delay could be longer if 
+large files are currently being backed up, if a large number of files have been 
+modified recently or if the system is momentarily down for maintenance. In all 
+these exceptional cases, it could take several hours before a file is actually backed up.
+
+The system will keep a maximum of 25 versions of each file for a maximum of 28 days. 
+These will be the 25 last versions, so if a file is backed up 12 times a day, 
+the oldest recoverable version will be only 2 days old. Previous versions of a 
+file that have been saved more than 28 days ago will be deleted from the system. 

 In other words, it is possible to recover any previous version of a file if
 - that version has been backed up (i.e. stayed inactive for at least 2h after having been modified and saved)
@@ -23,37 +32,140 @@ If a file is deleted from the disk, the last backed-up version will be kept on t
 Users have to ask the UDI GIGA-MED IT specialist to recover their last backed-up file, or, when possible, one of the previous versions.

 **IMPORTANT**  
-As the tape system can only handle a limited amount of data per hour, any action impacting 1 Tb of data or more, for example copying a large dataset or downloading large files, 
-must be reported **before** being performed to the UDI GIGA-MED IT department so that these large changes do not affect the proper functioning of the operations. This rule also applied 
-when large number of small files are created or modified (e.g. several thousands of files of less than 1 Mb).
-
-# nobackup folders
-
-For files that do not require to be backed up, such as temporary files, a specific folder can be created. This folder must be called "nobackup". It must be written exactly like this, without space and in lowercase. Otherwise it will still be backed up. 
-
-# Data archiving
+As the tape system can only handle a limited amount of data per hour, any action 
+impacting 1 Tb of data or more, for example copying a large dataset or downloading 
+large files, must be reported **before** being performed to the UDI GIGA-MED IT 
+department so that these large changes do not affect the proper functioning of 
+the operations. This rule also applied when large number of small files are created 
+or modified (e.g. several thousands of files of less than 1 Mb).
+
+### nobackup folders
+
+For files that do not require to be backed up, such as temporary files, a specific 
+folder can be created. This folder must be called "nobackup". It must be written 
+exactly like this, without space and in lowercase. Otherwise it will still be backed up. 
+
+# Offline data archiving
+
+**Why is it useful to send data to the offline archiving system?**  
+Since the disk space on the mass storage is finite and expensive and the 
+amount of data we are producing grows exponentially, users are encouraged to send files
+they need to keep but don't need to access anymore to our offline archiving 
+system in order to release some space on disk.   
+
+**How does it work?**  
+As explained in the backup section of this page, each file present on disk is
+also copied on 2 tapes libraries. Once a file is sent to the offline archiving system, 
+the copy on disk is truncated, but there is still 2 copies on the 2 tape libraries. 
+Archived files are not directly accessible to users, but can be restored on disk if needed.
+Of note, the restoration of an archive is obviously possible only if there is enough space on disk to store it.
+
+**Where will the archived data be stored?**  
+Currently, both tape libraries are located in the same robot at the SEGI. However, 
+in February 2021, we'll add a new robot at the CHU datacenter, and one of the tape 
+library will be moved there, so that the 2 copies will be in different locations. 
+
+**Which type of files can be sent to the offline archiving system?**  
+Tape libraries are designed for long term storage of data. They are more stable 
+and a lot cheaper than disk space. However, they are also slower in term of writing/reading 
+capacity, so retrieving archived data can take up to a week. 
+Therefore, files sent to the offline archiving system should be files that the user 
+needs to keep (for legal reasons for example, or data that have been fully analyzed but not 
+published yet, in case a reviewer is asking to redo part of the analysis with a 
+new software or different options) but do not need to use/access quickly anymore.
+
+
+**Is there a size limit for sending data to archive?**
+   - There is no upper limit to the size of an archive. But if you want to send several Terabytes of data,
+please organize the main folder into subfolders containing data that are likely to be retrieved 
+together and keep a record of the tree structure, so that we don't need to retrieve
+the whole archive if you need only some of the files.  
+   - Given that the process to archive and restore data is quite laborious, the minimum 
+size you can send to archive is 500Gb. If your experiments typically generate less 
+than 500Gb of data, you can wait to have several experiments (eventually in separated subfolder) 
+before to archive them. 
+
+**NB:** In some circumstances, some files may be sent offline even if the user didn't ask for it. 
+See the [automatic archiving](mass-storage/mass-storage-backup-archive#automatic-archiving) for more information.

-Since the space on the mass storage is finite and expensive and that we are producing more and more data, a user can now send files to an archiving (tape) library to release some space on the mass storage. Moreover, as explained in the [quota](mass-storage/mass-storage-quota) and [billing](mass-storage/mass-storage-billing) pages, a part of the space of the mass storage will be charged. Moving data to the archiving (tape) library could help reduce the bill. The files sent to this archiving (tape) library should be files that the user needs to keep (for legal reasons for example) but do not need to use/access quickly anymore.

-Currently, data are present in several copies (one on the mass storage located at the GIGA and two copies on tape at the SEGI). In January 2021, there will be an addition of a new tape library at the CHU datacenter. Two new copies of the data will thus be created at this location.
- In total, the data will be present in two copies at each locations (SEGI and CHU datacenter).
+## On demand archiving

-NB: Tape libraries are for long term storage of data (more stable but slower in term of writing/reading capacity).
+### Procedure to send data for archiving
+
+Files/folders should be properly organized before being sent for archiving.  
+Don't hesitate to contact the [Bioinformatic team](mass-storage/mass-storage-contacts) 
+if you need help for any of these steps.  
+
+The procedure to send files for archiving is:
+1. If not already done, ask the UDIMED/UDIGIGA (https://sam.med.uliege.be/) to create an "ARCHIVES" folder in 
+your team folder on the mass storage. This need to be done only the very first time.
+2. Determine which files/folders you want to send for archiving and organize them so that
+    - All related files that are likely to be retrieved together (if a retrieval is ever needed) are in the same (sub)folder with a meaningful name
+    - Big files (typically 200Mb and more) are compressed as much as possible as explained [here](mass-storage/mass-storage-compression)
+    - Numerous small files (typically thousands of files smaller than 4Mb) are grouped into archive files as explained [here](mass-storage/mass-storage-compression)
+3. Create a subfolder in the "ARCHIVE" folder with a meaningful name (for example the name of the project, the date and any specific information).   
+**WARNING: don't use any space or special character in the folder names !!!!**
+4. Move in that folder the data you want to send for archiving (organised as described above) 
+5. Wait until you have at least 500Gb of data to send for archiving 
+(eventually grouping separated project in separated sub-folder). 
+6. Keep a record of what you have sent for archiving, for example in a text file explaining what's in each folder. 
+You can also make a list of the files using the linux `tree` command (with the help of the bioinformatic platforms if needed).
+Of note, we strongly recommend to have both a tree and file with a description of what's in the archive, 
+as a list of file names might not be enough for you to exactly know what is in each folder.
+7. Contact the UDIMED/UDIGIGA by filling a form at https://sam.med.uliege.be/  
+    The form must contain the following pieces of information:
+    - path to the data to archive (or the name of the team and the name of the folder to archive)
+    - the number of years that the data must be kept (default is 5 years)
+    - name of the PI in charge
+8. The UDIMED/UDIGIGA will send your data to the archiving system. 
+Once that's done, you won't be able to enter the archived folder and to see your data anymore.
+
+
+**IMPORTANT CONSIDERATIONS**
+- Moving data to the "ARCHIVE" folder is not enough for them to be truncated. 
+You have to ask the UDIMED/UDIGIGA to archive them. 
+Data in the "ARCHIVE" folder that haven't been truncated still takes up space on disk and are therefore 
+still taken into account for the billing of your disk usage.
+- Each truncated file is occupying 4kb of space on disk. So, if you archive 1 million of individual files, 
+the remaining volume on disk will be 4Gb (ish). However, if you have hundreds of millions of small files, 
+you'll save more space by grouping them into one file as explained [here](mass-storage/mass-storage-compression)
+
+### Procedure to retrieve data from archiving
+
+The procedure to retrieve archived files/folders is:
+
+1. Contact the UDIMED/UDIGIGA by filling a form at  https://sam.med.uliege.be/  
+   The form must contain the following pieces of information:
+   - the path to folder you want to retrieve (or the name of the team and the name of the folder)
+   - if you don't want to retrieve the whole folder but only some part of it, don't forget to mention which subfolder
+   - the name of the PI in charge
+2. The UDIMED/UDIGIGA will check there is enough space left on disk to store your data. 
+If this is the case, they will retrieve your data and give you access to them. 
+
+
+**Note**: This operation may take several days. This means that you need to **anticipate** the need of those data. 

-**IMPORTANT**: Due the finite space on the mass storage, some files may be sent to archiving even if the user does not ask for it. See the [automatic archiving](mass-storage/mass-storage-backup-archive#automatic-archiving).


+## Automatic archiving

-## On demand archiving
+Once the storage will reach 80% of its maximum storage capacity, oldest data will 
+automatically be truncated from the disk in order to save space. Once that happens, 
+there will still be 2 copies of the file on tape but only the beginning of the file 
+and its metadata will stay on disk. Therefore, the file name will still be visible 
+in the tree view but the file itself will be on tape. 

-The final system is not yet implemented (it will be available during August). However, a page describing how to properly organize your data is already available [here](mass-storage/mass-storage-archive-creation).
+This procedure will affect only files of more than 4Mb and will start with files 
+that haven’t been modified or open for at least 270 days. If that’s not 
+enough, “younger” files may be affected too.

-## Automatic archiving
+It means that opening data that haven’t been used for a long time will become a 
+slow process, as these data will first need to migrate from tape to disk before 
+to be accessible again. The time required will depend of the ongoing backing up 
+of other files. In optimal conditions, it shouldn't take more than 1h to access a 1 TB file.

-Once the server will reach 80% of its maximum storage capacity, oldest data will automatically be truncated from the disk in order to save space. Once that happens, there will still be 2 copies of the file on tape but only the beginning of the file and its metadata will stay on disk. Therefore, the file name will still be visible in the tree view but the file itself will be on tape. 

-This procedure will affect only files of more than 4Mb and will start with files that haven’t been modified, read or open for at least 270 days. If that’s not enough, “younger” files may be affected too.

-It means that opening data that haven’t been used for a long time will become a slow process, as these data will first need to migrate from tape to drive before to be accessible again. The time required will depend of the ongoing backing up of other files. In optimal conditions, it shouldn't take more than 1h to access a 1 TB file.

 # [Contacts](mass-storage/mass-storage-contacts)
\ No newline at end of file