|
|
|
**Table of contents**
|
|
|
|
|
|
|
|
[[_TOC_]]
|
|
|
|
|
|
|
|
# Data organisation
|
|
|
|
|
|
|
|
Files should be named and organised in a way that indicates their content and specifies their relationship to other files. File names must describe, at a glance, what the document is about, making it easier to browse them more effectively and efficiently.
|
|
|
|
|
|
|
|
The directory tree must be clear with explicit (meaningful) and unique folder names. Using numbers can sometimes help organising the tree but it is recommended that the numbers are preceded with 0's to ensure that files/folders are listed in numerical order (examples: 01-Folder_x 02-Folder_y).
|
|
|
|
|
|
|
|
Folders with precious data that will be needed for publication and folders containing temporary files should be clearly defined to facilitate data management and avoid losing important files.
|
|
|
|
|
|
|
|
All data must be properly organised and annotated so that they are accessible to current and future members working on the corresponding project.
|
|
|
|
|
|
|
|
For this purpose, it is useful to provide in each project or experiment folder an additional file describing the organisation and/or content of the data files. For these meta-data, it is recommended to use text-only files, and not binary files like Word, Excell, PDF. For example if you use a `.csv/.tsv files` (comma/tab separated values) for tables instead of a Excel file, you will be able to edit it with Excel as usual and bioinformatics tools will be able to read it too, while Excel files are readable only by very Microsoft and OpenOffice software. For free text, you can also use `.txt or .md (markdown) files`, and `.json files` for structured data and key/value pairs.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# File naming
|
|
|
|
|
|
|
|
Because the mass storage is a Linux-based infrastructure, users should use Linux-friendly names for files and directories. This implies avoiding non-English characters such as accents and symbols as well as spaces and tabulations. This is important because some analysis tools won’t accept non-English characters, but also to facilitate data management by users and system administrators. Files and folders with names not following these rules may cause problems in system maintenance and in backup procedures, leading to the absence of backed up versions of the file and loss of data in case of problem with the main drive.
|
|
|
|
|
|
|
|
Of note, all file names are case sensitive, so test.txt, Test.txt and TEST.txt are three different files.
|
|
|
|
|
|
|
|
NB: This also means that it is dangerous to rely on the alphabetical order, where the case (‘a’ vs ’A’) may not be accounted in the same way across systems/softwares. So files created on Windows may not be listed in the same order on Linux.
|
|
|
|
|
|
|
|
## Mandatory rules
|
|
|
|
|
|
|
|
They can contain:
|
|
|
|
|
|
|
|
* digits 0 to 9
|
|
|
|
* lowercase letters a to z
|
|
|
|
* uppercase letter A to Z
|
|
|
|
* underscore and/or hyphen
|
|
|
|
|
|
|
|
They must **never** contain:
|
|
|
|
|
|
|
|
* spaces, tabulations or punctuation other than hyphen (-) and underscore (_)
|
|
|
|
* Japanese, Chinese, Korean, Greek, Hebraic, Arabic or other non-English characters and ideograms
|
|
|
|
* Diacritics such as accent, umlaut, tilde or cedilla (é, è, ê, ä, á, í, ñ, õ, ç, etc.)
|
|
|
|
* special characters such as , ; . : ! ? / \ “ # [ ] > < % * = & $ | ^ ( ) { }
|
|
|
|
* ystem reserved keywords such as CON, PRN, AUX, CLOCK$, NUL, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9
|
|
|
|
|
|
|
|
|
|
|
|
## Recommendations
|
|
|
|
|
|
|
|
* Keep file names short, meaningful and easily understandable to others. Limit them to 25 characters in length if possible. Short but meaningful is best.
|
|
|
|
* Do not use “empty words” like « le, la, les, un, une, des, et, ou, the, a, an, and, or, etc.»
|
|
|
|
* Dates should always follow the format YYYYMMDD (e.g. 20190625) (ISO8601 norm). Start the filename with the date if it is important to store or sort files
|
|
|
|
in chronological order.
|
|
|
|
* Avoid unnecessary repetition and redundancy in file names and paths.
|
|
|
|
* Avoid obscure abbreviations and acronyms. Use agreed GIGA or common scientific abbreviations where relevant.
|
|
|
|
* For numbers 0-9, always use a minimum of two digit numbers to ensure correct numerical order (e.g. 01, 02, 03 etc.)
|
|
|
|
* The version number of a record should be indicated in its file name by the inclusion of ‘V’ followed by the version number (e.g. V01, V02 etc. and VD “version definitive” for
|
|
|
|
validated version). The status must be the last information before the extension of the file (.pdf, .doc, .jpg for example). If users are versioning scripts for analysis
|
|
|
|
pipeline, they should use [git](work-in-progress) instead.
|
|
|
|
|
|
|
|
# Data maintenance
|
|
|
|
|
|
|
|
To minimise their space usage, users should
|
|
|
|
- never duplicate files, especially when working with large datasets. Proper use of the mass storage infrastructure already ensures that files are securely stored and backed-up.
|
|
|
|
- use hard or soft links when they need to access files from another location
|
|
|
|
- avoid keeping intermediate files from analysis pipelines. During the analysis, these files should ideally be generated on the computer used for the analysis or on the scratch disk if using the GIGA HPC-cluster (see [cluster-scratch](work-in-progress)). These files should then be deleted once the analysis is finished. If for any reason, these files need to be kept, users should make them easily distinguishable by storing them into specific directories for intermediate results. These intermediate files should then be deleted as soon as they are not needed anymore.
|
|
|
|
- use compression tools whenever possible.
|
|
|
|
- regroup numerous small files in zip/rar/tar/dar archives.
|
|
|
|
|
|
|
|
Moreover, all users should go through their folders on a regular basis and remove what is no longer needed. This should be done at least quarterly.
|
|
|
|
|
|
|
|
During these sorting and cleaning sessions, users should delete files that have become useless ("ROT": Redundant, Obsolete, Trivial) like, amongst others,
|
|
|
|
- temporary files (for examples intermediate versions when the latest version has been validated)
|
|
|
|
- intermediate files (see above)
|
|
|
|
- files that can be generated easily from existing files
|
|
|
|
- not exploited (bulky) raw data
|
|
|
|
|
|
|
|
It is also recommended to sort and store in an "Archive" folder documents that, although no longer in use, are nevertheless to be kept for a certain period of time. This folder
|
|
|
|
should be structured in the same way as the current file tree and should be sorted and cleaned at least twice a year.
|
|
|
|
|
|
|
|
# User leaving the GIGA
|
|
|
|
|
|
|
|
Any user leaving the GIGA must transfer (after sorting and cleaning) their home directory in the research group common area. All files must be directly identifiable without having to open all the documents. The PI must be kept informed of the files present and of their importance. After departure of the user, the user's home will be deleted and all files not transferred to the group common area will be lost. |
|
|
|
\ No newline at end of file |