Changes

Grailet Jean-François · d6b95ff2
--- a/cicd-with-nic5.md
+++ b/cicd-with-nic5.md
 ---
 title: Configuring a standard experiment for NIC5
 ---
 Thanks to GitLab CI/CD features and thanks to [Jacamar CI](https://ecp-ci.gitlab.io/docs/admin/jacamar/introduction.html), you can configure a standard experiment for BAMHBI to run on a cluster which you will be able to kickstart from ULiège GitLab with a simple click. For the time being, all CI/CD experiments of the MAST will be run on the NIC5 cluster, notably to benefit from the many MAST resources already stored on NIC5.
 This tutorial details the requirements you should meet beforehand, for both security and storage concerns, then how you can write a suitable CI/CD configuration to create your experiment.
@@ -15,7 +11,7 @@ This tutorial details the requirements you should meet beforehand, for both secu
 The [CÉCI (Consortium des Équipements de Calcul Intensif)](https://www.ceci-hpc.be/) expects us to make reasonable use of their cluster(s) through GitLab. I.e., the experiments we run through CI/CD should not compromise their security, in addition to avoid abusing their computer resources (i.e., just as with your regular jobs). This is why pre-configured experiments written as CI/CD jobs on GitLab will only be possible if the following conditions are met.
-In the next lines, the repository where the BAMHBI code you want to use is located is called the **target repository**.
+In the next lines, the GitLab repository where the BAMHBI code you want to use is located is called the **target repository**.
 1) The **target repository must be private**, i.e., restricted to selected users, such as MAST members.
@@ -53,7 +49,7 @@ The following repositories are already configured to enable experiments with the
 **The main advantage of [Jacamar CI](https://ecp-ci.gitlab.io/docs/admin/jacamar/introduction.html) is being able to write your CI/CD jobs as if you were already logged in on the cluster and inside a job.** This means that you do not have to worry about any file transfer between GitLab and the cluster. In particular, your cluster job will start in practice by cloning the target repository locally, which means that your repository files are reachable at soon as the script of your CI/CD job begins.
-Before going any further, it's strongly recommended to review [GitLab tutorials for CI/CD](https://docs.gitlab.com/ci/quick_start/) to get at least familiar with the syntax of a CI/CD configuration.
+Before reading this tutorial, it's recommended (though not essential) to review [GitLab tutorials for CI/CD](https://docs.gitlab.com/ci/quick_start/) to get familiar with the syntax of a CI/CD configuration.
 ### Basis of a NIC5 job
@@ -102,7 +98,54 @@ my_nic5_job:
  ...
 ```
-### Recommended additional keywords
+5) Finally, you can **start writing the instructions of your job with the ``script:`` keyword**. Technically, you could write all your instructions from there, as shown by the next example, where the sample instructions just prepare some directories.
+```
+my_nic5_job:
+  stage: some-stage
+  id_tokens:
+    SITE_ID_TOKEN:
+      aud: https://gitlab.uliege.be/especes/mast/nemo4.2.0-bamhbi
+  tags:
+    - nic5
+    - compute
+    - slurm
+  variables:
+    SCHEDULER_PARAMETERS: "--ntasks=114 --cpus-per-task=1 --mem-per-cpu=1024 --time=2:00:00
+  script:
+    - rm -rf ${GLOBALSCRATCH}/bamhbi_cicd/LR
+    - mkdir -p ${GLOBALSCRATCH}/bamhbi_cicd/LR
+    - export CICD_HOME=${GLOBALSCRATCH}/bamhbi_cicd/LR
+    - export ZENODO_MIRROR=/scratch/ulg/mast/mast/zenodo
+...
+```
+6) However, **it's strongly recommended to write your job instructions in a _nested_ Shell script (``.sh``), i.e., that will be called by your YAML script**. On the one hand, such a script will be pretty **similar to your typical NIC5 submission script**, with a few technical differences (see [Technical considerations](cicd-with-nic5#technical-considerations)). On the other hand, there are multiple good reasons for proceeding this way.
+   * The first reason is that YAML syntax is not suitable to write complex sets of instructions which may include loops, conditions, etc.
+   * The second reason is that specific characters, such as `:`, which you may use to start running NEMO (with BAMHBI) with MPI, are ambiguous with respects to YAML syntax.
+   * Finally, and perhaps most importantly for your colleagues, writing your job instructions in a nested script allows you to comment them to explain what you are doing. This is especially important if you are using a unique configuration.  
+   Once you have written your nested script, store it on the GitLab repository where you are configuring your CI/CD job, then copy it at the start of your job in the current working directory (`.`), make it executable and run it. This should look like in the following example, where ``test_cases/lr_cluster/scripts`` is a path on the GitLab repository.
+```
+my_nic5_job:
+  stage: some-stage
+  id_tokens:
+    SITE_ID_TOKEN:
+      aud: https://gitlab.uliege.be/especes/mast/nemo4.2.0-bamhbi
+  tags:
+    - nic5
+    - compute
+    - slurm
+  variables:
+    SCHEDULER_PARAMETERS: "--ntasks=114 --cpus-per-task=1 --mem-per-cpu=1024 --time=2:00:00
+  script:
+    - cp test_cases/lr_cluster/scripts/lr_nic5_run.sh .
+    - chmod +x ./lr_nic5_run.sh
+    - ./lr_nic5_run.sh
+```
+### Recommended additional YAML keywords
 By step 4 in the previous section, the job is normally already good to run on NIC5. However, it's **strongly recommended** to add a few additional keywords to have a better control of your job.
@@ -118,23 +161,43 @@ By step 4 in the previous section, the job is normally already good to run on NI
 ### Technical considerations
-The ``script`` part of your job can be virtually identical to what you would write as a script to submit on a cluster by yourself. You nevertheless have to pay attention to where you get your forcings from, where you will write your outputs (e.g., on the cluster directly or as job artifacts), among others. Here is some advice.
+The script you will call within the YAML job can be virtually identical to what you would write as a script to submit on a cluster by yourself. However, there are a few technical considerations to take into account due to Jacamar CI, on top of paying attention to where you get your forcings from, where you will write your outputs (e.g., on the cluster directly or as job artifacts).
+* **Always start your job with ``module purge``.** This innocuous instruction will mitigate the risks of having messages written on the standard error output upon loading modules (e.g., when loading another ``releases/`` module), especially if you have ``module load`` instructions in your ``.bashrc`` file. Such errors, though normally innocuous, may be wrongly interpreted by Jacamar CI as an error that will stop the job entirely.
+  * You may include ``module purge`` at other stages of your script prior to loading a different ``releases/`` module.
+* **Create and use directories on your ``${GLOBALSCRATCH}`` where you will run your CI/CD experiment.** It's not recommended to use ``/scratch/ulg/mast/mast`` on NIC5: since it's a shared location, there may be conflicts when multiple people try to run NIC5 CI/CD pipelines which may break them.
+  * Before creating the directory for your current job, deletes any previous instance of it (e.g., with ``rm -rf my_dir``) to prevent a job failure due to a pre-existing directory. This is important if you have to interrupt a CI/CD job and re-start it, because pre-existing files may break the re-started job.
-* **Consider writing the more complex instructions to Bash scripts you will call in your CI/CD job** (i.e., nested scripts). In addition to not being suitable for long sequences of instructions, the YAML syntax (i.e., the syntax of the CI/CD configuration) has some limits. For instance, characters such as `|` or `:` have their own meaning in YAML, so you may struggle running your typical ``mpirun`` command (among others). A simple workaround is to write such command in a short script (which may also contain your ``module load`` instructions) called in your YAML script. [You can find an example of such nested script here.](https://gitlab.uliege.be/especes/mast/nemo4.2.0-bamhbi/-/blob/main/test_cases/lr_cluster/scripts/run_nemo_with_mpi.sh)
+* **Copy resources from GitLab** (repository files, artifacts from a previous job, etc.) **in your ``${GLOBALSCRATCH}`` directory** (see above) **at the start of your script**. Keep in mind that, at the start of your NIC5 job, the GitLab repository has been cloned into the working directory where you start, so you may easily copy files from the repository into your ``${GLOBALSCRATCH}`` directory by using paths relative to the root of the GitLab repository. Likewise, artifacts are always located at the working directory where you start. You can find sample instructions below which apply this approach as well as the previous advice.
+```
+module purge # To prevent issues with individual .bashrc files on NIC5
+# Prepares the relevant directories in user's scratch
+rm -rf ${GLOBALSCRATCH}/bamhbi_cicd/LR
+mkdir -p ${GLOBALSCRATCH}/bamhbi_cicd/LR
+export CICD_HOME=${GLOBALSCRATCH}/bamhbi_cicd/LR
+export ZENODO_MIRROR=/scratch/ulg/mast/mast/zenodo
+# Gets resources from GitLab
+mv LR-runnable.tar.gz ${CICD_HOME} # Artifact from previous job
+cp -r test_cases/lr_cluster/exp_files ${CICD_HOME} # Files from the repository
+```
 * **Before calling ``mpirun``, use ``ulimit -l unlimited``.** The exact reason why this instruction is needed is unclear at the moment, as it's not needed when you submit a script by yourself on NIC5. This may be fixed by the CÉCI later.
-* **Do not hesitate to store side files (e.g., namelist files, nested scripts) on the cluster or on the repository.** In particular, with the former possibility, you will let other users know how you have configured you experiment, which will help them to reproduce your setting or results. [An example of this practice can be found on the Nemo4.2.0-Bamhbi repository.](https://gitlab.uliege.be/especes/mast/nemo4.2.0-bamhbi/-/tree/main/test_cases/lr_cluster)
+* **Do not hesitate to store small side files (e.g., namelist files, nested scripts) on the cluster or, even better, on the repository.** In particular, with the latter possibility, you will let other users know how you have configured you experiment, which will help them to reproduce your setting or results. [An example of this practice can be found on the Nemo4.2.0-Bamhbi repository.](https://gitlab.uliege.be/especes/mast/nemo4.2.0-bamhbi/-/tree/main/test_cases/lr_cluster)
 * **Store large files (e.g., atmospherical forcings) in specific directories on the cluster.** GitLab repositories are not suitable to store large files. [Zenodo.org](https://zenodo.org/) is a more suitable option, though the limited network capabilities (small bandwidth, no DNS look-up) of NIC5 compute nodes prevent downloading archives from zenodo.org. This is why it's recommended, for know, to store forcings at a known location, then copy (or alias) them where your job runs.
+  * A _mirror_ of Zenodo archives has been set up on NIC5 at ``/scratch/ulg/mast/mast/zenodo`` to make up for the inability to download straight from [Zenodo.org](https://zenodo.org). You can find [an example use case on the Nemo4.2.0-Bamhbi repository](https://gitlab.uliege.be/especes/mast/nemo4.2.0-bamhbi/-/tree/main/test_cases/lr_cluster).
 ### More elaborate pipelines
-As with regular GitLab CI/CD jobs, you can split your standard experiment in several stages. You can for instance run a first job that will compile the model and then perform the actual simulation as a subsequent job. In all cases, do not forget to use the ``needs`` keyword, which instructs GitLab to only run a job when a former job (or multiple jobs) have successfully completed, cf. the example below (from [Nemo4.2.0-Bamhbi CI/CD configuration](https://gitlab.uliege.be/especes/mast/nemo4.2.0-bamhbi/-/blob/main/.gitlab-ci.yml)).
+As with regular GitLab CI/CD jobs, **you can split your standard experiment in several stages**. You can for instance run a first job that will compile the model and then perform the actual simulation as a subsequent job. For this purpose, you should **use the ``needs`` keyword**, which instructs GitLab to only run a job when a former job (or multiple jobs) have successfully completed, cf. the example below (from [Nemo4.2.0-Bamhbi CI/CD configuration](https://gitlab.uliege.be/especes/mast/nemo4.2.0-bamhbi/-/blob/main/.gitlab-ci.yml)), which also gives a complete example of YAML script configuring a job for NIC5.
 ```
-...
 lr_nic5_run:
  stage: run
  id_tokens:
@@ -149,11 +212,18 @@ lr_nic5_run:
  needs:
    - lr_nic5_compile
  script:
-    - mkdir -p /scratch/ulg/mast/mast/bamhbi_cicd/LR
+    - cp test_cases/lr_cluster/scripts/lr_nic5_run.sh .
-    - export CICD_HOME=/scratch/ulg/mast/mast/bamhbi_cicd/LR
+    - chmod +x ./lr_nic5_run.sh
-    - export ZENODO_MIRROR=/scratch/ulg/mast/mast/zenodo
+    - ./lr_nic5_run.sh
-    - mv LR-runnable.tar.gz ${CICD_HOME}
+  retry:
-...
+    max: 2
+    exit_codes: 1
+  artifacts:
+    paths:
+      - ocean_output.tar.gz
+      - NIC5_CICD_outputs.tar.gz
+    when: always
+    expire_in: 1 week
 ```
 Finally, it's worth noting you can absolutely mix NIC5 jobs with jobs running on GitLab. You can find again an example in the [Nemo4.2.0-Bamhbi CI/CD configuration](https://gitlab.uliege.be/especes/mast/nemo4.2.0-bamhbi/-/blob/main/.gitlab-ci.yml), where the outputs from the NIC5 job running the simulation are sent to a containerized job running Python code to produce various figures to assess the model. The obvious constraint of this approach is that the inputs of the GitLab job must be entirely contained in the artifacts of the NIC5 job, as the GitLab job will not have access to the NIC5 environment.