HPC-Rocket
Scientific research often involves the use of complex software tools to analyze data and perform simulations. Such software may require specific hardware and computational resources to run efficiently, making it impractical to execute on regular workstations. High-performance computing (HPC) clusters provide the necessary computational power to run such simulations, but users of these clusters may not have the permissions required to install the software necessary to integrate the cluster with a continuous integration (CI) service.
To address this challenge, we developed HPC-Rocket, a command-line application written in Python. HPC-Rocket aims to bridge the gap between CI services and HPC clusters, allowing users to execute large-scale simulations on remote clusters without the need for extensive permissions. Being a simple command-line application, HPC-Rocket can easily be used in any continuous integration service, therefore making it more portable than other solutions that integrate directly with the CI platform (e.g. Jacamar CI).
Using HPC-Rocket, users can connect to a remote cluster via SSH and copy the files necessary for software execution. HPC-Rocket then executes a specified job script via the Slurm scheduling system, which is widely used in HPC clusters. The produced results can also be collected back to the machine executing the CI pipeline for further investigation. The following figure shows an activity diagram of the HPC-Rocket workflow.
Configuration for HPC-Rocket is written in the YAML file format that is also commonly used for the definition of CI pipelines. The configuration file contains the address and credentials in form of an SSH key or password for the target machine. Moreover, HPC-Rocket supports the usage of environment variables, therefore allowing easy integration with secret stores provided by the respective CI services. In case the remote cluster is only accessible from a specific network HPC-Rocket is also capable of tunneling SSH commands through multiple proxy jumps. Additional sections describe file copying, collection, and cleaning instructions. The final setting specifies which file should be passed to the Slurm scheduling system. A configuration that works with the image produced with the GitLab CI job in section Singularity is presented in the following listing:
host: $REMOTE_HOST
user: $REMOTE_USER
password: $REMOTE_PASSWORD
copy:
- from: laplace-mpich-bind.job
to: laplace2d-mpich-bind/laplace.job
- from: rockylinux9-mpich-bind.sif
to: laplace2d-mpich-bind/rockylinux9-mpich-bind.sif
collect:
- from: laplace2d-mpich-bind/results/*
to: results
- from: laplace2d-mpich-bind/laplace.out
to: results
sbatch: laplace2d-mpich-bind/laplace.job
GitLab CI job
To run the HPC-Rocket configuration provided in the previous section, we define a new CI job using a Docker image with Python installed. Since it needs to work with the previously produced Singularity image, it depends on the CI job defined in section Singularity as can be seen in the needs section of the job. Before the job is run, HPC-Rocket is installed using Python’s package manager pip. The script section then runs HPC-Rocket with the given configuration. Finally, the result produced by the simulation will be uploaded as an artifact to be verified by Fieldcompare as described in the next section.
run-hpc-cluster-mpich-bind:
image: python:3.10
stage: simulation
needs: ["build-singularity-mpich-bind"]
before_script:
- pip install hpc-rocket==0.4.0
script:
- hpc-rocket launch rocket-mpich-bind.yml
artifacts:
paths: ["results/"]