This lesson is in the early stages of development (Alpha version)

Parallel Processing 2: Using GNU `parallel`

Overview

Teaching: 20 min
Exercises: 15 min
Questions
  • How do we launch many computations on a modern HPC system?

Objectives
  • Users being able to use GNU parallel to launch many independent computations on an HPC system

About GNU parallel

In the previous episode, we learned how to launch processes in the background. Although this helps to run programs in parallel, this is not the most efficient way. Imagine that you have more than one computer node to work with. If you only run programs with &, you can only use one of the nodes and not all of them. How would you run multiple processes on multiple nodes? There are several tools that can allow you to run more processes in parallel on multiple nodes such as xargs, tee, and parallel. In this episode we will learn how to use parallel.

What is GNU parallel

GNU parallel is a shell tool that can execute jobs in parallel using one or more nodes of a cluster computer. It was written in Perl by Ole Tange. It is a tool available on Linux, and UNIX-like systems. GNU parallel (parallel from now on) features a plethora of options to fit the execution to our needs. At the basic level, parallel takes a set of inputs, a program to run, then manages the execution of a set of processes for the given program.

Setting up environment for using GNU Parallel

In order to use any special program on a cluster, you need to inform the system where to find this special program. There is a convenient system installed on most cluster computing platforms that allows users to add or remove paths to programs they need. This system is known as the module environment system. There is a set of commands you can use in order to query, load, unload modules. To interact with the module environment system, you can use the module command. In this section we need to use the GNU parallel program. To check for its availability, we type:

$ module avail parallel

We get this output:

---------------------------------------- /cm/shared/mls/Tools ----------------------------------------------
parallel/20161222    parallel/20190322 (D)

Where:
D:  Default Module

Use "module spider" to find all possible modules.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".

We see that there are two versions of the module available on Turing. To load the default module, we type the following:

$ module load parallel

In order to clear all loaded modules, try the following:

$ module purge

To list all loaded modules:

$ module list

As seen earlier, you can check if a module is available with module avail. Let’s check what versions of python are available to us:

$ module avail python
-------------------------------------- /cm/shared/mls/Languages/python/3.7 --------------------------------------
biopython/1.6    biopython/1.7 (D)    ipython/5.3    ipython/5.8    ipython/6.5    ipython/7.3    ipython/7.4 (D)

-------------------------------------------- /cm/shared/mls/Core ------------------------------------------------
python/2.7    python/3.6 (D)    python/3.7 (L)

Where:
L:  Module is loaded
D:  Default Module

Use "module spider" to find all possible modules.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".

In this workshop series, you will need different modules depending on the topic covered. For each workshop, a convenient list of modules to load will be made available to you in the workshop materials, under the Setup page.

Simple example

Let say we want to analyze emails from February and March 1999, from our email set, in parallel. All we have to do is type:

$ parallel ./spam_analysis.py ::: /scratch-lustre/DeapSECURE/module01/spams/untroubled/1999/02 /scratch-lustre/DeapSECURE/module01/spams/untroubled/1999/03

In this listing, we call the parallel command with spam_analysis.py as program to run. The ::: means that what follows is the list of inputs to run spam_analysis.py with.

Following this, we can run for a full year, let say 1999:

$ parallel ./spam_analysis.py ::: $(ls -1 -d /scratch-lustre/DeapSECURE/module01/spams/untroubled/1999/[0-9][0-9])

parallel also has the ability to take arguments from the stdin. The previous listing is equivalent to:

$ ls -1 -d /scratch-lustre/DeapSECURE/module01/spams/untroubled/1999/[0-9][0-9] | parallel ./spam_analysis.py

Specifying the number of processes

To specify the number of processes to launch, use the --jobs option:

$ parallel --jobs 12 ./spam_analysis.py ::: $(ls -1 -d /scratch-lustre/DeapSECURE/module01/spams/untroubled/1999/[0-9][0-9])

The above listing will start 12 processes of spam_analysis.py and will give each one of them a month from 1999 to work on. Note that you have to have 12 processor cores available for this to work efficiently; otherwise, the available cores will be shared among the 12 processes.

Running on more than a single node

So far, we have only used a single node to run our program in parallel. As mentioned before, it is possible to use parallel to run a program on many compute nodes. In order to do so, we first need to request the nodes in a Slurm script. From within the script, we need to establish the list of allocated nodes for our job to run on. This can be done using Slurm command scontrol as follows:

$ scontrol show $SLURM_JOB_NODELIST

This will display the list of nodes in the allocation, but we need to have it in file so that we can provide it to parallel.

$ scontrol show $SLURM_JOB_NODELIST > job_node_list

Now the file job_node_list will be created and will contain the list of all nodes for the allocation, one name per line. We can use parallel command option --sshloginfile to feed the file to parallel. At this point we have all we need to start our parallel command:

$ parallel --jobs 6 --sshloginfile job_node_list ./spam_analysis.py ::: $(ls -1 -d /scratch-lustre/DeapSECURE/module01/spams/untroubled/1999/[0-9][0-9])

The above listing will launch 6 processes of spam_analysis.py on each of the nodes in job_node_list.

One last thing is needed for our command to be complete. We need to specify a working directory where parallel will transfer file from the remote nodes. The option --workdir takes care of this for us:

$ parallel --jobs 6 --sshloginfile job_node_list --workdir $PWD ./spam_analysis.py ::: $(ls -1 -d /scratch-lustre/DeapSECURE/module01/spams/untroubled/1999/[0-9][0-9])

$PWD return the value of the current directory. This means that all files created remotely from processes created by parallel will be transferred to the path found as value of $PWD.

GNU Parallel Cheat Sheet

For a quick reference on GNU parallel please check: GNU parallel cheat sheet.

Key Points

  • GNU parallel is a useful tool to launch many independent computations simultaneously on an HPC system