Running Spam Analyzer on a High-Performance Computer System


  • How do we run computational jobs on a modern HPC system?

  • Explain the standard way of submitting and running jobs on an HPC system.

  • Use SLURM parameters to specify resource requirements for a job.

Why Job Scheduler?

A job scheduler in computing can be likened to a restaurant’s host managing customers. In this analogy, jobs are the customers, each with varying durations, sizes, and resource requirements. The goal of the job scheduler, like the host, is to ensure that all jobs are executed with the appropriate resources, providing a fair share among users while maximizing resource utilization. This ensures efficient and effective handling of multiple jobs, much like a host optimizes seating arrangements and service quality in a busy restaurant.

SLURM: A Job Scheduler on HPC

SLURM is an open-source job scheduler designed for HPC. It’s particularly popular in academic and research institutions, where users require efficient resource management for running compute-intensive tasks.

SLURM operates on a simple principle: users submit job scripts containing the necessary instructions for their computations, and SLURM takes care of scheduling, resource allocation, and job execution. Here’s a step-by-step breakdown:

  1. Creating a Job Script: Users craft a job script, which is essentially a UNIX shell script with additional directives at the top, including a #SBATCH line specifying parameters for SLURM.
  2. Submitting the Job Script: Once the job script is ready, users submit it to SLURM using the sbatch command.
  3. Queuing the Job: SLURM places the job in a queue, awaiting available resources.
  4. Resource Allocation: SLURM continuously monitors the cluster’s resources and identifies suitable nodes based on the job’s requirements (e.g., CPU, memory, GPU).
  5. Launching the Job: Once appropriate resources are found, SLURM reserves them and launches the job on the selected nodes.
  6. Execution: The job executes according to the instructions in the script, performing the desired computations.
  7. Input and Output Handling: Input data required for the computation is typically read from files specified in the job script, while output is directed to files as well. This ensures a structured approach to data handling and management.

Running Spam Analyzer on HPC

To process data on the HPC using SLURM, we first need to create a job script tailored for the task. Let’s name it month03.slurm to process the data on March 1998.

In this script, we specify essential parameters for SLURM using directives.

For instance:

#SBATCH --job-name month03
module load DeapSECURE
./ 1998/03

This script, month03.slurm, sets the job name as “month03” and loads necessary modules like DeapSECURE. It then executes a script named with the argument 1998/03, indicating the March 1998 dataset to process.

Once the month03.slurm script is prepared, it can be submitted to SLURM for execution using the sbatch command:

sbatch month03.slurm

Upon submission, SLURM assigns a process ID to the job, confirming its submission:

Submitted job 70592

Now, to track the progress of our job, we can use the squeue command with our user ID:

squeue -u USER_ID

This command provides a status update on all our submitted jobs, including their current state within the queue.

Additionally, SLURM automatically directs the output of our job to a designated file, typically named slurm-JOB_ID.out. For instance, in this case, the output would be directed to slurm-70592.out.

Lastly, if there’s a need to cancel the task for any reason, we can use the scancel command followed by the job’s process ID:

scancel 70594

This command effectively terminates the specified job, freeing up resources for other tasks in the queue. Overall, this workflow provides a structured and efficient approach to managing computations on HPC clusters using SLURM.


create a second script to process April 1998

For evaluating performance and measuring runtime, we can incorporate timing into our job script. Let’s name this script runtime_month03.slurm. Here’s how it looks:

#SBATCH --job-name month03
module load DeapSECURE
d1=$(date +%s)        # measure begin time (in sec)
./ 1998/03
d2=$(date +%s)        # measure end time (in sec)
echo 'Total time to run is' $(($d2 - $d1)) 'seconds'

When comparing the runtime of processing different datasets, such as the March 1998 and April 1998 collections. Please think these questions:

  1. Does the runtime differ when processing the March 1998 and April 1998 email collections, and if so, what could be the underlying reasons for this difference?
  2. Could the variance in runtime be attributed to the number of spam emails present in each directory?

Common Job Parameters

SLURM provides a wide range of job parameters that can be specified in the job script using the #SBATCH directive. These parameters allow users to customize the behavior of their jobs and specify resource requirements. Here are some commonly used job parameters:

Monitoring and Managing SLURM Jobs

Once you have submitted your job to SLURM, you may want to monitor its progress and manage it as needed. SLURM provides several commands for this purpose:

These commands provide you with the ability to monitor the progress of your jobs, check their resource usage, and cancel them if necessary. By effectively monitoring and managing your SLURM jobs, you can ensure efficient utilization of resources and timely completion of your computations.

Processing Multiple Months’ Data

To streamline the process of processing the entire year’s data of 1998, we can create a custom script to automate the execution of for each month.


Copy this to the file.


./ 1998/03
./ 1998/04
./ 1998/05
./ 1998/06
./ 1998/07
./ 1998/08
./ 1998/09
./ 1998/10
./ 1998/11
./ 1998/12

Instead of running individual commands for each month, which can be tedious and time-consuming, we can consolidate all commands into a single script. This script will iterate over each month of the year, invoking with the appropriate argument for the corresponding dataset.

Here’s how we can achieve this:

  1. Copy the Template Script

Start by copying the templates/year1998.template script to create a new script for each month. The template script should include the general structure and commands needed to run the script.

cp templates/year1998.template ./spam_ip_1998.slurm
  1. Adjust the Script for SLURM Compatibility Modify the copied script to include SLURM directives. These directives will specify job parameters such as job name, resource requirements, and output settings.
    nano spam_ip_1998.slurm
#SBATCH --job-name year1998
#SBATCH --time 2:00:00

module load DeapSECURE
# Hint: We have months: March through December (03 .. 12)
d1=$(date +%s)
for MONTH in ##EDIT...
d2=$(date +%s)
echo 'Total time to run is' $(($d2 - $d1)) 'seconds'

Once the scripts are prepared, we can execute them individually or submit them to SLURM for batch processing. SLURM will handle the scheduling and execution of each script, ensuring efficient utilization of resources and timely processing of the entire year’s data.

To estimate the time needed to compute the entire 1998 dataset, we can initially calculate a rough estimate using simple multiplication. We can multiply the time taken to process a single month (for example, March 1998) by the total number of months in the year (10 months).

After making this initial estimate, we can proceed to run the spam_ip_1998.slurm script and measure the actual runtime. By comparing the estimated time with the actual runtime, we can determine if the estimate is accurate or if adjustments are needed.

Upon completion of the computation, we can analyze the actual runtime and compare it with our initial estimate. If the actual runtime significantly deviates from the estimated time, it indicates that the estimate was off. This could be due to various factors such as variations in dataset sizes, computational complexity, or system resource availability.


Try the same processing for 1999, 2000

