This lesson is in the early stages of development (Alpha version)

My First HPC Computation: Spam Mail Analysis

Overview

Teaching: 0 min
Exercises: 15 min
Questions
  • How do we run a serial computation on a modern HPC system?

Objectives
  • Users will be able to create a job script

Motivating Background: Analyzing Email Origin

In this section, we will present the background of our first computation on HPC. Our task is to estimate the origin of SPAM emails. Every email always has a header block. Most people are familiar with the From:, Date:, To:, Cc:, and Subject: fields of an email header. However, there are more information bits stored in the header block which are not shown to the average users due to its overwhelming technicality. But these hidden bits contain information that we can use to track down the origin of an email.

Here is an example from a complete email header:

Delivered-To: bruce@untroubled.org
Received: (fqmail 15388 invoked from network); 02 Jan 2018 09:06:29 -0000
Received: from mx06.futurequest.net (mx06.futurequest.net [69.5.6.177])
  by 10.170.1.183 ([10.170.1.183])
  with FQDP via TCP; 02 Jan 2018 09:06:29 -0000
Received: (qmail 4668 invoked from network); 2 Jan 2018 09:06:29 -0000
Received: from quebec.terocrif.bid (go1-longer.lgol.net [173.232.229.177])
  by mx06.futurequest.net ([69.5.6.177])
  with ESMTP via TCP; 02 Jan 2018 09:05:18 -0000
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; s=k1; d=terocrif.bid;
 h=Mime-Version:Content-Type:Date:From:Reply-To:Subject:To:Message-ID; i=Numerologist@terocrif.bid;
 bh=f2/wUk81V4CNBaRgz/K4Mi1frMo=;
 b=SrNgEhZ3gAM4U2TipyThZh4O2aJ6VJtQUKqWF/5hDk4DkIoDqZ4phbuoqXYqHf2qrfWcReKxpcsc
   a1uJs7ZOGOPGsOn8vAnRQXRJ1UAtM0QiJ0zrJPT6fyw1wBb+NI78CYDk9Nb/4uACo+q0NZ/ESxI8
   EMwvKW08UA9TqB9rXjk=
DomainKey-Signature: a=rsa-sha1; c=nofws; q=dns; s=k1; d=terocrif.bid;
 b=BbAGr/SMsCdn9nKUPOTde4e4JqiJ1MnAVgJZc+NRbeBZ3ZMllL7fXo53mK+tBW6OtnQYOA6yD32G
   pT+rDZPC2AvFWuCcWyKr8M/nb45inyD/rZFe09QXd/I84VRfwP21srIox48XRsq3PcSgUgdKcCWC
   vJyvNSEA9tPrLdtenMo=;
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="4c82fb7de83aaebbfb586f12b7c9b955"
Date: Tue, 2 Jan 2018 04:04:02 -0500
From: "I was just shocked!" <Numerologist@terocrif.bid>
Reply-To: "I was just shocked!" <Numerologist@terocrif.bid>
Subject: Science of Numerological Analysis
To: <bruce@untroubled.org>
Message-ID: <0.0.49472.yncgxxc140zpwvzt645835.0@terocrif.bid>
Content-Length: 7066

The header stops when there is a blank line (in the original email, there is indeed a blank line right after the Content-Length: field). Of greatest interest will be the Received: fields. An email sent from the sender’s computer is passed from one relay server to another, until finally arriving at the receiving server, which then stores that email on the recipient’s mailbox. In an honest world, a Received: field is prepended whenever an email passes from one server to another. This is why an email would typically have multiple Received: field.

In this exercise, we limit ourselves to Received: fields that look like this:

Received: from quebec.terocrif.bid (go1-longer.lgol.net [173.232.229.177])
  by mx06.futurequest.net ([69.5.6.177])
  with ESMTP via TCP; 02 Jan 2018 09:05:18 -0000

According to this field, the mail server mx06.futurequest.net noted that it received this email from a server named quebec.terocrif.bid, whose IP address is 173.232.229.177. Because the Received: fields are prepended to the email header, the oldest field is located at the very bottom of the header block. In the example header above, there are only two Received: fields; the oldest one records the 173.232.229.177 IP address. This will be the IP address of the machine which sent out this email.

Next, our job is to figure out the country from which this email originated. An IP address is associated with a country and an organization. This is actually a publicly known information, since they are officially maintained by various registries at the regional, country, and international levels. There are internet services which provide mapping from IP to a geographical location (“geolocation”) such as country, city, and even approximate geo-coordinates. We will be using one of such a free service to map the IP address to the country. Because of the intensity of our query, we will be using a downloaded database to provide that mapping function.

What Is an IP Address?

An IP address is usually expressed in terms of a quartet of integers, such as 128.82.112.29 (this is the IP address of www.odu.edu). Each integer is in the range of 0 through 255 (inclusive). This address is actually a 32-bit integer (consisting of four 8-bit integers). For the example IP address above, the 32-bit number can be computed in this way (where ** refers to the exponentiation operator):

128 * 256**3 + 82 * 256**2 + 112 * 256 + 29
    = 2,152,886,301

The schema above refers to the “classic” IP address, also known as IPv4, which provides up to 4.3 billion unique addresses. In the newer protocol, IPv6, an IP address is a 128-bit number, thus it can accommodate enormously more devices connected to the global network (about 1038 addresses).

Information Veracity: The Devil in the Detail

Astute readers would notice that the IP address shown above, 173.232.229.177, actually were associated with two host names: quebec.terocrif.bid and go1-longer.lgol.net. Why the apparent conflicting information, you may ask? go1-longer.lgol.net is the host name returned by the IP reverse lookup. (On a Linux or Mac OS terminal, you can do a reverse IP lookup by invoking either host 173.232.229.177 or nslookup 173.232.229.177.) The name quebec.terocrif.bid was claimed by the sending server itself. In this case, an nslookup call for either hostname would point to the same IP address; but sometimes a server can lie. The point is, that the IP address is the most likely one to be the trustworthy information of all these bits. Even then, a spammer adds a phony Received: field–and we have observed that in our dataset. In our exercise, we will assume that all Received: fields are honest. A thorough analysis of email headers would be out of the scope of this training—it will be too big a digression for the fun thing we are about to do: being able to run computations on a supercomputer!

About the Datasets

There are two datasets we will be using in this exercise:

  1. A spam email collection created by Bruce Guenter at untroubled.org.

  2. An open, freely downloadable database of IP geolocation mapping as provided by IPInfoDB. We are using the DB1.LITE database from this website, which maps ranges of IP addresses to the countries.

The Untroubled Spam Collection

The spam collection was downloaded on September 2018 from http://untroubled.org/spam/. It contains emails classified as SPAM from March of 1998 through September of 2018. There is a total of over 85 million emails in this dataset, and the rate of SPAM received is increasing by the year. The total uncompressed size of the data is over 45 GB. For an average person, it is a formidable monster to process.

Exploring the spam collection

Let us explore the spam collection for a little bit so that you become familiar with its contents.

  • Go to directory /scratch/Workshops/DeapSECURE/datasets/untroubled-spam
  • List the contents of that directory: These are the years in which spam emails were captured.
  • Go into directory 1998 and list its contents: These spam mails were divided into months.
  • Go into one of the months (e.g. 03) and list its contents: what did you see?
  • View a couple of files using the less command to get a sense of the content of the spam emails.

WARNING: The spam collection contains very large number of files.

The IP Geolocation Mapping

IP addresses are assigned to a certain organization (and, by extension, a certain country) by a range of the 32-bit numbers. The DB1.LITE database provides a very simple table that looks like this:

min_ip max_ip country_code country
0 16777215 - -
16777216 16777471 AU Australia
16777472 16778239 CN China
16778240 16779263 AU Australia
16779264 16781311 CN China
16781312 16785407 JP Japan
3758095872 3758096127 SG Singapore
3758096128 3758096383 AU Australia
3758096384 4294967295 - -

Here, country_code refers to the two-letter country code. We use this table to look up the country associated with a certain IP address. All IP addresses in the range of min_ip through max_ip (inclusive endpoints) belong to a certain country. You can see that IP addresses are generally associated to a country in multiple blocks of IP addresses, Outside a block, these address blocks are scattered around, and there is no pattern.

Finding the Country of an IP Address

With other participants or with your friend, discuss the method, or algorithm, to map an IP address to the corresponding country. Think specifically about how this method is implemented in a computer program.

Solution

  • For every row in the table, check if the given IP satisfies min_ip <= IP <= max_ip.
  • When a row matches, return the country.

Case 0: Running on Your Own Computer

File preparation

This section assumes that you have staged the exercise files in your home directory, as explained the Basic Shell Interaction episode.

Suppose you want to run this analysis on your own computer. You will simply run the program with the appropriate argument. If Turing were your own laptop, you can do just this:

$ cd ~/CItraining/module1/Spam_analyser

Running on HPC: Sequential processing

When faced with massive amounts of data the spam collection above, it is necessary to use massive processing power of an HPC to analyze the origin of these emails in a timely manner.

In this exercise, we will first introduce how we create and process a computation on an HPC system. We will use only a single CPU core on a single compute node in this first exercise. Later, we will leverage parallel processing to use more CPU cores and get the job done faster!

As mentioned in the introduction section, HPC systems are shared among many users, therefore users must follow

Preparing input list

As mentioned above, the spam emails are organized by year and by month. Each year is contained in a folder of the name YYYY meaning the year. Each year-folder contains months, a folder for each. So, for example, /scratch-lustre/DeapSECURE/module01/spams/untroubled/1999/03 represents the folder containing all spam emails of march 1999. Our folder /scratch-lustre/DeapSECURE/module01/spams/untroubled/ contains the entire email collection. The spam_analysis.py script is a program written in Python that looks at an entire month folder and processes the emails of that month. The way to execute it is to feed the path to the month folder to our script.

$ ./spam_analysis.py /scratch-lustre/DeapSECURE/module01/spams/untroubled/1999/03

The above listing will execute spam_analysis.py for all emails received in march 1999.

Now that we know how to analyze a single folder, how can we analyze an entire year? This is simple. All we need is to call spam_analysis.py for each month in the year.

./spam_analysis.py /scratch-lustre/DeapSECURE/module01/spams/untroubled/1999/01
./spam_analysis.py /scratch-lustre/DeapSECURE/module01/spams/untroubled/1999/02
./spam_analysis.py /scratch-lustre/DeapSECURE/module01/spams/untroubled/1999/03
./spam_analysis.py /scratch-lustre/DeapSECURE/module01/spams/untroubled/1999/04
./spam_analysis.py /scratch-lustre/DeapSECURE/module01/spams/untroubled/1999/05
./spam_analysis.py /scratch-lustre/DeapSECURE/module01/spams/untroubled/1999/06
./spam_analysis.py /scratch-lustre/DeapSECURE/module01/spams/untroubled/1999/07
./spam_analysis.py /scratch-lustre/DeapSECURE/module01/spams/untroubled/1999/08
./spam_analysis.py /scratch-lustre/DeapSECURE/module01/spams/untroubled/1999/09
./spam_analysis.py /scratch-lustre/DeapSECURE/module01/spams/untroubled/1999/10
./spam_analysis.py /scratch-lustre/DeapSECURE/module01/spams/untroubled/1999/11
./spam_analysis.py /scratch-lustre/DeapSECURE/module01/spams/untroubled/1999/12

This is great but tedious. How can we automate this in such a way that we only type a few command lines? Remember the ls command? We are going to use it. If you look at the manual for ls you will see that there are options we can use to list directories, and not their contents. Try this:

$ ls -d ./*/

The -d option lists the directory name and not its content. The parameter ./*/ means every single element in ./ (current directory) that matches the pattern */ (anything followed by /, also known as directories). Now try this:

$ ls -d /scratch-lustre/DeapSECURE/module01/spams/untroubled/*/

This lists all folders in /scratch-lustre/DeapSECURE/module01/spams/untroubled/. We want to have only year directories. We can change out pattern to match YYYY.

$ ls -d /scratch-lustre/DeapSECURE/module01/spams/untroubled/[0-9][0-9][0-9][0-9]/

Now we have all years from our email collection. The next step is to get months. This is simple as we know that months folder are a two digit name. All we need to do is to add a match for the month folder to each year:

$ ls -d /scratch-lustre/DeapSECURE/module01/spams/untroubled/[0-9][0-9][0-9][0-9]/[0-9][0-9]

Now to be able to feed this to the spam_analysis.py script, we need to have them one month per line. To do this we just need to add the -1 option (display one per line) to ls command:

$ ls -1 -d /scratch-lustre/DeapSECURE/module01/spams/untroubled/[0-9][0-9][0-9][0-9]/[0-9][0-9]

To launch spam_analysis.py with this input one at the time we need a for loop otherwise it will be tedious for us to do it one by one:

for m in $(ls -1 -d /scratch-lustre/DeapSECURE/module01/spams/untroubled/[0-9][0-9][0-9][0-9]/[0-9][0-9]); do
     ./spam_analysis.py $m
done

This will take a while as analyzing each file takes some time. What you will want to do is to run this as a batch job so that you would not have to stay logged in waiting for it to complete.

Key Points

  • Job script is used to launch a computation on an HPC system