This lesson is in the early stages of development (Alpha version)

DeapSECURE module 2: Dealing with Big Data: Analytics of Network Instrusion with Spark

Introduction

In this episode we will utilize skills we learned from the previous episodes to work on a completely different dataset: a network intrusion dataset.

Our data comes from the “Intrusion Detection Evaluation Dataset (CICIDS2017)”, courtesy of the Canadian Institute of Cybersecurity. The description can be found here. In addition to the raw packet data collected using PCAP, the researchers have summarized about 80 flow features from the network traffic. These feature sets have been labeled as “benign” or “malicious” by the researchers. The data was collected from network experiment done for five days in July of 2017. Two dedicated networks (one for victim and one for attacker) were created, consisting of computers with variety of operating systems (Windows, Linux, Mac OS X), as well as network devices like modems, firewalls, switches, routers. During the experiment, all packets in the network were captured, then aggregated into network flows. The statistics of the network flows captured during the specified time (one or more CSV files per day) were saved into CSV files. Each line in the CSV file contains the statistics of a particular flow; all individual packets have been aggregate according to their respective flows.

We will examine these network flow features using Spark. The data files are located on Turing at

/scratch-lustre/DeapSECURE/module02/CIC-IDS-2017/TrafficLabeling

Peeking into the dataset

Always take a look at your dataset first before doing anything else! Use the shell commands from previous lesson to do this. The ls, less commands will be particularly useful.

Try to answer the following questions:

  1. How many files are in this dataset? How big are they?
  2. What does the data look like?

Solution

There are eight CSV files in this dataset. The file names tell on what day of the week the data were collected; and in some instances, the network attacks mounted.

$ cd /scratch-lustre/DeapSECURE/module02/CIC-IDS-2017/TrafficLabeling
$ ls -l
total 1174944
-rw-r--r-- 1 wpurwant DeapSECURE  96101069 Jun  7  2018 Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv
-rw-r--r-- 1 wpurwant DeapSECURE 101874777 Aug 17  2017 Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv
-rw-r--r-- 1 wpurwant DeapSECURE  75386737 Aug 17  2017 Friday-WorkingHours-Morning.pcap_ISCX.csv
-rw-r--r-- 1 wpurwant DeapSECURE 268649908 Aug 10  2017 Monday-WorkingHours.pcap_ISCX.csv
-rw-r--r-- 1 wpurwant DeapSECURE 108723183 Aug 17  2017 Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv
-rw-r--r-- 1 wpurwant DeapSECURE  92030223 Aug 17  2017 Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv
-rw-r--r-- 1 wpurwant DeapSECURE 174696560 Aug 17  2017 Tuesday-WorkingHours.pcap_ISCX.csv
-rw-r--r-- 1 wpurwant DeapSECURE 285642925 Aug 17  2017 Wednesday-workingHours.pcap_ISCX.csv

The file sizes range from 75 MB to 285 MB.

HINT: Use ls -l -h to show the file sizes in units that easy to read (kB, MB, GB, etc).

It is always take a initial look at the data by using the less command. The first few lines of Monday-WorkingHours.pcap_ISCX.csv look like:

Flow ID, Source IP, Source Port, Destination IP, Destination Port, Protocol, Timestamp, Flow Duration, Total Fwd Packets, Total Backward Packets,Total Length of Fwd Packets, Total Length of Bwd Packets, Fwd Packet Length Max, Fwd Packet Length Min, Fwd Packet Length Mean, Fwd Packet Length Std,Bwd Packet Length Max, Bwd Packet Length Min, Bwd Packet Length Mean, Bwd Packet Length Std,Flow Bytes/s, Flow Packets/s, Flow IAT Mean, Flow IAT Std, Flow IAT Max, Flow IAT Min,Fwd IAT Total, Fwd IAT Mean, Fwd IAT Std, Fwd IAT Max, Fwd IAT Min,Bwd IAT Total, Bwd IAT Mean, Bwd IAT Std, Bwd IAT Max, Bwd IAT Min,Fwd PSH Flags, Bwd PSH Flags, Fwd URG Flags, Bwd URG Flags, Fwd Header Length, Bwd Header Length,Fwd Packets/s, Bwd Packets/s, Min Packet Length, Max Packet Length, Packet Length Mean, Packet Length Std, Packet Length Variance,FIN Flag Count, SYN Flag Count, RST Flag Count, PSH Flag Count, ACK Flag Count, URG Flag Count, CWE Flag Count, ECE Flag Count, Down/Up Ratio, Average Packet Size, Avg Fwd Segment Size, Avg Bwd Segment Size, Fwd Header Length,Fwd Avg Bytes/Bulk, Fwd Avg Packets/Bulk, Fwd Avg Bulk Rate, Bwd Avg Bytes/Bulk, Bwd Avg Packets/Bulk,Bwd Avg Bulk Rate,Subflow Fwd Packets, Subflow Fwd Bytes, Subflow Bwd Packets, Subflow Bwd Bytes,Init_Win_bytes_forward, Init_Win_bytes_backward, act_data_pkt_fwd, min_seg_size_forward,Active Mean, Active Std, Active Max, Active Min,Idle Mean, Idle Std, Idle Max, Idle Min, Label
192.168.10.5-8.254.250.126-49188-80-6,8.254.250.126,80,192.168.10.5,49188,6,03/07/2017 08:55:58,4,2,0,12.0,0.0,6.0,6.0,6.0,0.0,0,0,0,0,3000000.0,500000.0,4.0,0.0,4.0,4.0,4.0,4.0,0.0,4.0,4.0,0,0,0,0,0,0,0,0,0,40,0,500000.0,0.0,6.0,6.0,6.0,0.0,0.0,0,0,0,0,1,1,0,0,0.0,9.0,6.0,0.0,40,0,0,0,0,0,0,2,12,0,0,329,-1,1,20,0,0,0,0,0,0,0,0,BENIGN
192.168.10.5-8.254.250.126-49188-80-6,8.254.250.126,80,192.168.10.5,49188,6,03/07/2017 08:55:58,1,2,0,12.0,0.0,6.0,6.0,6.0,0.0,0,0,0,0,1.2E7,2000000.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0,0,0,0,0,0,0,0,0,40,0,2000000.0,0.0,6.0,6.0,6.0,0.0,0.0,0,0,0,0,1,1,0,0,0.0,9.0,6.0,0.0,40,0,0,0,0,0,0,2,12,0,0,329,-1,1,20,0,0,0,0,0,0,0,0,BENIGN
192.168.10.5-8.254.250.126-49188-80-6,8.254.250.126,80,192.168.10.5,49188,6,03/07/2017 08:55:58,1,2,0,12.0,0.0,6.0,6.0,6.0,0.0,0,0,0,0,1.2E7,2000000.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0,0,0,0,0,0,0,0,0,40,0,2000000.0,0.0,6.0,6.0,6.0,0.0,0.0,0,0,0,0,1,1,0,0,0.0,9.0,6.0,0.0,40,0,0,0,0,0,0,2,12,0,0,329,-1,1,20,0,0,0,0,0,0,0,0,BENIGN
192.168.10.5-8.254.250.126-49188-80-6,8.254.250.126,80,192.168.10.5,49188,6,03/07/2017 08:55:58,1,2,0,12.0,0.0,6.0,6.0,6.0,0.0,0,0,0,0,1.2E7,2000000.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0,0,0,0,0,0,0,0,0,40,0,2000000.0,0.0,6.0,6.0,6.0,0.0,0.0,0,0,0,0,1,1,0,0,0.0,9.0,6.0,0.0,40,0,0,0,0,0,0,2,12,0,0,329,-1,1,20,0,0,0,0,0,0,0,0,BENIGN

Definitely not easy to look at. But you can see some objects you recognize, such as IP addresses, port numbers, a timestamp.

HINT: Try less -S Monday-WorkingHours.pcap_ISCX.csv for a more natural look at the table; you can scroll right and left using the Left and Right arrow keys.

A primer on computer networking

Computer networking is a very complex subject, with multiple layers involved to deliver networking experience that we know today. There are seven layers in the Open Systems Interconnection (OSI) networking model, from the physical layer all the way to the application layer. Please check out this Network World article or the Wikipedia article on OSI model for more details, as they are outside the scope of this training. The two layers that are most well known in network are the transport and application layers (layers 4 and 7, respectively).

There are two network transport protocols that are very widely in use today: TCP (short for Transmission Control Protocol) and UDP (User Datagram Protocol). These are often considered layer-4 protocols in the OSI model. Major services such as web, email, secure shell, etc. (on the application layer) are built on TCP because of the need of reliable and error-free communication. Each of these services has its own respective protocol, or “language”–therefore we speak of HTTP (HyperText Transfer Protocol), HTTPS (HyperText Transfer Protocol Secure), SSH (Secure Shell) protocols, and so on.

What is a flow?

In computer networking, a flow is defined as a sequence of packets from a computer to a destination (usually another computer, but can also be broadcast domain, etc.). An analogy of a flow is a phone conversation:

  • There is a beginning and an end of the flow, from “Hello” to “Good bye”.

  • The source and destination remains the same throughout one flow.

  • The “conversation” is generally bi-directional, but one side is usually the client and the other is the server that provides a specific service to the client.

  • There is a particular “language” of the conversation. For one flow, it may be web (HTTP) traffic; the other may be an encrypted (HTTPS) traffic; yet another one is FTP; and so on and so forth.

A source or a destination would consist of a (host, port) pair. The host is marked by its IP address; a port is an integer. Therefore, a flow would always be characterized by the following features:

  • source host (an IP address)
  • source port (an integer ranging from 0 through 65535)
  • destination host
  • destination port
  • network protocol (6 for TCP, 17 for UDP, …; see IANA protocol number reference)

Well-known TCP port numbers

A TCP port distinguishes the specific application that is communicating over the TCP network. Within a given host, there can be many applications using the network simultaneously (think of your browser having multiple tabs: the process belonging to each tab will have its own port TCP number to communicate with the remote web server. There are several well-known port numbers, belonging to popular internet services, as tabulated below (with the name of the respective application-level protocol):

Protocol Port number Description
SSH 22 Secure shell (encrypted)
Telnet 23 Remote login (unencrypted)
SMTP 25 Email sending / forwarding
DNS 53 Domain Name System (mapping between host name and IP address)
HTTP 80 HyperText Transfer Protocol (for serving web pages)
POP3 110 Post Office Protocol (email retrieval)
RPC 135 Microsoft’s Remote Procedure Call
IMAP4 143 Internet Message Access Protocol 4 (email retrieval)
LDAP 389 Lightweight Directory Access Protocol (e.g. for network-based user authentication)
HTTPS 443 HyperText Transfer Protocol Secure (for serving encrypted web pages)

Please see this Webopedia article or Meridian Outpost web page for more well-known port numbers.

One thing to be aware of is that these port numbers are merely conventions. This means that a system administrator may not follow these conventional numbers. For example, a computer owner may choose to run SSH in port 2222 in order to evade most port scanners which are looking for open SSH ports to attack.

Initial Exploratory Questions

Earlier, we were looking at the data from shell (using ls and less). But Spark will allow far more information to be unearthened from the dataset. Here are several questions that we can ask from the beginning:

  1. What are the features found in the dataset?

  2. Can we find some differences in the normal (benign) and attack traffic?

  3. There were several kinds of attacks recorded in the dataset; can we find some striking characteristics of each attack?

These are higher-level question than querying the count of records (DF.count() method) or printing out several few records using DF.take(n) method. We will not answer all these questions in this episode. The unanswered questions are left for the participants to explore on their own.

Exploration 1: Monday Traffic (benign, no attack)

On Monday, there were only ordinary network traffic. So this will be a good data to start our exploration with.

First, let us fire up a new Spark session and load the dataset. This dataset has 85 columns, so bear with the length of the program in some places. You will want to copy and paste several first lines to avoid errors. Here is a set of commands you will want to type into your pyspark session (with no >>> to aid copying and pasting).

SCHEMA_NETFLOW = "FlowID STRING, src_ip STRING, src_port INTEGER, dest_ip STRING, dest_port INTEGER, protocol INTEGER, t_stamp TIMESTAMP, duration BIGINT, fw_npkt_tot BIGINT, bw_npkt_tot BIGINT, fw_plen_tot DOUBLE, bw_plen_tot DOUBLE, fw_plen_max DOUBLE, fw_plen_min DOUBLE, fw_plen_mean DOUBLE, fw_plen_std DOUBLE, bw_plen_max DOUBLE, bw_plen_min DOUBLE, bw_plen_mean DOUBLE, bw_plen_std DOUBLE, bd_Bps DOUBLE, bd_Pps DOUBLE, bd_IAT_mean DOUBLE, bd_IAT_std DOUBLE, bd_IAT_max DOUBLE, bd_IAT_min DOUBLE, fw_IAT_tot DOUBLE, fw_IAT_mean DOUBLE, fw_IAT_std DOUBLE, fw_IAT_max DOUBLE, fw_IAT_min DOUBLE, bw_IAT_tot DOUBLE, bw_IAT_mean DOUBLE, bw_IAT_std DOUBLE, bw_IAT_max DOUBLE, bw_IAT_min DOUBLE, fw_nflg_PSH INTEGER, bw_nflg_PSH INTEGER, fw_nflg_URG INTEGER, bw_nflg_URG INTEGER, fw_len_hdr INTEGER, bw_len_hdr INTEGER, fw_Pps DOUBLE, bw_Pps DOUBLE, plen_min DOUBLE, plen_max DOUBLE, plen_mean DOUBLE, plen_std DOUBLE, plen_var DOUBLE, nflg_FIN INTEGER, nflg_SYN INTEGER, nflg_RST INTEGER, nflg_PSH INTEGER, nflg_ACK INTEGER, nflg_URG INTEGER, nflg_CWE INTEGER, nflg_ECE INTEGER, down_up_ratio DOUBLE, pkt_size_avg DOUBLE, fw_seg_size_avg DOUBLE, bw_seg_size_avg DOUBLE, fw_len_hdr61 INTEGER, fw_Bpb_avg DOUBLE, fw_Ppb_avg DOUBLE, fw_bulkrate_avg DOUBLE, bw_Bpb_avg DOUBLE, bw_Ppb_avg DOUBLE, bw_bulkrate_avg DOUBLE, fw_npkt_subflow BIGINT, fw_nbytes_subflow BIGINT, bw_npkt_subflow BIGINT, bw_nbytes_subflow BIGINT, fw_init_win_bytes INTEGER, bw_init_win_bytes INTEGER, fw_act_data_pkt BIGINT, fw_min_seg_size BIGINT, active_mean DOUBLE, active_std DOUBLE, active_max DOUBLE, active_min DOUBLE, idle_mean DOUBLE, idle_std DOUBLE, idle_max DOUBLE, idle_min DOUBLE, label STRING"

# load the dataset
DF_MON = spark.read.csv("/scratch-lustre/DeapSECURE/module02/CIC-IDS-2017/TrafficLabeling/Monday-WorkingHours.pcap_ISCX.csv", \
                        sep=",", ignoreLeadingWhiteSpace=True, ignoreTrailingWhiteSpace=True, \
                        timestampFormat="dd/MM/yyyy HH:mm:ss",
                        header=True,
                        schema=SCHEMA_NETFLOW)

The SCHEMA_NETFLOW variable defines the names of the columns (sometimes called “features” later on) as well as the corresponding data types. The syntax of the schema string follows that of DDL (Data Definition Language), which basically boils down to: COLUMN_NAME1 TYPE1, COLUMN_NAME2, TYPE2, ....

The second line reads the CSV file into a Spark DataFrame called DF_MON. The extra parameters define the mundane things such as field separator (a comma character), what to do with extra whitespace characters, and format of the time stamp. The header=True parameter indicates that the first line contains the column names. The schema parameter is key: it determines how each column is ingested into the DataFrame.

Python functions

You can shorten frequently used sequence of commands by using functions. Most programming or scripting languages have this feature. In Python, functions are defined using the keyword def as follows:

def function_name([parameter_list]):
    # function body

As example, for the previous loading of the CSV file into your data frame, you can define the following function.

def flowmeter_load_csv(filename):
    return spark.read.csv(filename, sep=",", ignoreLeadingWhiteSpace=True, ignoreTrailingWhiteSpace=True, \
    timestampFormat="dd/MM/yyyy HH:mm:ss", header=True, schema=SCHEMA_NETFLOW)

To use your function to load a data frame all you need is to call flowmeter_load_csv function with the file name as parameter.

DF_MON = flowmeter_load_csv("/scratch-lustre/DeapSECURE/module02/CIC-IDS-2017/TrafficLabeling/Monday-WorkingHours.pcap_ISCX.csv")

Schema is important!

When reading data from a CSV file, specifying correct schema is an important job. It boils down to:

  • Giving meaningful column names
  • Giving the correct data type for each column Without this schema, everything will be read as strings, and it will be difficult to compare and sort numerical values (such as traffic duration, number of packets, etc).

Once the data is loaded into a dataframe, the schema can be queried back:

>>> DF_MON.schema
StructType(List(StructField(FlowID,StringType,true),StructField(src_ip,StringType,true),StructField(src_port,IntegerType,true),StructField(dest_ip,StringType,true),StructField(dest_port,IntegerType,true),StructField(protocol,IntegerType,true),StructField(t_stamp,TimestampType,true), ...

>>> DF_MON.printSchema()
root
 |-- FlowID: string (nullable = true)
 |-- src_ip: string (nullable = true)
 |-- src_port: integer (nullable = true)
 |-- dest_ip: string (nullable = true)
 |-- dest_port: integer (nullable = true)
 |-- protocol: integer (nullable = true)
 |-- t_stamp: timestamp (nullable = true)
 |-- duration: long (nullable = true)
 |-- fw_npkt_tot: long (nullable = true)
 |-- bw_npkt_tot: long (nullable = true)
 |-- fw_plen_tot: double (nullable = true)
 |-- bw_plen_tot: double (nullable = true)
 |-- fw_plen_max: double (nullable = true)
 |-- fw_plen_min: double (nullable = true)
 |-- fw_plen_mean: double (nullable = true)
 |-- fw_plen_std: double (nullable = true)
 |-- bw_plen_max: double (nullable = true)
 |-- bw_plen_min: double (nullable = true)
 |-- bw_plen_mean: double (nullable = true)
 |-- bw_plen_std: double (nullable = true)
 |-- bd_Bps: double (nullable = true)
 |-- bd_Pps: double (nullable = true)
 |-- bd_IAT_mean: double (nullable = true)
...
 |-- active_mean: double (nullable = true)
 |-- active_std: double (nullable = true)
 |-- active_max: double (nullable = true)
 |-- active_min: double (nullable = true)
 |-- idle_mean: double (nullable = true)
 |-- idle_std: double (nullable = true)
 |-- idle_max: double (nullable = true)
 |-- idle_min: double (nullable = true)
 |-- label: string (nullable = true)

The first way above is an attribute (thus no need for ()), but it returns something that is mostly for computer to comprehend. (The output was truncated to avoid overwhelming the reader.) The second way prints the schema in a way that’s understandable to us (human).

Dataset features

In machine learning (the topic of the next workshop), a column in the data frame is often called a feature. When encountering a new dataset, we will want to familiarize ourselves with its features. The column names are coded using the following shorthands:

shorthand explanation
src source computer
dest destination computer
fw forward direction (i.e. data flowing from src_ip to dest_ip)
bw backward direction (i.e. data flowing from dest_ip to src_ip)
bd bidirectional (either forward or backward)
npkt number of packets
plen packet length (in units of byte)
Bps bytes per second (data transfer rate)
Pps packets per second
Bpb bytes per bulk
Ppb packets per bulk
hdr header (referring to TCP packet header)
nflg number of encounter of a particular flag (in the TCP packet header)
IAT inter-arrival-time (the distance between two successive packet arrivals in the same flow)

With these shorthands, you will make sense of the first few quantities:

column name meaning
src_ip source computer’s IP address
src_port source computer’s port number
dest_ip destination computer’s IP address
dest_port destination computer’s port number
protocol a number designating the transport protocol (for example: 6=TCP, 17=UDP)
t_stamp time stamp with a second resolution
duration duration of the flow, in units of microsecond
fw_npkt_tot total number of packets flowing in forward direction
bw_plen_tot total size of packets flowing in backward direction

Many of the quantities are statistical in nature (they return minimum, maximum, mean, and standard deviation of a quantity).

After understanding the features of the data, we are in a position to explore the data!

Finding most frequently used network ports

As a cybersecurity researcher or officer, we want to make sure that our network only contains traffic that are authorized. Specifically, we want to detect the presence of network attacks. This task is called anomaly detection in machine learning language.

Our first investigation is to check the network ports being used by the network activities captured in the DF_MON DataFrame.

Challenge

Use pySpark to find the most used source and destination ports. Hint: Use the groupBy() method, then aggregate using count(), then sort it in descending manner.

Solution

>>> DF_MON.groupBy('src_port').count().sort('count', ascending=False).take(20)
[Row(src_port=443, count=84887),
 Row(src_port=80, count=18518),
 Row(src_port=123, count=5180),
 Row(src_port=137, count=1510),
 Row(src_port=22, count=1023),
 Row(src_port=21, count=533),
 Row(src_port=0, count=317),
 Row(src_port=138, count=315),
 Row(src_port=53, count=162),
 Row(src_port=5353, count=156),
 Row(src_port=139, count=107),
 Row(src_port=389, count=105),
 Row(src_port=59398, count=105),
 Row(src_port=39440, count=99),
 Row(src_port=49666, count=99),
 Row(src_port=59546, count=72),
 Row(src_port=445, count=69),
 Row(src_port=135, count=69),
 Row(src_port=8080, count=64),
 Row(src_port=88, count=59)]

>>> DF_MON.groupBy('dest_port').count().sort('count', ascending=False).take(20)
[Row(dest_port=53, count=214674),
 Row(dest_port=443, count=140952),
 Row(dest_port=80, count=50834),
 Row(dest_port=123, count=5180),
 Row(dest_port=22, count=2132),
 Row(dest_port=137, count=1522),
 Row(dest_port=389, count=1215),
 Row(dest_port=88, count=1207),
 Row(dest_port=21, count=960),
 Row(dest_port=465, count=728),
 Row(dest_port=3268, count=499),
 Row(dest_port=139, count=449),
 Row(dest_port=0, count=317),
 Row(dest_port=138, count=315),
 Row(dest_port=445, count=255),
 Row(dest_port=5353, count=156),
 Row(dest_port=49666, count=149),
 Row(dest_port=135, count=148),
 Row(dest_port=1900, count=115),
 Row(dest_port=8080, count=84)]

Here are some of the most frequently used ports:

Port number count by dest_port count by src_port
53 214674 162
443 140952 84887
80 50834 18518
123 5180 5180
22 2132 1023

The first row of the table above means that for port number 53, there are over 214k traffic flows with dest_port equals to 53, and there are only 162 flows with src_port equals to 53. Port 53 is for DNS service, and most traffic flow with port 53 as the “destination”. This makes sense, because client computers initiates the network flow in order to query the DNS server whenever they need to look up the IP address belonging to a host name (e.g. for a web site). However, there are also flows where port 53 was recorded as the source; most likely, this is due to long delay beyond the flow recorder’s timeout. In fact, the sense of direction in this dataset is actually meaningless. This is why the statistics for port 443 (HTTPS) and port 80 (HTTP) exhibit quantities that are similar in order of magnitude.

One way to get around this problem is to note the convention that most network services are published with low port numbers: system ports from 0-1023, and user-defined ports from 1024-49151. Some services often run on ports 1024 and above, but rarely with port numbers beyond 10000. Hackers often target ports with low numbers to find valuable resources to attack or to penetrate. We can therefore take the minimum of the src_port and dest_port and consider that as the service port (serv_port). Spark provides an easy way to do this, using column functions.

Column functions

The pyspark.sql.functions package contains a number of functions which act on a DataFrame column (or columns) and computes a new quantity. Sometimes these are called “SQL functions” because these functions are used in SQL. There are two kinds of functions:

  • Elemental function, which operates on every value in the input column in the element-by-element fashion. The dimension of the output is therefore the same as the dimension of the input data. Some examples include: abs (absolute value), log (logarithm), lower (converting to lower-case characers).

  • Aggregate function, which performs aggregation (reduction) on all values in the input column, usually yielding a single output value. Some examples include: avg (average value), count (count the number of elements), max or min (computing the maximum or minimum value in a given column).

Function least in the pyspark.sql.function module computes the lowest value in two or more columns. Being an elemental function, least performs the comparison in the row-by-row fashion.

>>> from pyspark.sql.functions import least

# Create a new DataFrame with an additional column called 'serv_port':
>>> DF_MON_a = DF_MON.withColumn('serv_port', least('src_port', 'dest_port'))

In the snippet above, the columns were specified as strings; they will be understood as the column names of the DF_MON dataset.

EXERCISE: Check that DF_MON_a contains the new column, and that the value is the lowest of the two ports.

Performing statistical analysis

There are a lot of quantities recorded in DF_MON. We want to perform some statistical analysis on some of them, focusing on traffic involving HTTP port (80).

Exploration 1: Friday Traffic with DDoS

DDoS stands for Distributed Denial of Service. It is a kind of distributed network attacks designed to bring down the targeted network service(s). In DDoS, multiple machines are involved in the mounting of the attack. The attackers are sending tons of packets to the destination computer in order to overwhelm it and eventually bring it down. Therefore, this type of attack will have a stark features, such as a high number of bytes sent or received.

Let us focus our attention to this data file: Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv .

>>> in_FriDDOS = "/scratch-lustre/DeapSECURE/module02/CIC-IDS-2017/MachineLearningCVE/Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv"
>>> df_FriDDOS = spark.read.csv(in_FriDDOS, sep=",", 
        ignoreLeadingWhiteSpace=True, ignoreTrailingWhiteSpace=True,
         header=True)

Getting to know the data

  1. Find out how many records are in this dataset.

  2. What are the “features” (columns) on this dataset? HINT: Use df_FriDDOS.columns (no () because it’s not a function).

    The description of these features can be found on CICFlowMeter website.

    The authors claimed to have collected more than 80 flow features; but our datasets do not seem to have that many. Some of the features are already multidimensional (mean, stddev, min, max).

  3. Based on your knowledge of networking, which of these features would be of interest to study first?

Solutions (partial)

  1. 225745 records.

  2. There are 79 columns in the dataset, but the last field named Label is a marking whether this is a benign or malicious network traffic.

Quantities of interest

A few quantities appear interesting right out of the box:

Anything else interesting to you?

The network flows has been labeled two kinds: benign and DDoS:

In [151]: df_FriDDOS.groupBy('Label').count().show()
+------+------+                                                                 
| Label| count|
+------+------+
|BENIGN| 97718|
|  DDoS|128027|
+------+------+

Let us look at the destination port:

In [152]: df_fri_ddos = df_FriDDOS.filter("Label = 'DDoS'").groupBy('Destination Port').count()

In [153]: df_fri_ddos.count()
Out[153]: 4

In [154]: df_fri_ddos.show()
+----------------+------+
|Destination Port| count|
+----------------+------+
|           64873|     1|
|           64869|     1|
|              80|128024|
|           27636|     1|
+----------------+------+

Clearly … port 80 (HTTP) is the target of this attack.

In [155]: df_fri_norm = df_FriDDOS.filter("Label = 'BENIGN'").groupBy('Destination Port').count()

In [156]: df_fri_norm.count()
Out[156]: 23949

In [158]: df_fri_norm.sort('count', ascending=False).show(30)
+----------------+-----+                                                        
|Destination Port|count|
+----------------+-----+
|              53|31950|
|             443|13485|
|              80| 8927|
|            8080|  510|
|             123|  362|
|              22|  342|
|             137|  274|
|             389|  261|
|              88|  173|
|              21|  167|
|             465|  147|
|             139|  100|
|            3268|   91|
|               0|   54|
|             138|   53|
|             445|   48|
|           49666|   23|
|             135|   23|
|            5353|   16|
|           49671|   12|
|           55200|    8|
|           53675|    8|
|           59871|    8|
|           57844|    8|
|           55202|    8|
|           56873|    7|
|           61012|    7|
|           59044|    7|
|           55706|    7|
|           58505|    7|
+----------------+-----+
only showing top 30 rows

Can you recognize some of these ports?

Exploration 2: Wednesday Traffic: The Big Attack Day

On Wednesday we have an interesting set of events. Please load the data from file Wednesday-workingHours.pcap_ISCX.csv .

>>> in_Wed = "/scratch-lustre/DeapSECURE/module02/CIC-IDS-2017/MachineLearningCVE/Wednesday-workingHours.pcap_ISCX.csv"
>>> df_Wed = spark.read.csv(in_Wed, sep=",",
        ignoreLeadingWhiteSpace=True, ignoreTrailingWhiteSpace=True,
         header=True)

What labels are attached to the events on Wednesday?

In [161]: df_Wed.groupBy('Label').count().show()
+----------------+------+
|           Label| count|
+----------------+------+
|          BENIGN|440031|
|   DoS slowloris|  5796|
|        DoS Hulk|231073|
|DoS Slowhttptest|  5499|
|   DoS GoldenEye| 10293|
|      Heartbleed|    11|
+----------------+------+

There were several attacks happening on this day.

In [161]: df_Wed.groupBy('Label').count().show()
+----------------+------+
|           Label| count|
+----------------+------+
|          BENIGN|440031|
|   DoS slowloris|  5796|
|        DoS Hulk|231073|
|DoS Slowhttptest|  5499|
|   DoS GoldenEye| 10293|
|      Heartbleed|    11|
+----------------+------+

In [162]: df_Wed.groupBy('Label', 'Destination Port').count().sort('count', ascending=False).show(50)
+----------------+----------------+------+
|           Label|Destination Port| count|
+----------------+----------------+------+
|        DoS Hulk|              80|231073|
|          BENIGN|              53|195531|
|          BENIGN|             443|100229|
|          BENIGN|              80| 48761|
|   DoS GoldenEye|              80| 10293|
|   DoS slowloris|              80|  5796|
|DoS Slowhttptest|              80|  5499|
|          BENIGN|             123|  4293|
|          BENIGN|              22|  2169|
|          BENIGN|             137|  1480|
|          BENIGN|             389|  1136|
|          BENIGN|              21|  1112|
|          BENIGN|              88|   932|
|          BENIGN|             465|   691|
|          BENIGN|             139|   578|
|          BENIGN|            3268|   476|
|          BENIGN|               0|   330|
|          BENIGN|             138|   326|
|          BENIGN|             445|   259|
|          BENIGN|            5353|   148|
|          BENIGN|             135|   141|
|          BENIGN|           49666|   127|
|          BENIGN|           49671|    59|
|          BENIGN|            5355|    51|
|          BENIGN|            8080|    27|
|          BENIGN|           58122|    26|
|          BENIGN|           51994|    14|
|          BENIGN|           53236|    14|
|          BENIGN|            7973|    14|
|          BENIGN|            8612|    13|
|          BENIGN|              42|    13|
|          BENIGN|           36522|    13|
|          BENIGN|           53498|    13|
|          BENIGN|           52164|    13|
|          BENIGN|           53952|    12|
|          BENIGN|           50240|    12|
|          BENIGN|           34242|    12|
|          BENIGN|           53300|    12|
|          BENIGN|            3289|    12|
|          BENIGN|           53298|    12|
|          BENIGN|           53020|    12|
|          BENIGN|           52768|    12|
|          BENIGN|           52912|    12|
|          BENIGN|           49856|    12|
|          BENIGN|            1124|    12|
|          BENIGN|           54436|    12|
|          BENIGN|           53396|    12|
|          BENIGN|           52538|    12|
|      Heartbleed|             444|    11|
|          BENIGN|           55896|    11|
+----------------+----------------+------+
only showing top 50 rows