This lesson is still being designed and assembled (Pre-Alpha version)

Case Study 2: Drone RF Signal Classification

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What are the steps to do on a given dataset before performing machine learning?

Objectives
  • Understanding the key preparation steps leading to machine learning.

Introducing the Problem

Drones are controlled wirelessly with remote control. The controller and the drones are communicating via radio frequency (RF) signals. By intercepting the signals over the open air, one can detect the presence of drones nearby. Furthermore, by building a machine learning model to learn the characteristics of the signals emitted by each of these drones, we can distinguish the different drones that are actively emitting the RF signals. To reiterate, the goal of machine learning here is that:

1) we will be able to detect the presence of one or more drones;

2) we will be able to distinguish which drones are currently active (emitting RF).

An analogy to this problem is like asking a computer to recognize the different musical instruments playing in an orchestra recording—the violin, piano, clarinet, bass, drums… How can this be done? The computer must be trained to recognize the audio characteristics of violin, piano, clarinet, and so on. Each training datum, which is a snippet of audio recording (say, 100 milliseconds), comes with an associated label–be it “violin”, “piano”, “clarinet”… However, in the drone case, instead of feeding a series of RF packets, the researchers have preprocessed the RF packets into a set of features which we will explain below.

What kind of machine learning?

  • Does the drone recognition problem fall under the category of supervised learning or unsupervised learning?

  • Under either category, what is the learning subtype of this problem? (I.e., regression, classification, clustering, or dimensional reduction.)

Answer

This is a supervised learning, And the second goal of the learning: “distinguish which drones” gives a clear hint that we are dealing with a classification problem.

In this episode, we will cover how one prepare data for machine learning. In the next episode, we will perform the actual machine learning on the data. The preparation steps include:

Data Loading and Exploration

The first time we obtain new data, it is always a good practice to take a look into the data first. The data file is located at the following path on Turing:

/scratch-lustre/DeapSECURE/module03/Drones/data/machinelearningdata.csv

It is in the CSV format; here are the first few lines of the data:

$ cd /scratch-lustre/DeapSECURE/module03/Drones/data

$ head machinelearningdata.csv
Subcarrier-Spacing,Symbol-time,fft-length,cp-length,Signal-Power,Detection,class
97.65625,0.01024,1024,32,463.655823,1,mavric
390.625,0.00256,256,32,466.324219,1,mavric
97.65625,0.01024,1024,32,462.016113,1,mavric
195.3125,0.00512,512,32,470.765259,1,mavric
97.65625,0.01024,1024,32,470.765259,1,mavric
195.3125,0.00512,512,32,470.765259,1,mavric
195.3125,0.00512,512,32,467.561218,1,mavric
97.65625,0.01024,1024,32,464.200226,1,mavric
1562.5,0.00064,64,4,449.435547,1,mavric

$ tail machinelearningdata.csv
390.625,0.00256,256,16,442.858521,1,Phantom
1562.5,0.00064,64,32,455.123779,1,Phantom
1562.5,0.00064,64,4,447.67157,1,Phantom
390.625,0.00256,256,16,284.808777,1,Phantom
1562.5,0.00064,64,16,448.591064,1,Phantom
1562.5,0.00064,64,8,441.025574,1,Phantom
781.25,0.00128,128,32,451.479004,1,Phantom
781.25,0.00128,128,32,454.212402,1,Phantom
781.25,0.00128,128,32,27.003826,1,Phantom
3125,0.00032,32,32,1.909638,0,Phantom

UNIX head and tail commands

The UNIX head and tail commands shows the first or last few lines of a text files, respectively. By default 10 lines are shown. If you want to show more or less lines, use the -n N options, where N is the number of lines you want to show.

A better dataset?

There is an alternative dataset that has one extra feature (“standard deviation”). It is located in the same directory as above, under the name machinelearning_stdev.csv. Some participants are encouraged to take this alternative data file and compare the end-result with the original dataset.

The first line contains the column names. From this we know that there are seven columns in the CSV file. The dataset is presented in a simple tabular format, one line per record.

Features and output

Earlier we talk about the model (the function f), the features as the model’s inputs (X), and the output label (y). We have an as-yet-to-be-determined model, but we can already identify the features and the labels in this dataset. Consider the first record in the dataset above (the second line of the head command output):

97.65625,0.01024,1024,32,463.655823,1,mavric

The CSV header clearly tells us that the last column (named class) is the label that identify which drone the signal belongs to (in this one case, it is the “mavric” drone). The first six columns are the input parameters that were produced by the RF signal capture and pre-processing stage. These are the features (inputs). One “data point” (datum) will consist of a set of features (the six numbers above) plus the label.

Usually, the features will be presented as a vector of values to the machine learning algorithm. Thus the X for this datum can be represented as a Python list (array)

[ 97.65625, 0.01024, 1024, 32, 463.655823, 1 ]

In practice, machine learning algorithm typically takes many data points at once—in that case, we speak of a feature matrix as the input. For our Drone dataset, the feature matrix will contain the contents of the CSV columns 1–6:

      /                                             \
      |  97.65625, 0.01024, 1024, 32, 463.655823, 1 |
      | 390.625,   0.00256,  256, 32, 466.324219, 1 |
      |  97.65625, 0.01024, 1024, 32, 462.016113, 1 |
      | 195.3125,  0.00512,  512, 32, 470.765259, 1 |
      |  97.65625, 0.01024, 1024, 32, 470.765259, 1 |
{X} = | ...                                         |
      |1562.5,     0.00064,   64,  8, 441.025574, 1 |
      | 781.25,    0.00128,  128, 32, 451.479004, 1 |
      | 781.25,    0.00128,  128, 32, 454.212402, 1 |
      | 781.25,    0.00128,  128, 32,  27.003826, 1 |
      |3125,       0.00032,   32, 32,  1.909638,  0 |
      \                                             /

The labels will be laid in a column vector format like this:

      /         \
      | mavric  |
      | mavric  |
      | mavric  |
      | mavric  |
      | mavric  |
{y} = | ...     |
      | Phantom |
      | Phantom |
      | Phantom |
      | Phantom |
      | Phantom |
      \         /

Here are the meaning of the columns of the data. The first four features are related to the OFDM parameters (which can be thought as the “characteristics” of the RF signal):

The next two parameters are related to the power characteristics of the signal:

(A second data file machinelearning_stdev.csv is also provided, which has the third energy feature, standard deviation.)

About OFDM

OFDM (Orthogonal Frequency Division Multiplexing) is a nifty way to encode digital signal into radio signal. It allows a limited band of frequencies to transport a large amount of information at once. To learn more about OFDM, interested readers are referred to these articles:

Loading the Data into Python

We are using Pandas to read the CSV-formatted input data into a DataFrame variable called df_drones:

>>> import pandas
>>> import numpy
>>> import os
>>> df_drones = pandas.read_csv(os.path.join(DATA_HOME, 'machinelearningdata.csv'))

Once loaded, we can print the contents of df_drones to terminal–try this:

>>> df_drones

What comes out? Can you make sense of the output? How does it compare to the head and tail outputs earlier? As you can see, Pandas intentionally limits the amount of data it prints to the terminal.

In Python, it is easy to check the type of a variable using the type function:

>>> type(df_drones)
pandas.core.frame.DataFrame

Pandas DataFrame also has the head and tail methods, which by default only show the first or last five rows:

>>> df_drones.head(10)
   Subcarrier-Spacing  Symbol-time  fft-length  cp-length  Signal-Power  Detection   class
0            97.65625      0.01024        1024         32    463.655823          1  mavric
1           390.62500      0.00256         256         32    466.324219          1  mavric
2            97.65625      0.01024        1024         32    462.016113          1  mavric
3           195.31250      0.00512         512         32    470.765259          1  mavric
4            97.65625      0.01024        1024         32    470.765259          1  mavric
5           195.31250      0.00512         512         32    470.765259          1  mavric
6           195.31250      0.00512         512         32    467.561218          1  mavric
7            97.65625      0.01024        1024         32    464.200226          1  mavric
8          1562.50000      0.00064          64          4    449.435547          1  mavric
9            97.65625      0.01024        1024         32    466.614563          1  mavric

>>> df_drones.tail(10)
      Subcarrier-Spacing  Symbol-time  fft-length  cp-length  Signal-Power  Detection    class
2990             390.625      0.00256         256         16    442.858521          1  Phantom
2991            1562.500      0.00064          64         32    455.123779          1  Phantom
2992            1562.500      0.00064          64          4    447.671570          1  Phantom
2993             390.625      0.00256         256         16    284.808777          1  Phantom
2994            1562.500      0.00064          64         16    448.591064          1  Phantom
2995            1562.500      0.00064          64          8    441.025574          1  Phantom
2996             781.250      0.00128         128         32    451.479004          1  Phantom
2997             781.250      0.00128         128         32    454.212402          1  Phantom
2998             781.250      0.00128         128         32     27.003826          1  Phantom
2999            3125.000      0.00032          32         32      1.909638          0  Phantom

Using this method, you can verify whether Pandas has loaded our dataset correctly. Incidentally, the tail output also shows that there are 3000 records in the DataFrame. Each record Pandas DataFrame are “numbered” with an index, which by default are integers 0, 1, 2, … (through the number of records minus one).

DataFrame’s head and tail method returns another DataFrame

Each of the head and tail methods return a new DataFrame object. The contents of the new DataFrame is what printed out in the head and tail calls above. Because we are using an interactive Python session, the contents of the new DataFrame is printed to the terminal.

Another way to make sense of the dataset is to gather statistics on the numerical data, using the DataFrame’s describe method. This will tell us the magnitude of the values in each column:

>>> df_drones.describe()
       Subcarrier-Spacing  Symbol-time   fft-length    cp-length  Signal-Power    Detection
count         3000.000000  3000.000000  3000.000000  3000.000000   3000.000000  3000.000000
mean           921.061198     0.004095   409.530667    27.376000    351.274618     0.948333
std           1398.364566     0.003463   346.253810     8.473167    140.984973     0.221390
min             97.656250     0.000160    16.000000     4.000000      0.926364     0.000000
25%            195.312500     0.001280   128.000000    32.000000    282.561363     1.000000
50%            390.625000     0.002560   256.000000    32.000000    420.977829     1.000000
75%            781.250000     0.005120   512.000000    32.000000    454.217132     1.000000
max           6250.000000     0.010240  1024.000000    32.000000    499.303192     1.000000

In this dataset, the class column is dropped as it does not contain numerical values.

Datatypes

Do you notice that some of the columns contain real numbers and some are integers? Pandas detect and assign the datatype of the columns automatically. Let’s look at the datatypes stored in the df_drones we just loaded:

>>> df_drones.dtypes
Subcarrier-Spacing    float64
Symbol-time           float64
fft-length              int64
cp-length               int64
Signal-Power          float64
Detection               int64
class                  object
dtype: object

The Subcarrier-Spacing, Symbol-time, and Signal-Power columns have float64 datatype (which is the IEEE standard double precision datatype); while the fft-length, cp-length and Detection have the 64-bit integer datatype. Under the hood, both float64 and int64 columns are stored efficiently in Pandas as numpy arrays with appropriate datatypes.

The class column is odd: It assigned the generic object datatype. Why is this? This column contains data that is not numeric. When encountering such a column, Pandas will read the column contents literally and store them as strings. However, due to (1) the way strings are represented in Python, and (2) the fact that the string can be arbitrarily long, the array of strings are stored in an numpy array of generic object, which in this case is actually a Python string object. It is worth nothing that there are only two possible string values in the class column: mavric and Phantom. Later on we will learn how to efficiently represent this kind of data using category datatype.

Specifying custom schema

The definition of the DataFrame in terms of the column names and datatypes shown above is often called schema in database terminology. In a later example, we will use a custom schema to load our dataset and assign a precise datatype for each column. For example, an appropriate schema for df_drones dataset above would be:

numpy.dtype([('Subcarrier-Spacing', 'float64'),
             ('Symbol-time', 'float64'),
             ('fft-length', 'int32'),
             ('cp-length', 'int32'),
             ('Signal-Power', 'float64'),
             ('Detection', 'int32'),
             ('class', numpy.str_)])

In this example, we reduced the type of fft-length and co-length to only 32-bit integers, which can represent whole numbers in the range of approximately +/- 2 billions. (By looking at the range of the values in the column—see the output of df_drones.describe() above—it is possible to reduce them further down to 16-bit integers [-32768 .. 32767] should situation necessitates it—e.g. extremely large datasets; but we will do not do that here.) This schema would then be used in the pandas.read_csv function call, like this:

df_drones = pandas.read_csv(os.path.join(DATA_HOME, 'machinelearningdata.csv'),
                            dtype=numpy.dtype([('Subcarrier-Spacing', 'float64'),
                                                ....,
                                               ('class', numpy.str_)]))

Data Cleaning

At this point, one typically has to check the dataset for possible defects, missing values, and anomalies. Following the example above, certain record(s) in the dataset may miss the Signal-power value(s) or the class label—in which case Pandas will mark them as NaN (not a number) or a None object. One then has to decide what to do with defective data like this. If we have a lot of data and not too many defective records (rows), it may be reasonable to simply drop the defective records. However, we also have to guard against biasing the latter analyses; this can be possible, for example, when the data that are dropped contain certain segment of the spectrum (or population) that is not represented elsewhere in the dataset, which, for some reason, was not collected perfectly.

For this “drone” case, we will not need to do any cleaning step, as it has been prepared carefully by the researchers.

Formatting the Data

The dataset was loaded into Python variable df_drones, which is a DataFrame. This df_drones contains both the features and labels from all the data points. Strictly speaking, a DataFrame is not a matrix, because each column can be of a different datatype. We need to reformat the data, so that the feature matrix is separated from the label vector.

Extracting the Feature Matrix

To extract the feature matrix, let’s perform the following steps:

df_copy = df_drones.copy()
del df_copy['class']
# Extract all the df_copy values as a double-precision (float64) array
all_FM = df_copy.astype('float64').values

You can check the type of all_FM, then view its contents:

>>> type(all_FM)
numpy.ndarray

>>> all_FM
array([[9.76562500e+01, 1.02400000e-02, 1.02400000e+03, 3.20000000e+01,
        4.63655823e+02, 1.00000000e+00],
       [3.90625000e+02, 2.56000000e-03, 2.56000000e+02, 3.20000000e+01,
        4.66324219e+02, 1.00000000e+00],
       [9.76562500e+01, 1.02400000e-02, 1.02400000e+03, 3.20000000e+01,
        4.62016113e+02, 1.00000000e+00],
       ...,
       [7.81250000e+02, 1.28000000e-03, 1.28000000e+02, 3.20000000e+01,
        4.54212402e+02, 1.00000000e+00],
       [7.81250000e+02, 1.28000000e-03, 1.28000000e+02, 3.20000000e+01,
        2.70038260e+01, 1.00000000e+00],
       [3.12500000e+03, 3.20000000e-04, 3.20000000e+01, 3.20000000e+01,
        1.90963800e+00, 0.00000000e+00]])

Variable all_FM has a numpy.ndarray datatype, which is the standard array datatype in NumPy. Please convince yourself that this array contains exactly the same data as the original CSV file, except for the labels.

Extracting the Labels

What about the labels? They come as strings, but for machine learning, we need them to be in numerical format. Here we provide a general recipe to convert an array of strings to an array of integers, where there is a one-on-one mapping between the strings and the integer values. Please copy and paste this function to your ipython session:

def categorical_to_numerics(a, cats=None):
    if cats is not None:
        # assume that cats is a valid list of categories
        pass
        # Otherwise, extract the categories: hopefully one of these
        # ways gets it:
    elif isinstance(a, pandas.Series):
        if isinstance(a.dtypes, pandas.api.types.CategoricalDtype):
            cats = a.dtypes.categories
        else:
            # general approach for array of strings
            cats = sorted(a.unique())
    elif isinstance(a, pandas.Categorical):
        cats = a.categories
    else:
        # general iterable case
        cats = sorted(pandas.Series(a).unique())

    # mapping: category -> numerics
    cat_map = dict((c, i) for (i,c) in enumerate(cats))
    # mapping: numerics -> category
    cat_revmap = list(cats)

    return (numpy.array([cat_map[c] for c in a]), cat_revmap)

Copy and Pasting to ipython

Sometimes ipython gives you trouble in copying and pasting a code snippet like this. There are several ways to get around this. In this lesson, we will use the text editor (nano) to paste the snippet into a text file. Let’s create a new file called functions.py to contain this function as well as other functions we will create later on. From the ipython prompt, type:

>>> !nano functions.py

Remember that ipython also doubles as a “UNIX” shell! This will open an empty file (unless functions.py exists before) Now paste the code snippet:

  • Shift+Ins on Windows (Putty and MobaXTerm)
  • Command+V on Mac
  • Ctrl+Shift+V on Linux (gnome-terminal or a similar terminal application)

Save the file (Ctrl-X, reply Y to the question).

After that you will need to load the function onto ipython. This is the way to do it:

>>> exec(open("functions.py").read())

Now we apply this function to the class column to convert the labels into a machine-learning friendly format:

all_L, labels_L = categorical_to_numerics(df_drones['class'])

The categorical_to_numerics function returns two values—therefore there are two variables to receive these values on the left hand side of the = operator. The all_L variable will contain the array of integers (0s and 1s) corresponding to the labels mavric and Phantom.

Checking out the conversion results

Print the values of all_L and labels_L to the terminal. What are they? Do you see how the 0 and 1 values of all_L relate to the labels stored in labels_L?

Solution

>>> all_L
array([1, 1, 1, ..., 0, 0, 0])

>>> labels_L
['Phantom', 'mavric']

# Compare against the original column values:

>>> df_drones['class']
0        mavric
1        mavric
2        mavric
3        mavric
4        mavric
         ...
2995    Phantom
2996    Phantom
2997    Phantom
2998    Phantom
2999    Phantom
Name: class, Length: 3000, dtype: object

Remember that arrays and lists have zero-based indices in python, therefore:

>>> labels_L[0]
'Phantom'

>>> labels_L[1]
'mavric'

Comparing the indices here with the values in all_L confirms that we successfully converted the list of labels to an array of integers.

A keen reader may wonder why the labels begin with Phantom—although the function clearly sorts the order of labels by the string order. The reason lies in the ASCII encoding of letters: A-Z appears earlier than a-z in the character code list, therefore uppercase letters are lexicographically before the lowercase letters.

Splitting the Data: Training, Validation, Testing Sets

After the dataset is cleaned, we will need to partition the data into two or three sets for training, cross-validation, and test purposes. Recall that the machine learning is an iterative process:

* MACHINE LEARNING LIFECYCLE *
|
|    (1)           (2)           accuracy                (3)
|  Training --> Validation --> good enough?  (YES) --> Testing --> Deployment
|     ^                           (NO)
|     |          adjust            |
|     +----- hyperparameters <-----+
|
*

We will explain these groups now.

There are some slight variations in practice:

Selecting the Right Proportion

A usual recommendation for the ratio of (training : validation : test) set sizes is 60% : 20% : 20%. Or without the test set, it can be (training : validation) = (80% : 20%) or (70% : 30%).

The following Coursera video by Andrew Ng is helpful to understand good practice for choosing the best ratio for training/validation/testing data: Train / Dev / Test sets. The rule of thumb is: The bigger the size of the entire data set, the more data you can dedicate for the training purposes.

Back to the Drones dataset…

We will use the Scikit-learn function named train_test_split to perform this data split. For simplicity we will skip the “test” set for now.

from sklearn.model_selection import train_test_split
train_FM, dev_FM, train_L, dev_L = train_test_split(all_FM, all_L, test_size=0.2)

We feed two arrays to the train_test_split function, which will split the data in two sets. The split is performed by randomly shuffling the rows in the original sets, then taking certain fraction for the first (train) set and the other fraction for the second (test) set. Note that the arrays will be shuffled in a synchronized way so that the labels will not get mixed up. The fraction is determined by either the train_size or test_size argument. In this example, we set the test_size argument to 0.2, meaning that 20% of the data will to the test set while 80% will go to the train set. Note the order of the output variables above—they must not be flipped! For the sake of consistency with our convention of the set names (train, validation/dev, test), we use the “train” and “dev” prefixes to mark the training and validation (dev) datasets.

It is always wise to check at least the sizes of the returned arrays:

>>> train_FM.shape
(2400, 6)

>>> train_L.shape
(2400,)

>>> dev_FM.shape
(600, 6)

>>> dev_L.shape
(600,)

Splitting the data three way

How would you split the data three-way in the 60:20:20 ratio?

Hint: Use the train_test_split twice.

Solution

td_FM, test_FM, td_L, test_L = train_test_split(all_FM, all_L, test_size=0.2)
train_FM, dev_FM, train_L, dev_L = train_test_split(td_FM, td_L, test_size=0.25)  # why 0.25?

Key Points

  • Key steps preceding machine learning: Data loading, exploration, cleaning, and input preparation