Case Study 2: Drone RF Signal Classification
Overview
Teaching: 0 min
Exercises: 0 minQuestions
What are the steps to do on a given dataset before performing machine learning?
Objectives
Understanding the key preparation steps leading to machine learning.
Introducing the Problem
Drones are controlled wirelessly with remote control. The controller and the drones are communicating via radio frequency (RF) signals. By intercepting the signals over the open air, one can detect the presence of drones nearby. Furthermore, by building a machine learning model to learn the characteristics of the signals emitted by each of these drones, we can distinguish the different drones that are actively emitting the RF signals. To reiterate, the goal of machine learning here is that:
1) we will be able to detect the presence of one or more drones;
2) we will be able to distinguish which drones are currently active (emitting RF).
An analogy to this problem is like asking a computer to recognize the different musical instruments playing in an orchestra recording—the violin, piano, clarinet, bass, drums… How can this be done? The computer must be trained to recognize the audio characteristics of violin, piano, clarinet, and so on. Each training datum, which is a snippet of audio recording (say, 100 milliseconds), comes with an associated label–be it “violin”, “piano”, “clarinet”… However, in the drone case, instead of feeding a series of RF packets, the researchers have preprocessed the RF packets into a set of features which we will explain below.
What kind of machine learning?
Does the drone recognition problem fall under the category of supervised learning or unsupervised learning?
Under either category, what is the learning subtype of this problem? (I.e., regression, classification, clustering, or dimensional reduction.)
Answer
This is a supervised learning, And the second goal of the learning: “distinguish which drones” gives a clear hint that we are dealing with a classification problem.
In this episode, we will cover how one prepare data for machine learning. In the next episode, we will perform the actual machine learning on the data. The preparation steps include:
-
Loading the data into computer memory
-
Data exploration
-
Data cleaning
-
Adapting the data to the correct format for machine learning input
-
Splitting the data into training and test data
Data Loading and Exploration
The first time we obtain new data, it is always a good practice to take a look into the data first. The data file is located at the following path on Turing:
/scratch-lustre/DeapSECURE/module03/Drones/data/machinelearningdata.csv
It is in the CSV format; here are the first few lines of the data:
$ cd /scratch-lustre/DeapSECURE/module03/Drones/data
$ head machinelearningdata.csv
Subcarrier-Spacing,Symbol-time,fft-length,cp-length,Signal-Power,Detection,class
97.65625,0.01024,1024,32,463.655823,1,mavric
390.625,0.00256,256,32,466.324219,1,mavric
97.65625,0.01024,1024,32,462.016113,1,mavric
195.3125,0.00512,512,32,470.765259,1,mavric
97.65625,0.01024,1024,32,470.765259,1,mavric
195.3125,0.00512,512,32,470.765259,1,mavric
195.3125,0.00512,512,32,467.561218,1,mavric
97.65625,0.01024,1024,32,464.200226,1,mavric
1562.5,0.00064,64,4,449.435547,1,mavric
$ tail machinelearningdata.csv
390.625,0.00256,256,16,442.858521,1,Phantom
1562.5,0.00064,64,32,455.123779,1,Phantom
1562.5,0.00064,64,4,447.67157,1,Phantom
390.625,0.00256,256,16,284.808777,1,Phantom
1562.5,0.00064,64,16,448.591064,1,Phantom
1562.5,0.00064,64,8,441.025574,1,Phantom
781.25,0.00128,128,32,451.479004,1,Phantom
781.25,0.00128,128,32,454.212402,1,Phantom
781.25,0.00128,128,32,27.003826,1,Phantom
3125,0.00032,32,32,1.909638,0,Phantom
UNIX
head
andtail
commandsThe UNIX
head
andtail
commands shows the first or last few lines of a text files, respectively. By default 10 lines are shown. If you want to show more or less lines, use the-n N
options, whereN
is the number of lines you want to show.
A better dataset?
There is an alternative dataset that has one extra feature (“standard deviation”). It is located in the same directory as above, under the name
machinelearning_stdev.csv
. Some participants are encouraged to take this alternative data file and compare the end-result with the original dataset.
The first line contains the column names. From this we know that there are seven columns in the CSV file. The dataset is presented in a simple tabular format, one line per record.
Features and output
Earlier we talk about the model (the function f
),
the features as the model’s inputs (X
), and
the output label (y
).
We have an as-yet-to-be-determined model, but we can already
identify the features and the labels in this dataset.
Consider the first record in the dataset above (the second line of the head
command output):
97.65625,0.01024,1024,32,463.655823,1,mavric
The CSV header clearly tells us that the last column (named class
)
is the label that identify which drone the signal belongs to
(in this one case, it is the “mavric” drone).
The first six columns are the input parameters that were produced by the RF
signal capture and pre-processing stage.
These are the features (inputs).
One “data point” (datum) will consist of a set of features (the six numbers above)
plus the label.
Usually, the features will be presented as a vector of values to the
machine learning algorithm.
Thus the X
for this datum can be represented as a Python list (array)
[ 97.65625, 0.01024, 1024, 32, 463.655823, 1 ]
In practice, machine learning algorithm typically takes many data points at once—in that case, we speak of a feature matrix as the input. For our Drone dataset, the feature matrix will contain the contents of the CSV columns 1–6:
/ \
| 97.65625, 0.01024, 1024, 32, 463.655823, 1 |
| 390.625, 0.00256, 256, 32, 466.324219, 1 |
| 97.65625, 0.01024, 1024, 32, 462.016113, 1 |
| 195.3125, 0.00512, 512, 32, 470.765259, 1 |
| 97.65625, 0.01024, 1024, 32, 470.765259, 1 |
{X} = | ... |
|1562.5, 0.00064, 64, 8, 441.025574, 1 |
| 781.25, 0.00128, 128, 32, 451.479004, 1 |
| 781.25, 0.00128, 128, 32, 454.212402, 1 |
| 781.25, 0.00128, 128, 32, 27.003826, 1 |
|3125, 0.00032, 32, 32, 1.909638, 0 |
\ /
The labels will be laid in a column vector format like this:
/ \
| mavric |
| mavric |
| mavric |
| mavric |
| mavric |
{y} = | ... |
| Phantom |
| Phantom |
| Phantom |
| Phantom |
| Phantom |
\ /
Here are the meaning of the columns of the data. The first four features are related to the OFDM parameters (which can be thought as the “characteristics” of the RF signal):
-
Subcarrier-Spacing
: frequency spacing of the subcarriers of the OFDM symbols -
Symbol-time
: defined as the inverse of subcarrier spacing, it defines the time duration of OFDM symbol -
fft-length
: length of bins used for Fast Fourier transformation (to convert the digital signal to time domain) -
cp-length
: the length of cyclic prefix (used to eliminate inter-carrier and inter-symbol inteferences)
The next two parameters are related to the power characteristics of the signal:
-
Signal-Power
: the average of input signal magnitude in the frequency domain -
Detection
: indicator of signal availability; i.e. if the signal is above certain threshold, then value 1 is indicated.
(A second data file machinelearning_stdev.csv
is also provided, which has
the third energy feature, standard deviation
.)
About OFDM
OFDM (Orthogonal Frequency Division Multiplexing) is a nifty way to encode digital signal into radio signal. It allows a limited band of frequencies to transport a large amount of information at once. To learn more about OFDM, interested readers are referred to these articles:
Loading the Data into Python
We are using Pandas to read the CSV-formatted input data into a DataFrame
variable called df_drones
:
>>> import pandas
>>> import numpy
>>> import os
>>> df_drones = pandas.read_csv(os.path.join(DATA_HOME, 'machinelearningdata.csv'))
Once loaded, we can print the contents of df_drones
to terminal–try this:
>>> df_drones
What comes out? Can you make sense of the output?
How does it compare to the head
and tail
outputs earlier?
As you can see, Pandas intentionally limits the amount of data
it prints to the terminal.
In Python, it is easy to check the type of a variable using the type
function:
>>> type(df_drones)
pandas.core.frame.DataFrame
Pandas DataFrame
also has the head
and tail
methods, which by default only
show the first or last five rows:
>>> df_drones.head(10)
Subcarrier-Spacing Symbol-time fft-length cp-length Signal-Power Detection class
0 97.65625 0.01024 1024 32 463.655823 1 mavric
1 390.62500 0.00256 256 32 466.324219 1 mavric
2 97.65625 0.01024 1024 32 462.016113 1 mavric
3 195.31250 0.00512 512 32 470.765259 1 mavric
4 97.65625 0.01024 1024 32 470.765259 1 mavric
5 195.31250 0.00512 512 32 470.765259 1 mavric
6 195.31250 0.00512 512 32 467.561218 1 mavric
7 97.65625 0.01024 1024 32 464.200226 1 mavric
8 1562.50000 0.00064 64 4 449.435547 1 mavric
9 97.65625 0.01024 1024 32 466.614563 1 mavric
>>> df_drones.tail(10)
Subcarrier-Spacing Symbol-time fft-length cp-length Signal-Power Detection class
2990 390.625 0.00256 256 16 442.858521 1 Phantom
2991 1562.500 0.00064 64 32 455.123779 1 Phantom
2992 1562.500 0.00064 64 4 447.671570 1 Phantom
2993 390.625 0.00256 256 16 284.808777 1 Phantom
2994 1562.500 0.00064 64 16 448.591064 1 Phantom
2995 1562.500 0.00064 64 8 441.025574 1 Phantom
2996 781.250 0.00128 128 32 451.479004 1 Phantom
2997 781.250 0.00128 128 32 454.212402 1 Phantom
2998 781.250 0.00128 128 32 27.003826 1 Phantom
2999 3125.000 0.00032 32 32 1.909638 0 Phantom
Using this method, you can verify whether Pandas has loaded our dataset correctly.
Incidentally, the tail
output also shows that there are 3000 records in the
DataFrame.
Each record Pandas DataFrame are “numbered” with an index, which by default
are integers 0, 1, 2, … (through the number of records minus one).
DataFrame’s
head
andtail
method returns anotherDataFrame
Each of the
head
andtail
methods return a newDataFrame
object. The contents of the new DataFrame is what printed out in thehead
andtail
calls above. Because we are using an interactive Python session, the contents of the new DataFrame is printed to the terminal.
Another way to make sense of the dataset is to gather statistics on
the numerical data, using the DataFrame’s describe
method.
This will tell us the magnitude of the values in each column:
>>> df_drones.describe()
Subcarrier-Spacing Symbol-time fft-length cp-length Signal-Power Detection
count 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000
mean 921.061198 0.004095 409.530667 27.376000 351.274618 0.948333
std 1398.364566 0.003463 346.253810 8.473167 140.984973 0.221390
min 97.656250 0.000160 16.000000 4.000000 0.926364 0.000000
25% 195.312500 0.001280 128.000000 32.000000 282.561363 1.000000
50% 390.625000 0.002560 256.000000 32.000000 420.977829 1.000000
75% 781.250000 0.005120 512.000000 32.000000 454.217132 1.000000
max 6250.000000 0.010240 1024.000000 32.000000 499.303192 1.000000
In this dataset, the class
column is dropped as it does not contain numerical
values.
Datatypes
Do you notice that some of the columns contain real numbers and some are integers?
Pandas detect and assign the datatype of the columns automatically.
Let’s look at the datatypes stored in the df_drones
we just loaded:
>>> df_drones.dtypes
Subcarrier-Spacing float64
Symbol-time float64
fft-length int64
cp-length int64
Signal-Power float64
Detection int64
class object
dtype: object
The Subcarrier-Spacing
, Symbol-time
, and Signal-Power
columns have float64
datatype (which is the
IEEE standard double precision datatype);
while the fft-length
, cp-length
and Detection
have the
64-bit integer
datatype.
Under the hood, both float64
and int64
columns are stored
efficiently in Pandas as numpy arrays with appropriate datatypes.
The class
column is odd: It assigned the generic object
datatype.
Why is this?
This column contains data that is not numeric.
When encountering such a column, Pandas will read the column contents
literally and store them as strings.
However, due to (1) the way strings are represented in Python,
and (2) the fact that the string can be arbitrarily long,
the array of strings are stored in an numpy array of generic object
,
which in this case is actually a Python string object.
It is worth nothing that there are only two possible string values
in the class
column: mavric
and Phantom
.
Later on we will learn how to efficiently represent this kind of data
using category
datatype.
Specifying custom schema
The definition of the DataFrame in terms of the column names and datatypes shown above is often called
schema
in database terminology. In a later example, we will use a custom schema to load our dataset and assign a precise datatype for each column. For example, an appropriate schema fordf_drones
dataset above would be:numpy.dtype([('Subcarrier-Spacing', 'float64'), ('Symbol-time', 'float64'), ('fft-length', 'int32'), ('cp-length', 'int32'), ('Signal-Power', 'float64'), ('Detection', 'int32'), ('class', numpy.str_)])
In this example, we reduced the type of
fft-length
andco-length
to only 32-bit integers, which can represent whole numbers in the range of approximately +/- 2 billions. (By looking at the range of the values in the column—see the output ofdf_drones.describe()
above—it is possible to reduce them further down to 16-bit integers [-32768 .. 32767] should situation necessitates it—e.g. extremely large datasets; but we will do not do that here.) This schema would then be used in thepandas.read_csv
function call, like this:df_drones = pandas.read_csv(os.path.join(DATA_HOME, 'machinelearningdata.csv'), dtype=numpy.dtype([('Subcarrier-Spacing', 'float64'), ...., ('class', numpy.str_)]))
Data Cleaning
At this point, one typically has to check the dataset for possible defects,
missing values, and anomalies.
Following the example above, certain record(s) in the dataset may miss
the Signal-power
value(s) or the class
label—in which case Pandas
will mark them as NaN
(not a number) or a None
object.
One then has to decide what to do with defective data like this.
If we have a lot of data and not too many defective records (rows), it may be
reasonable to simply drop the defective records.
However, we also have to guard against biasing the latter analyses;
this can be possible, for example, when the data that are dropped contain
certain segment of the spectrum (or population) that is not represented
elsewhere in the dataset, which, for some reason, was not collected perfectly.
For this “drone” case, we will not need to do any cleaning step, as it has been prepared carefully by the researchers.
Formatting the Data
The dataset was loaded into Python variable df_drones
, which is a DataFrame
.
This df_drones
contains both the features and labels from all the data points.
Strictly speaking, a DataFrame is not a matrix, because each column can be of
a different datatype.
We need to reformat the data, so that the feature matrix is separated from the
label vector
.
Extracting the Feature Matrix
To extract the feature matrix, let’s perform the following steps:
- Create a copy of the DataFrame and delete the
class
column - Convert the remaining to a matrix (of real numbers)
df_copy = df_drones.copy()
del df_copy['class']
# Extract all the df_copy values as a double-precision (float64) array
all_FM = df_copy.astype('float64').values
You can check the type of all_FM
, then view its contents:
>>> type(all_FM)
numpy.ndarray
>>> all_FM
array([[9.76562500e+01, 1.02400000e-02, 1.02400000e+03, 3.20000000e+01,
4.63655823e+02, 1.00000000e+00],
[3.90625000e+02, 2.56000000e-03, 2.56000000e+02, 3.20000000e+01,
4.66324219e+02, 1.00000000e+00],
[9.76562500e+01, 1.02400000e-02, 1.02400000e+03, 3.20000000e+01,
4.62016113e+02, 1.00000000e+00],
...,
[7.81250000e+02, 1.28000000e-03, 1.28000000e+02, 3.20000000e+01,
4.54212402e+02, 1.00000000e+00],
[7.81250000e+02, 1.28000000e-03, 1.28000000e+02, 3.20000000e+01,
2.70038260e+01, 1.00000000e+00],
[3.12500000e+03, 3.20000000e-04, 3.20000000e+01, 3.20000000e+01,
1.90963800e+00, 0.00000000e+00]])
Variable all_FM
has a numpy.ndarray
datatype, which is the standard array
datatype in NumPy.
Please convince yourself that this array contains exactly the same data as
the original CSV file, except for the labels.
Extracting the Labels
What about the labels? They come as strings, but for machine learning, we need them to be in numerical format. Here we provide a general recipe to convert an array of strings to an array of integers, where there is a one-on-one mapping between the strings and the integer values. Please copy and paste this function to your ipython session:
def categorical_to_numerics(a, cats=None):
if cats is not None:
# assume that cats is a valid list of categories
pass
# Otherwise, extract the categories: hopefully one of these
# ways gets it:
elif isinstance(a, pandas.Series):
if isinstance(a.dtypes, pandas.api.types.CategoricalDtype):
cats = a.dtypes.categories
else:
# general approach for array of strings
cats = sorted(a.unique())
elif isinstance(a, pandas.Categorical):
cats = a.categories
else:
# general iterable case
cats = sorted(pandas.Series(a).unique())
# mapping: category -> numerics
cat_map = dict((c, i) for (i,c) in enumerate(cats))
# mapping: numerics -> category
cat_revmap = list(cats)
return (numpy.array([cat_map[c] for c in a]), cat_revmap)
Copy and Pasting to ipython
Sometimes ipython gives you trouble in copying and pasting a code snippet like this. There are several ways to get around this. In this lesson, we will use the text editor (
nano
) to paste the snippet into a text file. Let’s create a new file calledfunctions.py
to contain this function as well as other functions we will create later on. From the ipython prompt, type:>>> !nano functions.py
Remember that ipython also doubles as a “UNIX” shell! This will open an empty file (unless functions.py exists before) Now paste the code snippet:
Shift+Ins
on Windows (Putty and MobaXTerm)Command+V
on MacCtrl+Shift+V
on Linux (gnome-terminal or a similar terminal application)Save the file (
Ctrl-X
, replyY
to the question).After that you will need to load the function onto ipython. This is the way to do it:
>>> exec(open("functions.py").read())
Now we apply this function to the class
column to
convert the labels into a machine-learning friendly format:
all_L, labels_L = categorical_to_numerics(df_drones['class'])
The categorical_to_numerics
function returns two values—therefore there
are two variables to receive these values on the left hand side of the =
operator.
The all_L
variable will contain the array of integers (0s and 1s) corresponding
to the labels mavric
and Phantom
.
Checking out the conversion results
Print the values of
all_L
andlabels_L
to the terminal. What are they? Do you see how the 0 and 1 values ofall_L
relate to the labels stored inlabels_L
?Solution
>>> all_L array([1, 1, 1, ..., 0, 0, 0]) >>> labels_L ['Phantom', 'mavric'] # Compare against the original column values: >>> df_drones['class'] 0 mavric 1 mavric 2 mavric 3 mavric 4 mavric ... 2995 Phantom 2996 Phantom 2997 Phantom 2998 Phantom 2999 Phantom Name: class, Length: 3000, dtype: object
Remember that arrays and lists have zero-based indices in python, therefore:
>>> labels_L[0] 'Phantom' >>> labels_L[1] 'mavric'
Comparing the indices here with the values in
all_L
confirms that we successfully converted the list of labels to an array of integers.A keen reader may wonder why the labels begin with
Phantom
—although the function clearly sorts the order of labels by the string order. The reason lies in the ASCII encoding of letters:A-Z
appears earlier thana-z
in the character code list, therefore uppercase letters are lexicographically before the lowercase letters.
Splitting the Data: Training, Validation, Testing Sets
After the dataset is cleaned, we will need to partition the data into two or three sets for training, cross-validation, and test purposes. Recall that the machine learning is an iterative process:
* MACHINE LEARNING LIFECYCLE *
|
| (1) (2) accuracy (3)
| Training --> Validation --> good enough? (YES) --> Testing --> Deployment
| ^ (NO)
| | adjust |
| +----- hyperparameters <-----+
|
*
We will explain these groups now.
-
Training set: This is the set of data that we will actually use to train the machine learning model. In the training process, we systematically adjust the model parameters so that its predicted outcomes,
y_pred
, would match the expectedy
as accurately as possible, without overfitting. But how do we know that we don’t overfit? That is where the validation set comes in. -
Validation (“dev”) set: Once the model is trained using the training set, the model’s performance needs to be validated using a separate subset of the data. This is guarding against overfitting the model in the training phase. The goal of validation is to ensure that the model’s accuracy does not degrade as it is dealing with a new set of data that it has not seen before. The accuracy judged using the validation set is an unbiased measure of the model.
Usually, training and validation will need to be repeated until a satisfactory performance is measured (using the validation data). In this iterative cycle, we adjust the hyperparameters of the model and re-train the model with the new set of hyperparameters.
-
Test set: This dataset is reserved to provide the very final accuracy measures for the machine learning model that have passed through the training–validation cycle. Why do we need the last dataset and last accuracy check? Because the validation set has been used repeatedly to iterate over the search for the best hyperparameters, there is still a danger for the model to overfit (this time, against the validation set). Therefore, yet another unbiased accuracy estimate is needed to judge the accuracy of the final model.
There are some slight variations in practice:
-
Some people only split the data into “training” and “validation” sets. This may be ok if they are not concerned with overfitting at the end. Models that are simple enough (i.e. having only very few hyperparameters) may be well optimized by the end of training–validation cycle. In other words, there are not that many extra degrees of freedom (in the hyperparameter space) that cause a risk of overfitting.
-
Some people do not adjust the hyperparameters at all, therefore have no need for the “validation” set. Some machine learning tutorials take this approach, where you do not see the validation step and re-training. However, for real-world research and production machine learning, it is not recommended to skip hyperparameter optimization.
Selecting the Right Proportion
A usual recommendation for the ratio of (training : validation : test) set sizes is 60% : 20% : 20%. Or without the test set, it can be (training : validation) = (80% : 20%) or (70% : 30%).
The following Coursera video by Andrew Ng is helpful to understand good practice for choosing the best ratio for training/validation/testing data: Train / Dev / Test sets. The rule of thumb is: The bigger the size of the entire data set, the more data you can dedicate for the training purposes.
-
For smallish data (e.g. ~1000 records or less), the
60:20:20
or70:30
ratio (with/without test set) is reasonable. -
But for biggish data (e.g. ~1 million records), the ratio can change to something like
98:1:1
. -
For extremely big data, the ratio can even be skewed to something like
99.5:0.4:0.1
.
Back to the Drones dataset…
We will use the Scikit-learn function named train_test_split
to perform this data split.
For simplicity we will skip the “test” set for now.
from sklearn.model_selection import train_test_split
train_FM, dev_FM, train_L, dev_L = train_test_split(all_FM, all_L, test_size=0.2)
We feed two arrays to the train_test_split
function, which will
split the data in two sets.
The split is performed by randomly shuffling the rows in the original sets,
then taking certain fraction for the first (train
) set and
the other fraction for the second (test
) set.
Note that the arrays will be shuffled in a synchronized way so that the labels
will not get mixed up.
The fraction is determined by either the train_size
or test_size
argument.
In this example, we set the test_size
argument to 0.2,
meaning that 20% of the data will to the test
set while
80% will go to the train
set.
Note the order of the output variables above—they must not be flipped!
For the sake of consistency with our convention of the set names
(train, validation/dev, test),
we use the “train” and “dev” prefixes to mark the training and validation (dev)
datasets.
It is always wise to check at least the sizes of the returned arrays:
>>> train_FM.shape
(2400, 6)
>>> train_L.shape
(2400,)
>>> dev_FM.shape
(600, 6)
>>> dev_L.shape
(600,)
Splitting the data three way
How would you split the data three-way in the 60:20:20 ratio?
Hint: Use the
train_test_split
twice.Solution
td_FM, test_FM, td_L, test_L = train_test_split(all_FM, all_L, test_size=0.2) train_FM, dev_FM, train_L, dev_L = train_test_split(td_FM, td_L, test_size=0.25) # why 0.25?
Key Points
Key steps preceding machine learning: Data loading, exploration, cleaning, and input preparation