This lesson is still being designed and assembled (Pre-Alpha version)

Machine Learning for Drone Signal Classification

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How one would use Scikit-Learn to perform machine learning?

Objectives
  • Understanding the machine learning process in scikit-learn.

Using the Scikit-Learn Machine Learning API

In the previous episode, we are doing a lot of legwork to prepare the Drone data for machine learning. Now we come to the exciting part of actually doing machine learning.

Scikit-Learn has a vast library of machine learning methods. With Scikit-Learn, these methods has been implemented and tested by many thousands (or more) users, therefore we will not need to program the “learning” algorithm—something that often needs much expertise to write and optimize for high performance.

It can be quite overwhelming to choose the appropriate machine learning algorithm for your particular task. In this regard, Scikit-Learn’s user guide provides excellent guidance. For example, consider the following pages for the various machine learning algorithms below:

For each method, the user guide provides some strengths and weaknesses of each method. These are helpful items to match the characteristics of your task the the method.

Indeed we will be testing SVM and decision tree in this episode.

Support Vector Machine (SVM) Classifier

Let’s start with the SVM method. There are several variations of SVM: as a classifier, as a regressor, and even as a one-class SVM. In this lesson we will only focus on the SVM classifier.

The idea of a Support Vector Classifier (SVC) is to find the best boundaries to separate items belonging to different classes. SVC accomplishes this by constructing “hyperplanes” in high or infinite dimensions space that are the separation of these different classes. Nonlinearity is used to “bend” these hyperplanes:

SVC on the 'iris petal' dataset

Theoretical Overview

A brief mathematics on the optimization we’re trying to solve:

SVC optimization math

We identify the following:

The most popular nonlinear function (which we also use in this problem) is called “radial basis function” (rbf):

K(x[i], x[j]) = exp(-gamma * Norm2(x[i] - x[j]))    

where Norm2 = 2-norm of a vector; gamma = adjustable constant.

From these we recognize that the parameters of an SVM model are:

and the hyperparameters are:

Scikit-Learn Implementation of SVC

Scikit-Learn’s implementation notes:

The implementation is based on libsvm. The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples.

LibSVM is the actual implementation of SVM method used in Scikit-Learn. This is the github source code link: cjlin1/libsvm and this is the comprehensive paper on libsvm: [PDF link].

Here’s how we create a Python object called model_svc that represents an SVM classifier:

from sklearn.svm import SVC
model_svc = SVC(verbose=1)

Scikit-Learn has chosen some default parameters that are reasonable enough for many uses. But we need to be aware that the default choices may not give the best prediction performance. Non-default hyperparameters can be specified explicitly like this example:

model_svc = SVC(verbose=1, kernel='rbf', C=1000.0, gamma=0.001)

Training phase: How do we train this model?

model_svc.fit(train_FM, train_L)

How easy is that! This is a very simple problem and the fit returns right away. Be advised that for large datasets the training phase can take quite awhile!

The training will produce an output like this one (ipython prompts used to distinguish terminal printing and actual Python result):

In [68]: model_svc.fit(train_FM, train_L)
/cm/shared/applications/Python/share/3.6/scikit-learn/0.20.2/lib/python3.6/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
..*...*
optimization finished, #iter = 5644
obj = -1199.150839, rho = -0.083055
nSV = 1936, nBSV = 1037
Total nSV = 1936
Out[68]: [LibSVM]
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=1)

This training procedure passed through 5644 iterations; the objective function value is –1199.15 at the optimal solution. Here, rho is simply -b, and nSV is the number of support vectors of the solution. For you who are curious, you can inspect the parameters via the model_svm object interface. Refer to the API documentation for more details.

Decision Tree Classifier

Decision tree classifier (DTC) is a learning method that builds a tree-like logical sequences whereby it determines the outcome based on the evaluation of conditions along the tree nodes.

Manually created decision tree

On the back of our mind, we do this a lot, therefore this is a very intuitive algorithm. With decision tree, we can trace back why a certain feature set gets a certain label (i.e. why

X = [ 97.65625, 0.01024, 1024, 32, 463.655823, 1 ]

leads to y = mavric).

In machine learning, the tree is constructed and improved iteratively according to specific rules.

Scikit-Learn Implementation of Decision Tree Classifier

Here’s how we create a Python object called model_dtc that represents a DTC model:

from sklearn.tree import DecisionTreeClassifier
model_dtc = DecisionTreeClassifier()

The constructed tree object and the threshold values (e.g. the value 50 as in condition X[0] > 50) are the parameters of the tree.

There are many hyperparameters that can be adjusted for this model. The following are only a subset of the hyperparameters—please refer to Scikit-Learn’s documentation page for more details:

As you can see, it quickly becomes a nontrivial process to determine the best set of hyperparameters.

Non-default hyperparameters can be specified explicitly like this example:

model_dtc = DecisionTreeClassifier(criterion='entropy',
                                   max_depth=6,
                                   min_samples_split=8)

Training phase: How do we train this model?

model_dtc.fit(train_FM, train_L)

It’s the same method call as in the SVC case. This is the convenience of using Scikit-Learn: the API is quite uniform which makes it easy to swap models.

Accuracy, Precision, Recall, Confusion Matrix

Validation phase: Now we use the trained model to make predictions. Our work is not complete yet before we validate the model. Let us see how model_svc performs. To make prediction, we simply feed the feature matrix to the predict method of model_svc or model_dtc (we use model_svc in this section for illustrative purposes):

# Predict L for the training set
>>> train_L_pred = model_svc.predict(train_FM)

Let’s see what happens:

# The correct labels (expected responses)
>>> train_L
array([1, 0, 0, ..., 0, 0, 1])

# The labels predicted by model_svc:

>>>  train_L_pred
array([1, 0, 1, ..., 0, 0, 1])

Numpy printing gets in the way here. Let’s examine the first 20 of each:

>>> train_L[:20]
array([1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0])

>>> train_L_pred[:20]
array([1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0])

How many are mismatched? But we have 2400 data points—will be dizzying if we have to look at each one! We need a way to get a sense of how well we are doing.

Evaluating Model Performance

Scikit-Learn has many functions to help quantify the performance of a trained model. These are stored in the sklearn.metrics module, which also come with a complete user’s guide. Here are some of them:

Here is the example output for the Drone’s training data:

>>> from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

>>> accuracy_score(train_L, train_L_pred)
0.8441666666666666

# Unnormalized (not the fraction), i.e. the actual number of correct L predictions
>>> accuracy_score(train_L, train_L_pred, normalize=False)
2026

>>> precision_score(train_L, train_L_pred)
0.8545918367346939

>>> recall_score(train_L, train_L_pred)
0.831953642384106

Confusion matrix is also easily obtained:

>>> confusion_matrix(train_L, train_L_pred)
array([[1021,  171],
       [ 203, 1005]])

The rows correspond to the true values of L (i.e. row 0 for actual Phantom label and row 1 for actual mavric label); the columns correspond to the predicted values of L (from model_svc.predict).

From the off-diagonal values of the matrix, we can see where the predictions are off, and how many. In the example above, 203 actual mavric data were labeled Phantom by model_svc.

Performance scores for DecisionTreeClassifier model

Train a DTC model and evaluate the accuracy, precision, recall, and confusion matrix the same way as above. What will you conclude?

Solutions

>>> train_L_pred_dtc = model_dtc.predict(train_FM)

>>> accuracy_score(train_L, train_L_pred_dtc)
1.0

>>> precision_score(train_L, train_L_pred_dtc)
1.0

>>> confusion_matrix(train_L, train_L_pred_dtc)
array([[1192,    0],
       [   0, 1208]])

Will you conclude that we have a perfectly trained model to predict the type of the drones?

You can also predict for a different feature matrix; let’s do that for the training set:

# Predict L for the dev set
>>> dev_L_pred = model_svc.predict(dev_FM)

Performance scores for dev sets

Using the dev set (dev_FM and dev_L), evaluate the prediction of L made by the SVC and DTC.

Solutions

A few sample solutions are provided; others are left for your exercise.

# For SVC model
accuracy_score(dev_L, dev_L_pred)
0.72

>>> confusion_matrix(dev_L, dev_L_pred)
array([[229,  79],
       [ 89, 203]])

# For DTC model

>>> accuracy_score(dev_L, dev_L_pred_dtc)
0.6733333333333333

>>> confusion_matrix(dev_L, dev_L_pred_dtc)
array([[206, 102],
       [ 94, 198]])

You may be surprised to find that:

1) The performance metrics of SVC on the “dev” set is not as good as those on the “training” set.

2) The same performance drop is observed on the DTC model, and even worse!

3) The DTC model is not as good as SVC on the “dev” set.

Points 1–2 shows that because the model is trained by training data, performance measurement using the same training data is biased. This is why we have to use another set of data to gauge the model performance!

Scaling (Normalizing) the Features

Machine learning methods often work the best when the values of the features are on the order of one (i.e. having values like 0.5, -1.214, 3.5 but not -100.72, 7.148e+6, and so on) and are centered on zero.

Without going into details, we can scale each feature so that the mean value of the feature is zero (or close to it) and the standard devision of the feature values is 1. This is accomplished using sklearn.preprocessing.StandardScaler class:

>>> from sklearn.preprocessing import StandardScaler

>>> scaler = StandardScaler()

>>> scaler.fit(train_FM)
StandardScaler(copy=True, with_mean=True, with_std=True)

>>> scaler.mean_
array([9.14713542e+02, 4.08306667e-03, 4.08306667e+02, 2.72866667e+01,
       3.50526761e+02, 9.49583333e-01])

>>> scaler.scale_
array([1.38713323e+03, 3.45074464e-03, 3.45074464e+02, 8.52981959e+00,
       1.41258864e+02, 2.18803168e-01])

Note: we don’t scale the labels, it does not make sense in classification!

In this exercise, we will use only the training set features to obtain the scaling parameters. One can use all_FM if so desired.

(Each feature will get its own mean-value removal and scaling.) At first you might think that this would corrupt the data and render it useless for machine learning. Actually it is not so, because the model’s parameters will adjust to this new data.

There are also other types of scaler that can be used—see the documentation page.

Now scale the train_FM and dev_FM:

>>> train_FM = scaler.transform(train_FM)

>>> dev_FM = scaler.transform(dev_FM)

# Inspect the scaled values
>>> train_FM
array([[-0.51862433,  0.30049553,  0.30049553,  0.55257128,  1.00958108,
         0.23042019],
       [ 3.84626822, -1.13687539, -1.13687539,  0.55257128, -1.46154134,
         0.23042019],
       [-0.37782134, -0.44137333, -0.44137333,  0.55257128, -1.12787606,
         0.23042019],
       ...,
       [ 0.46699657, -0.99777498, -0.99777498,  0.55257128, -1.18086033,
         0.23042019],
       [-0.37782134, -0.44137333, -0.44137333,  0.55257128, -2.45167095,
        -4.33989755],
       [-0.51862433,  0.30049553,  0.30049553,  0.55257128,  0.96804176,
         0.23042019]])

You can reuse the train_FM and dev_FM variable names, or create new names (train_FM2, dev_FM2) if you choose to do so; in the latter choice, you will need to adjust the variable names accordingly later on.

Training models with normalized features

Using the dev set (dev_FM and dev_L), evaluate the prediction of L made by the SVC and DTC.

Solutions

>>> train_L_pred = model_svc.predict(train_FM)

>>> dev_L_pred = model_svc.predict(dev_FM)

>>> accuracy_score(train_L, train_L_pred)
0.6904166666666667

>>> accuracy_score(dev_L, dev_L_pred)
0.7116666666666667

Saving Your Fingers: Scripts and Functions

Up to this point you get the feel that machine learning involves a lot of trials and errors. Scripts and functions are extremely helpful to save labor.

Python script

Script is basically a text file that contains Python statements. A script (say, script.py) can be executed using:

exec(open('script.py').read())

You save and reuse frequently used sequence of statements by putting them into a script.

Script: Training and validation

Create a script called train_valid.py that will perform training and validation.

Solution

This is just one example. You can make a completely different script that fits your style/taste.

print("Training the SVC model...")
model_svc.fit(train_FM, train_L)
train_L_pred = model_svc.predict(train_FM)
train_accuracy = accuracy_score(train_L, train_L_pred)
print()
print("Accuracy score on training set = ", train_accuracy)
dev_L_pred = model_svc.predict(dev_FM)
dev_accuracy = accuracy_score(dev_L, dev_L_pred)
print("Accuracy score on dev set = ", dev_accuracy)
print()

NOTE: This is not a complete script. It will be only used in the ipython environment while iterating during the model train–validation cycle.

Python functions

You can shorten frequently used sequence of commands by using functions. Most programming or scripting languages have this feature. In Python, functions are defined using the keyword def as follows:

def function_name([parameter_list]):
    statement_1()
    statement_2()
    # and so on

As example,

def prepare_data(data_home, file_name, ratio = .8):
    data = pandas.read_csv(os.path.join(data_home, file_name))
    train_data, dev_data = train_test_split(data, test_size = ratio)
    train_labels = train_data["class"]
    train_fm = train_data.copy()
    del train_fm["class"]
    train_fm = train_fm.astype("float64").values
    train_real_labels, labels = categorical_to_numerics(train_labels)
    dev_fm = dev_data.copy()
    del dev_fm["class"]
    dev_fm = dev_fm.astype("float64").values
    dev_labels = dev_data["class"]
    dev_real_labels, dev_l_cat = categorical_to_numerics(dev_labels, labels)

    return train_fm, train_real_labels, dev_fm, dev_real_labels

To use your function:

train_FM, train_L, dev_FM, dev_L = prepare_data('/scratch-lustre/DeapSECURE/module03/Drones/data',
                                                'machinelearningdata.csv')

Results must be captured upon the call of the function.

NOTES: Variables that are defined in the function are LOCAL variables, they are not going to be visible outside the function. If you need them outside the function, you must declare them as global, say,

global data

at the start of the function.

Tuning Hyperparameters

Now let’s tune the hyperparameters. We take the SVC model example. Start with taking a nontrivial value of the C and gamma:

Hyperparameter search can be automated! This is a brute-force search and it is going to be expensive. We use the GridSearchCV to perform this search. We will also try different kind of SVM kernel here.

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
# Set the parameters by cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000, 10000, 100000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

scores = ['precision', 'recall']

for score in scores:
    print("# Tuning hyper-parameters for {}".format(score))
    print()

    clf = GridSearchCV(model_svc, tuned_parameters, cv=5,
                       scoring='%s_macro' % score)
    clf.fit(train_FM, train_L)

    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = dev_L, clf.predict(dev_FM)
    print(classification_report(y_true, y_pred))
    print()

Parallel execution: Those who have reserved more than one core can now unleash the power of multi-core CPUs: Add the n_jobs=4 (assuming you reserved 4 cores) to your GridSearchCV constructor call:

clf = GridSearchCV(model_svc, tuned_parameters, cv=5,
                   scoring='%s_macro' % score, n_jobs=4)

Key Points