Machine Learning for Drone Signal Classification
Overview
Teaching: 0 min
Exercises: 0 minQuestions
How one would use Scikit-Learn to perform machine learning?
Objectives
Understanding the machine learning process in scikit-learn.
Using the Scikit-Learn Machine Learning API
In the previous episode, we are doing a lot of legwork to prepare the Drone data for machine learning. Now we come to the exciting part of actually doing machine learning.
Scikit-Learn has a vast library of machine learning methods. With Scikit-Learn, these methods has been implemented and tested by many thousands (or more) users, therefore we will not need to program the “learning” algorithm—something that often needs much expertise to write and optimize for high performance.
It can be quite overwhelming to choose the appropriate machine learning algorithm for your particular task. In this regard, Scikit-Learn’s user guide provides excellent guidance. For example, consider the following pages for the various machine learning algorithms below:
For each method, the user guide provides some strengths and weaknesses of each method. These are helpful items to match the characteristics of your task the the method.
Indeed we will be testing SVM and decision tree in this episode.
Support Vector Machine (SVM) Classifier
Let’s start with the SVM method. There are several variations of SVM: as a classifier, as a regressor, and even as a one-class SVM. In this lesson we will only focus on the SVM classifier.
The idea of a Support Vector Classifier (SVC) is to find the best boundaries to separate items belonging to different classes. SVC accomplishes this by constructing “hyperplanes” in high or infinite dimensions space that are the separation of these different classes. Nonlinearity is used to “bend” these hyperplanes:
Theoretical Overview
A brief mathematics on the optimization we’re trying to solve:
We identify the following:
-
(input) feature vector
x[i]
(for the i-th data point); -
regularization parameter
C
which limits the magnitude ofalpha[i]
; -
support vectors
w
; -
coefficients
y[i]*alpha[i]
; -
constant
b
; -
kernel function
K(x[i], x[j])
; -
(nonlinear) function
phi[i]
which maps input featuresx
to high dimension.
The most popular nonlinear function (which we also use in this problem) is called “radial basis function” (rbf):
K(x[i], x[j]) = exp(-gamma * Norm2(x[i] - x[j]))
where Norm2
= 2-norm of a vector; gamma
= adjustable constant.
From these we recognize that the parameters of an SVM model are:
-
y[i]*alpha[i]
-
b
and the hyperparameters are:
-
C
-
choice of kernel function
K
-
(for rbf kernel)
gamma
Scikit-Learn Implementation of SVC
Scikit-Learn’s implementation notes:
The implementation is based on libsvm. The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples.
LibSVM is the actual implementation of SVM method used in Scikit-Learn. This is the github source code link: cjlin1/libsvm and this is the comprehensive paper on libsvm: [PDF link].
Here’s how we create a Python object called model_svc
that represents an SVM classifier:
from sklearn.svm import SVC
model_svc = SVC(verbose=1)
Scikit-Learn has chosen some default parameters that are reasonable enough for many uses. But we need to be aware that the default choices may not give the best prediction performance. Non-default hyperparameters can be specified explicitly like this example:
model_svc = SVC(verbose=1, kernel='rbf', C=1000.0, gamma=0.001)
Training phase: How do we train this model?
model_svc.fit(train_FM, train_L)
How easy is that! This is a very simple problem and the fit returns right away. Be advised that for large datasets the training phase can take quite awhile!
The training will produce an output like this one (ipython prompts used to distinguish terminal printing and actual Python result):
In [68]: model_svc.fit(train_FM, train_L)
/cm/shared/applications/Python/share/3.6/scikit-learn/0.20.2/lib/python3.6/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
"avoid this warning.", FutureWarning)
..*...*
optimization finished, #iter = 5644
obj = -1199.150839, rho = -0.083055
nSV = 1936, nBSV = 1037
Total nSV = 1936
Out[68]: [LibSVM]
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=1)
This training procedure passed through 5644 iterations;
the objective function value is –1199.15 at the optimal solution.
Here, rho
is simply -b
, and nSV
is the number of support vectors
of the solution.
For you who are curious, you can inspect the parameters via the
model_svm
object interface.
Refer to the
API documentation
for more details.
Decision Tree Classifier
Decision tree classifier (DTC) is a learning method that builds a tree-like logical sequences whereby it determines the outcome based on the evaluation of conditions along the tree nodes.
On the back of our mind, we do this a lot, therefore this is a very intuitive algorithm. With decision tree, we can trace back why a certain feature set gets a certain label (i.e. why
X = [ 97.65625, 0.01024, 1024, 32, 463.655823, 1 ]
leads to y = mavric
).
In machine learning, the tree is constructed and improved iteratively according to specific rules.
Scikit-Learn Implementation of Decision Tree Classifier
Here’s how we create a Python object called model_dtc
that represents a DTC model:
from sklearn.tree import DecisionTreeClassifier
model_dtc = DecisionTreeClassifier()
The constructed tree object and the threshold values
(e.g. the value 50 as in condition X[0] > 50
) are the parameters
of the tree.
There are many hyperparameters that can be adjusted for this model. The following are only a subset of the hyperparameters—please refer to Scikit-Learn’s documentation page for more details:
-
criterion
: Choice of function to measure the quality of a split, which can be'gini'
(minimizing the Gini index) or'entropy'
(maximizing entropy gain); -
max_depth
: Maximum vertical depth of the tree; -
min_samples_split
: The minimum number of samples required to split an internal node, which by default is set to 2 in Scikit-Learn; -
max_features
: The number of features to consider when looking for the best split; -
and many more!
As you can see, it quickly becomes a nontrivial process to determine the best set of hyperparameters.
Non-default hyperparameters can be specified explicitly like this example:
model_dtc = DecisionTreeClassifier(criterion='entropy',
max_depth=6,
min_samples_split=8)
Training phase: How do we train this model?
model_dtc.fit(train_FM, train_L)
It’s the same method call as in the SVC case. This is the convenience of using Scikit-Learn: the API is quite uniform which makes it easy to swap models.
Accuracy, Precision, Recall, Confusion Matrix
Validation phase: Now we use the trained model to make predictions.
Our work is not complete yet before we validate the model.
Let us see how model_svc
performs.
To make prediction, we simply feed the feature matrix to the predict
method of model_svc
or model_dtc
(we use model_svc
in this
section for illustrative purposes):
# Predict L for the training set
>>> train_L_pred = model_svc.predict(train_FM)
Let’s see what happens:
# The correct labels (expected responses)
>>> train_L
array([1, 0, 0, ..., 0, 0, 1])
# The labels predicted by model_svc:
>>> train_L_pred
array([1, 0, 1, ..., 0, 0, 1])
Numpy printing gets in the way here. Let’s examine the first 20 of each:
>>> train_L[:20]
array([1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0])
>>> train_L_pred[:20]
array([1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0])
How many are mismatched? But we have 2400 data points—will be dizzying if we have to look at each one! We need a way to get a sense of how well we are doing.
Evaluating Model Performance
Scikit-Learn has many functions to help quantify the performance of
a trained model.
These are stored in the
sklearn.metrics
module,
which also come with a
complete user’s guide.
Here are some of them:
-
accuracy_score
— Accuracy score is essentially the fraction of correct predictions to the size of the evaluated dataset. -
precision_score
andrecall_score
– Precision and recall scores. Remember that intuitively speaking, precision is the ability of the classifier not to label as positive a sample that is negative, while recall is the ability of the classifier to find all the positive samples.Important: If there are more than two categories in the classification, then precision and recall values have to be averaged in some fashion. Refer to the function’s documentation (links provided in the function names above) for more detail.
Here is the example output for the Drone’s training data:
>>> from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
>>> accuracy_score(train_L, train_L_pred)
0.8441666666666666
# Unnormalized (not the fraction), i.e. the actual number of correct L predictions
>>> accuracy_score(train_L, train_L_pred, normalize=False)
2026
>>> precision_score(train_L, train_L_pred)
0.8545918367346939
>>> recall_score(train_L, train_L_pred)
0.831953642384106
Confusion matrix is also easily obtained:
>>> confusion_matrix(train_L, train_L_pred)
array([[1021, 171],
[ 203, 1005]])
The rows correspond to the true values of L
(i.e. row 0 for actual
Phantom
label and row 1 for actual mavric
label);
the columns correspond to the predicted values of L
(from model_svc.predict
).
From the off-diagonal values of the matrix, we can see where the
predictions are off, and how many.
In the example above, 203 actual mavric
data were labeled Phantom
by model_svc
.
Performance scores for DecisionTreeClassifier model
Train a DTC model and evaluate the accuracy, precision, recall, and confusion matrix the same way as above. What will you conclude?
Solutions
>>> train_L_pred_dtc = model_dtc.predict(train_FM) >>> accuracy_score(train_L, train_L_pred_dtc) 1.0 >>> precision_score(train_L, train_L_pred_dtc) 1.0 >>> confusion_matrix(train_L, train_L_pred_dtc) array([[1192, 0], [ 0, 1208]])
Will you conclude that we have a perfectly trained model to predict the type of the drones?
You can also predict for a different feature matrix; let’s do that for the training set:
# Predict L for the dev set
>>> dev_L_pred = model_svc.predict(dev_FM)
Performance scores for dev sets
Using the dev set (
dev_FM
anddev_L
), evaluate the prediction of L made by the SVC and DTC.Solutions
A few sample solutions are provided; others are left for your exercise.
# For SVC model accuracy_score(dev_L, dev_L_pred) 0.72 >>> confusion_matrix(dev_L, dev_L_pred) array([[229, 79], [ 89, 203]]) # For DTC model >>> accuracy_score(dev_L, dev_L_pred_dtc) 0.6733333333333333 >>> confusion_matrix(dev_L, dev_L_pred_dtc) array([[206, 102], [ 94, 198]])
You may be surprised to find that:
1) The performance metrics of SVC on the “dev” set is not as good as those on the “training” set.
2) The same performance drop is observed on the DTC model, and even worse!
3) The DTC model is not as good as SVC on the “dev” set.
Points 1–2 shows that because the model is trained by training data, performance measurement using the same training data is biased. This is why we have to use another set of data to gauge the model performance!
Scaling (Normalizing) the Features
Machine learning methods often work the best when the values of the features are on the order of one (i.e. having values like 0.5, -1.214, 3.5 but not -100.72, 7.148e+6, and so on) and are centered on zero.
Without going into details, we can scale each feature so that
the mean value of the feature is zero (or close to it)
and
the standard devision of the feature values is 1.
This is accomplished using sklearn.preprocessing.StandardScaler
class:
>>> from sklearn.preprocessing import StandardScaler
>>> scaler = StandardScaler()
>>> scaler.fit(train_FM)
StandardScaler(copy=True, with_mean=True, with_std=True)
>>> scaler.mean_
array([9.14713542e+02, 4.08306667e-03, 4.08306667e+02, 2.72866667e+01,
3.50526761e+02, 9.49583333e-01])
>>> scaler.scale_
array([1.38713323e+03, 3.45074464e-03, 3.45074464e+02, 8.52981959e+00,
1.41258864e+02, 2.18803168e-01])
Note: we don’t scale the labels, it does not make sense in classification!
In this exercise, we will use only the training set features to obtain
the scaling parameters.
One can use all_FM
if so desired.
(Each feature will get its own mean-value removal and scaling.) At first you might think that this would corrupt the data and render it useless for machine learning. Actually it is not so, because the model’s parameters will adjust to this new data.
There are also other types of scaler that can be used—see the documentation page.
Now scale the train_FM
and dev_FM
:
>>> train_FM = scaler.transform(train_FM)
>>> dev_FM = scaler.transform(dev_FM)
# Inspect the scaled values
>>> train_FM
array([[-0.51862433, 0.30049553, 0.30049553, 0.55257128, 1.00958108,
0.23042019],
[ 3.84626822, -1.13687539, -1.13687539, 0.55257128, -1.46154134,
0.23042019],
[-0.37782134, -0.44137333, -0.44137333, 0.55257128, -1.12787606,
0.23042019],
...,
[ 0.46699657, -0.99777498, -0.99777498, 0.55257128, -1.18086033,
0.23042019],
[-0.37782134, -0.44137333, -0.44137333, 0.55257128, -2.45167095,
-4.33989755],
[-0.51862433, 0.30049553, 0.30049553, 0.55257128, 0.96804176,
0.23042019]])
You can reuse the train_FM
and dev_FM
variable names, or
create new names (train_FM2
, dev_FM2
) if you choose to do so;
in the latter choice, you will need to adjust the variable names accordingly
later on.
Training models with normalized features
Using the dev set (
dev_FM
anddev_L
), evaluate the prediction of L made by the SVC and DTC.Solutions
>>> train_L_pred = model_svc.predict(train_FM) >>> dev_L_pred = model_svc.predict(dev_FM) >>> accuracy_score(train_L, train_L_pred) 0.6904166666666667 >>> accuracy_score(dev_L, dev_L_pred) 0.7116666666666667
Saving Your Fingers: Scripts and Functions
Up to this point you get the feel that machine learning involves a lot of trials and errors. Scripts and functions are extremely helpful to save labor.
Python script
Script is basically a text file that contains Python statements. A script (say,
script.py
) can be executed using:exec(open('script.py').read())
You save and reuse frequently used sequence of statements by putting them into a script.
Script: Training and validation
Create a script called
train_valid.py
that will perform training and validation.Solution
This is just one example. You can make a completely different script that fits your style/taste.
print("Training the SVC model...") model_svc.fit(train_FM, train_L) train_L_pred = model_svc.predict(train_FM) train_accuracy = accuracy_score(train_L, train_L_pred) print() print("Accuracy score on training set = ", train_accuracy) dev_L_pred = model_svc.predict(dev_FM) dev_accuracy = accuracy_score(dev_L, dev_L_pred) print("Accuracy score on dev set = ", dev_accuracy) print()
NOTE: This is not a complete script. It will be only used in the ipython environment while iterating during the model train–validation cycle.
Python functions
You can shorten frequently used sequence of commands by using functions. Most programming or scripting languages have this feature. In Python, functions are defined using the keyword
def
as follows:def function_name([parameter_list]): statement_1() statement_2() # and so on
As example,
def prepare_data(data_home, file_name, ratio = .8): data = pandas.read_csv(os.path.join(data_home, file_name)) train_data, dev_data = train_test_split(data, test_size = ratio) train_labels = train_data["class"] train_fm = train_data.copy() del train_fm["class"] train_fm = train_fm.astype("float64").values train_real_labels, labels = categorical_to_numerics(train_labels) dev_fm = dev_data.copy() del dev_fm["class"] dev_fm = dev_fm.astype("float64").values dev_labels = dev_data["class"] dev_real_labels, dev_l_cat = categorical_to_numerics(dev_labels, labels) return train_fm, train_real_labels, dev_fm, dev_real_labels
To use your function:
train_FM, train_L, dev_FM, dev_L = prepare_data('/scratch-lustre/DeapSECURE/module03/Drones/data', 'machinelearningdata.csv')
Results must be captured upon the call of the function.
NOTES: Variables that are defined in the function are LOCAL variables, they are not going to be visible outside the function. If you need them outside the function, you must declare them as global, say,
global data
at the start of the function.
Tuning Hyperparameters
Now let’s tune the hyperparameters.
We take the SVC model example.
Start with taking a nontrivial value of the C
and gamma
:
-
Try
C
on the sequence of 1 (default), 10, 100, 1000, … -
Tray
gamma
on the sequence of 1, 0.1, 0.01, 0.001, …
Automating hyperparameter search
Hyperparameter search can be automated! This is a brute-force search and it is going to be expensive. We use the
GridSearchCV
to perform this search. We will also try different kind of SVM kernel here.from sklearn.model_selection import GridSearchCV from sklearn.metrics import classification_report # Set the parameters by cross-validation tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4], 'C': [1, 10, 100, 1000, 10000, 100000]}, {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}] scores = ['precision', 'recall'] for score in scores: print("# Tuning hyper-parameters for {}".format(score)) print() clf = GridSearchCV(model_svc, tuned_parameters, cv=5, scoring='%s_macro' % score) clf.fit(train_FM, train_L) print("Best parameters set found on development set:") print() print(clf.best_params_) print() print("Grid scores on development set:") print() means = clf.cv_results_['mean_test_score'] stds = clf.cv_results_['std_test_score'] for mean, std, params in zip(means, stds, clf.cv_results_['params']): print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params)) print() print("Detailed classification report:") print() print("The model is trained on the full development set.") print("The scores are computed on the full evaluation set.") print() y_true, y_pred = dev_L, clf.predict(dev_FM) print(classification_report(y_true, y_pred)) print()
Parallel execution: Those who have reserved more than one core can now unleash the power of multi-core CPUs: Add the
n_jobs=4
(assuming you reserved 4 cores) to your GridSearchCV constructor call:clf = GridSearchCV(model_svc, tuned_parameters, cv=5, scoring='%s_macro' % score, n_jobs=4)
Key Points