Tuning the Machine Learning Model

Overview

Teaching: 20 min
Exercises: 20 min

Questions

What is model tuning and why do we need it?

What are the key procedures to tune a machine learning model for the best performance?

What are the hyperparameters that we need to adjust in the tuning process?

Objectives

Understands the different methods to tune a machine learning model.

In the previous episode, we learned how to build and train simple ML models using scikit-learn, then assess the performance of these models using several metrics such as accuracy and confusion matrix. For simplicity, we manually selected the features in the dataset that would be used in the model training and inference. But we also see that this manual process can be tedious with many possible combinations to try.

In this episode, we will use learn how to systematically improve the predictive performance of the ML model aimed at classifying the running smartphone apps based on their resource usage signatures.

Prerequisites

This episodes builds and depends on the Python environment already set up in the previous episode. If you have not already, you must load the requisite Python modules as well as preprocess the SherLock dataset. Then your environment will be ready for the machine learning training step. Please execute the following steps if you started Python from scratch.

Solution: Preparing Python Modules and Dataset

First, load all the required Python modules and functions:

import os
import sys
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import sklearn

# also add more tools:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
# machine learning models:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
# for evaluating model performance
from sklearn.metrics import accuracy_score, confusion_matrix

Next, we load and preprocess the SherLock “2-apps” dataset as we did in the previous episode. All of the necessary steps are now placed in this code snippet:

df2 = pd.read_csv('sherlock/sherlock_mystery_2apps.csv')

# Remove irrelevant feature(s)
df2.drop('Unnamed: 0', axis=1, inplace=True)

# Remove rows with missing values
df2.dropna(inplace=True)

# Remove duplicate features
df2.drop('Mem', axis=1, inplace=True)

# Separate labels from features
df2_labels = df2['ApplicationName']
df2_features = df2.drop('ApplicationName', axis=1)

# Feature scaling
scaler = preprocessing.StandardScaler()
scaler.fit(df2_features)
df2_features_n = pd.DataFrame(scaler.transform(df2_features),
                              columns=df2_features.columns,
                              index=df2_features.index)

Check Your Data!

Before we go on, let us make sure that you have the correct data. Please examine the features after the normalization process:

print("Normalized features:")
print(df2_features_n.head(10))

Normalized features:
        CPU_USAGE    cutime       lru  num_threads  otherPrivateDirty  \
176473  -0.159870 -0.429029 -0.041774    -1.300898          -0.780597   
176474   4.129610 -0.429029 -0.041774     0.222698          -0.688933   
176475   0.213345 -0.429029 -0.041774    -0.292636          -0.321111   
176476  -0.159870 -0.429029 -0.041774    -1.300898          -0.785560   
176477   3.935538 -0.429029 -0.041774     0.222698          -0.687036   
176478   0.213345 -0.429029 -0.041774    -0.292636          -0.323008   
176479  -0.159870 -0.429029 -0.041774    -1.300898          -0.785560   
176480   3.791228 -0.429029 -0.041774     0.222698          -0.688349   
176481   0.213345 -0.429029 -0.041774    -0.292636          -0.328701   
176482  -0.159870 -0.429029 -0.041774    -1.300898          -0.786873   

        priority     utime     vsize   cminflt  guest_time     queue  
176473  0.246368 -0.847813 -0.558714 -0.698484   -0.841396 -0.244324  
176474  0.246368 -0.705633  0.242407 -0.698484   -0.705121 -0.244324  
176475  0.246368 -0.292064 -0.956849  0.537550   -0.302963 -0.244324  
176476  0.246368 -0.847813 -0.558714 -0.698484   -0.841660 -0.244324  
176477  0.246368 -0.705633  0.242407 -0.698484   -0.707196 -0.244324  
176478  0.246368 -0.292064 -0.956849  0.537550   -0.293689 -0.244324  
176479  0.246368 -0.847813 -0.558714 -0.698484   -0.852326 -0.244324  
176480  0.246368 -0.705633  0.242407 -0.698484   -0.713160 -0.244324  
176481  0.246368 -0.292064 -0.956849  0.537550   -0.287239 -0.244324  
176482  0.246368 -0.847813 -0.558714 -0.698484   -0.850737 -0.244324  

The contents of your df2_features_n dataframe should match the output printed above.

At this stage, it is also a good idea to create a backup of the normalized feature matrix, in case we would make a mistake later and need to revert:

df2_features_n_backup = df2_features_n.copy()

Feature Selection

In the previous episode, we discovered that the performance an ML model may be strongly affected by the choices of the features. Even an ML method that can potentially perform very well (e.g. decision tree) may perform poorly when an inappropriate set of features are used in the modeling.

In ML modeling, generally speaking, we want to start with a handful of features (2-4) with the most predictive power. These are features that have the strongest influence on the model’s output. How do we select such features? We need a way to reason why certain columns can be dropped first, so that our model is as compact as possible. In this section, we will attempt to devise some ways to reason the selection of the features.

First, let’s review the existing features in the preprocessed “2-apps” SherLock dataset:

df2_features_n.columns

Index(['CPU_USAGE', 'cutime', 'lru', 'num_threads', 'otherPrivateDirty',
       'priority', 'utime', 'vsize', 'cminflt', 'guest_time', 'queue'],
      dtype='object')

Altogether, there are 11 features.

First, we want to find features that are very similar or even identical; we then drop the (near) duplicate features. We will use two complementary means to detect such duplicates:

Histogram analysis
Correlation analysis

Histogram Analysis

A histogram plot provides visualization of the distribution of values in a feature. Let’s make a panel of histogram for all the normalized features.

~~~python

plt is a shorthand for matplotlib.pyplot

plt.figure(figsize=(10.0, 8.0)) for (i, col) in enumerate(df2_features_n.columns): # Creates a 4 row by 3 cols plot matrix plt.subplot(4,3,i+1) plt.hist(df2_features_n[col], bins=50) plt.title(col)

plt.subplots_adjust(top=0.92, bottom=0.08, left=0.10, right=0.95, hspace=0.75, wspace=0.35) plt.show()

![Histogram ](ML-Session-3-devel_files/ML-Session-3-devel_26_0.png)

Visualizing histograms of multiple features in a panel form
is a powerful tool to identify features that are identical or very similar.

> ## Finding Identical or Similar Features
>
> From the histogram panel plot above,
> can you spot features that are suspected to be identical or similar?
>

> ## Finding Identical or Similar Features, Digging Deeper
>
> Repeat drawing the histogram panel above,
> but color the histogram differently for each category (`ApplicationName`)
> to confirm the identical features.
> Why is this step needed?


```python
df2_labels.unique()

array(['Facebook', 'WhatsApp'], dtype=object)

"""Separate the rows in the feature matrix based on the associated app names""";
Apps = df2_labels.unique()
indx_app = {}
features_app = {}
# The first loop filters the rows by the app names
# using the df2_labels
for app in Apps:
    print("\nApp:", app)
    indx_app[app] = df2_labels[df2_labels == app].index
    print("Index:")
    print(indx_app[app][:5])
    features_app[app] = df2_features_n.loc[indx_app[app]]
    print("Features:")
    print(features_app[app].head(5))

App: Facebook
Index:
Int64Index([176473, 176474, 176476, 176477, 176479], dtype='int64')
Features:
        CPU_USAGE    cutime       lru  num_threads  otherPrivateDirty  \
176473  -0.159870 -0.429029 -0.041774    -1.300898          -0.780597   
176474   4.129610 -0.429029 -0.041774     0.222698          -0.688933   
176476  -0.159870 -0.429029 -0.041774    -1.300898          -0.785560   
176477   3.935538 -0.429029 -0.041774     0.222698          -0.687036   
176479  -0.159870 -0.429029 -0.041774    -1.300898          -0.785560   

        priority     utime     vsize   cminflt  guest_time     queue  
176473  0.246368 -0.847813 -0.558714 -0.698484   -0.841396 -0.244324  
176474  0.246368 -0.705633  0.242407 -0.698484   -0.705121 -0.244324  
176476  0.246368 -0.847813 -0.558714 -0.698484   -0.841660 -0.244324  
176477  0.246368 -0.705633  0.242407 -0.698484   -0.707196 -0.244324  
176479  0.246368 -0.847813 -0.558714 -0.698484   -0.852326 -0.244324  

App: WhatsApp
Index:
Int64Index([176475, 176478, 176481, 176484, 176487], dtype='int64')
Features:
        CPU_USAGE    cutime       lru  num_threads  otherPrivateDirty  \
176475   0.213345 -0.429029 -0.041774    -0.292636          -0.321111   
176478   0.213345 -0.429029 -0.041774    -0.292636          -0.323008   
176481   0.213345 -0.429029 -0.041774    -0.292636          -0.328701   
176484   0.213345 -0.429029 -0.041774    -0.270230          -0.324906   
176487   0.213345 -0.429029 -0.041774    -0.270230          -0.324906   

        priority     utime     vsize  cminflt  guest_time     queue  
176475  0.246368 -0.292064 -0.956849  0.53755   -0.302963 -0.244324  
176478  0.246368 -0.292064 -0.956849  0.53755   -0.293689 -0.244324  
176481  0.246368 -0.292064 -0.956849  0.53755   -0.287239 -0.244324  
176484  0.246368 -0.292064 -0.948734  0.53755   -0.298266 -0.244324  
176487  0.246368 -0.292064 -0.948734  0.53755   -0.292894 -0.244324  

"""Draw the multi-app histogram panel""";
pyplot.figure(figsize=(12.0, 9.0))
for (i, col) in enumerate(df2_features_n.columns):
    # Creates a 4 row by 3 cols plot matrix
    pyplot.subplot(4,3,i+1)
    for app in Apps:
        pyplot.hist(features_app[app][col], bins=50)
    pyplot.title(col)

pyplot.subplots_adjust(top=0.92, bottom=0.08, left=0.10, right=0.95, hspace=0.75,
                       wspace=0.35)
pyplot.show()

png

QUESTIONS:

From this second graph, further confirm that there are two features are identical.
If you inspect the raw (unnormalized) values are these two features identical? This shows the value of normalizing the features–it further exposes duplicate features that may be masked by a multiplicative factor.

#RUNIT
# ##FIXME Moved here from "BigData-workshop-3" notebook 2021-06-07 for use later.
# Alternate version: bigger graphs, but only showing 2 apps here
"""
Run this code cell to generate a panel of raw data plots.
Be patient, it will take a few seconds to complete.
Take this code and adapt it for your own analysis.
Feel free to adjust the parameters.
""";

fig = pyplot.figure(figsize=(16.0, 14.0))
nx = 3
ny = 4
DF = df2_features_n
LABELS = df2_labels
columns = ( c for c in DF.columns if c != "ApplicationName" )

print("Visually inspecting individual values:")
for i, col in enumerate(columns):
    #print(" ", col, sep="", end="")
    axes = fig.add_subplot(ny, nx, i+1)
    axes.set_xlabel(col)

    vals_FB = DF[LABELS == 'Facebook'][col]
    vals_WA = DF[LABELS == 'WhatsApp'][col]
    min_val = DF[col].min()
    max_val = DF[col].max()

    print('* ', col, '  range:', min_val, '..', max_val)

    pyplot.hist(vals_FB, label='Facebook', range=(min_val,max_val), bins=50)
    pyplot.hist(vals_WA, label='WhatsApp', range=(min_val,max_val), bins=50)
    pyplot.legend()    #if i > 3: break

pyplot.subplots_adjust(top=0.92, bottom=0.08, left=0.10, right=0.95, hspace=0.25, wspace=0.15)

Visually inspecting individual values:
*  CPU_USAGE   range: -0.15987005513820488 .. 56.981770146128696
*  cutime   range: -0.42902903775337814 .. 5.414514283832063
*  lru   range: -0.04177386015423364 .. 27.71388642761829
*  num_threads   range: -1.5025506498298415 .. 2.5753095640085486
*  otherPrivateDirty   range: -0.7912521410158143 .. 11.032493022137361
*  priority   range: -11.517829624469842 .. 0.24636791222366086
*  utime   range: -0.8566225850847952 .. 6.445692365289096
*  vsize   range: -16.16567005070764 .. 4.080380746370644
*  cminflt   range: -0.6984842444974012 .. 3.1179573468024
*  guest_time   range: -0.8651330060797121 .. 6.456384348686843
*  queue   range: -0.24432361328717375 .. 20.43728803389054

png

3.2 Correlation

At this time, we may want to do further feature selection from the correlation between each feature pairs. Feature pairs that are highly correlated can be deemed as duplicate features, thus we can delete one of each pair. The pair correlations can be computed using the DataFrame.corr() method.

df2_corr = df2_features_n.corr()
df2_corr

	CPU_USAGE	cutime	lru	num_threads	otherPrivateDirty	priority	utime	vsize	cminflt	guest_time	queue
CPU_USAGE	1.000000	0.006790	0.167896	0.039330	0.197823	0.001379	0.095689	0.072699	-0.000837	0.095685	-0.000574
cutime	0.006790	1.000000	-0.017922	-0.095443	0.120551	0.104557	0.151107	-0.296729	0.594047	0.151105	-0.102815
lru	0.167896	-0.017922	1.000000	-0.043429	-0.002386	0.009580	0.052039	0.005342	-0.029178	0.052049	-0.008956
num_threads	0.039330	-0.095443	-0.043429	1.000000	0.529398	-0.198157	0.503220	0.859857	-0.143042	0.503206	0.195843
otherPrivateDirty	0.197823	0.120551	-0.002386	0.529398	1.000000	0.097185	0.630480	0.464462	0.238920	0.630457	-0.095587
priority	0.001379	0.104557	0.009580	-0.198157	0.097185	1.000000	0.136242	-0.174586	0.170894	0.136241	-0.996884
utime	0.095689	0.151107	0.052039	0.503220	0.630480	0.136242	1.000000	0.394805	0.414287	0.999975	-0.134727
vsize	0.072699	-0.296729	0.005342	0.859857	0.464462	-0.174586	0.394805	1.000000	-0.491281	0.394797	0.172313
cminflt	-0.000837	0.594047	-0.029178	-0.143042	0.238920	0.170894	0.414287	-0.491281	1.000000	0.414275	-0.168564
guest_time	0.095685	0.151105	0.052049	0.503206	0.630457	0.136241	0.999975	0.394797	0.414275	1.000000	-0.134726
queue	-0.000574	-0.102815	-0.008956	0.195843	-0.095587	-0.996884	-0.134727	0.172313	-0.168564	-0.134726	1.000000

The .corr() method returns a matrix of correlation between feature pairs. The maximum value is 1 (perfectly correlated, i.e. identical), whereas the minimum value is -1 (perfectly anti-correlated). For a pair with negative correlation, it means that the increase in one feature leads to the decrease in the other.

We can use a heatmap to visualize the correlation matrix above and find the highly-correlated feature pair(s) by using the seaborn.heatmap() function.

pyplot.figure(figsize=(10.0,10.0))
seaborn.heatmap(df2_corr, annot=True, vmax=1, square=True, cmap="Blues")

<AxesSubplot:>

png

QUESTION: From the matrix or heatmap above, please

Identify three pairs whose correlation values are the highest (close to +1 or -1);
Identify additional pairs whose correlation values are beyond 0.5.

Compare your observation with the similar features discovered by the histogram panel earlier! Are they the same pairs?

–> (Enter your responses here) <–

Based on our discussion above, we can definitely delete vsize, queue and guest_time because of their very high correlations with other three features:

df2_features_n.drop(['vsize', 'queue', 'guest_time'], axis=1, inplace=True)
print(df2_features_n.columns)

Index(['CPU_USAGE', 'cutime', 'lru', 'num_threads', 'otherPrivateDirty',
       'priority', 'utime', 'cminflt'],
      dtype='object')

Eight features remaining!

Next pairs that can be considered for dropping would be:

(otherPrivateDirty, utime)
(cutime, cminflt)

The first pair also shows similarity in the histogram visuals (see earlier plot). We can drop utime and cminflt because of their marked correlations with the other two.

df2_features_n.drop(['utime', 'cminflt'], axis=1, inplace=True)
print(df2_features_n.columns)

Index(['CPU_USAGE', 'cutime', 'lru', 'num_threads', 'otherPrivateDirty',
       'priority'],
      dtype='object')

3.3 Simple Group Analysis

At this point, we have reduced our feature set to just six for the two applications (“WhatsApp” and “Facebook”). The next thing we can consider is the distribution of each feature grouped by the application category. When two features are similar, we may argue that the similarity will be reflected in the value distributions. Histograms can help uncover some similarities, but descriptive statistics provide a complementary way. This can be achieved by employing the .groupby() method before computing the descriptive statistics.

We recombine the label temporarily to do this group analysis:

df2_with_label = df2_features_n.copy()
df2_with_label['ApplicationName'] = df2_labels
df2_with_label.head()

	CPU_USAGE	cutime	lru	num_threads	otherPrivateDirty	priority	ApplicationName
176473	-0.159870	-0.429029	-0.041774	-1.300898	-0.780597	0.246368	Facebook
176474	4.129610	-0.429029	-0.041774	0.222698	-0.688933	0.246368	Facebook
176475	0.213345	-0.429029	-0.041774	-0.292636	-0.321111	0.246368	WhatsApp
176476	-0.159870	-0.429029	-0.041774	-1.300898	-0.785560	0.246368	Facebook
176477	3.935538	-0.429029	-0.041774	0.222698	-0.687036	0.246368	Facebook

Let’s get the feature values for each app by .groupby(), get the information of each feature from same app.

df2_with_label.groupby('ApplicationName')['CPU_USAGE'].describe()

	count	mean	std	min	25%	50%	75%	max
ApplicationName
Facebook	379054.0	-0.013990	1.193461	-0.159870	-0.159870	-0.105132	-0.075275	56.981770
WhatsApp	233060.0	0.022753	0.555877	-0.134989	-0.075275	-0.030489	0.014297	45.725618

df2_with_label.groupby('ApplicationName')['lru'].describe()

	count	mean	std	min	25%	50%	75%	max
ApplicationName
Facebook	379054.0	0.025685	1.270086e+00	-0.041774	-0.041774	-0.041774	-0.041774	27.713886
WhatsApp	233060.0	-0.041774	4.322940e-15	-0.041774	-0.041774	-0.041774	-0.041774	-0.041774

QUESTION: Observe how similar or dissimilar are the statistical quantities (mean, standard deviation, as well as the quartiles)

Do the means of CPU_USAGE (for the different applications) overlap within their standard deviations?
What about lru?

"""Compare the descriptive statistics of other features as well...""";
#TODO

#RUNIT
for col in df2_features_n.columns:
    if col not in ('CPU_USAGE', 'lru'):
        print("Column:", col)
        display(df2_with_label.groupby('ApplicationName')[col].describe())

Column: cutime

	count	mean	std	min	25%	50%	75%	max
ApplicationName
Facebook	379054.0	-0.429029	2.301717e-12	-0.429029	-0.429029	-0.429029	-0.429029	-0.429029
WhatsApp	233060.0	0.697782	1.356525e+00	-0.429029	-0.429029	0.544895	1.518819	5.414514

Column: num_threads

	count	mean	std	min	25%	50%	75%	max
ApplicationName
Facebook	379054.0	0.130986	1.246569	-1.502551	-1.278492	0.267510	1.096525	2.575310
WhatsApp	233060.0	-0.213038	0.160584	-1.502551	-0.270230	-0.203013	-0.135795	0.603597

Column: otherPrivateDirty

	count	mean	std	min	25%	50%	75%	max
ApplicationName
Facebook	379054.0	-0.207624	1.032435	-0.791252	-0.779721	-0.648356	-0.213099	11.032493
WhatsApp	233060.0	0.337685	0.841812	-0.791252	-0.263748	0.153994	0.793596	6.684450

Column: priority

	count	mean	std	min	25%	50%	75%	max
ApplicationName
Facebook	379054.0	-0.150299	1.241631	-11.51783	0.246368	0.246368	0.246368	0.246368
WhatsApp	233060.0	0.244450	0.150205	-11.51783	0.246368	0.246368	0.246368	0.246368

#RUNIT

Per-class boxplots without outlier

(This is cancelled)

#RUNIT
# Strip off outliers outside the usually defined (median +/- 1.5*IQR)
desc_stat = df2_with_label.groupby('ApplicationName')['CPU_USAGE'].describe().T
display(desc_stat)

ApplicationName	Facebook	WhatsApp
count	379054.000000	233060.000000
mean	-0.013990	0.022753
std	1.193461	0.555877
min	-0.159870	-0.134989
25%	-0.159870	-0.075275
50%	-0.105132	-0.030489
75%	-0.075275	0.014297
max	56.981770	45.725618

#RUNIT
IQR = (desc_stat.loc['75%'] - desc_stat.loc['25%']) * 1.5
print("IQR:")
print(IQR)
boxplot_min = desc_stat.loc['25%'] - IQR
boxplot_max = desc_stat.loc['75%'] + IQR
CPU_USAGE_ranges = pandas.DataFrame({'IQR': IQR, 
                                     'boxplot_min': boxplot_min,
                                     '50%': desc_stat.loc['50%'],
                                     'boxplot_max': boxplot_max, })
display(CPU_USAGE_ranges)

# Define a batch filter to remove outliers
for _name, _row in CPU_USAGE_ranges.iterrows():
    print(_row)
    _bp_min = _row['boxplot_min']
    _bp_max = _row['boxplot_max']
    
    # FIXME Not done yet
    # Not needed -- boxplot can ignore outliers.

IQR:
ApplicationName
Facebook    0.126893
WhatsApp    0.134357
dtype: float64

	IQR	boxplot_min	50%	boxplot_max
ApplicationName
Facebook	0.126893	-0.286763	-0.105132	0.051618
WhatsApp	0.134357	-0.209632	-0.030489	0.148654

IQR            0.126893
boxplot_min   -0.286763
50%           -0.105132
boxplot_max    0.051618
Name: Facebook, dtype: float64
IQR            0.134357
boxplot_min   -0.209632
50%           -0.030489
boxplot_max    0.148654
Name: WhatsApp, dtype: float64

DECISION: After some explorations, we found that the averages of CPU_USAGE and lru for the two different apps are much closer to each other, compared to the others. Thus let us remove these two features.

df2_features_n.drop(['CPU_USAGE','lru'],axis=1,inplace=True)
df2_features_n.head(10)

	cutime	num_threads	otherPrivateDirty	priority
176473	-0.429029	-1.300898	-0.780597	0.246368
176474	-0.429029	0.222698	-0.688933	0.246368
176475	-0.429029	-0.292636	-0.321111	0.246368
176476	-0.429029	-1.300898	-0.785560	0.246368
176477	-0.429029	0.222698	-0.687036	0.246368
176478	-0.429029	-0.292636	-0.323008	0.246368
176479	-0.429029	-1.300898	-0.785560	0.246368
176480	-0.429029	0.222698	-0.688349	0.246368
176481	-0.429029	-0.292636	-0.328701	0.246368
176482	-0.429029	-1.300898	-0.786873	0.246368

3.4 Feature Selection Summary

We now have the four features we want: cutime, num_threads, otherPrivateDirty, priority.

# Save this featureset in a new variable:
df2_features_n1 = df2_features_n_backup[['cutime', 'num_threads', 'otherPrivateDirty', 'priority']]

Save these featuers into a file for further usage.

#RUNIT
# We replace the categories from strings to numbers (0=Facebook, 1=WhatsApp)
# for several reason: not only to save space, but later on when we need this data
# in NN notebook, the categories need to be 0s and 1s

labels_save = df2_labels.replace(['Facebook', 'WhatsApp'], [0, 1])
labels_save.to_csv('sherlock_2apps_labels.csv',header=True,index=False)

df2_features_n1.to_csv('sherlock_2apps_features.csv',index=False)

labels_save.head(10)

176473    0
176474    0
176475    1
176476    0
176477    0
176478    1
176479    0
176480    0
176481    1
176482    0
Name: ApplicationName, dtype: int64

3.5 Training and Validating Machine Learning Model

EXERCISES: Now do the same procedure as elaborated in the previous notebook to train the machine learning models (linear regression and decision tree) to train and validate them based on the newly selected features. Record these accuracy scores and the necessary details (such as the list of features, tweaked hyperparameters) on your notebook/spreadsheet.

"""Train and validate the LogisticRegression model wih the new feature set""";

#train_F1, test_F1, train_L1, test_L1 = train_test_split(#TODO)
model_lr1 = LogisticRegression(solver='lbfgs')
#...TODO

#RUNIT
train_F1, test_F1, train_L1, test_L1 = train_test_split(df2_features_n1, df2_labels, test_size=0.2, random_state=162639729)

print("Model training with features:", list(df2_features_n1.columns))
model_lr1 = LogisticRegression(solver='lbfgs')
print("Training model_lr1")
%time model_lr1.fit(train_F1,train_L1)
model_dtc1 = DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_split=8)
print("Training model_dtc1")
%time model_dtc1.fit(train_F1, train_L1)
model_evaluate(model_lr1, test_F1, test_L1)
model_evaluate(model_dtc1, test_F1, test_L1)

Model training with features: ['cutime', 'num_threads', 'otherPrivateDirty', 'priority']
Training model_lr1
CPU times: user 3.31 s, sys: 117 ms, total: 3.43 s
Wall time: 2.53 s
Training model_dtc1
CPU times: user 1.44 s, sys: 34 ms, total: 1.47 s
Wall time: 1.31 s
Evaluation by using model: LogisticRegression
accuracy_score: 0.8507878421538436
confusion_matrix: 
 [[73919  1978]
 [16289 30237]]
Evaluation by using model: DecisionTreeClassifier
accuracy_score: 0.9871347704271256
confusion_matrix: 
 [[75310   587]
 [  988 45538]]

#RUNIT

Developer’s Checkpoint

Make sure these numbers were obtained above!

#Checkpoint-date: 2021-06-14
Evaluation by using model: LogisticRegression
accuracy_score: 0.8507878421538436
confusion_matrix: 
 [[73919  1978]
 [16289 30237]]
Evaluation by using model: DecisionTreeClassifier
accuracy_score: 0.9871347704271256
confusion_matrix: 
 [[75310   587]
 [  988 45538]]

QUESTIONS:

Compare the Performance of the two trained models
Discuss which model may be better for our dataset and think about the possible reasons.
Have we achieved the maximum accuracy of the methods that we see at the previous notebook (ML-session-2.ipynb)? Why–or why not?

The last question is very important to ponder. If the current featureset is indeed a perfect reduced set of features, then the accuracy should be pretty close to the maximum possible accuracy. Otherwise there is still something amiss!

#RUNIT

(Developer’s Notes) Post-analysis, June 2021

Let’s visually examine the boxplots of the two categories for every feature–but excluding the outliers because they are small fractions of the data but disturbing the main trends.

#RUNIT
# Do a massive panel boxplot
fig = pyplot.figure(figsize=(16.0, 10.0))
nx = 3
ny = 5
columns = ( c for c in df2_with_label.columns if c != "ApplicationName" )

print("Visually inspecting value spread (Facebook vs WA datasets): ", end="")
for i, col in enumerate(columns):
    print(" ", col, sep="", end="")
    ax = fig.add_subplot(ny, nx, i+1)
    df2_with_label[col]
    seaborn.boxplot(x='ApplicationName', y=col,
                    data=df2_with_label, ax=ax, showfliers=False)
print()

Visually inspecting value spread (Facebook vs WA datasets):  CPU_USAGE cutime lru num_threads otherPrivateDirty priority

png

#RUNIT

WP Comment 20210614 – We found some issues with the choices above based on “simple group analysis”. First, the ones that need to be dropped immediately are the priority and lru. Why did we come to a different conclusion? BECAUSE in the analysis above, I did not include the outliers. Also, I was considering the medians instead of the means. The outliers may have caused a mess on the “simple group analysis” above.

#RUNIT FIXME HERE

Alternative Feature Selection (UNDER CONSTRUCTION)

#RUNIT
display(df2_corr [ df2_corr.abs() > 0.5 ])
pyplot.figure(figsize=(10.0,10.0))
seaborn.heatmap(df2_corr[ df2_corr.abs() > 0.5 ], annot=True, vmax=1, square=True, cmap="Blues")

	CPU_USAGE	cutime	lru	num_threads	otherPrivateDirty	priority	utime	vsize	cminflt	guest_time	queue
CPU_USAGE	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
cutime	NaN	1.000000	NaN	NaN	NaN	NaN	NaN	NaN	0.594047	NaN	NaN
lru	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
num_threads	NaN	NaN	NaN	1.000000	0.529398	NaN	0.503220	0.859857	NaN	0.503206	NaN
otherPrivateDirty	NaN	NaN	NaN	0.529398	1.000000	NaN	0.630480	NaN	NaN	0.630457	NaN
priority	NaN	NaN	NaN	NaN	NaN	1.000000	NaN	NaN	NaN	NaN	-0.996884
utime	NaN	NaN	NaN	0.503220	0.630480	NaN	1.000000	NaN	NaN	0.999975	NaN
vsize	NaN	NaN	NaN	0.859857	NaN	NaN	NaN	1.000000	NaN	NaN	NaN
cminflt	NaN	0.594047	NaN	NaN	NaN	NaN	NaN	NaN	1.000000	NaN	NaN
guest_time	NaN	NaN	NaN	0.503206	0.630457	NaN	0.999975	NaN	NaN	1.000000	NaN
queue	NaN	NaN	NaN	NaN	NaN	-0.996884	NaN	NaN	NaN	NaN	1.000000

<AxesSubplot:>

png

#RUNIT

Of the following pairs, one of them in each pair should be dropped:

(utime, guest_time)
(priority, queue)
(vsize, num_threads)

Next pairs that can be considered for dropping would be:

(otherPrivateDirty, utime)
(cutime, cminflt)

#RUNIT

# Using SelectKBest
from sklearn.feature_selection import SelectKBest, f_classif
fea_selector = SelectKBest(score_func=f_classif, k="all")
fea_selector.fit(df2_features_n_backup, df2_labels)
print(fea_selector.scores_)

# NOTE: We really care for the scores here, so we can manually make a cut

[1.94898967e+02 2.61546079e+05 6.57465535e+02 1.75712797e+04
 4.61520066e+04 2.33471349e+04 7.00942800e+04 2.84285482e+05
 2.35210130e+06 7.00867936e+04 2.26591254e+04]

#RUNIT
feature_scores = pandas.Series(fea_selector.scores_, index=df2_features_n_backup.columns)
feature_scores

CPU_USAGE            1.948990e+02
cutime               2.615461e+05
lru                  6.574655e+02
num_threads          1.757128e+04
otherPrivateDirty    4.615201e+04
priority             2.334713e+04
utime                7.009428e+04
vsize                2.842855e+05
cminflt              2.352101e+06
guest_time           7.008679e+04
queue                2.265913e+04
dtype: float64

#RUNIT
# Sort it, then we will select the most weighted features
feature_scores.sort_values(ascending=False)

cminflt              2.352101e+06
vsize                2.842855e+05
cutime               2.615461e+05
utime                7.009428e+04
guest_time           7.008679e+04
otherPrivateDirty    4.615201e+04
priority             2.334713e+04
queue                2.265913e+04
num_threads          1.757128e+04
lru                  6.574655e+02
CPU_USAGE            1.948990e+02
dtype: float64

#RUNIT

At this point, we will combine the scoring above with the correlation analyses. Then it becomes clearer which features to drop as a result of correlation.

DECISION: Features to be selected: cminflt, vsize, cutime, utime

#RUNIT
# Suppose we run with k=4, still the scores are the same
fea_selector4 = SelectKBest(score_func=f_classif, k=4)
fea_selector4.fit(df2_features_n_backup, df2_labels)
fea_selector4.scores_

array([1.94898967e+02, 2.61546079e+05, 6.57465535e+02, 1.75712797e+04,
       4.61520066e+04, 2.33471349e+04, 7.00942800e+04, 2.84285482e+05,
       2.35210130e+06, 7.00867936e+04, 2.26591254e+04])

#RUNIT

Machine Learning with new featureset: cminflt, vsize, cutime, utime

df2_features_n_backup.columns

Index(['CPU_USAGE', 'cutime', 'lru', 'num_threads', 'otherPrivateDirty',
       'priority', 'utime', 'vsize', 'cminflt', 'guest_time', 'queue'],
      dtype='object')

#RUNIT
# Save this featureset in a new variable:
df2_features_n2 = df2_features_n_backup[['cminflt', 'vsize', 'cutime', 'utime']]
train_F2, test_F2, train_L2, test_L2 = train_test_split(df2_features_n2, df2_labels, test_size=0.2, random_state=162639729)

print("Model training with features:", list(df2_features_n2.columns))
model_lr2 = LogisticRegression(solver='lbfgs')
print("Training model_lr2")
%time model_lr2.fit(train_F2,train_L2)
model_dtc2 = DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_split=8)
print("Training model_dtc2")
%time model_dtc2.fit(train_F2, train_L2)
model_evaluate(model_lr2, test_F2, test_L2)
model_evaluate(model_dtc2, test_F2, test_L2)

Model training with features: ['cminflt', 'vsize', 'cutime', 'utime']
Training model_lr2
CPU times: user 3.13 s, sys: 75.8 ms, total: 3.21 s
Wall time: 2.51 s
Training model_dtc2
CPU times: user 1.33 s, sys: 29.5 ms, total: 1.36 s
Wall time: 1.18 s
Evaluation by using model: LogisticRegression
accuracy_score: 0.9999836632005423
confusion_matrix: 
 [[75897     0]
 [    2 46524]]
Evaluation by using model: DecisionTreeClassifier
accuracy_score: 1.0
confusion_matrix: 
 [[75897     0]
 [    0 46526]]

#RUNIT

Developer’s Checkpoint

Make sure these numbers were obtained above!

#Checkpoint-date: 2021-06-14
Evaluation by using model: LogisticRegression
accuracy_score: 0.9999836632005423
confusion_matrix: 
 [[75897     0]
 [    2 46524]]
Evaluation by using model: DecisionTreeClassifier
accuracy_score: 1.0
confusion_matrix: 
 [[75897     0]
 [    0 46526]]

4. Better Validation in the Training Phase

In the previous ML modeling, we only use the training dataset to train the model. The evaluation of a model’s performance should not rely on the training dataset, otherwise it would result in a biased performance score. We have held out a portion of the data as test dataset for validation purposes to give an unbiased estimate of the performance. One problem is that we do not know the uncertainty of this performance score (e.g. accuracy score).

Here we introduce the k-fold cross-validation approach. In the k-fold cross-validation, the data is divided into k folds. The model is trained on k-1 folds with one fold held back for testing. This process gets repeated to ensure each fold of the dataset gets the chance to be the “test” set. Once the process is completed, we can summarize the evaluation metric using the mean and quantify its uncertainty using the measured standard deviation.

from sklearn import model_selection

kfold = model_selection.KFold(n_splits=10)
model_kfold = LogisticRegression(solver='lbfgs')
results_kfold = model_selection.cross_val_score(model_kfold, train_F1, train_L1, cv=kfold)
print("Accuracy: %.2f%%" % (results_kfold.mean()*100.0)) 

Accuracy: 84.95%

results_kfold

array([0.84572187, 0.85084441, 0.85057894, 0.8510282 , 0.84771999,
       0.84955788, 0.85074231, 0.84955788, 0.84937409, 0.84976209])

This answer is consistent with the previous train_test_split approach.

#RUNIT

–FIXME: save this for later–

Answer Keys

Take a look at file solutions/ML-session-3-solutions.txt if you need to find the analytic answer to some of the questions asked in this notebook.

Key Points

The key methods for machine learning model tuning include: feature selection and model hyperparameter adjustments.

previous episode

DeapSECURE module 3: Machine Learning

next episode

Tuning the Machine Learning Model

Overview

Prerequisites

Solution: Preparing Python Modules and Dataset

Check Your Data!

Feature Selection

Histogram Analysis

plt is a shorthand for matplotlib.pyplot

3.2 Correlation

3.3 Simple Group Analysis

Per-class boxplots without outlier

3.4 Feature Selection Summary

3.5 Training and Validating Machine Learning Model

Developer’s Checkpoint

(Developer’s Notes) Post-analysis, June 2021

Alternative Feature Selection (UNDER CONSTRUCTION)

Machine Learning with new featureset: cminflt, vsize, cutime, utime

Developer’s Checkpoint

4. Better Validation in the Training Phase

Answer Keys

Key Points

previous episode

next episode