This lesson is still being designed and assembled (Pre-Alpha version)

Tuning the Machine Learning Model

Overview

Teaching: 20 min
Exercises: 20 min
Questions
  • What is model tuning and why do we need it?

  • What are the key procedures to tune a machine learning model for the best performance?

  • What are the hyperparameters that we need to adjust in the tuning process?

Objectives
  • Understands the different methods to tune a machine learning model.

In the previous episode, we learned how to build and train simple ML models using scikit-learn, then assess the performance of these models using several metrics such as accuracy and confusion matrix. For simplicity, we manually selected the features in the dataset that would be used in the model training and inference. But we also see that this manual process can be tedious with many possible combinations to try.

In this episode, we will use learn how to systematically improve the predictive performance of the ML model aimed at classifying the running smartphone apps based on their resource usage signatures.

Prerequisites

This episodes builds and depends on the Python environment already set up in the previous episode. If you have not already, you must load the requisite Python modules as well as preprocess the SherLock dataset. Then your environment will be ready for the machine learning training step. Please execute the following steps if you started Python from scratch.

Solution: Preparing Python Modules and Dataset

First, load all the required Python modules and functions:

import os
import sys
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import sklearn

# also add more tools:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
# machine learning models:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
# for evaluating model performance
from sklearn.metrics import accuracy_score, confusion_matrix

Next, we load and preprocess the SherLock “2-apps” dataset as we did in the previous episode. All of the necessary steps are now placed in this code snippet:

df2 = pd.read_csv('sherlock/sherlock_mystery_2apps.csv')

# Remove irrelevant feature(s)
df2.drop('Unnamed: 0', axis=1, inplace=True)

# Remove rows with missing values
df2.dropna(inplace=True)

# Remove duplicate features
df2.drop('Mem', axis=1, inplace=True)

# Separate labels from features
df2_labels = df2['ApplicationName']
df2_features = df2.drop('ApplicationName', axis=1)

# Feature scaling
scaler = preprocessing.StandardScaler()
scaler.fit(df2_features)
df2_features_n = pd.DataFrame(scaler.transform(df2_features),
                              columns=df2_features.columns,
                              index=df2_features.index)

Check Your Data!

Before we go on, let us make sure that you have the correct data. Please examine the features after the normalization process:

print("Normalized features:")
print(df2_features_n.head(10))
Normalized features:
        CPU_USAGE    cutime       lru  num_threads  otherPrivateDirty  \
176473  -0.159870 -0.429029 -0.041774    -1.300898          -0.780597   
176474   4.129610 -0.429029 -0.041774     0.222698          -0.688933   
176475   0.213345 -0.429029 -0.041774    -0.292636          -0.321111   
176476  -0.159870 -0.429029 -0.041774    -1.300898          -0.785560   
176477   3.935538 -0.429029 -0.041774     0.222698          -0.687036   
176478   0.213345 -0.429029 -0.041774    -0.292636          -0.323008   
176479  -0.159870 -0.429029 -0.041774    -1.300898          -0.785560   
176480   3.791228 -0.429029 -0.041774     0.222698          -0.688349   
176481   0.213345 -0.429029 -0.041774    -0.292636          -0.328701   
176482  -0.159870 -0.429029 -0.041774    -1.300898          -0.786873   

        priority     utime     vsize   cminflt  guest_time     queue  
176473  0.246368 -0.847813 -0.558714 -0.698484   -0.841396 -0.244324  
176474  0.246368 -0.705633  0.242407 -0.698484   -0.705121 -0.244324  
176475  0.246368 -0.292064 -0.956849  0.537550   -0.302963 -0.244324  
176476  0.246368 -0.847813 -0.558714 -0.698484   -0.841660 -0.244324  
176477  0.246368 -0.705633  0.242407 -0.698484   -0.707196 -0.244324  
176478  0.246368 -0.292064 -0.956849  0.537550   -0.293689 -0.244324  
176479  0.246368 -0.847813 -0.558714 -0.698484   -0.852326 -0.244324  
176480  0.246368 -0.705633  0.242407 -0.698484   -0.713160 -0.244324  
176481  0.246368 -0.292064 -0.956849  0.537550   -0.287239 -0.244324  
176482  0.246368 -0.847813 -0.558714 -0.698484   -0.850737 -0.244324  

The contents of your df2_features_n dataframe should match the output printed above.

At this stage, it is also a good idea to create a backup of the normalized feature matrix, in case we would make a mistake later and need to revert:

df2_features_n_backup = df2_features_n.copy()

Feature Selection

In the previous episode, we discovered that the performance an ML model may be strongly affected by the choices of the features. Even an ML method that can potentially perform very well (e.g. decision tree) may perform poorly when an inappropriate set of features are used in the modeling.

In ML modeling, generally speaking, we want to start with a handful of features (2-4) with the most predictive power. These are features that have the strongest influence on the model’s output. How do we select such features? We need a way to reason why certain columns can be dropped first, so that our model is as compact as possible. In this section, we will attempt to devise some ways to reason the selection of the features.

First, let’s review the existing features in the preprocessed “2-apps” SherLock dataset:

df2_features_n.columns
Index(['CPU_USAGE', 'cutime', 'lru', 'num_threads', 'otherPrivateDirty',
       'priority', 'utime', 'vsize', 'cminflt', 'guest_time', 'queue'],
      dtype='object')

Altogether, there are 11 features.

First, we want to find features that are very similar or even identical; we then drop the (near) duplicate features. We will use two complementary means to detect such duplicates:

Histogram Analysis

A histogram plot provides visualization of the distribution of values in a feature. Let’s make a panel of histogram for all the normalized features.

~~~python

plt is a shorthand for matplotlib.pyplot

plt.figure(figsize=(10.0, 8.0)) for (i, col) in enumerate(df2_features_n.columns): # Creates a 4 row by 3 cols plot matrix plt.subplot(4,3,i+1) plt.hist(df2_features_n[col], bins=50) plt.title(col)

plt.subplots_adjust(top=0.92, bottom=0.08, left=0.10, right=0.95, hspace=0.75, wspace=0.35) plt.show()


![Histogram ](ML-Session-3-devel_files/ML-Session-3-devel_26_0.png)

Visualizing histograms of multiple features in a panel form
is a powerful tool to identify features that are identical or very similar.

> ## Finding Identical or Similar Features
>
> From the histogram panel plot above,
> can you spot features that are suspected to be identical or similar?
>

> ## Finding Identical or Similar Features, Digging Deeper
>
> Repeat drawing the histogram panel above,
> but color the histogram differently for each category (`ApplicationName`)
> to confirm the identical features.
> Why is this step needed?


```python
df2_labels.unique()
array(['Facebook', 'WhatsApp'], dtype=object)
"""Separate the rows in the feature matrix based on the associated app names""";
Apps = df2_labels.unique()
indx_app = {}
features_app = {}
# The first loop filters the rows by the app names
# using the df2_labels
for app in Apps:
    print("\nApp:", app)
    indx_app[app] = df2_labels[df2_labels == app].index
    print("Index:")
    print(indx_app[app][:5])
    features_app[app] = df2_features_n.loc[indx_app[app]]
    print("Features:")
    print(features_app[app].head(5))
App: Facebook
Index:
Int64Index([176473, 176474, 176476, 176477, 176479], dtype='int64')
Features:
        CPU_USAGE    cutime       lru  num_threads  otherPrivateDirty  \
176473  -0.159870 -0.429029 -0.041774    -1.300898          -0.780597   
176474   4.129610 -0.429029 -0.041774     0.222698          -0.688933   
176476  -0.159870 -0.429029 -0.041774    -1.300898          -0.785560   
176477   3.935538 -0.429029 -0.041774     0.222698          -0.687036   
176479  -0.159870 -0.429029 -0.041774    -1.300898          -0.785560   

        priority     utime     vsize   cminflt  guest_time     queue  
176473  0.246368 -0.847813 -0.558714 -0.698484   -0.841396 -0.244324  
176474  0.246368 -0.705633  0.242407 -0.698484   -0.705121 -0.244324  
176476  0.246368 -0.847813 -0.558714 -0.698484   -0.841660 -0.244324  
176477  0.246368 -0.705633  0.242407 -0.698484   -0.707196 -0.244324  
176479  0.246368 -0.847813 -0.558714 -0.698484   -0.852326 -0.244324  

App: WhatsApp
Index:
Int64Index([176475, 176478, 176481, 176484, 176487], dtype='int64')
Features:
        CPU_USAGE    cutime       lru  num_threads  otherPrivateDirty  \
176475   0.213345 -0.429029 -0.041774    -0.292636          -0.321111   
176478   0.213345 -0.429029 -0.041774    -0.292636          -0.323008   
176481   0.213345 -0.429029 -0.041774    -0.292636          -0.328701   
176484   0.213345 -0.429029 -0.041774    -0.270230          -0.324906   
176487   0.213345 -0.429029 -0.041774    -0.270230          -0.324906   

        priority     utime     vsize  cminflt  guest_time     queue  
176475  0.246368 -0.292064 -0.956849  0.53755   -0.302963 -0.244324  
176478  0.246368 -0.292064 -0.956849  0.53755   -0.293689 -0.244324  
176481  0.246368 -0.292064 -0.956849  0.53755   -0.287239 -0.244324  
176484  0.246368 -0.292064 -0.948734  0.53755   -0.298266 -0.244324  
176487  0.246368 -0.292064 -0.948734  0.53755   -0.292894 -0.244324  
"""Draw the multi-app histogram panel""";
pyplot.figure(figsize=(12.0, 9.0))
for (i, col) in enumerate(df2_features_n.columns):
    # Creates a 4 row by 3 cols plot matrix
    pyplot.subplot(4,3,i+1)
    for app in Apps:
        pyplot.hist(features_app[app][col], bins=50)
    pyplot.title(col)

pyplot.subplots_adjust(top=0.92, bottom=0.08, left=0.10, right=0.95, hspace=0.75,
                       wspace=0.35)
pyplot.show()

png

QUESTIONS:

#RUNIT
# ##FIXME Moved here from "BigData-workshop-3" notebook 2021-06-07 for use later.
# Alternate version: bigger graphs, but only showing 2 apps here
"""
Run this code cell to generate a panel of raw data plots.
Be patient, it will take a few seconds to complete.
Take this code and adapt it for your own analysis.
Feel free to adjust the parameters.
""";

fig = pyplot.figure(figsize=(16.0, 14.0))
nx = 3
ny = 4
DF = df2_features_n
LABELS = df2_labels
columns = ( c for c in DF.columns if c != "ApplicationName" )

print("Visually inspecting individual values:")
for i, col in enumerate(columns):
    #print(" ", col, sep="", end="")
    axes = fig.add_subplot(ny, nx, i+1)
    axes.set_xlabel(col)

    vals_FB = DF[LABELS == 'Facebook'][col]
    vals_WA = DF[LABELS == 'WhatsApp'][col]
    min_val = DF[col].min()
    max_val = DF[col].max()

    print('* ', col, '  range:', min_val, '..', max_val)

    pyplot.hist(vals_FB, label='Facebook', range=(min_val,max_val), bins=50)
    pyplot.hist(vals_WA, label='WhatsApp', range=(min_val,max_val), bins=50)
    pyplot.legend()    #if i > 3: break

pyplot.subplots_adjust(top=0.92, bottom=0.08, left=0.10, right=0.95, hspace=0.25, wspace=0.15)
Visually inspecting individual values:
*  CPU_USAGE   range: -0.15987005513820488 .. 56.981770146128696
*  cutime   range: -0.42902903775337814 .. 5.414514283832063
*  lru   range: -0.04177386015423364 .. 27.71388642761829
*  num_threads   range: -1.5025506498298415 .. 2.5753095640085486
*  otherPrivateDirty   range: -0.7912521410158143 .. 11.032493022137361
*  priority   range: -11.517829624469842 .. 0.24636791222366086
*  utime   range: -0.8566225850847952 .. 6.445692365289096
*  vsize   range: -16.16567005070764 .. 4.080380746370644
*  cminflt   range: -0.6984842444974012 .. 3.1179573468024
*  guest_time   range: -0.8651330060797121 .. 6.456384348686843
*  queue   range: -0.24432361328717375 .. 20.43728803389054

png

3.2 Correlation

At this time, we may want to do further feature selection from the correlation between each feature pairs. Feature pairs that are highly correlated can be deemed as duplicate features, thus we can delete one of each pair. The pair correlations can be computed using the DataFrame.corr() method.

df2_corr = df2_features_n.corr()
df2_corr
CPU_USAGE cutime lru num_threads otherPrivateDirty priority utime vsize cminflt guest_time queue
CPU_USAGE 1.000000 0.006790 0.167896 0.039330 0.197823 0.001379 0.095689 0.072699 -0.000837 0.095685 -0.000574
cutime 0.006790 1.000000 -0.017922 -0.095443 0.120551 0.104557 0.151107 -0.296729 0.594047 0.151105 -0.102815
lru 0.167896 -0.017922 1.000000 -0.043429 -0.002386 0.009580 0.052039 0.005342 -0.029178 0.052049 -0.008956
num_threads 0.039330 -0.095443 -0.043429 1.000000 0.529398 -0.198157 0.503220 0.859857 -0.143042 0.503206 0.195843
otherPrivateDirty 0.197823 0.120551 -0.002386 0.529398 1.000000 0.097185 0.630480 0.464462 0.238920 0.630457 -0.095587
priority 0.001379 0.104557 0.009580 -0.198157 0.097185 1.000000 0.136242 -0.174586 0.170894 0.136241 -0.996884
utime 0.095689 0.151107 0.052039 0.503220 0.630480 0.136242 1.000000 0.394805 0.414287 0.999975 -0.134727
vsize 0.072699 -0.296729 0.005342 0.859857 0.464462 -0.174586 0.394805 1.000000 -0.491281 0.394797 0.172313
cminflt -0.000837 0.594047 -0.029178 -0.143042 0.238920 0.170894 0.414287 -0.491281 1.000000 0.414275 -0.168564
guest_time 0.095685 0.151105 0.052049 0.503206 0.630457 0.136241 0.999975 0.394797 0.414275 1.000000 -0.134726
queue -0.000574 -0.102815 -0.008956 0.195843 -0.095587 -0.996884 -0.134727 0.172313 -0.168564 -0.134726 1.000000

The .corr() method returns a matrix of correlation between feature pairs. The maximum value is 1 (perfectly correlated, i.e. identical), whereas the minimum value is -1 (perfectly anti-correlated). For a pair with negative correlation, it means that the increase in one feature leads to the decrease in the other.

We can use a heatmap to visualize the correlation matrix above and find the highly-correlated feature pair(s) by using the seaborn.heatmap() function.

pyplot.figure(figsize=(10.0,10.0))
seaborn.heatmap(df2_corr, annot=True, vmax=1, square=True, cmap="Blues")
<AxesSubplot:>

png

QUESTION: From the matrix or heatmap above, please

Compare your observation with the similar features discovered by the histogram panel earlier! Are they the same pairs?

–> (Enter your responses here) <–

Based on our discussion above, we can definitely delete vsize, queue and guest_time because of their very high correlations with other three features:

df2_features_n.drop(['vsize', 'queue', 'guest_time'], axis=1, inplace=True)
print(df2_features_n.columns)
Index(['CPU_USAGE', 'cutime', 'lru', 'num_threads', 'otherPrivateDirty',
       'priority', 'utime', 'cminflt'],
      dtype='object')

Eight features remaining!

Next pairs that can be considered for dropping would be:

The first pair also shows similarity in the histogram visuals (see earlier plot). We can drop utime and cminflt because of their marked correlations with the other two.

df2_features_n.drop(['utime', 'cminflt'], axis=1, inplace=True)
print(df2_features_n.columns)
Index(['CPU_USAGE', 'cutime', 'lru', 'num_threads', 'otherPrivateDirty',
       'priority'],
      dtype='object')

3.3 Simple Group Analysis

At this point, we have reduced our feature set to just six for the two applications (“WhatsApp” and “Facebook”). The next thing we can consider is the distribution of each feature grouped by the application category. When two features are similar, we may argue that the similarity will be reflected in the value distributions. Histograms can help uncover some similarities, but descriptive statistics provide a complementary way. This can be achieved by employing the .groupby() method before computing the descriptive statistics.

We recombine the label temporarily to do this group analysis:

df2_with_label = df2_features_n.copy()
df2_with_label['ApplicationName'] = df2_labels
df2_with_label.head()
CPU_USAGE cutime lru num_threads otherPrivateDirty priority ApplicationName
176473 -0.159870 -0.429029 -0.041774 -1.300898 -0.780597 0.246368 Facebook
176474 4.129610 -0.429029 -0.041774 0.222698 -0.688933 0.246368 Facebook
176475 0.213345 -0.429029 -0.041774 -0.292636 -0.321111 0.246368 WhatsApp
176476 -0.159870 -0.429029 -0.041774 -1.300898 -0.785560 0.246368 Facebook
176477 3.935538 -0.429029 -0.041774 0.222698 -0.687036 0.246368 Facebook

Let’s get the feature values for each app by .groupby(), get the information of each feature from same app.

df2_with_label.groupby('ApplicationName')['CPU_USAGE'].describe()
count mean std min 25% 50% 75% max
ApplicationName
Facebook 379054.0 -0.013990 1.193461 -0.159870 -0.159870 -0.105132 -0.075275 56.981770
WhatsApp 233060.0 0.022753 0.555877 -0.134989 -0.075275 -0.030489 0.014297 45.725618
df2_with_label.groupby('ApplicationName')['lru'].describe()
count mean std min 25% 50% 75% max
ApplicationName
Facebook 379054.0 0.025685 1.270086e+00 -0.041774 -0.041774 -0.041774 -0.041774 27.713886
WhatsApp 233060.0 -0.041774 4.322940e-15 -0.041774 -0.041774 -0.041774 -0.041774 -0.041774

QUESTION: Observe how similar or dissimilar are the statistical quantities (mean, standard deviation, as well as the quartiles)

  1. Do the means of CPU_USAGE (for the different applications) overlap within their standard deviations?
  2. What about lru?
"""Compare the descriptive statistics of other features as well...""";
#TODO
#RUNIT
for col in df2_features_n.columns:
    if col not in ('CPU_USAGE', 'lru'):
        print("Column:", col)
        display(df2_with_label.groupby('ApplicationName')[col].describe())
Column: cutime
count mean std min 25% 50% 75% max
ApplicationName
Facebook 379054.0 -0.429029 2.301717e-12 -0.429029 -0.429029 -0.429029 -0.429029 -0.429029
WhatsApp 233060.0 0.697782 1.356525e+00 -0.429029 -0.429029 0.544895 1.518819 5.414514
Column: num_threads
count mean std min 25% 50% 75% max
ApplicationName
Facebook 379054.0 0.130986 1.246569 -1.502551 -1.278492 0.267510 1.096525 2.575310
WhatsApp 233060.0 -0.213038 0.160584 -1.502551 -0.270230 -0.203013 -0.135795 0.603597
Column: otherPrivateDirty
count mean std min 25% 50% 75% max
ApplicationName
Facebook 379054.0 -0.207624 1.032435 -0.791252 -0.779721 -0.648356 -0.213099 11.032493
WhatsApp 233060.0 0.337685 0.841812 -0.791252 -0.263748 0.153994 0.793596 6.684450
Column: priority
count mean std min 25% 50% 75% max
ApplicationName
Facebook 379054.0 -0.150299 1.241631 -11.51783 0.246368 0.246368 0.246368 0.246368
WhatsApp 233060.0 0.244450 0.150205 -11.51783 0.246368 0.246368 0.246368 0.246368

#RUNIT

Per-class boxplots without outlier

(This is cancelled)

#RUNIT
# Strip off outliers outside the usually defined (median +/- 1.5*IQR)
desc_stat = df2_with_label.groupby('ApplicationName')['CPU_USAGE'].describe().T
display(desc_stat)
ApplicationName Facebook WhatsApp
count 379054.000000 233060.000000
mean -0.013990 0.022753
std 1.193461 0.555877
min -0.159870 -0.134989
25% -0.159870 -0.075275
50% -0.105132 -0.030489
75% -0.075275 0.014297
max 56.981770 45.725618
#RUNIT
IQR = (desc_stat.loc['75%'] - desc_stat.loc['25%']) * 1.5
print("IQR:")
print(IQR)
boxplot_min = desc_stat.loc['25%'] - IQR
boxplot_max = desc_stat.loc['75%'] + IQR
CPU_USAGE_ranges = pandas.DataFrame({'IQR': IQR, 
                                     'boxplot_min': boxplot_min,
                                     '50%': desc_stat.loc['50%'],
                                     'boxplot_max': boxplot_max, })
display(CPU_USAGE_ranges)

# Define a batch filter to remove outliers
for _name, _row in CPU_USAGE_ranges.iterrows():
    print(_row)
    _bp_min = _row['boxplot_min']
    _bp_max = _row['boxplot_max']
    
    # FIXME Not done yet
    # Not needed -- boxplot can ignore outliers.
IQR:
ApplicationName
Facebook    0.126893
WhatsApp    0.134357
dtype: float64
IQR boxplot_min 50% boxplot_max
ApplicationName
Facebook 0.126893 -0.286763 -0.105132 0.051618
WhatsApp 0.134357 -0.209632 -0.030489 0.148654
IQR            0.126893
boxplot_min   -0.286763
50%           -0.105132
boxplot_max    0.051618
Name: Facebook, dtype: float64
IQR            0.134357
boxplot_min   -0.209632
50%           -0.030489
boxplot_max    0.148654
Name: WhatsApp, dtype: float64

DECISION: After some explorations, we found that the averages of CPU_USAGE and lru for the two different apps are much closer to each other, compared to the others. Thus let us remove these two features.

df2_features_n.drop(['CPU_USAGE','lru'],axis=1,inplace=True)
df2_features_n.head(10)
cutime num_threads otherPrivateDirty priority
176473 -0.429029 -1.300898 -0.780597 0.246368
176474 -0.429029 0.222698 -0.688933 0.246368
176475 -0.429029 -0.292636 -0.321111 0.246368
176476 -0.429029 -1.300898 -0.785560 0.246368
176477 -0.429029 0.222698 -0.687036 0.246368
176478 -0.429029 -0.292636 -0.323008 0.246368
176479 -0.429029 -1.300898 -0.785560 0.246368
176480 -0.429029 0.222698 -0.688349 0.246368
176481 -0.429029 -0.292636 -0.328701 0.246368
176482 -0.429029 -1.300898 -0.786873 0.246368

3.4 Feature Selection Summary

We now have the four features we want: cutime, num_threads, otherPrivateDirty, priority.

# Save this featureset in a new variable:
df2_features_n1 = df2_features_n_backup[['cutime', 'num_threads', 'otherPrivateDirty', 'priority']]

Save these featuers into a file for further usage.

#RUNIT
# We replace the categories from strings to numbers (0=Facebook, 1=WhatsApp)
# for several reason: not only to save space, but later on when we need this data
# in NN notebook, the categories need to be 0s and 1s
labels_save = df2_labels.replace(['Facebook', 'WhatsApp'], [0, 1])
labels_save.to_csv('sherlock_2apps_labels.csv',header=True,index=False)

df2_features_n1.to_csv('sherlock_2apps_features.csv',index=False)
labels_save.head(10)
176473    0
176474    0
176475    1
176476    0
176477    0
176478    1
176479    0
176480    0
176481    1
176482    0
Name: ApplicationName, dtype: int64

3.5 Training and Validating Machine Learning Model

EXERCISES: Now do the same procedure as elaborated in the previous notebook to train the machine learning models (linear regression and decision tree) to train and validate them based on the newly selected features. Record these accuracy scores and the necessary details (such as the list of features, tweaked hyperparameters) on your notebook/spreadsheet.

"""Train and validate the LogisticRegression model wih the new feature set""";

#train_F1, test_F1, train_L1, test_L1 = train_test_split(#TODO)
model_lr1 = LogisticRegression(solver='lbfgs')
#...TODO
#RUNIT
train_F1, test_F1, train_L1, test_L1 = train_test_split(df2_features_n1, df2_labels, test_size=0.2, random_state=162639729)

print("Model training with features:", list(df2_features_n1.columns))
model_lr1 = LogisticRegression(solver='lbfgs')
print("Training model_lr1")
%time model_lr1.fit(train_F1,train_L1)
model_dtc1 = DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_split=8)
print("Training model_dtc1")
%time model_dtc1.fit(train_F1, train_L1)
model_evaluate(model_lr1, test_F1, test_L1)
model_evaluate(model_dtc1, test_F1, test_L1)
Model training with features: ['cutime', 'num_threads', 'otherPrivateDirty', 'priority']
Training model_lr1
CPU times: user 3.31 s, sys: 117 ms, total: 3.43 s
Wall time: 2.53 s
Training model_dtc1
CPU times: user 1.44 s, sys: 34 ms, total: 1.47 s
Wall time: 1.31 s
Evaluation by using model: LogisticRegression
accuracy_score: 0.8507878421538436
confusion_matrix: 
 [[73919  1978]
 [16289 30237]]
Evaluation by using model: DecisionTreeClassifier
accuracy_score: 0.9871347704271256
confusion_matrix: 
 [[75310   587]
 [  988 45538]]

#RUNIT

Developer’s Checkpoint

Make sure these numbers were obtained above!

#Checkpoint-date: 2021-06-14
Evaluation by using model: LogisticRegression
accuracy_score: 0.8507878421538436
confusion_matrix: 
 [[73919  1978]
 [16289 30237]]
Evaluation by using model: DecisionTreeClassifier
accuracy_score: 0.9871347704271256
confusion_matrix: 
 [[75310   587]
 [  988 45538]]

QUESTIONS:

The last question is very important to ponder. If the current featureset is indeed a perfect reduced set of features, then the accuracy should be pretty close to the maximum possible accuracy. Otherwise there is still something amiss!

#RUNIT

(Developer’s Notes) Post-analysis, June 2021

Let’s visually examine the boxplots of the two categories for every feature–but excluding the outliers because they are small fractions of the data but disturbing the main trends.

#RUNIT
# Do a massive panel boxplot
fig = pyplot.figure(figsize=(16.0, 10.0))
nx = 3
ny = 5
columns = ( c for c in df2_with_label.columns if c != "ApplicationName" )

print("Visually inspecting value spread (Facebook vs WA datasets): ", end="")
for i, col in enumerate(columns):
    print(" ", col, sep="", end="")
    ax = fig.add_subplot(ny, nx, i+1)
    df2_with_label[col]
    seaborn.boxplot(x='ApplicationName', y=col,
                    data=df2_with_label, ax=ax, showfliers=False)
print()
Visually inspecting value spread (Facebook vs WA datasets):  CPU_USAGE cutime lru num_threads otherPrivateDirty priority

png

#RUNIT

WP Comment 20210614 – We found some issues with the choices above based on “simple group analysis”. First, the ones that need to be dropped immediately are the priority and lru. Why did we come to a different conclusion? BECAUSE in the analysis above, I did not include the outliers. Also, I was considering the medians instead of the means. The outliers may have caused a mess on the “simple group analysis” above.

#RUNIT FIXME HERE

Alternative Feature Selection (UNDER CONSTRUCTION)

#RUNIT
display(df2_corr [ df2_corr.abs() > 0.5 ])
pyplot.figure(figsize=(10.0,10.0))
seaborn.heatmap(df2_corr[ df2_corr.abs() > 0.5 ], annot=True, vmax=1, square=True, cmap="Blues")
CPU_USAGE cutime lru num_threads otherPrivateDirty priority utime vsize cminflt guest_time queue
CPU_USAGE 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
cutime NaN 1.000000 NaN NaN NaN NaN NaN NaN 0.594047 NaN NaN
lru NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
num_threads NaN NaN NaN 1.000000 0.529398 NaN 0.503220 0.859857 NaN 0.503206 NaN
otherPrivateDirty NaN NaN NaN 0.529398 1.000000 NaN 0.630480 NaN NaN 0.630457 NaN
priority NaN NaN NaN NaN NaN 1.000000 NaN NaN NaN NaN -0.996884
utime NaN NaN NaN 0.503220 0.630480 NaN 1.000000 NaN NaN 0.999975 NaN
vsize NaN NaN NaN 0.859857 NaN NaN NaN 1.000000 NaN NaN NaN
cminflt NaN 0.594047 NaN NaN NaN NaN NaN NaN 1.000000 NaN NaN
guest_time NaN NaN NaN 0.503206 0.630457 NaN 0.999975 NaN NaN 1.000000 NaN
queue NaN NaN NaN NaN NaN -0.996884 NaN NaN NaN NaN 1.000000
<AxesSubplot:>

png

#RUNIT

Of the following pairs, one of them in each pair should be dropped:

Next pairs that can be considered for dropping would be:

#RUNIT

# Using SelectKBest
from sklearn.feature_selection import SelectKBest, f_classif
fea_selector = SelectKBest(score_func=f_classif, k="all")
fea_selector.fit(df2_features_n_backup, df2_labels)
print(fea_selector.scores_)

# NOTE: We really care for the scores here, so we can manually make a cut
[1.94898967e+02 2.61546079e+05 6.57465535e+02 1.75712797e+04
 4.61520066e+04 2.33471349e+04 7.00942800e+04 2.84285482e+05
 2.35210130e+06 7.00867936e+04 2.26591254e+04]
#RUNIT
feature_scores = pandas.Series(fea_selector.scores_, index=df2_features_n_backup.columns)
feature_scores
CPU_USAGE            1.948990e+02
cutime               2.615461e+05
lru                  6.574655e+02
num_threads          1.757128e+04
otherPrivateDirty    4.615201e+04
priority             2.334713e+04
utime                7.009428e+04
vsize                2.842855e+05
cminflt              2.352101e+06
guest_time           7.008679e+04
queue                2.265913e+04
dtype: float64
#RUNIT
# Sort it, then we will select the most weighted features
feature_scores.sort_values(ascending=False)
cminflt              2.352101e+06
vsize                2.842855e+05
cutime               2.615461e+05
utime                7.009428e+04
guest_time           7.008679e+04
otherPrivateDirty    4.615201e+04
priority             2.334713e+04
queue                2.265913e+04
num_threads          1.757128e+04
lru                  6.574655e+02
CPU_USAGE            1.948990e+02
dtype: float64

#RUNIT

At this point, we will combine the scoring above with the correlation analyses. Then it becomes clearer which features to drop as a result of correlation.

DECISION: Features to be selected: cminflt, vsize, cutime, utime

#RUNIT
# Suppose we run with k=4, still the scores are the same
fea_selector4 = SelectKBest(score_func=f_classif, k=4)
fea_selector4.fit(df2_features_n_backup, df2_labels)
fea_selector4.scores_
array([1.94898967e+02, 2.61546079e+05, 6.57465535e+02, 1.75712797e+04,
       4.61520066e+04, 2.33471349e+04, 7.00942800e+04, 2.84285482e+05,
       2.35210130e+06, 7.00867936e+04, 2.26591254e+04])

#RUNIT

Machine Learning with new featureset: cminflt, vsize, cutime, utime

df2_features_n_backup.columns
Index(['CPU_USAGE', 'cutime', 'lru', 'num_threads', 'otherPrivateDirty',
       'priority', 'utime', 'vsize', 'cminflt', 'guest_time', 'queue'],
      dtype='object')
#RUNIT
# Save this featureset in a new variable:
df2_features_n2 = df2_features_n_backup[['cminflt', 'vsize', 'cutime', 'utime']]
train_F2, test_F2, train_L2, test_L2 = train_test_split(df2_features_n2, df2_labels, test_size=0.2, random_state=162639729)

print("Model training with features:", list(df2_features_n2.columns))
model_lr2 = LogisticRegression(solver='lbfgs')
print("Training model_lr2")
%time model_lr2.fit(train_F2,train_L2)
model_dtc2 = DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_split=8)
print("Training model_dtc2")
%time model_dtc2.fit(train_F2, train_L2)
model_evaluate(model_lr2, test_F2, test_L2)
model_evaluate(model_dtc2, test_F2, test_L2)
Model training with features: ['cminflt', 'vsize', 'cutime', 'utime']
Training model_lr2
CPU times: user 3.13 s, sys: 75.8 ms, total: 3.21 s
Wall time: 2.51 s
Training model_dtc2
CPU times: user 1.33 s, sys: 29.5 ms, total: 1.36 s
Wall time: 1.18 s
Evaluation by using model: LogisticRegression
accuracy_score: 0.9999836632005423
confusion_matrix: 
 [[75897     0]
 [    2 46524]]
Evaluation by using model: DecisionTreeClassifier
accuracy_score: 1.0
confusion_matrix: 
 [[75897     0]
 [    0 46526]]

#RUNIT

Developer’s Checkpoint

Make sure these numbers were obtained above!

#Checkpoint-date: 2021-06-14
Evaluation by using model: LogisticRegression
accuracy_score: 0.9999836632005423
confusion_matrix: 
 [[75897     0]
 [    2 46524]]
Evaluation by using model: DecisionTreeClassifier
accuracy_score: 1.0
confusion_matrix: 
 [[75897     0]
 [    0 46526]]

4. Better Validation in the Training Phase

In the previous ML modeling, we only use the training dataset to train the model. The evaluation of a model’s performance should not rely on the training dataset, otherwise it would result in a biased performance score. We have held out a portion of the data as test dataset for validation purposes to give an unbiased estimate of the performance. One problem is that we do not know the uncertainty of this performance score (e.g. accuracy score).

Here we introduce the k-fold cross-validation approach. In the k-fold cross-validation, the data is divided into k folds. The model is trained on k-1 folds with one fold held back for testing. This process gets repeated to ensure each fold of the dataset gets the chance to be the “test” set. Once the process is completed, we can summarize the evaluation metric using the mean and quantify its uncertainty using the measured standard deviation.

from sklearn import model_selection

kfold = model_selection.KFold(n_splits=10)
model_kfold = LogisticRegression(solver='lbfgs')
results_kfold = model_selection.cross_val_score(model_kfold, train_F1, train_L1, cv=kfold)
print("Accuracy: %.2f%%" % (results_kfold.mean()*100.0)) 
Accuracy: 84.95%
results_kfold
array([0.84572187, 0.85084441, 0.85057894, 0.8510282 , 0.84771999,
       0.84955788, 0.85074231, 0.84955788, 0.84937409, 0.84976209])

This answer is consistent with the previous train_test_split approach.


#RUNIT

–FIXME: save this for later–

Answer Keys

Take a look at file solutions/ML-session-3-solutions.txt if you need to find the analytic answer to some of the questions asked in this notebook.

Key Points

  • The key methods for machine learning model tuning include: feature selection and model hyperparameter adjustments.