Tuning the Machine Learning Model
Overview
Teaching: 20 min
Exercises: 20 minQuestions
What is model tuning and why do we need it?
What are the key procedures to tune a machine learning model for the best performance?
What are the hyperparameters that we need to adjust in the tuning process?
Objectives
Understands the different methods to tune a machine learning model.
In the previous episode, we learned how to build and train simple ML models using scikit-learn, then assess the performance of these models using several metrics such as accuracy and confusion matrix. For simplicity, we manually selected the features in the dataset that would be used in the model training and inference. But we also see that this manual process can be tedious with many possible combinations to try.
In this episode, we will use learn how to systematically improve the predictive performance of the ML model aimed at classifying the running smartphone apps based on their resource usage signatures.
Prerequisites
This episodes builds and depends on the Python environment already set up in the previous episode. If you have not already, you must load the requisite Python modules as well as preprocess the SherLock dataset. Then your environment will be ready for the machine learning training step. Please execute the following steps if you started Python from scratch.
Solution: Preparing Python Modules and Dataset
First, load all the required Python modules and functions:
import os import sys import pandas as pd import numpy as np import seaborn as sns from matplotlib import pyplot as plt import sklearn # also add more tools: from sklearn import preprocessing from sklearn.model_selection import train_test_split # machine learning models: from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier # for evaluating model performance from sklearn.metrics import accuracy_score, confusion_matrix
Next, we load and preprocess the SherLock “2-apps” dataset as we did in the previous episode. All of the necessary steps are now placed in this code snippet:
df2 = pd.read_csv('sherlock/sherlock_mystery_2apps.csv') # Remove irrelevant feature(s) df2.drop('Unnamed: 0', axis=1, inplace=True) # Remove rows with missing values df2.dropna(inplace=True) # Remove duplicate features df2.drop('Mem', axis=1, inplace=True) # Separate labels from features df2_labels = df2['ApplicationName'] df2_features = df2.drop('ApplicationName', axis=1) # Feature scaling scaler = preprocessing.StandardScaler() scaler.fit(df2_features) df2_features_n = pd.DataFrame(scaler.transform(df2_features), columns=df2_features.columns, index=df2_features.index)
Check Your Data!
Before we go on, let us make sure that you have the correct data. Please examine the features after the normalization process:
print("Normalized features:") print(df2_features_n.head(10))
Normalized features: CPU_USAGE cutime lru num_threads otherPrivateDirty \ 176473 -0.159870 -0.429029 -0.041774 -1.300898 -0.780597 176474 4.129610 -0.429029 -0.041774 0.222698 -0.688933 176475 0.213345 -0.429029 -0.041774 -0.292636 -0.321111 176476 -0.159870 -0.429029 -0.041774 -1.300898 -0.785560 176477 3.935538 -0.429029 -0.041774 0.222698 -0.687036 176478 0.213345 -0.429029 -0.041774 -0.292636 -0.323008 176479 -0.159870 -0.429029 -0.041774 -1.300898 -0.785560 176480 3.791228 -0.429029 -0.041774 0.222698 -0.688349 176481 0.213345 -0.429029 -0.041774 -0.292636 -0.328701 176482 -0.159870 -0.429029 -0.041774 -1.300898 -0.786873 priority utime vsize cminflt guest_time queue 176473 0.246368 -0.847813 -0.558714 -0.698484 -0.841396 -0.244324 176474 0.246368 -0.705633 0.242407 -0.698484 -0.705121 -0.244324 176475 0.246368 -0.292064 -0.956849 0.537550 -0.302963 -0.244324 176476 0.246368 -0.847813 -0.558714 -0.698484 -0.841660 -0.244324 176477 0.246368 -0.705633 0.242407 -0.698484 -0.707196 -0.244324 176478 0.246368 -0.292064 -0.956849 0.537550 -0.293689 -0.244324 176479 0.246368 -0.847813 -0.558714 -0.698484 -0.852326 -0.244324 176480 0.246368 -0.705633 0.242407 -0.698484 -0.713160 -0.244324 176481 0.246368 -0.292064 -0.956849 0.537550 -0.287239 -0.244324 176482 0.246368 -0.847813 -0.558714 -0.698484 -0.850737 -0.244324
The contents of your
df2_features_n
dataframe should match the output printed above.At this stage, it is also a good idea to create a backup of the normalized feature matrix, in case we would make a mistake later and need to revert:
df2_features_n_backup = df2_features_n.copy()
Feature Selection
In the previous episode, we discovered that the performance an ML model may be strongly affected by the choices of the features. Even an ML method that can potentially perform very well (e.g. decision tree) may perform poorly when an inappropriate set of features are used in the modeling.
In ML modeling, generally speaking, we want to start with a handful of features (2-4) with the most predictive power. These are features that have the strongest influence on the model’s output. How do we select such features? We need a way to reason why certain columns can be dropped first, so that our model is as compact as possible. In this section, we will attempt to devise some ways to reason the selection of the features.
First, let’s review the existing features in the preprocessed “2-apps” SherLock dataset:
df2_features_n.columns
Index(['CPU_USAGE', 'cutime', 'lru', 'num_threads', 'otherPrivateDirty',
'priority', 'utime', 'vsize', 'cminflt', 'guest_time', 'queue'],
dtype='object')
Altogether, there are 11 features.
First, we want to find features that are very similar or even identical; we then drop the (near) duplicate features. We will use two complementary means to detect such duplicates:
- Histogram analysis
- Correlation analysis
Histogram Analysis
A histogram plot provides visualization of the distribution of values in a feature. Let’s make a panel of histogram for all the normalized features.
~~~python
plt is a shorthand for matplotlib.pyplot
plt.figure(figsize=(10.0, 8.0)) for (i, col) in enumerate(df2_features_n.columns): # Creates a 4 row by 3 cols plot matrix plt.subplot(4,3,i+1) plt.hist(df2_features_n[col], bins=50) plt.title(col)
plt.subplots_adjust(top=0.92, bottom=0.08, left=0.10, right=0.95, hspace=0.75, wspace=0.35) plt.show()
![Histogram ](ML-Session-3-devel_files/ML-Session-3-devel_26_0.png)
Visualizing histograms of multiple features in a panel form
is a powerful tool to identify features that are identical or very similar.
> ## Finding Identical or Similar Features
>
> From the histogram panel plot above,
> can you spot features that are suspected to be identical or similar?
>
> ## Finding Identical or Similar Features, Digging Deeper
>
> Repeat drawing the histogram panel above,
> but color the histogram differently for each category (`ApplicationName`)
> to confirm the identical features.
> Why is this step needed?
```python
df2_labels.unique()
array(['Facebook', 'WhatsApp'], dtype=object)
"""Separate the rows in the feature matrix based on the associated app names""";
Apps = df2_labels.unique()
indx_app = {}
features_app = {}
# The first loop filters the rows by the app names
# using the df2_labels
for app in Apps:
print("\nApp:", app)
indx_app[app] = df2_labels[df2_labels == app].index
print("Index:")
print(indx_app[app][:5])
features_app[app] = df2_features_n.loc[indx_app[app]]
print("Features:")
print(features_app[app].head(5))
App: Facebook
Index:
Int64Index([176473, 176474, 176476, 176477, 176479], dtype='int64')
Features:
CPU_USAGE cutime lru num_threads otherPrivateDirty \
176473 -0.159870 -0.429029 -0.041774 -1.300898 -0.780597
176474 4.129610 -0.429029 -0.041774 0.222698 -0.688933
176476 -0.159870 -0.429029 -0.041774 -1.300898 -0.785560
176477 3.935538 -0.429029 -0.041774 0.222698 -0.687036
176479 -0.159870 -0.429029 -0.041774 -1.300898 -0.785560
priority utime vsize cminflt guest_time queue
176473 0.246368 -0.847813 -0.558714 -0.698484 -0.841396 -0.244324
176474 0.246368 -0.705633 0.242407 -0.698484 -0.705121 -0.244324
176476 0.246368 -0.847813 -0.558714 -0.698484 -0.841660 -0.244324
176477 0.246368 -0.705633 0.242407 -0.698484 -0.707196 -0.244324
176479 0.246368 -0.847813 -0.558714 -0.698484 -0.852326 -0.244324
App: WhatsApp
Index:
Int64Index([176475, 176478, 176481, 176484, 176487], dtype='int64')
Features:
CPU_USAGE cutime lru num_threads otherPrivateDirty \
176475 0.213345 -0.429029 -0.041774 -0.292636 -0.321111
176478 0.213345 -0.429029 -0.041774 -0.292636 -0.323008
176481 0.213345 -0.429029 -0.041774 -0.292636 -0.328701
176484 0.213345 -0.429029 -0.041774 -0.270230 -0.324906
176487 0.213345 -0.429029 -0.041774 -0.270230 -0.324906
priority utime vsize cminflt guest_time queue
176475 0.246368 -0.292064 -0.956849 0.53755 -0.302963 -0.244324
176478 0.246368 -0.292064 -0.956849 0.53755 -0.293689 -0.244324
176481 0.246368 -0.292064 -0.956849 0.53755 -0.287239 -0.244324
176484 0.246368 -0.292064 -0.948734 0.53755 -0.298266 -0.244324
176487 0.246368 -0.292064 -0.948734 0.53755 -0.292894 -0.244324
"""Draw the multi-app histogram panel""";
pyplot.figure(figsize=(12.0, 9.0))
for (i, col) in enumerate(df2_features_n.columns):
# Creates a 4 row by 3 cols plot matrix
pyplot.subplot(4,3,i+1)
for app in Apps:
pyplot.hist(features_app[app][col], bins=50)
pyplot.title(col)
pyplot.subplots_adjust(top=0.92, bottom=0.08, left=0.10, right=0.95, hspace=0.75,
wspace=0.35)
pyplot.show()
QUESTIONS:
-
From this second graph, further confirm that there are two features are identical.
-
If you inspect the raw (unnormalized) values are these two features identical? This shows the value of normalizing the features–it further exposes duplicate features that may be masked by a multiplicative factor.
#RUNIT
# ##FIXME Moved here from "BigData-workshop-3" notebook 2021-06-07 for use later.
# Alternate version: bigger graphs, but only showing 2 apps here
"""
Run this code cell to generate a panel of raw data plots.
Be patient, it will take a few seconds to complete.
Take this code and adapt it for your own analysis.
Feel free to adjust the parameters.
""";
fig = pyplot.figure(figsize=(16.0, 14.0))
nx = 3
ny = 4
DF = df2_features_n
LABELS = df2_labels
columns = ( c for c in DF.columns if c != "ApplicationName" )
print("Visually inspecting individual values:")
for i, col in enumerate(columns):
#print(" ", col, sep="", end="")
axes = fig.add_subplot(ny, nx, i+1)
axes.set_xlabel(col)
vals_FB = DF[LABELS == 'Facebook'][col]
vals_WA = DF[LABELS == 'WhatsApp'][col]
min_val = DF[col].min()
max_val = DF[col].max()
print('* ', col, ' range:', min_val, '..', max_val)
pyplot.hist(vals_FB, label='Facebook', range=(min_val,max_val), bins=50)
pyplot.hist(vals_WA, label='WhatsApp', range=(min_val,max_val), bins=50)
pyplot.legend() #if i > 3: break
pyplot.subplots_adjust(top=0.92, bottom=0.08, left=0.10, right=0.95, hspace=0.25, wspace=0.15)
Visually inspecting individual values:
* CPU_USAGE range: -0.15987005513820488 .. 56.981770146128696
* cutime range: -0.42902903775337814 .. 5.414514283832063
* lru range: -0.04177386015423364 .. 27.71388642761829
* num_threads range: -1.5025506498298415 .. 2.5753095640085486
* otherPrivateDirty range: -0.7912521410158143 .. 11.032493022137361
* priority range: -11.517829624469842 .. 0.24636791222366086
* utime range: -0.8566225850847952 .. 6.445692365289096
* vsize range: -16.16567005070764 .. 4.080380746370644
* cminflt range: -0.6984842444974012 .. 3.1179573468024
* guest_time range: -0.8651330060797121 .. 6.456384348686843
* queue range: -0.24432361328717375 .. 20.43728803389054
3.2 Correlation
At this time, we may want to do further feature selection from the correlation between each feature pairs. Feature pairs that are highly correlated can be deemed as duplicate features, thus we can delete one of each pair. The pair correlations can be computed using the DataFrame.corr()
method.
df2_corr = df2_features_n.corr()
df2_corr
CPU_USAGE | cutime | lru | num_threads | otherPrivateDirty | priority | utime | vsize | cminflt | guest_time | queue | |
---|---|---|---|---|---|---|---|---|---|---|---|
CPU_USAGE | 1.000000 | 0.006790 | 0.167896 | 0.039330 | 0.197823 | 0.001379 | 0.095689 | 0.072699 | -0.000837 | 0.095685 | -0.000574 |
cutime | 0.006790 | 1.000000 | -0.017922 | -0.095443 | 0.120551 | 0.104557 | 0.151107 | -0.296729 | 0.594047 | 0.151105 | -0.102815 |
lru | 0.167896 | -0.017922 | 1.000000 | -0.043429 | -0.002386 | 0.009580 | 0.052039 | 0.005342 | -0.029178 | 0.052049 | -0.008956 |
num_threads | 0.039330 | -0.095443 | -0.043429 | 1.000000 | 0.529398 | -0.198157 | 0.503220 | 0.859857 | -0.143042 | 0.503206 | 0.195843 |
otherPrivateDirty | 0.197823 | 0.120551 | -0.002386 | 0.529398 | 1.000000 | 0.097185 | 0.630480 | 0.464462 | 0.238920 | 0.630457 | -0.095587 |
priority | 0.001379 | 0.104557 | 0.009580 | -0.198157 | 0.097185 | 1.000000 | 0.136242 | -0.174586 | 0.170894 | 0.136241 | -0.996884 |
utime | 0.095689 | 0.151107 | 0.052039 | 0.503220 | 0.630480 | 0.136242 | 1.000000 | 0.394805 | 0.414287 | 0.999975 | -0.134727 |
vsize | 0.072699 | -0.296729 | 0.005342 | 0.859857 | 0.464462 | -0.174586 | 0.394805 | 1.000000 | -0.491281 | 0.394797 | 0.172313 |
cminflt | -0.000837 | 0.594047 | -0.029178 | -0.143042 | 0.238920 | 0.170894 | 0.414287 | -0.491281 | 1.000000 | 0.414275 | -0.168564 |
guest_time | 0.095685 | 0.151105 | 0.052049 | 0.503206 | 0.630457 | 0.136241 | 0.999975 | 0.394797 | 0.414275 | 1.000000 | -0.134726 |
queue | -0.000574 | -0.102815 | -0.008956 | 0.195843 | -0.095587 | -0.996884 | -0.134727 | 0.172313 | -0.168564 | -0.134726 | 1.000000 |
The .corr()
method returns a matrix of correlation between feature pairs.
The maximum value is 1 (perfectly correlated, i.e. identical), whereas the minimum value is -1 (perfectly anti-correlated).
For a pair with negative correlation, it means that the increase in one feature leads to the decrease in the other.
We can use a heatmap to visualize the correlation matrix above and find the highly-correlated feature pair(s) by using the seaborn.heatmap()
function.
pyplot.figure(figsize=(10.0,10.0))
seaborn.heatmap(df2_corr, annot=True, vmax=1, square=True, cmap="Blues")
<AxesSubplot:>
QUESTION: From the matrix or heatmap above, please
- Identify three pairs whose correlation values are the highest (close to +1 or -1);
- Identify additional pairs whose correlation values are beyond 0.5.
Compare your observation with the similar features discovered by the histogram panel earlier! Are they the same pairs?
–> (Enter your responses here) <–
Based on our discussion above, we can definitely delete vsize
, queue
and guest_time
because of their very high correlations with other three features:
df2_features_n.drop(['vsize', 'queue', 'guest_time'], axis=1, inplace=True)
print(df2_features_n.columns)
Index(['CPU_USAGE', 'cutime', 'lru', 'num_threads', 'otherPrivateDirty',
'priority', 'utime', 'cminflt'],
dtype='object')
Eight features remaining!
Next pairs that can be considered for dropping would be:
- (
otherPrivateDirty
,utime
) - (
cutime
,cminflt
)
The first pair also shows similarity in the histogram visuals (see earlier plot).
We can drop utime
and cminflt
because of their marked correlations with the other two.
df2_features_n.drop(['utime', 'cminflt'], axis=1, inplace=True)
print(df2_features_n.columns)
Index(['CPU_USAGE', 'cutime', 'lru', 'num_threads', 'otherPrivateDirty',
'priority'],
dtype='object')
3.3 Simple Group Analysis
At this point, we have reduced our feature set to just six for the two applications (“WhatsApp” and “Facebook”).
The next thing we can consider is the distribution of each feature grouped by the application category.
When two features are similar, we may argue that the similarity will be reflected in the value distributions.
Histograms can help uncover some similarities, but descriptive statistics provide a complementary way.
This can be achieved by employing the .groupby()
method before computing the descriptive statistics.
We recombine the label temporarily to do this group analysis:
df2_with_label = df2_features_n.copy()
df2_with_label['ApplicationName'] = df2_labels
df2_with_label.head()
CPU_USAGE | cutime | lru | num_threads | otherPrivateDirty | priority | ApplicationName | |
---|---|---|---|---|---|---|---|
176473 | -0.159870 | -0.429029 | -0.041774 | -1.300898 | -0.780597 | 0.246368 | |
176474 | 4.129610 | -0.429029 | -0.041774 | 0.222698 | -0.688933 | 0.246368 | |
176475 | 0.213345 | -0.429029 | -0.041774 | -0.292636 | -0.321111 | 0.246368 | |
176476 | -0.159870 | -0.429029 | -0.041774 | -1.300898 | -0.785560 | 0.246368 | |
176477 | 3.935538 | -0.429029 | -0.041774 | 0.222698 | -0.687036 | 0.246368 |
Let’s get the feature values for each app by .groupby()
, get the information of each feature from same app.
df2_with_label.groupby('ApplicationName')['CPU_USAGE'].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
ApplicationName | ||||||||
379054.0 | -0.013990 | 1.193461 | -0.159870 | -0.159870 | -0.105132 | -0.075275 | 56.981770 | |
233060.0 | 0.022753 | 0.555877 | -0.134989 | -0.075275 | -0.030489 | 0.014297 | 45.725618 |
df2_with_label.groupby('ApplicationName')['lru'].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
ApplicationName | ||||||||
379054.0 | 0.025685 | 1.270086e+00 | -0.041774 | -0.041774 | -0.041774 | -0.041774 | 27.713886 | |
233060.0 | -0.041774 | 4.322940e-15 | -0.041774 | -0.041774 | -0.041774 | -0.041774 | -0.041774 |
QUESTION: Observe how similar or dissimilar are the statistical quantities (mean, standard deviation, as well as the quartiles)
- Do the means of
CPU_USAGE
(for the different applications) overlap within their standard deviations? - What about
lru
?
"""Compare the descriptive statistics of other features as well...""";
#TODO
#RUNIT
for col in df2_features_n.columns:
if col not in ('CPU_USAGE', 'lru'):
print("Column:", col)
display(df2_with_label.groupby('ApplicationName')[col].describe())
Column: cutime
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
ApplicationName | ||||||||
379054.0 | -0.429029 | 2.301717e-12 | -0.429029 | -0.429029 | -0.429029 | -0.429029 | -0.429029 | |
233060.0 | 0.697782 | 1.356525e+00 | -0.429029 | -0.429029 | 0.544895 | 1.518819 | 5.414514 |
Column: num_threads
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
ApplicationName | ||||||||
379054.0 | 0.130986 | 1.246569 | -1.502551 | -1.278492 | 0.267510 | 1.096525 | 2.575310 | |
233060.0 | -0.213038 | 0.160584 | -1.502551 | -0.270230 | -0.203013 | -0.135795 | 0.603597 |
Column: otherPrivateDirty
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
ApplicationName | ||||||||
379054.0 | -0.207624 | 1.032435 | -0.791252 | -0.779721 | -0.648356 | -0.213099 | 11.032493 | |
233060.0 | 0.337685 | 0.841812 | -0.791252 | -0.263748 | 0.153994 | 0.793596 | 6.684450 |
Column: priority
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
ApplicationName | ||||||||
379054.0 | -0.150299 | 1.241631 | -11.51783 | 0.246368 | 0.246368 | 0.246368 | 0.246368 | |
233060.0 | 0.244450 | 0.150205 | -11.51783 | 0.246368 | 0.246368 | 0.246368 | 0.246368 |
#RUNIT
Per-class boxplots without outlier
(This is cancelled)
#RUNIT
# Strip off outliers outside the usually defined (median +/- 1.5*IQR)
desc_stat = df2_with_label.groupby('ApplicationName')['CPU_USAGE'].describe().T
display(desc_stat)
ApplicationName | ||
---|---|---|
count | 379054.000000 | 233060.000000 |
mean | -0.013990 | 0.022753 |
std | 1.193461 | 0.555877 |
min | -0.159870 | -0.134989 |
25% | -0.159870 | -0.075275 |
50% | -0.105132 | -0.030489 |
75% | -0.075275 | 0.014297 |
max | 56.981770 | 45.725618 |
#RUNIT
IQR = (desc_stat.loc['75%'] - desc_stat.loc['25%']) * 1.5
print("IQR:")
print(IQR)
boxplot_min = desc_stat.loc['25%'] - IQR
boxplot_max = desc_stat.loc['75%'] + IQR
CPU_USAGE_ranges = pandas.DataFrame({'IQR': IQR,
'boxplot_min': boxplot_min,
'50%': desc_stat.loc['50%'],
'boxplot_max': boxplot_max, })
display(CPU_USAGE_ranges)
# Define a batch filter to remove outliers
for _name, _row in CPU_USAGE_ranges.iterrows():
print(_row)
_bp_min = _row['boxplot_min']
_bp_max = _row['boxplot_max']
# FIXME Not done yet
# Not needed -- boxplot can ignore outliers.
IQR:
ApplicationName
Facebook 0.126893
WhatsApp 0.134357
dtype: float64
IQR | boxplot_min | 50% | boxplot_max | |
---|---|---|---|---|
ApplicationName | ||||
0.126893 | -0.286763 | -0.105132 | 0.051618 | |
0.134357 | -0.209632 | -0.030489 | 0.148654 |
IQR 0.126893
boxplot_min -0.286763
50% -0.105132
boxplot_max 0.051618
Name: Facebook, dtype: float64
IQR 0.134357
boxplot_min -0.209632
50% -0.030489
boxplot_max 0.148654
Name: WhatsApp, dtype: float64
DECISION:
After some explorations, we found that the averages of CPU_USAGE
and lru
for the two different apps are much closer to each other, compared to the others.
Thus let us remove these two features.
df2_features_n.drop(['CPU_USAGE','lru'],axis=1,inplace=True)
df2_features_n.head(10)
cutime | num_threads | otherPrivateDirty | priority | |
---|---|---|---|---|
176473 | -0.429029 | -1.300898 | -0.780597 | 0.246368 |
176474 | -0.429029 | 0.222698 | -0.688933 | 0.246368 |
176475 | -0.429029 | -0.292636 | -0.321111 | 0.246368 |
176476 | -0.429029 | -1.300898 | -0.785560 | 0.246368 |
176477 | -0.429029 | 0.222698 | -0.687036 | 0.246368 |
176478 | -0.429029 | -0.292636 | -0.323008 | 0.246368 |
176479 | -0.429029 | -1.300898 | -0.785560 | 0.246368 |
176480 | -0.429029 | 0.222698 | -0.688349 | 0.246368 |
176481 | -0.429029 | -0.292636 | -0.328701 | 0.246368 |
176482 | -0.429029 | -1.300898 | -0.786873 | 0.246368 |
3.4 Feature Selection Summary
We now have the four features we want: cutime
, num_threads
, otherPrivateDirty
, priority
.
# Save this featureset in a new variable:
df2_features_n1 = df2_features_n_backup[['cutime', 'num_threads', 'otherPrivateDirty', 'priority']]
Save these featuers into a file for further usage.
#RUNIT
# We replace the categories from strings to numbers (0=Facebook, 1=WhatsApp)
# for several reason: not only to save space, but later on when we need this data
# in NN notebook, the categories need to be 0s and 1s
labels_save = df2_labels.replace(['Facebook', 'WhatsApp'], [0, 1])
labels_save.to_csv('sherlock_2apps_labels.csv',header=True,index=False)
df2_features_n1.to_csv('sherlock_2apps_features.csv',index=False)
labels_save.head(10)
176473 0
176474 0
176475 1
176476 0
176477 0
176478 1
176479 0
176480 0
176481 1
176482 0
Name: ApplicationName, dtype: int64
3.5 Training and Validating Machine Learning Model
EXERCISES: Now do the same procedure as elaborated in the previous notebook to train the machine learning models (linear regression and decision tree) to train and validate them based on the newly selected features. Record these accuracy scores and the necessary details (such as the list of features, tweaked hyperparameters) on your notebook/spreadsheet.
"""Train and validate the LogisticRegression model wih the new feature set""";
#train_F1, test_F1, train_L1, test_L1 = train_test_split(#TODO)
model_lr1 = LogisticRegression(solver='lbfgs')
#...TODO
#RUNIT
train_F1, test_F1, train_L1, test_L1 = train_test_split(df2_features_n1, df2_labels, test_size=0.2, random_state=162639729)
print("Model training with features:", list(df2_features_n1.columns))
model_lr1 = LogisticRegression(solver='lbfgs')
print("Training model_lr1")
%time model_lr1.fit(train_F1,train_L1)
model_dtc1 = DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_split=8)
print("Training model_dtc1")
%time model_dtc1.fit(train_F1, train_L1)
model_evaluate(model_lr1, test_F1, test_L1)
model_evaluate(model_dtc1, test_F1, test_L1)
Model training with features: ['cutime', 'num_threads', 'otherPrivateDirty', 'priority']
Training model_lr1
CPU times: user 3.31 s, sys: 117 ms, total: 3.43 s
Wall time: 2.53 s
Training model_dtc1
CPU times: user 1.44 s, sys: 34 ms, total: 1.47 s
Wall time: 1.31 s
Evaluation by using model: LogisticRegression
accuracy_score: 0.8507878421538436
confusion_matrix:
[[73919 1978]
[16289 30237]]
Evaluation by using model: DecisionTreeClassifier
accuracy_score: 0.9871347704271256
confusion_matrix:
[[75310 587]
[ 988 45538]]
#RUNIT
Developer’s Checkpoint
Make sure these numbers were obtained above!
#Checkpoint-date: 2021-06-14
Evaluation by using model: LogisticRegression
accuracy_score: 0.8507878421538436
confusion_matrix:
[[73919 1978]
[16289 30237]]
Evaluation by using model: DecisionTreeClassifier
accuracy_score: 0.9871347704271256
confusion_matrix:
[[75310 587]
[ 988 45538]]
QUESTIONS:
-
Compare the Performance of the two trained models
-
Discuss which model may be better for our dataset and think about the possible reasons.
-
Have we achieved the maximum accuracy of the methods that we see at the previous notebook (
ML-session-2.ipynb
)? Why–or why not?
The last question is very important to ponder. If the current featureset is indeed a perfect reduced set of features, then the accuracy should be pretty close to the maximum possible accuracy. Otherwise there is still something amiss!
#RUNIT
(Developer’s Notes) Post-analysis, June 2021
Let’s visually examine the boxplots of the two categories for every feature–but excluding the outliers because they are small fractions of the data but disturbing the main trends.
#RUNIT
# Do a massive panel boxplot
fig = pyplot.figure(figsize=(16.0, 10.0))
nx = 3
ny = 5
columns = ( c for c in df2_with_label.columns if c != "ApplicationName" )
print("Visually inspecting value spread (Facebook vs WA datasets): ", end="")
for i, col in enumerate(columns):
print(" ", col, sep="", end="")
ax = fig.add_subplot(ny, nx, i+1)
df2_with_label[col]
seaborn.boxplot(x='ApplicationName', y=col,
data=df2_with_label, ax=ax, showfliers=False)
print()
Visually inspecting value spread (Facebook vs WA datasets): CPU_USAGE cutime lru num_threads otherPrivateDirty priority
#RUNIT
WP Comment 20210614 – We found some issues with the choices above based on “simple group analysis”. First, the ones that need to be dropped immediately are the priority
and lru
.
Why did we come to a different conclusion? BECAUSE in the analysis above, I did not include the outliers. Also, I was considering the medians instead of the means. The outliers may have caused a mess on the “simple group analysis” above.
#RUNIT FIXME HERE
Alternative Feature Selection (UNDER CONSTRUCTION)
#RUNIT
display(df2_corr [ df2_corr.abs() > 0.5 ])
pyplot.figure(figsize=(10.0,10.0))
seaborn.heatmap(df2_corr[ df2_corr.abs() > 0.5 ], annot=True, vmax=1, square=True, cmap="Blues")
CPU_USAGE | cutime | lru | num_threads | otherPrivateDirty | priority | utime | vsize | cminflt | guest_time | queue | |
---|---|---|---|---|---|---|---|---|---|---|---|
CPU_USAGE | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
cutime | NaN | 1.000000 | NaN | NaN | NaN | NaN | NaN | NaN | 0.594047 | NaN | NaN |
lru | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
num_threads | NaN | NaN | NaN | 1.000000 | 0.529398 | NaN | 0.503220 | 0.859857 | NaN | 0.503206 | NaN |
otherPrivateDirty | NaN | NaN | NaN | 0.529398 | 1.000000 | NaN | 0.630480 | NaN | NaN | 0.630457 | NaN |
priority | NaN | NaN | NaN | NaN | NaN | 1.000000 | NaN | NaN | NaN | NaN | -0.996884 |
utime | NaN | NaN | NaN | 0.503220 | 0.630480 | NaN | 1.000000 | NaN | NaN | 0.999975 | NaN |
vsize | NaN | NaN | NaN | 0.859857 | NaN | NaN | NaN | 1.000000 | NaN | NaN | NaN |
cminflt | NaN | 0.594047 | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 | NaN | NaN |
guest_time | NaN | NaN | NaN | 0.503206 | 0.630457 | NaN | 0.999975 | NaN | NaN | 1.000000 | NaN |
queue | NaN | NaN | NaN | NaN | NaN | -0.996884 | NaN | NaN | NaN | NaN | 1.000000 |
<AxesSubplot:>
#RUNIT
Of the following pairs, one of them in each pair should be dropped:
- (
utime
,guest_time
) - (
priority
,queue
) - (
vsize
,num_threads
)
Next pairs that can be considered for dropping would be:
- (
otherPrivateDirty
,utime
) - (
cutime
,cminflt
)
#RUNIT
# Using SelectKBest
from sklearn.feature_selection import SelectKBest, f_classif
fea_selector = SelectKBest(score_func=f_classif, k="all")
fea_selector.fit(df2_features_n_backup, df2_labels)
print(fea_selector.scores_)
# NOTE: We really care for the scores here, so we can manually make a cut
[1.94898967e+02 2.61546079e+05 6.57465535e+02 1.75712797e+04
4.61520066e+04 2.33471349e+04 7.00942800e+04 2.84285482e+05
2.35210130e+06 7.00867936e+04 2.26591254e+04]
#RUNIT
feature_scores = pandas.Series(fea_selector.scores_, index=df2_features_n_backup.columns)
feature_scores
CPU_USAGE 1.948990e+02
cutime 2.615461e+05
lru 6.574655e+02
num_threads 1.757128e+04
otherPrivateDirty 4.615201e+04
priority 2.334713e+04
utime 7.009428e+04
vsize 2.842855e+05
cminflt 2.352101e+06
guest_time 7.008679e+04
queue 2.265913e+04
dtype: float64
#RUNIT
# Sort it, then we will select the most weighted features
feature_scores.sort_values(ascending=False)
cminflt 2.352101e+06
vsize 2.842855e+05
cutime 2.615461e+05
utime 7.009428e+04
guest_time 7.008679e+04
otherPrivateDirty 4.615201e+04
priority 2.334713e+04
queue 2.265913e+04
num_threads 1.757128e+04
lru 6.574655e+02
CPU_USAGE 1.948990e+02
dtype: float64
#RUNIT
At this point, we will combine the scoring above with the correlation analyses. Then it becomes clearer which features to drop as a result of correlation.
DECISION: Features to be selected:
cminflt
,vsize
,cutime
,utime
#RUNIT
# Suppose we run with k=4, still the scores are the same
fea_selector4 = SelectKBest(score_func=f_classif, k=4)
fea_selector4.fit(df2_features_n_backup, df2_labels)
fea_selector4.scores_
array([1.94898967e+02, 2.61546079e+05, 6.57465535e+02, 1.75712797e+04,
4.61520066e+04, 2.33471349e+04, 7.00942800e+04, 2.84285482e+05,
2.35210130e+06, 7.00867936e+04, 2.26591254e+04])
#RUNIT
Machine Learning with new featureset: cminflt, vsize, cutime, utime
df2_features_n_backup.columns
Index(['CPU_USAGE', 'cutime', 'lru', 'num_threads', 'otherPrivateDirty',
'priority', 'utime', 'vsize', 'cminflt', 'guest_time', 'queue'],
dtype='object')
#RUNIT
# Save this featureset in a new variable:
df2_features_n2 = df2_features_n_backup[['cminflt', 'vsize', 'cutime', 'utime']]
train_F2, test_F2, train_L2, test_L2 = train_test_split(df2_features_n2, df2_labels, test_size=0.2, random_state=162639729)
print("Model training with features:", list(df2_features_n2.columns))
model_lr2 = LogisticRegression(solver='lbfgs')
print("Training model_lr2")
%time model_lr2.fit(train_F2,train_L2)
model_dtc2 = DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_split=8)
print("Training model_dtc2")
%time model_dtc2.fit(train_F2, train_L2)
model_evaluate(model_lr2, test_F2, test_L2)
model_evaluate(model_dtc2, test_F2, test_L2)
Model training with features: ['cminflt', 'vsize', 'cutime', 'utime']
Training model_lr2
CPU times: user 3.13 s, sys: 75.8 ms, total: 3.21 s
Wall time: 2.51 s
Training model_dtc2
CPU times: user 1.33 s, sys: 29.5 ms, total: 1.36 s
Wall time: 1.18 s
Evaluation by using model: LogisticRegression
accuracy_score: 0.9999836632005423
confusion_matrix:
[[75897 0]
[ 2 46524]]
Evaluation by using model: DecisionTreeClassifier
accuracy_score: 1.0
confusion_matrix:
[[75897 0]
[ 0 46526]]
#RUNIT
Developer’s Checkpoint
Make sure these numbers were obtained above!
#Checkpoint-date: 2021-06-14
Evaluation by using model: LogisticRegression
accuracy_score: 0.9999836632005423
confusion_matrix:
[[75897 0]
[ 2 46524]]
Evaluation by using model: DecisionTreeClassifier
accuracy_score: 1.0
confusion_matrix:
[[75897 0]
[ 0 46526]]
4. Better Validation in the Training Phase
In the previous ML modeling, we only use the training dataset to train the model. The evaluation of a model’s performance should not rely on the training dataset, otherwise it would result in a biased performance score. We have held out a portion of the data as test dataset for validation purposes to give an unbiased estimate of the performance. One problem is that we do not know the uncertainty of this performance score (e.g. accuracy score).
Here we introduce the k-fold cross-validation approach. In the k-fold cross-validation, the data is divided into k folds. The model is trained on k-1 folds with one fold held back for testing. This process gets repeated to ensure each fold of the dataset gets the chance to be the “test” set. Once the process is completed, we can summarize the evaluation metric using the mean and quantify its uncertainty using the measured standard deviation.
from sklearn import model_selection
kfold = model_selection.KFold(n_splits=10)
model_kfold = LogisticRegression(solver='lbfgs')
results_kfold = model_selection.cross_val_score(model_kfold, train_F1, train_L1, cv=kfold)
print("Accuracy: %.2f%%" % (results_kfold.mean()*100.0))
Accuracy: 84.95%
results_kfold
array([0.84572187, 0.85084441, 0.85057894, 0.8510282 , 0.84771999,
0.84955788, 0.85074231, 0.84955788, 0.84937409, 0.84976209])
This answer is consistent with the previous train_test_split
approach.
#RUNIT
–FIXME: save this for later–
Answer Keys
Take a look at file solutions/ML-session-3-solutions.txt
if you need to find the analytic answer to some of the questions asked in this notebook.
Key Points
The key methods for machine learning model tuning include: feature selection and model hyperparameter adjustments.