DeapSECURE module 3: Machine Learning

Session 3: Tuning the Machine Learning Model

Welcome to the DeapSECURE online training program! This is a Jupyter notebook for the hands-on learning activities of the "Machine Learning" module, Episode 6: "Tuning the Machine Learning Model" (new episode to be written, as of 2021--stay tuned!). Please visit the DeapSECURE website to learn more about our training program.

In this session, we will use this notebook to learn how to optimize the predictive performance a model to classify the running applications based on their resource usage signatures.

Quick Links (sections of this notebook):

1. Setup Instructions

If you are opening this notebook from Wahab cluster's OnDemand interface, you're all set.

If you see this notebook elsewhere and want to perform the exercises on Wahab cluster, please follow the steps outlined in our setup procedure.

  1. Make sure you have activated your HPC service.
  2. Point your web browser to https://ondemand.wahab.hpc.odu.edu/ and sign in with your MIDAS ID and password.
  3. Create a new Jupyter session using "legacy" Python suite, then create a new "Python3" notebook. (See ODU HPC wiki for more detailed help.)

  4. Get the necessary files using commands below within Jupyter:

    mkdir -p ~/CItraining/module-ml
    cp -pr /shared/DeapSECURE/module-ml/. ~/CItraining/module-ml
    cd ~/CItraining/module-ml

The file name of this notebook is ML-session-3.ipynb.

1.1 Reminder

1.2 Loading Python Libraries

Next step, we need to import the required libraries into this Jupyter Notebook: pandas, matplotlib.pyplot and seaborn.

For Wahab cluster only: before importing these libraries, we need to load the DeapSECURE environment modules:

In [ ]:
module("load", "DeapSECURE")

Now we can import the requisite Python libraries, most notably: pandas, NumPy, Matplotlib, Seaborn, and Scikit-learn.

In [ ]:
"""Import the necessary Python modules""";

import os
import sys
import pandas
import numpy
import seaborn
from matplotlib import pyplot
import sklearn

# also add more tools:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
# machine learning models:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
# for evaluating model performance
from sklearn.metrics import accuracy_score, confusion_matrix

%matplotlib inline
In [ ]:
# Some advanced learners may like to use shortcuts,
# so we give them here:
pd = pandas
np = numpy
plt = pyplot
sns = seaborn

We also copy some functions we defined in the previous notebook:

In [ ]:
def model_evaluate(model,test_F,test_L):
    test_L_pred = model.predict(test_F)
    print("Evaluation by using model:",type(model).__name__)
    print("accuracy_score:",accuracy_score(test_L, test_L_pred))
    print("confusion_matrix:","\n",confusion_matrix(test_L, test_L_pred))
    return

2. Preprocessing Sherlock Dataset

First, we load and preprocess the SherLock "2-apps" dataset as we did in the previous notebook. Instead of doing them cell-by-cell, let's put all the steps into one cell and execute them in one shot:

In [ ]:
df2 = pandas.read_csv('sherlock/sherlock_mystery_2apps.csv')

# Remove irrelevant feature(s)
df2.drop('Unnamed: 0', axis=1, inplace=True)

# Remove rows with missing values
df2.dropna(inplace=True)

# Remove duplicate features
df2.drop('Mem', axis=1, inplace=True)

# Separate labels from features
df2_labels = df2['ApplicationName']
df2_features = df2.drop('ApplicationName', axis=1)

# Feature scaling
scaler = preprocessing.StandardScaler()
scaler.fit(df2_features)
df2_features_n = pandas.DataFrame(scaler.transform(df2_features),
                                  columns=df2_features.columns,
                                  index=df2_features.index)
print("Normalized features:")
df2_features_n.head(10)
In [ ]:
# Create a backup
df2_features_n_backup = df2_features_n.copy()

HINT: If you did not finish notebook 2, then the sherlock_features.csv file did not exist yet. In that case, please use solutions/sherlock_features.csv.

In [ ]:
print("Features:")
print(df2_features.head(10))

3. Feature Selection

In the previous notebook (ML-session-2.ipynb) we have discovered that the performance a machine learning model may be strongly affected by the choices of the features. Even a model that can perform very good can perform poorly when an inappropriate set of features are used for the learning.

In a machine learning project, generally speaking, we want to start with a handful of features (2-4) with the most predictive power. These are features that have the strongest influence on the model’s output. How do we select such features? We need a way to reason why certain columns can be dropped first, so that our model is as compact as possible. In this section, we will attempt to build some way to reason the selection of features.

First, let's review the existing features in the preprocessed "2-apps" SherLock dataset:

In [ ]:
df2_features_n.columns

There are 11 features.

First, we want to find features that are very similar, then drop the (near) duplicate features. We will use two complementary means to detect such duplicates:

  • Histograms
  • Correlation plot

3.1 Histograms

A histogram is a visualization of the distribution of values in a feature. Let’s make a panel of histogram for all the normalized features: this will easily help spotting features that may be duplicate of one another:

In [ ]:
# plt stands for matplotlib.pyplot
plt.figure(figsize=(10.0, 8.0))
for (i, col) in enumerate(df2_features_n.columns):
    # Creates a 4 row by 3 cols plot matrix
    plt.subplot(4,3,i+1)
    plt.hist(df2_features_n[col], bins=50)
    plt.title(col)

plt.subplots_adjust(top=0.92, bottom=0.08, left=0.10, right=0.95, hspace=0.75,
                    wspace=0.35)
plt.show()

Visualizing histograms of multiple features in a panel form is a powerful tool to detect features that are identical or very similar.

QUESTION: From the histogram above, can you spot features that are suspected to be identical or similar?

EXERCISE: Repeat the histogram panel above, but color the histogram differently for each category (ApplicationName) to verify the identical features.

In [ ]:
df2_labels.unique()
In [ ]:
"""Separate the rows in the feature matrix based on the associated app names""";
Apps = df2_labels.unique()
indx_app = {}
features_app = {}
# The first loop filters the rows by the app names
# using the df2_labels
for app in Apps:
    print("\nApp:", app)
    indx_app[app] = df2_labels[df2_labels == app].index
    print("Index:")
    print(indx_app[app][:5])
    features_app[app] = df2_features_n.loc[indx_app[app]]
    print("Features:")
    print(features_app[app].head(5))
In [ ]:
"""Draw the multi-app histogram panel""";
pyplot.figure(figsize=(12.0, 9.0))
for (i, col) in enumerate(df2_features_n.columns):
    # Creates a 4 row by 3 cols plot matrix
    pyplot.subplot(4,3,i+1)
    for app in Apps:
        pyplot.hist(features_app[app][col], bins=50)
    pyplot.title(col)

pyplot.subplots_adjust(top=0.92, bottom=0.08, left=0.10, right=0.95, hspace=0.75,
                       wspace=0.35)
pyplot.show()

QUESTIONS:

  • From this second graph, further confirm that there are two features are identical.

  • If you inspect the raw (unnormalized) values are these two features identical? This shows the value of normalizing the features--it further exposes duplicate features that may be masked by a multiplicative factor.

3.2 Correlation

At this time, we may want to do further feature selection from the correlation between each feature pairs. Feature pairs that are highly correlated can be deemed as duplicate features, thus we can delete one of each pair. The pair correlations can be computed using the DataFrame.corr() method.

In [ ]:
df2_corr = df2_features_n.corr()
df2_corr

The .corr() method returns a matrix of correlation between feature pairs. The maximum value is 1 (perfectly correlated, i.e. identical), whereas the minimum value is -1 (perfectly anti-correlated). For a pair with negative correlation, it means that the increase in one feature leads to the decrease in the other.

We can use a heatmap to visualize the correlation matrix above and find the highly-correlated feature pair(s) by using the seaborn.heatmap() function.

In [ ]:
pyplot.figure(figsize=(10.0,10.0))
seaborn.heatmap(df2_corr, annot=True, vmax=1, square=True, cmap="Blues")

QUESTION: From the matrix or heatmap above, please

  • Identify three pairs whose correlation values are the highest (close to +1 or -1);
  • Identify additional pairs whose correlation values are beyond 0.5.

Compare your observation with the similar features discovered by the histogram panel earlier! Are they the same pairs?

--> (Enter your responses here) <--

Based on our discussion above, we can definitely delete vsize, queue and guest_time because of their very high correlations with other three features:

In [ ]:
df2_features_n.drop(['vsize', 'queue', 'guest_time'], axis=1, inplace=True)
print(df2_features_n.columns)

Eight features remaining!

Next pairs that can be considered for dropping would be:

  • (otherPrivateDirty, utime)
  • (cutime, cminflt)

The first pair also shows similarity in the histogram visuals (see earlier plot). We can drop utime and cminflt because of their marked correlations with the other two.

In [ ]:
df2_features_n.drop(['utime', 'cminflt'], axis=1, inplace=True)
print(df2_features_n.columns)

3.3 Simple Group Analysis

At this point, we have reduced our feature set to just six for the two applications ("WhatsApp" and "Facebook"). The next thing we can consider is the distribution of each feature grouped by the application category. When two features are similar, we may argue that the similarity will be reflected in the value distributions. Histograms can help uncover some similarities, but descriptive statistics provide a complementary way. This can be achieved by employing the .groupby() method before computing the descriptive statistics.

We recombine the label temporarily to do this group analysis:

In [ ]:
df2_with_label = df2_features_n.copy()
df2_with_label['ApplicationName'] = df2_labels
df2_with_label.head()

Let's get the feature values for each app by .groupby(), get the information of each feature from same app.

In [ ]:
df2_with_label.groupby('ApplicationName')['CPU_USAGE'].describe()
In [ ]:
df2_with_label.groupby('ApplicationName')['lru'].describe()

QUESTION: Observe how similar or dissimilar are the statistical quantities (mean, standard deviation, as well as the quartiles)

  1. Do the means of CPU_USAGE (for the different applications) overlap within their standard deviations?
  2. What about lru?
In [ ]:
"""Compare the descriptive statistics of other features as well...""";
#TODO

DECISION: After some explorations, we found that the averages of CPU_USAGE and lru for the two different apps are much closer to each other, compared to the others. Thus let us remove these two features.

In [ ]:
df2_features_n.drop(['CPU_USAGE','lru'],axis=1,inplace=True)
df2_features_n.head(10)

3.4 Feature Selection Summary

We now have the four features we want: cutime, num_threads, otherPrivateDirty, priority.

In [ ]:
# Save this featureset in a new variable:
df2_features_n1 = df2_features_n_backup[['cutime', 'num_threads', 'otherPrivateDirty', 'priority']]

Save these featuers into a file for further usage.

In [ ]:
labels_save = df2_labels.replace(['Facebook', 'WhatsApp'], [0, 1])
labels_save.to_csv('sherlock_2apps_labels.csv',header=True,index=False)

df2_features_n1.to_csv('sherlock_2apps_features.csv',index=False)
In [ ]:
labels_save.head(10)

3.5 Training and Validating Machine Learning Model

EXERCISES: Now do the same procedure as elaborated in the previous notebook to train the machine learning models (linear regression and decision tree) to train and validate them based on the newly selected features. Record these accuracy scores and the necessary details (such as the list of features, tweaked hyperparameters) on your notebook/spreadsheet.

In [ ]:
"""Train and validate the LogisticRegression model wih the new feature set""";

#train_F1, test_F1, train_L1, test_L1 = train_test_split(#TODO)
model_lr1 = LogisticRegression(solver='lbfgs')
#...TODO
In [ ]:
 

QUESTIONS:

  • Compare the Performance of the two trained models

  • Discuss which model may be better for our dataset and think about the possible reasons.

  • Have we achieved the maximum accuracy of the methods that we see at the previous notebook (ML-session-2.ipynb)? Why--or why not?

The last question is very important to ponder. If the current featureset is indeed a perfect reduced set of features, then the accuracy should be pretty close to the maximum possible accuracy. Otherwise there is still something amiss!

In [ ]:
df2_features_n_backup.columns

4. Better Validation in the Training Phase

In the previous ML modeling, we only use the training dataset to train the model. The evaluation of a model's performance should not rely on the training dataset, otherwise it would result in a biased performance score. We have held out a portion of the data as test dataset for validation purposes to give an unbiased estimate of the performance. One problem is that we do not know the uncertainty of this performance score (e.g. accuracy score).

Here we introduce the k-fold cross-validation approach. In the k-fold cross-validation, the data is divided into k folds. The model is trained on k-1 folds with one fold held back for testing. This process gets repeated to ensure each fold of the dataset gets the chance to be the "test" set. Once the process is completed, we can summarize the evaluation metric using the mean and quantify its uncertainty using the measured standard deviation.

In [ ]:
from sklearn import model_selection

kfold = model_selection.KFold(n_splits=10)
model_kfold = LogisticRegression(solver='lbfgs')
results_kfold = model_selection.cross_val_score(model_kfold, train_F1, train_L1, cv=kfold)
print("Accuracy: %.2f%%" % (results_kfold.mean()*100.0)) 
In [ ]:
results_kfold

This answer is consistent with the previous train_test_split approach.

In [ ]: