DeapSECURE module 3: Machine Learning

Session 2: Data Preprocessing for Machine Learning

Welcome to the DeapSECURE online training program! This is a Jupyter notebook for the hands-on learning activities of the "Machine Learning" module, Episodes 4 and 5: "Data Preprocessing for Machine Learning", "Machine Learning for Smartphone Application Classification".

Please visit the DeapSECURE website to learn more about our training program.

In this session, we will use this notebook to perform data preparation & initial experiment with machine learning, so that learners will see the complete steps of a machine learning workflow. We will build upon the skill and insight already acquired in the "Big Data" module.

Quick Links (sections of this notebook):

1. Setup Instructions

If you are opening this notebook from Wahab cluster's OnDemand interface, you're all set.

If you see this notebook elsewhere and want to perform the exercises on Wahab cluster, please follow the steps outlined in our setup procedure.

  1. Make sure you have activated your HPC service.
  2. Point your web browser to https://ondemand.wahab.hpc.odu.edu/ and sign in with your MIDAS ID and password.
  3. Create a new Jupyter session using "legacy" Python suite, then create a new "Python3" notebook. (See ODU HPC wiki for more detailed help.)

  4. Get the necessary files using commands below within Jupyter:

    mkdir -p ~/CItraining/module-ml
    cp -pr /shared/DeapSECURE/module-ml/. ~/CItraining/module-ml
    cd ~/CItraining/module-ml

The file name of this notebook is ML-session-2.ipynb.

1.1 Reminder

1.2 Loading Python Libraries

Next step, we need to import the required libraries into this Jupyter Notebook: pandas, matplotlib.pyplot and seaborn.

For Wahab cluster only: before importing these libraries, we need to load the DeapSECURE environment modules:

In [ ]:
module("load", "DeapSECURE")

Now we can import the requisite Python libraries, most notably: pandas, NumPy, Matplotlib, Seaborn, and Scikit-learn.

In [ ]:
"""Import the necessary Python modules""";

import os
import sys
import pandas
import numpy
import seaborn
from matplotlib import pyplot
import sklearn

# also add more tools:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
# machine learning models:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
# for evaluating model performance
from sklearn.metrics import accuracy_score, confusion_matrix

%matplotlib inline
In [ ]:
# Some advanced learners may like to use shortcuts,
# so we give them here:
pd = pandas
np = numpy
plt = pyplot
sns = seaborn

2. Loading Sherlock Dataset

Let us load the Sherlock's "Applications" dataset into a DataFrame for analysis and machine learning. The dataset contains measurements of resource utilization from two applications on a smartphone, namely Facebook and WhatsApp. This was the same dataset used in the Data Wrangling & Visualization notebook of the DeapSECURE's Big Data lesson, where we familiarize ourselves with this dataset and identified the necessary steps to clean the data. In this present notebook, we will continue the data preprocessing to make it ready for machine learning.

In [ ]:
df2 = pd.read_csv('sherlock/sherlock_mystery_2apps.csv')
In [ ]:
df2.head(10)

Dataset Features

The sherlock_mystery_2apps.csv file actually contains a small subset of a much larger Application.csv data file. There are 14 columns in this subset:

  • Unnamed: 0 [int]: Record index.

  • ApplicationName [str]: Name of the application.

  • CPU_USAGE [float]: CPU utilization (100% = completely busy CPU).

  • cutime [int]: CPU "user time" spent the spawned (child) processes.

  • lru [int]: "Least Recently Used"; This is a parameter of the Android application memory management.

  • num_threads [int]: Number of threads in this process.

  • otherPrivateDirty [int]: The private dirty pages used by everything else other than Dalvik heap and native heap.

  • priority [int]: Process's scheduling priority.

  • utime [int]: Measured CPU "user time".

  • vsize [int]: The size of the virtual memory, in bytes.

  • cminflt [int]: Count of minor faults that the process's child processes.

  • guest_time [int]: Running time of "virtual CPU".

  • Mem [int]: Size of memory, in bytes.

  • queue [int]: The waiting order (priority).

3. Data Preprocessing

Up to two-thirds of the time of data analysis is spent on data preparation, in order to achieve a clean, consistent, and processable state of data. Data preparation is absolutely crucial to obtaining trustworthy insight from the data. We have covered in this topic in great detail in the lesson on the Data Wrangling and Visualization. (See also the corresponding notebook, BigData-session-3.ipynb.)

IMPORTANT!

You must do all the EXERCISEs in this section (do not skip any one), so that you obtain the clean dataset.

3.1. Known Issues in sherlock_mystery_2apps.csv

From the Data Wrangling & Visualization notebook, we identified the following issues with the raw data:

  • irrelevant data (column: Unnamed: 0),
  • missing data (about 22% data of the cminflt column are undefined),
  • a duplicate feature (Mem, a duplicate of vsize).

We also identified the necessary course of action to address these defects. In this notebook, we will simply execute the necessary steps in order to prepare, or preprocess, the data for machine learning.

3.2 Removing Irrelevant Features

Dropping Unnamed: 0 Column

EXERCISE: Remove the Unnamed: 0 column from df2 because it is irrelevant for our analysis.

In [ ]:
"""Drop the `Unnamed: 0` column from df2""";
#df2.drop(#TODO, inplace=True)

3.3 Dealing with Missing Data

Removing Missing Data from cminflt

Missing data is cause by several reasons. We can use the .isna().sum() operation to identify features with missing data and how many values are missing:

In [ ]:
df2.isna().sum()

OPTIONAL QUESTION: Of what fraction is the data missing in that one column?

Hint: One way is to use df2[COLUMN_NAME].size to get the total number of rows.

In [ ]:
"""Compute the fraction of mising data to the total number of rows in the `cminflt` column""";
#TODO
In [ ]:
df2['cminflt'].isna().sum() / df2['cminflt'].size

OPTIONAL: It is interesting to plot where the cminflt is missing, using the following trick.

In [ ]:
df2['cminflt'].isna().astype(int).plot()

DECISION: we decided to drop the rows where the cminflt values are missing. Reason: The number of records missing data in cminflt is large; but we also have a lot of data to begin with (nearly 800k rows in the raw dataset).

EXERCISE: Use DataFrame's dropna() method to remove rows that have missing values. Perform the operation in-place. Then verify that the new df2 no longer have any missing data.

In [ ]:
"""Remove rows with missing values from df2""";
#TODO

No more duplicate found.

3.4 Removing Duplicate Features

Now let's remove Mem, guest_time and queue from the dataset because they are duplicates of the other features, or have a very strong linear correlation to those other features.

In [ ]:
"""Drop the following columns from the DataFrame: Mem, guest_time, queue""";
#df2.drop(#TODO)
In [ ]:
df2.drop(['Mem'], axis=1, inplace=True)

EXERCISE: Please verify that the unwanted columns above have been removed from df2 now!

In [ ]:
df2.columns

3.5 Separating Labels from Features

Quite frequently, labels (output values) come in the same dataframe as the features. In this case, we need to separate the label column(s) from the input features. In our dataset, the ApplicationName column contains the labels for classification machine learning. Let’s extract that into df2_labels, whereas the features go to df2_features.

In [ ]:
df2_labels = df2['ApplicationName']
df2_features = df2.drop('ApplicationName', axis=1)

Note: We do NOT drop the ApplicationName column in-place, so we have the backup of the cleaned data. Therefore we assign the feature-only dataframe to a new variable called df2_features.

Let's inspect the cleaned data (features & labels):

In [ ]:
"""Print first few rows of the labels and features.
   You can also inspect the descriptive statistics of the features.""";
print("Labels:")
#TODO
print("Features:")
#TODO
print("Statistics and range of values:")
#TODO

If a dataset contains categorical features, these features will need to be converted to numbers using schemes such as integer encoding or one-hot encoding to be properly represented as numbers. Consult the Data Preprocessing episode to learn more.

At this point, an object df2_features contains only numerical values. This condition is a prerequisite for using the dataset for machine learning. The feature-only part of the data is often referred to as feature matrix, because it is in a matrix form by now. There are two more steps required before we actually perform the training step in machine learning: feature scaling and test-train split.

3.5 Feature Scaling (Data Normalization)

Many ML algorithms work best when the typical values of the features are of the same order of magnitude. A range of a feature is typically the difference between the minimum and maximum values in that feature. Because in general each feature has its own value range, feature scaling is necessary to bring all the features into a similar order of magnitude.

Scikit-learn also contains a number of scalers that can be used--each tailored for a certain kind of conditions in the dataset. We will use standard scaler, which normalizes the features according to their respective means and standard deviations. (This is usually a reasonable starting point; you may want to try out other scalers, see Scikit-learn's document on data preprocessing.)

In [ ]:
scaler = preprocessing.StandardScaler()
scaler.fit(df2_features)
df2_features_n = pd.DataFrame(scaler.transform(df2_features),
                              columns=df2_features.columns,
                              index=df2_features.index)
df2_features_n.head(10)

The normalized features are stored in a new variable, df2_features_n.

4. Initial Machine Learning Experiments

Our data is now ready and we can try out some machine learning models! Let us now do our first experiment with machine learning. Remember our goal?

We want to build a machine learning model to predict the name of the running app on the phone, given the observation of the behavior of the app. Thus we want to do an application classification task, given their measured usage of CPU_USAGE, cutime, num_threads, etc.

The main goal of this section is to guide you through the all the steps necessary to train and assess the quality of the machine learning model. There are many models that we can try out, but all of them follow the same set of steps.

One important art in machine learning is choosing the best set of features to go into the model to achieve the best predictive ability. From the observations in the "Data Wrangling and Visualization" episode of Big Data module, we can intuitively guess that CPU_USAGE and vsize may be two important features for an application detection task. After all, different applications would differ in the CPU and memory usage. Let us build our first machine learning model with these two features and observe the outcome of the prediction.

In [ ]:
features = df2_features_n[['CPU_USAGE', 'vsize']]
features.head()
In [ ]:
labels = df2_labels.copy()
labels.head()

The labels and features variables contain the labels and selected features to use in the model, and labels contains the associated application labels. (Scikit-learn classification models know how to handle different classes in the labels, so we do not need to perform special encoding for the labels.)

4.1 Train-Test Split

As the last step before building and training a ML model, we need to split the dataset into "training" and "testing" sets (both features and labels). The training set is used to train the model, whereas the testing set will be used to validate the performance of the trained model.

In [ ]:
"""Uncomment and run""";
#from sklearn.model_selection import train_test_split
#train_F, test_F, train_L, test_L = train_test_split(features, labels, test_size=0.2)

The _F and _L suffixes in the variables above refer to the features and labels, respectively. We reserve 80% of the dataset for training, and 20% for testing (test_size=0.2).

In [ ]:
print("Training set shapes:")
print(train_F.shape)
print(train_L.shape)
print("Testing set shapes:")
print(test_F.shape)
print(test_L.shape)

4.2 Building and Testing Machine Learning Models

In this notebook, we will experiment two machine learning models: Decision Tree and Logistic Regression. We will start with establishing a Logistic Regression model.

Training the ML Model: Logistic Regression

In [ ]:
model_lr = LogisticRegression(solver='lbfgs')
model_lr.fit(train_F, train_L)

The first statement above creates a LogisticRegression object, named model_lr, that will perform the logistic regression classification. In the second statement, we train the model using the training dataset (features and labels).

Timing a Python statement

Do you notice that the model_lr.fit does not return immediately? Indeed, training an ML model can take awhile. It is useful to note how long the training takes place. With Jupyter, you can time the function call easily, like this:

%time model_lr.fit(train_F, train_L)

Please do this from this time on so you will get the timing.

After the training, model_lr is ready to do the classification task. But we need to first evaluate the model using the testing dataset (test_F and test_L) to measure the its ability to make correct predictions. The accuracy score is the most popular metric, defined as the fraction of the number of correct predictions (i.e. classification) over the total number of predictions made. We will introduce two common metrics to evaluate our model's performance: accuracy_score and confusion_matrix.

Evaluating the ML Model

To evaluate, we use the trained model to predict the applications based on the test features:

In [ ]:
test_pred = model_lr.predict(test_F)
In [ ]:
test_pred[:20]

The prediction result is a numpy array. These predictions can be compared to the elements of test_L.

EXERCISE: Manually compare the first 20 elements of test_pred and test_L: how many correct answers do you get?

accuracy_score

Well, Scikit-learn has a lot of tools to make our lives easier! To measure accuracy (the fraction of the correct answers), we can simply use the accuracy_score:

In [ ]:
print("accuracy_score:        ", accuracy_score(test_L, test_pred))
print("num of correct answers:", accuracy_score(test_L, test_pred, normalize=False))

QUESTION:

  • What accuracy score that you obtain? Is this a good accuracy?

  • Where do the wrong answers go? What mispredictions do the model make?

Compare your result with ours: We obtained just below 70%, meaning only about 70% of the answers are correct.

confusion_matrix

The confusion_matrix computes the confusion matrix which visually quantifies the correct and incorrect answers, showing the number of various mispredictions:

In [ ]:
print("confusion_matrix:\n", confusion_matrix(test_L, test_pred))

This is a two-dimensional array where the row position corresponds to the true classes whereas the column position corresponds to the ML-predicted classes. But what are the classes in this matrix? We can query the ML object:

In [ ]:
model_lr.classes_

So class number 0 is Facebook and 1 is WhatsApp. Scikit-learn can even plot the confusion matrix in a nice-looking graph:

In [ ]:
sklearn.metrics.plot_confusion_matrix(model_lr, test_F, test_L, cmap='Blues_r')

In the example above, we have a total of ~75k Facebook records and ~47k

Since we will be evaluating different models, let us define a function to evaluate the accuracy of a model:

In [ ]:
def model_evaluate(model,test_F,test_L):
    test_L_pred = model.predict(test_F)
    print("Evaluation by using model:",type(model).__name__)
    print("accuracy_score:",accuracy_score(test_L, test_L_pred))
    print("confusion_matrix:","\n",confusion_matrix(test_L, test_L_pred))
    return

Now we can use the model_evaluate function to evaluate the model, like this:

In [ ]:
model_evaluate(model_lr,test_F,test_L)

Alternative Model: Decision Tree

Next, we can try the Decision Tree model. We are still using the same training and testing sets, just a different model.

In [ ]:
"""Uncomment, complete, and run this code cell to train a decision tree.
   Use the same `fit()` function call as before to train the model.""";
#model_dtc = DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_split=8)
#model_dtc.fit(#TODO)

We created a decision tree classifier and adjusted two (hyper)parameters: max_depth and min_samples_split. Check the accuracy of this model:

In [ ]:
"""Evaluate the accuracy of model_dtc""";
#model_evaluate(#TODO)

QUESTION:

How's the accuracy of decision tree compared to logistic regression? Discuss your finding and compare with the previous model (model_lr)!

Using the ML Model

Once a model has been trained and evaluated for accuracy, it is ready to be deployed! For example, we can use this model as part of a system monitoring on the phone: The software gather the resource utilization data in real time, preprocess to extract the FEATURE_MATRIX, invoke MODEL.predict(FEATURE_MATRIX) to classify the running apps. Is this cool?

DISCUSSION: Consider how this type of machine learning can be used to detect known malware.

  • How does this differ from the traditional approach to malware/virus detection?
  • What are the strengths and weaknesses of each approach?

Know Your Model! Read the References

Scikit-learn makes it convenient to construct, train, and validate machine learning models. They appear to be "magic black box" that will give us the best solution given a task. They are NOT. In fact, you need to become familiar with the basic idea behind the models, their general behavior, their (hyper)parameters, etc. Do read the Scikit-learn references and gain familiarity with them.

The user's guide contain excellent overview of these models with plenty of examples.

4.3 Summary: A Complete Pipeline for Machine Learning

Congratulations! You have completed all the necessary steps to perform supervised machine learning.

First, we need to define the purpose of the model, i.e. what the model is supposed to perform or predict. Once we are clear about this goal, then we follow a step-by-step procedure to do the machine learning modeling:

  1. Preprocessing the data;
  2. Separating the labels from the features, if necessary;
  3. Selecting the features to use for the model;
  4. Splitting the dataset into training data and testing data;
  5. Deciding which machine learning model to use: this largely depends on the nature of the task, and on the characteristics of the input data;
  6. Training (fitting) the machine learning model using the training data;
  7. Evaluating the model's performance.

In the upcoming notebook, we will focus on tuning the model in order to improve its performance. Our goal is to obtain the best model to accomplish the task at hand.

4.4 Additional Experiments

Next to data wrangling/preparation step, the tweaking of machine learning models require a lot of exploration and experimentation. Without some experimentation, it is difficult to ascertain that we have arrived at the best model. We encourage you to apply the same procedure summarized above to try out a few variants of ML models and evaluate their performance.

EXERCISE: Establish new ML models by using different set of features as the input:

  1. Feature set 2: (CPU_USAGE, cutime)
  2. Feature set 3: (CPU_USAGE, queue)

Evaluate the accuracy of logistic regression and decision tree models when using these feature sets.

Experiment 2: Using feature set (CPU_USAGE, cutime)

Try both logistic regression and decision tree.

Hint: Start with redefining the features by choosing the new set of columns:

In [ ]:
"""Select `CPU_USAGE` and `cutime` to try out a new model""";
#features = df2_features_n[#TODO]
#features.head()
In [ ]:
"""Split the dataset into training and testing sets
   (the 80%-20% split used above is fine)""";

#train_F, test_F, train_L, test_L = train_test_split(#TODO...)
In [ ]:
"""Create and train a new Logistic Regression model using the new training dataset""";

model_lr2 = LogisticRegression(solver='lbfgs')
#model_lr2.fit(#TODO)
In [ ]:
"""Evaluate the new model_lr2""";

#model_evaluate(#TODO)

DISCUSSIONS:

  • Compare the performance of model_lr2 against the previous logistic regression model (model_lr). Which one does better?
  • Compare not only the accuracy, but also the confusion matrix.
  • Which model is more apt at getting the Facebook class correct? And which one is more reliable at predicting WhatsApp?

Keep Track of Your Results!

Your workshop instructor may set up a shared spreadsheet to save your results (at least accuracy); if so, please use that to keep track of all your experiment results! You can also store your result in a text file or a spreadsheet of your own. This will ease analysis and comparison later on.

WARNING Regarding Jupyter

Jupyter allows you to go back and re-run earlier code cells, but there are potential pitfalls you have to be aware off when doing this. Here is one of them: In the last few cells we redefined the variables features, test_F, test_L, ...; when we do this, we should not go back to the earlier cells where those variables take up different values and re-run the cells. For example, if we re-train model_lr declared earlier with the new train_F and train_L, then the model will change (it becomes the same as model_lr2). On the other hand, we have lost access to the dataset we used to train and validate model_lr. One way to get around this problem is to redefine a new set of variables (i.e. features2, train_F2, train_L2, ...) that correspond to the new model (model_lr2). We will introduce a different approach to deal with the sprawling of new variables.

In [ ]:
"""Redo the same with a new Decision Tree model: create, train, evaluate""";

model_dtc2 = DecisionTreeClassifier(criterion='entropy',
                                    max_depth=3, min_samples_split=8)
#model_dtc2.fit(#TODO...)
#model_evaluate(#TODO...)

Notice how things also get more routine and boring now? This is where we can start leveraging the old-fashioned way of running Python: scripting!

DISCUSSION: Again, carefully examine the results of this model and compare it against previous models. Which model performs the best so far?

Experiment 2: Using feature set (CPU_USAGE, queue)

Now try creating a model using the features, CPU_USAGE and queue. Try both logistic regression and decision tree, again. Use exactly the same procedure as we have practiced above.

In [ ]:
"""Use features `CPU_USAGE` and `queue`; create and train `model_lr3` and `model_dtc3`""";

#features = #TODO
#features.head()
In [ ]:
"""Create, train, validate the Logistic Regression model""";

#model_lr3 = LogisticRegression(solver='lbfgs')
#TODO
In [ ]:
"""Create the Decision Tree model""";

model_dtc3 = DecisionTreeClassifier(criterion='entropy',
                                    max_depth=3, min_samples_split=8)
#TODO

DISCUSSION

The scores for the new models that replaced vsize with queue were relatively low. Features CPU_USAGE and queue also happen to be the least correlated out of all the features. Is it clear that the (CPU_USAGE, vsize) is the best set of features to use? Clearly we need a way to select the best features to go into the model.

QUESTIONS:

  • Can you think of any other features to select?
  • What would happen if we selected all features?

--> (Enter your responses here) <--

Unleashing It All: Using All Features

CHALLENGE: Use the cell below to test the machine learning results when all the available features are used.

In [ ]:
"""Use all features to train and build LR and DTC models!""";

QUESTIONS

If you have reached this point and done the last experiment, then:

  • What is the ultimate accuracy of the two models (LR and DTC) given the original dataset?

  • Why don't we want to use all the features to build a machine learning model in real-life problems?

--> (Enter your responses here) <--

5. Summary

(Edit this cell to produce your own summary. Change the questions to the answers based on your experimental results to do that.)

  • The accuracy of the model is _____ on the input features.

  • Is training a ML model a fast or slow process?

  • What is the purpose of experimentation with different models in machine learning?

  • Write down a summary table / list of the accuracies and confusion matrices from different models.

  • Take heed of the Jupyter warning above regarding re-running earlier cells.

  • The process of overwriting the variables and running similar code cells in a strict, sequential order presents an annoyance; is there a better way to test these machine learning models outside of a Jupyter Notebook environment?