DeapSECURE module 3: Machine Learning
Welcome to the DeapSECURE online training program! This is a Jupyter notebook for the hands-on learning activities of the "Machine Learning" module, Episodes 4 and 5: "Data Preprocessing for Machine Learning", "Machine Learning for Smartphone Application Classification".
Please visit the DeapSECURE website to learn more about our training program.
In this session, we will use this notebook to perform data preparation & initial experiment with machine learning, so that learners will see the complete steps of a machine learning workflow. We will build upon the skill and insight already acquired in the "Big Data" module.
If you are opening this notebook from Wahab cluster's OnDemand interface, you're all set.
If you see this notebook elsewhere and want to perform the exercises on Wahab cluster, please follow the steps outlined in our setup procedure.
Create a new Jupyter session using "legacy" Python suite, then create a new "Python3" notebook. (See ODU HPC wiki for more detailed help.)
Get the necessary files using commands below within Jupyter:
mkdir -p ~/CItraining/module-ml
cp -pr /shared/DeapSECURE/module-ml/. ~/CItraining/module-ml
cd ~/CItraining/module-ml
The file name of this notebook is ML-session-2.ipynb
.
#TODO
is used as a placeholder where you need to fill in with something appropriate. Shift+Enter
.Use ls
to view the contents of a directory.
Summary table of the commonly used indexing syntax from our own lesson.
If you have not done so, we recommend that you review the Data Wrangling and Visualization episode of our Big Data lesson module.
Next step, we need to import the required libraries into this Jupyter Notebook:
pandas
, matplotlib.pyplot
and seaborn
.
For Wahab cluster only: before importing these libraries, we need to load the DeapSECURE
environment modules:
module("load", "DeapSECURE")
Now we can import the requisite Python libraries, most notably: pandas, NumPy, Matplotlib, Seaborn, and Scikit-learn.
"""Import the necessary Python modules""";
import os
import sys
import pandas
import numpy
import seaborn
from matplotlib import pyplot
import sklearn
# also add more tools:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
# machine learning models:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
# for evaluating model performance
from sklearn.metrics import accuracy_score, confusion_matrix
%matplotlib inline
# Some advanced learners may like to use shortcuts,
# so we give them here:
pd = pandas
np = numpy
plt = pyplot
sns = seaborn
Let us load the Sherlock's "Applications" dataset into a DataFrame for analysis and machine learning. The dataset contains measurements of resource utilization from two applications on a smartphone, namely Facebook and WhatsApp. This was the same dataset used in the Data Wrangling & Visualization notebook of the DeapSECURE's Big Data lesson, where we familiarize ourselves with this dataset and identified the necessary steps to clean the data. In this present notebook, we will continue the data preprocessing to make it ready for machine learning.
df2 = pd.read_csv('sherlock/sherlock_mystery_2apps.csv')
df2.head(10)
The sherlock_mystery_2apps.csv
file actually contains a small subset of a much larger Application.csv
data file.
There are 14 columns in this subset:
Unnamed: 0
[int]: Record index.
ApplicationName
[str]: Name of the application.
CPU_USAGE
[float]: CPU utilization (100% = completely busy CPU).
cutime
[int]: CPU "user time" spent the spawned (child) processes.
lru
[int]: "Least Recently Used"; This is a parameter of the Android application memory management.
num_threads
[int]: Number of threads in this process.
otherPrivateDirty
[int]: The private dirty pages used by everything else
other than Dalvik heap and native heap.
priority
[int]: Process's scheduling priority.
utime
[int]: Measured CPU "user time".
vsize
[int]: The size of the virtual memory, in bytes.
cminflt
[int]: Count of minor faults that the process's child processes.
guest_time
[int]: Running time of "virtual CPU".
Mem
[int]: Size of memory, in bytes.
queue
[int]: The waiting order (priority).
Up to two-thirds of the time of data analysis is spent on data preparation, in order to achieve a clean, consistent, and processable state of data.
Data preparation is absolutely crucial to obtaining trustworthy insight from the data.
We have covered in this topic in great detail in the lesson on the Data Wrangling and Visualization.
(See also the corresponding notebook, BigData-session-3.ipynb
.)
IMPORTANT!¶
You must do all the EXERCISEs in this section (do not skip any one), so that you obtain the clean dataset.
sherlock_mystery_2apps.csv
¶From the Data Wrangling & Visualization notebook, we identified the following issues with the raw data:
Unnamed: 0
),cminflt
column are undefined),Mem
, a duplicate of vsize
).We also identified the necessary course of action to address these defects. In this notebook, we will simply execute the necessary steps in order to prepare, or preprocess, the data for machine learning.
Unnamed: 0
Column¶EXERCISE: Remove the Unnamed: 0
column from df2
because it is irrelevant for our analysis.
"""Drop the `Unnamed: 0` column from df2""";
#df2.drop(#TODO, inplace=True)
df2.isna().sum()
OPTIONAL QUESTION: Of what fraction is the data missing in that one column?
Hint: One way is to use df2[COLUMN_NAME].size
to get the total number of rows.
"""Compute the fraction of mising data to the total number of rows in the `cminflt` column""";
#TODO
df2['cminflt'].isna().sum() / df2['cminflt'].size
OPTIONAL: It is interesting to plot where the cminflt
is missing, using the following trick.
df2['cminflt'].isna().astype(int).plot()
DECISION: we decided to drop the rows where the cminflt
values are missing.
Reason: The number of records missing data in cminflt
is large;
but we also have a lot of data to begin with (nearly 800k rows in the raw dataset).
EXERCISE:
Use DataFrame's dropna()
method to remove rows that have missing values.
Perform the operation in-place.
Then verify that the new df2 no longer have any missing data.
"""Remove rows with missing values from df2""";
#TODO
No more duplicate found.
Now let's remove Mem
, guest_time
and queue
from the dataset because they are duplicates of the other features, or have a very strong linear correlation to those other features.
"""Drop the following columns from the DataFrame: Mem, guest_time, queue""";
#df2.drop(#TODO)
df2.drop(['Mem'], axis=1, inplace=True)
EXERCISE: Please verify that the unwanted columns above have been removed from df2
now!
df2.columns
Quite frequently, labels (output values) come in the same dataframe as the features.
In this case, we need to separate the label column(s) from the input features.
In our dataset, the ApplicationName
column contains the labels for classification machine learning.
Let’s extract that into df2_labels
, whereas the features go to df2_features
.
df2_labels = df2['ApplicationName']
df2_features = df2.drop('ApplicationName', axis=1)
Note: We do NOT drop the
ApplicationName
column in-place, so we have the backup of the cleaned data. Therefore we assign the feature-only dataframe to a new variable calleddf2_features
.
Let's inspect the cleaned data (features & labels):
"""Print first few rows of the labels and features.
You can also inspect the descriptive statistics of the features.""";
print("Labels:")
#TODO
print("Features:")
#TODO
print("Statistics and range of values:")
#TODO
If a dataset contains categorical features, these features will need to be converted to numbers using schemes such as integer encoding or one-hot encoding to be properly represented as numbers. Consult the Data Preprocessing episode to learn more.
At this point, an object df2_features
contains only numerical values.
This condition is a prerequisite for using the dataset for machine learning.
The feature-only part of the data is often referred to as feature matrix,
because it is in a matrix form by now.
There are two more steps required before we actually perform the training step in machine learning:
feature scaling and test-train split.
Many ML algorithms work best when the typical values of the features are of the same order of magnitude. A range of a feature is typically the difference between the minimum and maximum values in that feature. Because in general each feature has its own value range, feature scaling is necessary to bring all the features into a similar order of magnitude.
Scikit-learn also contains a number of scalers that can be used--each tailored for a certain kind of conditions in the dataset. We will use standard scaler, which normalizes the features according to their respective means and standard deviations. (This is usually a reasonable starting point; you may want to try out other scalers, see Scikit-learn's document on data preprocessing.)
scaler = preprocessing.StandardScaler()
scaler.fit(df2_features)
df2_features_n = pd.DataFrame(scaler.transform(df2_features),
columns=df2_features.columns,
index=df2_features.index)
df2_features_n.head(10)
The normalized features are stored in a new variable, df2_features_n
.
Our data is now ready and we can try out some machine learning models! Let us now do our first experiment with machine learning. Remember our goal?
We want to build a machine learning model to predict the name of the running app on the phone, given the observation of the behavior of the app. Thus we want to do an application classification task, given their measured usage of
CPU_USAGE
,cutime
,num_threads
, etc.
The main goal of this section is to guide you through the all the steps necessary to train and assess the quality of the machine learning model. There are many models that we can try out, but all of them follow the same set of steps.
One important art in machine learning is choosing the best set of features to go into the model to achieve the best predictive ability.
From the observations in the "Data Wrangling and Visualization" episode of Big Data module, we can intuitively guess that CPU_USAGE
and vsize
may be two important features for an application detection task.
After all, different applications would differ in the CPU and memory usage.
Let us build our first machine learning model with these two features and observe the outcome of the prediction.
features = df2_features_n[['CPU_USAGE', 'vsize']]
features.head()
labels = df2_labels.copy()
labels.head()
The labels
and features
variables contain the labels and selected features to use in the model, and labels
contains the associated application labels.
(Scikit-learn classification models know how to handle different classes in the labels
, so we do not need to perform special encoding for the labels.)
As the last step before building and training a ML model, we need to split the dataset into "training" and "testing" sets (both features and labels). The training set is used to train the model, whereas the testing set will be used to validate the performance of the trained model.
"""Uncomment and run""";
#from sklearn.model_selection import train_test_split
#train_F, test_F, train_L, test_L = train_test_split(features, labels, test_size=0.2)
The _F
and _L
suffixes in the variables above refer to the features and labels, respectively.
We reserve 80% of the dataset for training, and 20% for testing (test_size=0.2
).
print("Training set shapes:")
print(train_F.shape)
print(train_L.shape)
print("Testing set shapes:")
print(test_F.shape)
print(test_L.shape)
model_lr = LogisticRegression(solver='lbfgs')
model_lr.fit(train_F, train_L)
The first statement above creates a LogisticRegression
object, named model_lr
, that will perform the logistic regression classification.
In the second statement, we train the model using the training dataset (features and labels).
Timing a Python statement¶
Do you notice that the
model_lr.fit
does not return immediately? Indeed, training an ML model can take awhile. It is useful to note how long the training takes place. With Jupyter, you can time the function call easily, like this:%time model_lr.fit(train_F, train_L)Please do this from this time on so you will get the timing.
After the training, model_lr
is ready to do the classification task.
But we need to first evaluate the model using the testing dataset (test_F
and test_L
) to measure the its ability to make correct predictions.
The accuracy score is the most popular metric, defined as the fraction of the number of correct predictions (i.e. classification) over the total number of predictions made.
We will introduce two common metrics to evaluate our model's performance: accuracy_score
and confusion_matrix
.
To evaluate, we use the trained model to predict the applications based on the test features:
test_pred = model_lr.predict(test_F)
test_pred[:20]
The prediction result is a numpy array. These predictions can be compared to the elements of test_L
.
EXERCISE: Manually compare the first 20 elements of test_pred
and test_L
: how many correct answers do you get?
accuracy_score
¶Well, Scikit-learn has a lot of tools to make our lives easier!
To measure accuracy (the fraction of the correct answers), we can simply use the accuracy_score
:
print("accuracy_score: ", accuracy_score(test_L, test_pred))
print("num of correct answers:", accuracy_score(test_L, test_pred, normalize=False))
QUESTION:
What accuracy score that you obtain? Is this a good accuracy?
Where do the wrong answers go? What mispredictions do the model make?
Compare your result with ours: We obtained just below 70%, meaning only about 70% of the answers are correct.
confusion_matrix
¶The confusion_matrix
computes the confusion matrix which visually quantifies the correct and incorrect answers, showing the number of various mispredictions:
print("confusion_matrix:\n", confusion_matrix(test_L, test_pred))
This is a two-dimensional array where the row position corresponds to the true classes whereas the column position corresponds to the ML-predicted classes. But what are the classes in this matrix? We can query the ML object:
model_lr.classes_
So class number 0
is Facebook
and 1
is WhatsApp
.
Scikit-learn can even plot the confusion matrix in a nice-looking graph:
sklearn.metrics.plot_confusion_matrix(model_lr, test_F, test_L, cmap='Blues_r')
In the example above, we have a total of ~75k Facebook records and ~47k
Since we will be evaluating different models, let us define a function to evaluate the accuracy of a model:
def model_evaluate(model,test_F,test_L):
test_L_pred = model.predict(test_F)
print("Evaluation by using model:",type(model).__name__)
print("accuracy_score:",accuracy_score(test_L, test_L_pred))
print("confusion_matrix:","\n",confusion_matrix(test_L, test_L_pred))
return
Now we can use the model_evaluate
function to evaluate the model, like this:
model_evaluate(model_lr,test_F,test_L)
Next, we can try the Decision Tree model. We are still using the same training and testing sets, just a different model.
"""Uncomment, complete, and run this code cell to train a decision tree.
Use the same `fit()` function call as before to train the model.""";
#model_dtc = DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_split=8)
#model_dtc.fit(#TODO)
We created a decision tree classifier and adjusted two (hyper)parameters: max_depth
and min_samples_split
.
Check the accuracy of this model:
"""Evaluate the accuracy of model_dtc""";
#model_evaluate(#TODO)
QUESTION:
How's the accuracy of decision tree compared to logistic regression? Discuss your finding and compare with the previous model (model_lr
)!
Once a model has been trained and evaluated for accuracy, it is ready to be deployed!
For example, we can use this model as part of a system monitoring on the phone:
The software gather the resource utilization data in real time, preprocess to extract the FEATURE_MATRIX
, invoke MODEL.predict(FEATURE_MATRIX)
to classify the running apps.
Is this cool?
DISCUSSION: Consider how this type of machine learning can be used to detect known malware.
Scikit-learn makes it convenient to construct, train, and validate machine learning models. They appear to be "magic black box" that will give us the best solution given a task. They are NOT. In fact, you need to become familiar with the basic idea behind the models, their general behavior, their (hyper)parameters, etc. Do read the Scikit-learn references and gain familiarity with them.
The user's guide contain excellent overview of these models with plenty of examples.
Congratulations! You have completed all the necessary steps to perform supervised machine learning.
First, we need to define the purpose of the model, i.e. what the model is supposed to perform or predict. Once we are clear about this goal, then we follow a step-by-step procedure to do the machine learning modeling:
In the upcoming notebook, we will focus on tuning the model in order to improve its performance. Our goal is to obtain the best model to accomplish the task at hand.
Next to data wrangling/preparation step, the tweaking of machine learning models require a lot of exploration and experimentation. Without some experimentation, it is difficult to ascertain that we have arrived at the best model. We encourage you to apply the same procedure summarized above to try out a few variants of ML models and evaluate their performance.
EXERCISE: Establish new ML models by using different set of features as the input:
CPU_USAGE
, cutime
)CPU_USAGE
, queue
)Evaluate the accuracy of logistic regression and decision tree models when using these feature sets.
CPU_USAGE
, cutime
)¶Try both logistic regression and decision tree.
Hint: Start with redefining the features
by choosing the new set of columns:
"""Select `CPU_USAGE` and `cutime` to try out a new model""";
#features = df2_features_n[#TODO]
#features.head()
"""Split the dataset into training and testing sets
(the 80%-20% split used above is fine)""";
#train_F, test_F, train_L, test_L = train_test_split(#TODO...)
"""Create and train a new Logistic Regression model using the new training dataset""";
model_lr2 = LogisticRegression(solver='lbfgs')
#model_lr2.fit(#TODO)
"""Evaluate the new model_lr2""";
#model_evaluate(#TODO)
DISCUSSIONS:
model_lr2
against the previous logistic regression model (model_lr
). Which one does better?Keep Track of Your Results!¶
Your workshop instructor may set up a shared spreadsheet to save your results (at least accuracy); if so, please use that to keep track of all your experiment results! You can also store your result in a text file or a spreadsheet of your own. This will ease analysis and comparison later on.
WARNING Regarding Jupyter¶
Jupyter allows you to go back and re-run earlier code cells, but there are potential pitfalls you have to be aware off when doing this. Here is one of them: In the last few cells we redefined the variables
features
,test_F
,test_L
, ...; when we do this, we should not go back to the earlier cells where those variables take up different values and re-run the cells. For example, if we re-trainmodel_lr
declared earlier with the newtrain_F
andtrain_L
, then the model will change (it becomes the same asmodel_lr2
). On the other hand, we have lost access to the dataset we used to train and validatemodel_lr
. One way to get around this problem is to redefine a new set of variables (i.e.features2
,train_F2
,train_L2
, ...) that correspond to the new model (model_lr2
). We will introduce a different approach to deal with the sprawling of new variables.
"""Redo the same with a new Decision Tree model: create, train, evaluate""";
model_dtc2 = DecisionTreeClassifier(criterion='entropy',
max_depth=3, min_samples_split=8)
#model_dtc2.fit(#TODO...)
#model_evaluate(#TODO...)
Notice how things also get more routine and boring now? This is where we can start leveraging the old-fashioned way of running Python: scripting!
DISCUSSION: Again, carefully examine the results of this model and compare it against previous models. Which model performs the best so far?
CPU_USAGE
, queue
)¶Now try creating a model using the features, CPU_USAGE
and queue
.
Try both logistic regression and decision tree, again.
Use exactly the same procedure as we have practiced above.
"""Use features `CPU_USAGE` and `queue`; create and train `model_lr3` and `model_dtc3`""";
#features = #TODO
#features.head()
"""Create, train, validate the Logistic Regression model""";
#model_lr3 = LogisticRegression(solver='lbfgs')
#TODO
"""Create the Decision Tree model""";
model_dtc3 = DecisionTreeClassifier(criterion='entropy',
max_depth=3, min_samples_split=8)
#TODO
DISCUSSION
The scores for the new models that replaced vsize
with queue
were relatively low.
Features CPU_USAGE
and queue
also happen to be the least correlated out of all the features.
Is it clear that the (CPU_USAGE
, vsize
) is the best set of features to use?
Clearly we need a way to select the best features to go into the model.
QUESTIONS:
--> (Enter your responses here) <--
CHALLENGE: Use the cell below to test the machine learning results when all the available features are used.
"""Use all features to train and build LR and DTC models!""";
QUESTIONS
If you have reached this point and done the last experiment, then:
What is the ultimate accuracy of the two models (LR and DTC) given the original dataset?
Why don't we want to use all the features to build a machine learning model in real-life problems?
--> (Enter your responses here) <--
(Edit this cell to produce your own summary. Change the questions to the answers based on your experimental results to do that.)
The accuracy of the model is _____
on the input features.
Is training a ML model a fast or slow process?
What is the purpose of experimentation with different models in machine learning?
Write down a summary table / list of the accuracies and confusion matrices from different models.
Take heed of the Jupyter warning above regarding re-running earlier cells.
The process of overwriting the variables and running similar code cells in a strict, sequential order presents an annoyance; is there a better way to test these machine learning models outside of a Jupyter Notebook environment?