DeapSECURE module 4: Deap Learning
Welcome to the DeapSECURE online training program! This is a Jupyter notebook for the hands-on learning activities of the "Deap Learning" module, Please visit the DeapSECURE website to learn more about our training program.
In this session, We will use this notebook to prepare the Sherlock dataset for the DL lesson
When preparing data for analytics and machine learning, up to two-thirds of the time is actually spent preparing the data. This may sound like a waste of time, but that step is absolutely crucial to obtaining trustworthy insight from the data. The goal of data preparation is to achieve a clean, consistent and processable state of data.
In this session, you will perform data preparation used in the previous ML workshop.
QUICK LINKS
If you are opening this notebook from the Wahab OnDemand interface, you're all set.
If you see this notebook elsewhere, and want to perform the exercises on Wahab cluster, please follow the steps outlined in our setup procedure.
Get the necessary files using commands below within Jupyter:
mkdir -p ~/CItraining/module-nn
cp -pr /shared/DeapSECURE/module-nn/. ~/CItraining/module-nn
cd ~/CItraining/module-nn
The file name of this notebook is NN-session-2.ipynb
.
Throughout this notebook, #TODO
is used as a placeholder where you need to fill in with something appropriate.
To run a code in a cell, press Shift+Enter
.
Summary table of the commonly used indexing syntax from our own lesson.
We recommend you open these on separate tabs or print them; they are handy help for writing your own codes.
Next step, we need to import the required libraries into this Jupyter Notebook:
pandas
, numpy
,matplotlib.pyplot
,sklearn
and tensorflow
.
For Wahab cluster only: before importing these libraries, we have to load the DeapSECURE
environment module:
# Run to load environment modules on HPC
module("load", "DeapSECURE")
Few additional modules need to be loaded to access the GPU via CUDA and TensorFlow library. Keras is now part of TensorFlow:
module("load", "cuda")
module("load", "py-tensorflow")
module("list")
Now we can import all the required modules into Python:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from sklearn import preprocessing
import tensorflow as tf
import tensorflow.keras as keras
%matplotlib inline
# tools for machine learning:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
# for evaluating model performance
from sklearn.metrics import accuracy_score, confusion_matrix
# classic machine learning models:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
# Import KERAS objects
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import optimizers
First of all, let us review the data preparation step, data wrangling step and machine learning step one by one in this bigger dataset. In the first step above we are actually using two Python scripts: Prep_ML.py
and analysis_sherlock_ML.py
The script Prep_ML.py
contains all the steps necessary to read the data, remove useless data, handle missing data, extract the feature matrix and labels, then do the train/dev split. Load the commands contained in this script into your current Jupyter notebook using the IPython’s %load
magic. Then you can run this function.
The script analysis_sherlock_ML.py
is a library of functions, which contains the steps we described in the earlier lesson. These functions are clearly named such as: preprocess_sherlock_19F17C
, step0_label_features
, step_onehot_encoding
, and step_feature_scaling
.
Uncomment and run the magic statement %load Prep_ML.py
below.
(It will replace the cell with the contents of Prep_ML.py.)
You may have to run this cell twice with Shift+Enter
to actually run the loaded code.
#%load Prep_ML.py
"""^^^ Uncomment and run the magic statement above.
You may have to run the cell twice to actually run this cell!""";
After the cell above is executed, you will find the training & test data in the following members of Rec
object:
Rec.df_features
: DataFrame of the features for the machine learning modelsRec.labels
: The labels (expected output of the ML models)Rec.train_features
= training data's featuresRec.test_features
= testing data's featuresRec.train_labels
= training data's labelsRec.test_labels
= testing data's labelsWe use this approach to manage the complexity of having too many variables (e.g. train_F
, train_F2
, train_F3
, ...).
This is a more diverse subset of the SherLock Application dataset, covering significantly more applications and features.
Your challenge is to train a similar model (like in the previous notebooks) using the "18-apps" dataset to correctly classify running apps on the smartphone with very high accuracy (> 99%).
EXERCISE
Take a peek at the training feature DataFrame.
"""Take a peek at the training feature DataFrame.""";
#TODO
From above, we know that we are working with a significantly larger data file, sherlock/sherlock_18apps.csv
.
Question:
analysis_sherlock_ML.py
and see how this function defined.This dataset has 19 features for each record and 18 applications in total.
Now, we first try the traditional machine learning algorithms we learn in the previous session.
Here we test on Decision Tree and Logistic Regression.
To simplify the code, we will use the model_evaluate
function to evaluate the performance of a machine learning model (whether traditional ML or neural network model).
def model_evaluate(model,test_F,test_L):
test_L_pred = model.predict(test_F)
print("Evaluation by using model:",type(model).__name__)
print("accuracy_score:",accuracy_score(test_L, test_L_pred))
print("confusion_matrix:","\n",confusion_matrix(test_L, test_L_pred))
return
ML_dtc = DecisionTreeClassifier(criterion='entropy',
max_depth=6,
min_samples_split=8)
%time ML_dtc.fit(Rec.train_features, Rec.train_labels)
model_evaluate(ML_dtc, Rec.test_features, Rec.test_labels)
ML_log = LogisticRegression(solver='lbfgs')
%time ML_log.fit(Rec.train_features, Rec.train_labels)
model_evaluate(ML_log, Rec.test_features, Rec.test_labels)
QUESTIONS:
By now, we have a pretty good background knowledge about this dataset. And we know the accuracy scores we can get by using the Decision Tree and Logistic Regression methods, which are reasonably good, but not close to 99%.
Do you notice that the training of logistic regression model takes a while?
Often we want to know how long this actually takes place.
We can get this timing easily in Jupyter by prepending %time
to the Python statement we'd like to measure the execution time.
EXERCISE: If you haven't already, let's retrain the logistic regression model here and get the timing:
About the Warning Message¶
The training phase stops with an error:
ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
This happens because the solver fails to reach convergence after the maximum number of iteration (default=100) is reached. You may want to investigate by trying different solvers in the
LogisticRegression
object. Please Scikit-learn documentation on Logistic Regression, thesolver
argument, if you are interested. Our internal test showed that with another solver
Let us now proceed by building some neural network models to classify smartphone apps.
When using neural networks to do a classification task, we need to encode the labels using one-hot encoding. This is necessary because many machine learning algorithms require numeric labels due to implementation efficiency, as such, any categorical data must be converted to numerical data.
For more information on why we need one-hot encoding, see these articles:
Comment: We did not have to do one-hot in scikit-learn, because the ML objects such as DecisionTreeClassifier
does it for us behind the scene.
Similarly, any input features that are of categorical data type will also have to be encoded using either integer encoding or one-hot encoding.
Rec.train_L_onehot = pd.get_dummies(Rec.train_labels)
Rec.test_L_onehot = pd.get_dummies(Rec.test_labels)
For one-hot encoding, there is a 1
in a distinct spot for every category and 0
everywhere else.
Below shows the first five rows; notice that there is only a single 1
in each row, with the rest being 0
.
Rec.train_L_onehot.head()
Here, we first give an example of a neural network model without any hidden layer.
def NN_Model_no_hidden(learning_rate):
"""Definition of deep learning model with one dense hidden layer"""
model = Sequential([
Dense(18, activation='softmax',input_shape=(19,),kernel_initializer='random_normal')
])
adam=tf.keras.optimizers.Adam(lr=learning_rate, beta_1=0.9, beta_2=0.999, amsgrad=False)
model.compile(optimizer=adam,
loss='categorical_crossentropy',
metrics=['accuracy'])
return model
We train this model with an initial learning rate of 0.0003.
model_0 = NN_Model_no_hidden(0.0003)
model_0_history = model_0.fit(Rec.train_features,
Rec.train_L_onehot,
epochs=5, batch_size=32,
validation_data=(Rec.test_features, Rec.test_L_onehot),
verbose=2)
To better analyze the training process, we would like to visualize model training history. In Keras, we can collect the history with
history
function, returned from training the model and creates two charts:
- A plot of accuracy on the training and validation datasets over training epochs.
- A plot of loss on the training and validation datasets over training epochs.
model_0_history.history.keys()
def plot_loss(model_history):
# summarize history for loss
plt.plot(model_history.history['loss'])
plt.plot(model_history.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper right')
plt.show()
def plot_acc(model_history):
# summarize history for accuracy
plt.plot(model_history.history['acc'])
plt.plot(model_history.history['val_acc'])
plt.title('Model Accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
plot_loss(model_0_history)
plot_acc(model_0_history)
QUESTIONS:
What is the effect of changing the learning rate? (Examples: 0.03, 0.003, or 0.00003)
What will happen if we increase the epochs
value? (To 10, 20?)
What is the ultimate accuracy of one-layer model compared to Decision Tree and Logistic Regression?
Apparently, the first NN model that we created above did not perform very well.
One way to improve the performance of a NN model is to add one or more hidden layers.
The function below has a hidden layer, an output layer, and utilizes the adam
optimizer that was used in the previous notebook.
We will use this function to test the performance based on different parameters (number of hidden neurons, hidden layers, learning rate, etc.).
To start, let us try an example of a model with 1 hidden layer, 18
hidden neurons, and a learning rate of 0.0003
.
def NN_Model(hidden_neurons,learning_rate):
"""Definition of deep learning model with one dense hidden layer"""
model = Sequential([
# More hidden layers can be added here
Dense(hidden_neurons, activation='relu',input_shape=(19,),kernel_initializer='random_normal'), # Hidden Layer
Dense(18, activation='softmax') # Output Layer
])
adam=tf.keras.optimizers.Adam(lr=learning_rate, beta_1=0.9, beta_2=0.999, amsgrad=False)
model.compile(optimizer=adam,
loss='categorical_crossentropy',
metrics=['accuracy'])
return model
model_1 = NN_Model(18,0.0003)
model_1_history=model_1.fit(Rec.train_features,
Rec.train_L_onehot,
epochs=10, batch_size=32,
validation_data=(Rec.test_features, Rec.test_L_onehot),
verbose=2)
Self Exploration:
Now that we know how to use the function NN_model
, we can use it to run a variety of tests using different parameters.
Below is a list of what could be interesting in exploring; feel free to experiment with your own ideas as well.
NOTE:
The easiest way to do this exploration is to simply copy the code in the cell above and paste it in a new cell below, since most of the parameters (hidden_neurons
, learning_rate
, batch_size
, etc.) can be changed when calling the function or when fitting the model.
However, to change the number of hidden layers, the original function NN_model
will need to be modified, therefore, it is best to do this last.
"""Start self exploration here""";
This process of experimentation with different parameters for the neural network can get repetitive and cause this notebook to become very long. Instead, it would be more beneficial to run experiments like this in a scripting environment. To do this, we need to identify the relevant code elements for our script. In a general sense, this is what we should pick out:
In brief, once the initial experiments are done and we have established a working pipeline for machine learning, we need to change the way we work. Real machine-learning work requires many repetitive experiments, each of which may take a long time to complete. Instead of running many experiments in Jupyter notebooks, where each will require us to wait for a while to finish), we need to be able to carry out many experiments in parallel so that we can obtain our results in a timely manner. This is key reason why we really should make a script for these experiments and submit the script to run them in batch (non-interactive model). HPC is well suited for this type of workflow--in fact it is most efficient when used in this way. Here are the key components of the "batch" way of working:
In your hands-on package, there is a folder called expts-sherlock
which contains a sample Python script and SLURM job script that you can submit to the HPC cluster:
NN_Model-064n.py
shows an example of how a script converted from this notebook would look like.
We recommend only one experiment per script to avoid complication.
NN_Model-064n.wahab.job
is the corresponding job script for ODU's Wahab cluster.