This lesson is still being designed and assembled (Pre-Alpha version)

Classifying Smartphone Apps with Keras

Overview

Teaching: 20 min
Exercises: 40 min
Questions
  • How do we build a neural network on KERAS to perform multi-class classification?

  • How do we monitor the progress of neural-network training?

  • How do we set up appropriate hyperparameters?

Objectives
  • Understanding how to build a general neural network with KERAS.

  • Understanding how to tune the hyperparameters based on the results.

Prior to this episode, we focused on an extremely simple problem: a binary classification task using the sherlock_2apps (“2-apps”) data. Certainly, this is a very basic problem; real-world problems have much richer complexities. Beginning from this episode, we will use a richer subset of Sherlock’s Applications.csv to build a classifier to distinguish nearly 20 smartphone apps. As in the previous episode, we will build simple neural networks with KERAS (using all the essential building blocks of a neural network introduced in that episode) and quantify their performance metrics. As we progress, we will introduce additional techniques that are helpful in practical ML modeling.

Along the way, we will be guided to answer the following questions:

Loading Required Python Libraries and Objects

Please make sure that the necessary libraries and objects are loaded into your environment:

import os
import sys

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Tools for machine learning:
import sklearn
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
# for evaluating model performance
from sklearn.metrics import accuracy_score, confusion_matrix
# classic machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# Tools for deep learning:
import tensorflow as tf
import tensorflow.keras as keras

# Import key Keras objects
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

These are the same imports used at the beginning of the previous episode on binary classification. you start with a fresh Python session in this episode to avoid any confusion caused by using a different dataset.

The sherlock_18apps Dataset

The dataset used for this episode is the “18-apps” dataset, a significantly more diverse subset of the SherLock dataset that includes nearly 20 apps. Not only are there more classes (apps), but the dataset also contains more features. As with the 2-apps counterpart, the rows stored in this table were generated by periodically measuring the resource (CPU, memory, network, input/output) utilization stats for the individual apps. The table below presents the features of this dataset and offers a brief explanation of each:

No Feature Data Type Meaning
0 CPU_USAGE float Instantaneous percent utilization of CPU
1 UidRxBytes int Number of bytes received by this app via network
2 UidRxPackets int Number of network packets received by this app
3 UidTxBytes int Number of bytes transmitted (sent) by this app via network
4 UidTxPackets int Number of network packets received transmitted by this app
5 cutime int (Linux) Amount of CPU time spent in “user-mode” by the spawned & waited on child process
6 guest_time int (Linux) Amount of CPU time spent running a virtual CPU
7 importance int (Android) The relative importance of this app, as set by the Android system
8 lru int (Android) An additional ordering within a particular Android importance category
9 num_threads int (Linux) Number of threads in this app
10 otherPrivateDirty int (Android) Amount of dirty memory (i.e. written by this app), in units of kiB
11 priority int (Linux) The process’s priority in terms of CPU scheduling policy.
12 rss int (Linux) The amount of memory (RAM) actually occupied by this app, in units of kiB
13 state char (Linux) The state of the app’s process (Sleeping, Running, Busy I/O (D), Zombie
14 stime int (Linux) Amount of CPU time spent in “system-mode” by the app
15 utime int (Linux) Amount of CPU time spent in “user-mode” by the app
16 vsize int (Linux) Amount of virtual memory allocated for the app, in units of bytes
17 cminflt int (Linux) Number of minor page faults of the spawned & waited child process

(Source: Sherlock Dataset Data Field Description, version 2.4.1 by the SherLock team at BGU.) The explanations in the “meaning” field above are terse; they do not precisely explain everything we need to know about these fields. Those marked with “(Linux)” are tracked and reported by the operating system (Linux OS), those with “(Android)” come from Android system, and the rest are synthesized from OS or other measurements by the SherLock agent.

For comparison, the “2-apps” dataset introduced earlier, after cleaning and preprocessing, contains the following fields:

CPU_USAGE, cutime, lru, num_threads, otherPrivateDirty, priority, utime, vsize, cminflt.

Advice to Learners and Instructors

We strongly encourage all learners to familiarize themselves with the data by actually doing the the data exploration and identifying issues with the data before running the cleaning steps below. However, the complete codes for cleaning and preprocessing are given in the “Solution” boxes below, in case they are needed. In any case, the cleaning steps must be executed before proceeding to the preprocessing and modeling steps. While executing the codes, please read them and understand what steps were needed to make the data ready for machine learning modeling.

Initial Exploration

Because this is a new dataset, we will need to reconstruct the entire data wrangling and preparation steps. The steps will be similar to those used for the “2-apps” dataset. But each dataset will require a unique procedure of cleaning and preparation, therefore we cannot blindly apply the same recipe to every dataset. An exploration of a new dataset is required in order to know how to clean and prepare the dataset for use in machine learning. We will first load and explore the new dataset, then identify the necessary preprocessing and cleaning.

df = pd.read_csv("sherlock/sherlock_18apps.csv", index_col=0)

## Summarize the dataset
print("* shape:", df.shape)
print()
print("* info::\n")
df.info()
print()
print("* describe::\n")
print(df.describe().T)
print()

Output (click/tap to reveal)

* shape: (273129, 19)

* info::

<class 'pandas.core.frame.DataFrame'>
Int64Index: 273129 entries, 0 to 999994
Data columns (total 19 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   ApplicationName    273129 non-null  object 
 1   CPU_USAGE          273077 non-null  float64
 2   UidRxBytes         273129 non-null  int64  
 3   UidRxPackets       273129 non-null  int64  
 4   UidTxBytes         273129 non-null  int64  
 5   UidTxPackets       273129 non-null  int64  
 6   cutime             273077 non-null  float64
 7   guest_time         273077 non-null  float64
 8   importance         273129 non-null  int64  
 9   lru                273129 non-null  int64  
 10  num_threads        273077 non-null  float64
 11  otherPrivateDirty  273129 non-null  int64  
 12  priority           273077 non-null  float64
 13  rss                273077 non-null  float64
 14  state              273077 non-null  object 
 15  stime              273077 non-null  float64
 16  utime              273077 non-null  float64
 17  vsize              273077 non-null  float64
 18  cminflt            0 non-null       float64
dtypes: float64(10), int64(7), object(2)
memory usage: 41.7+ MB

* describe::

                      count          mean           std    min           25%  \
CPU_USAGE          273077.0  6.618322e-01  3.207833e+00    0.0  5.000000e-02   
UidRxBytes         273129.0  3.922973e+02  3.693198e+04 -280.0  0.000000e+00   
UidRxPackets       273129.0  4.204643e-01  2.790607e+01  -11.0  0.000000e+00   
UidTxBytes         273129.0  2.454729e+02  2.977305e+04  -60.0  0.000000e+00   
UidTxPackets       273129.0  3.878826e-01  2.420920e+01   -1.0  0.000000e+00   
cutime             273077.0  3.279844e-01  1.768488e+00    0.0  0.000000e+00   
guest_time         273077.0  0.000000e+00  0.000000e+00    0.0  0.000000e+00   
importance         273129.0  3.139921e+02  8.891191e+01  100.0  3.000000e+02   
lru                273129.0  4.712480e+00  6.348188e+00    0.0  0.000000e+00   
num_threads        273077.0  3.928061e+01  2.682408e+01    2.0  1.700000e+01   
otherPrivateDirty  273129.0  1.211232e+04  2.026702e+04    0.0  1.480000e+03   
priority           273077.0  1.975093e+01  1.170649e+00    9.0  2.000000e+01   
rss                273077.0  8.500590e+03  4.942350e+03    0.0  4.894000e+03   
stime              273077.0  1.378527e+03  3.568420e+03    3.0  9.100000e+01   
utime              273077.0  2.509427e+03  5.325113e+03    2.0  1.020000e+02   
vsize              273077.0  2.049264e+09  1.179834e+08    0.0  1.958326e+09   
cminflt                 0.0           NaN           NaN    NaN           NaN   

                            50%           75%           max  
CPU_USAGE          1.300000e-01  3.700000e-01  1.108900e+02  
UidRxBytes         0.000000e+00  0.000000e+00  8.872786e+06  
UidRxPackets       0.000000e+00  0.000000e+00  6.165000e+03  
UidTxBytes         0.000000e+00  0.000000e+00  9.830372e+06  
UidTxPackets       0.000000e+00  0.000000e+00  6.748000e+03  
cutime             0.000000e+00  0.000000e+00  1.100000e+01  
guest_time         0.000000e+00  0.000000e+00  0.000000e+00  
importance         3.000000e+02  4.000000e+02  4.000000e+02  
lru                0.000000e+00  1.100000e+01  1.600000e+01  
num_threads        3.000000e+01  5.500000e+01  1.410000e+02  
otherPrivateDirty  4.308000e+03  1.354800e+04  1.928560e+05  
priority           2.000000e+01  2.000000e+01  2.000000e+01  
rss                6.959000e+03  1.120600e+04  5.466800e+04  
stime              3.450000e+02  1.474000e+03  4.662900e+04  
utime              5.650000e+02  2.636000e+03  4.284500e+04  
vsize              2.026893e+09  2.125877e+09  2.456613e+09  
cminflt                     NaN           NaN           NaN  

Exploring the New “18-apps” Dataset

Please use the standard pandas functions to explore the new dataset (e.g. info(), describe(), head(), tail(), and so on) and answer the following questions:

  1. How many features exist in the original table? Which column contains the label?
  2. From the pandas output in the previous cell do you see any irregularities in the dataset?
  3. What are the names of the applications contained in this “18-apps” dataset? Do you recognize some of these apps?
  4. What are the frequencies of these apps in the dataset? Are there apps that are much represented or underrepresented in the dataset? According to this data, which apps are used most often by this user?

Solution

  1. The original table has 18 features, from CPU+USAGE through cminflt. The ApplicationName column contains the labels.

  2. Several irregularities can be uncovered by carefully looking at the outputs of df.info() and df.describe():

    • cminflt column does not contain any data.
    • guest_time column contains all zeros.
    • Several columns have missing data: CPU_USAGE, cutime, num_threads, priority, rss, stime, utime, vsize.
  3. The answers to questions 3 and 4 are provided and discussed below.

Questions 3 and 4 in the challenge box above pertains the distribution of the classes (i.e. labels) in the dataset. As the table name suggests (sherlock_18apps), there are 18 apps contained in the dataset, and the frequencies of these apps appearing in the table are as follows:

app_frequencies = df['ApplicationName'].value_counts()
print('Total num of apps = ', app_frequencies)
Google App          60001
Chrome              28046
Facebook            20103
Geo News            19991
Messenger           19989
WhatsApp            19985
Photos              17382
ES File Explorer    16667
Gmail               16417
Calendar             8996
Moovit               8365
Waze                 8237
Hangouts             7608
YouTube              5173
Maps                 5159
Skype                4877
Moriarty             3616
Messages             2517
Name: ApplicationName, dtype: int64
Total num of apps =  18

The frequencies of the apps recorded in the table are representative of how frequently these apps are running. (This stems from the fact that the Sherlock’s Application.csv table contains the records of running apps on the phone which were taken periodically with a regular interval–every 5 seconds.)

We can infer, although not definitely, that Google app is the most frequently run app, significantly more than the other apps, followed by Chrome. Then social media and messaging apps also appear frequently (Facebook, Messenger, WhatsApp) as well as a news app (Geo News). All these suggest (though do not prove) that this user spent much time on the web, social media as well as messaging platforms. The user also spent some amount of time on a news site.

Data Cleaning and Preprocessing

This section will guide you to clean and preprocess the “18-apps” data so that it is suitable for neural network modeling.

Follow All the Steps

All the exercises below are mandatory. They constitute all the required steps to make the data ready for machine learning.

Required: Cleaning the “18-apps” Dataset

Let us first clean the data, based on the issues identified in the previous exercise box. Create a new dataframe called df2 which contains the cleaned data.

Hint: Only two pandas statements are required: one to remove bad data and the other to address missing data. As this is very similar to the previous dataset (“2-apps”), please do your best to work this out before looking at the solution.

Solution (Minimal)

The absolute bare minimum cleaning steps for the Sherlock’s “18-apps” data would be like this:

df2 = df.drop(['cminflt', 'guest_time'], axis=1)
df2.dropna(inplace=True)

Solution (Comprehensive)

Verbose code is often helpful especially when automating the machine learning workflow–which we will do at a later episode of this lesson. The following code segments are examples of self-documenting code which also prints clear messages as it processes the data.

STEP 1: Columns with obviously irrelevant and missing data are removed.

# Missing data or bad data or irrelevant data
del_features_bad = [
    'cminflt', # all-missing feature
    'guest_time', # all-flat feature
]
df2 = df.drop(del_features_bad, axis=1)

print("Cleaning:")
print("- dropped %d columns: %s" % (len(del_features_bad), del_features_bad))
Cleaning:
- dropped 2 columns: ['cminflt', 'guest_time']

STEP 2: Remove rows with missing data.

print("- remaining missing data (per feature):")

isna_counts = df2.isna().sum()
print(isna_counts[isna_counts > 0])
print("- dropping the rest of missing data")

df2.dropna(inplace=True)

print("- remaining shape: %s" % (df2.shape,))
- remaining missing data (per feature):
CPU_USAGE      52
cutime         52
num_threads    52
priority       52
rss            52
state          52
stime          52
utime          52
vsize          52
dtype: int64
- dropping the rest of missing data
- remaining shape: (273077, 17)

Required: Separating Labels from Features

After the data is cleaned, we must separate the label column from the features. Create two variables named labels and df_features to contain the separated label array and feature matrix, respectively.

Solution

labels = df2['ApplicationName']
df_features = df2.drop('ApplicationName', axis=1)

One-Hot Encoding

In order to properly build and train neural networks for classification tasks, we need to encode the labels using one-hot encoding. This is necessary because most machine learning algorithms must treat both inputs and outputs as numerical values. Labels in classification machine learning are variables of categorical type. Categorical variables, however, do not possess any numerical significance. Take the 18 apps in the Sherlock table we just loaded as an example: There is clearly no intrinsic order among these apps. Classification variables are frequently represented in computer as integers for efficiency, or as text for human convenience. The integers that represent the different classes, however, do not possess any order in the numerical sense, nor can they be operated on mathematically. (We have briefly discussed this in our Big Data lesson, under “Data Wrangling and Visualization”.)

Converting Labels to One-Hot Encoding

One-hot encoding gets around this dilemma by representing each class by a separate integer, which can only be 0 or 1. An 18-class variable will be represented by a vector of 18 integers that are mostly zeros except for one. Here is an example of such an encoding:

App name One-hot representation
Calendar 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Chrome 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ES File Explorer 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Facebook 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 
WhatsApp 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
Zelle 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

This representation is reminiscent of button-and-light interfaces in old electronic devices such as casette tape player or old-fashioned DVD player! Training a classification NN model, therefore, amounts to training the model to switch on the correct light and switch off the rest.

Pandas has a built-in tool to convert non-numerical, non-boolean values into one-hot representation, using pandas’ get_dummies function:

df_labels_onehot = pd.get_dummies(labels)

Each variable of N unique categorical values will be substituted with N columns containing ones and zeroes. Below is a snippet of the labels (df_labels_onehot) after one-hot encoding.

df_labels_onehot.head(5)

Output of one-hot encoding of labels`

For each row, there is a single 1 corresponding to the selected category while there is a 0 for each of the remaining columns in that row. The table above shows the first five rows of df_labels_onehot. Notice that there is only a single 1 in each row, with the rest being 0. For example, the first row contains a 1 in the Maps column with the rest of the columns containing 0. The following two rows contain 1 in the Gmail column only. Finally, notice that the input Series (labels) was converted into a DataFrame.

Categorical Features Need One-Hot Encoding, Too!

Similarly, any input features that are of categorical data type must also be encoded using either integer encoding or one-hot encoding.

Which Feature Is Categorical?

There is one categorical variable among the features of the sherlock_18apps table. Can you identify which feature is categorical? (Hint: consider the df.head() or df.info() output.)

Solution

The state feature is a four-class categorical variable. This is evidenced by the data type of state printed by df.info(),

<class 'pandas.core.frame.DataFrame'>
Int64Index: 273129 entries, 0 to 999994
Data columns (total 19 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   ApplicationName    273129 non-null  object 
 1   CPU_USAGE          273077 non-null  float64
 2   UidRxBytes         273129 non-null  int64  
 3   UidRxPackets       273129 non-null  int64  
 4   UidTxBytes         273129 non-null  int64  
 5   UidTxPackets       273129 non-null  int64  
 6   cutime             273077 non-null  float64
 7   guest_time         273077 non-null  float64
 8   importance         273129 non-null  int64  
 9   lru                273129 non-null  int64  
 10  num_threads        273077 non-null  float64
 11  otherPrivateDirty  273129 non-null  int64  
 12  priority           273077 non-null  float64
 13  rss                273077 non-null  float64
 14  state              273077 non-null  object 
 15  stime              273077 non-null  float64
 16  utime              273077 non-null  float64
 17  vsize              273077 non-null  float64
 18  cminflt            0 non-null       float64
dtypes: float64(10), int64(7), object(2)
memory usage: 41.7+ MB

state is the only one that has an object datatype; this is the most likely candidate. All the other features have numerical in type and meaning. Take a peek at the values:

df_features.head(5)

output of df_features.head(5)

Clearly, the state feature is non-numerical. The number of classes in this variable can be discovered by the value_counts() method:

df_features['state'].value_counts()
S    271951
R       995
D       114
Z        17
Name: state, dtype: int64

The state feature indicates [the state of the process]( (i.e. a running program):

  • S stands for “sleeping” (where the process is idle);
  • R indicates that the process is “running”, i.e. using much CPU;
  • D usually means the process is busy waiting for data from/to storage device;
  • Z means the process is already terminated but has not been cleaned up by the operating system.

The categorical columns in the raw table can be converted to one-hot encoding in the same way we converted the labels:

df_features = pd.get_dummies(df_features)

One-Hot Encoding in Scikit-Learn

One-hot for labels is generally not necessary in Scikit-learn. Why did we not have to explicitly apply one-hot encoding to the labels in scikit-learn? This is because ML objects such as DecisionTreeClassifier perform this for us, behind the scene.

One-hot encoding is still necessary for categorical features for Scikit-learn. Interested learners are referred to read Encoding of Categorical Variables from the Scikit-learn MOOC from INRIA for in-depth discussion.

For more information on why we need one-hot encoding, see these articles:

To summarize: With one-hot encoding, the categorical variables (including the label array, which was a vector of strings), are converted to a matrix of ones and zeros to represent input or output categorical values for NN models.

Feature Scaling

The next step we must do is to scale the features in order to normalize the values.

print("Step: Feature scaling with StandardScaler")

df_features_unscaled = df_features
scaler = preprocessing.StandardScaler()
scaler.fit(df_features_unscaled)

# Recast the features still in a dataframe form
df_features = pd.DataFrame(scaler.transform(df_features_unscaled),
                           columns=df_features_unscaled.columns,
                           index=df_features_unscaled.index)
print("After scaling:")
print(df_features.head(10))
print()
Step: Feature scaling with StandardScaler
After scaling:
    CPU_USAGE  UidRxBytes  UidRxPackets  UidTxBytes  UidTxPackets    cutime  \
0   -0.165792   -0.010623     -0.015068   -0.008245     -0.016022 -0.185461   
6    0.308049   -0.010623     -0.015068   -0.008245     -0.016022 -0.185461   
11  -0.140853   -0.010623     -0.015068   -0.008245     -0.016022 -0.185461   
18  -0.196966   -0.010623     -0.015068   -0.008245     -0.016022 -0.185461   
19  -0.143970   -0.010623     -0.015068   -0.008245     -0.016022 -0.185461   
28  -0.047332   -0.003421      0.092426   -0.001260      0.149189 -0.185461   
29  -0.196966   -0.010623     -0.015068   -0.008245     -0.016022 -0.185461   
32  -0.200083   -0.010623     -0.015068   -0.008245     -0.016022 -0.185461   
35  -0.181379   -0.010623     -0.015068   -0.008245     -0.016022 -0.185461   
39  -0.206318   -0.010623     -0.015068   -0.008245     -0.016022 -0.185461   

    importance       lru  num_threads  otherPrivateDirty  priority       rss  \
0     0.967513  1.621094    -0.271421          -0.188207  0.212762  0.497013   
6     0.967513  1.621094    -0.830621           0.748432  0.212762  2.335413   
11   -0.157189 -0.742150     1.219779          -0.039008  0.212762  1.090659   
18   -0.157189 -0.742150    -0.942461          -0.555284  0.212762 -0.567867   
19   -0.157189 -0.742150     1.406179          -0.312737  0.212762  0.002106   
28   -0.157189 -0.742150     2.039939           0.680739  0.212762  1.858918   
29    0.967513  1.621094    -0.532381          -0.336024  0.212762 -0.112617   
32   -0.157189 -0.742150    -1.054301          -0.593373  0.212762 -0.918915   
35    0.967513  1.305995    -0.718781          -0.435688  0.212762 -0.369580   
39   -1.281891 -0.742150    -1.091581          -0.570282  0.212762 -1.120843   

       stime     utime     vsize   state_D   state_R   state_S  state_Z  
0  -0.335871 -0.412842  0.127145 -0.020436 -0.060473  0.064346 -0.00789  
6  -0.367538 -0.431809 -0.010889 -0.020436 -0.060473  0.064346 -0.00789  
11 -0.275620 -0.369463  0.487610 -0.020436 -0.060473  0.064346 -0.00789  
18 -0.374264 -0.463921 -1.320928 -0.020436 -0.060473  0.064346 -0.00789  
19 -0.273939 -0.384110  1.316752 -0.020436 -0.060473  0.064346 -0.00789  
28 -0.108319 -0.140735  1.898536 -0.020436 -0.060473  0.064346 -0.00789  
29 -0.379588 -0.463921 -1.018926 -0.020436 -0.060473  0.064346 -0.00789  
32 -0.381830 -0.461104 -1.064197 -0.020436 -0.060473  0.064346 -0.00789  
35 -0.352405 -0.439696 -0.751398 -0.020436 -0.060473  0.064346 -0.00789  
39 -0.343438 -0.459038 -1.089401 -0.020436 -0.060473  0.064346 -0.00789  

Splitting to Training and Validation Datasets

As the final step, we split the original dataset into training and validation sets:

test_size = 0.2
random_state = np.random.randint(1000000)

print("Step: Train-test split  test_size=%s  random_state=%s" \
      % (test_size, random_state))

train_features, test_features, train_L_onehot, test_L_onehot = \
    train_test_split(df_features, df_labels_onehot,
                     test_size=test_size, random_state=random_state)

print("- training dataset: %d records" % (len(train_features),))
print("- testing dataset:  %d records" % (len(test_features),))
print("Now the data is ready for machine learning!")
sys.stdout.flush()

Training and Validating Neural Network Models

Let us train and validate a couple of neural network models and observe how they perform to classify the various running apps.

Model with No Hidden Layers

We will begin by defining the simplest neural network model, which has no hidden layers:

model = Sequential([
    Dense(18, activation='softmax', input_shape=(19,),
          kernel_initializer='random_normal')
])

This code construct is similar to that introduced in the previous episode to create a single-neuron model. A thorough explanation of this code has been given in the previous episode, so we only recap the most important points and make mention of the additional options.

It is often helpful to define a function which will prepare a neural network model ready to for training, like this one for a model without hidden layer:

def NN_Model_no_hidden(learning_rate):
    """Definition of deep learning model with no hidden layer"""
    # (optional if these were already imported earlier)
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Dense
    from tensorflow.keras.optimizers import Adam

    model = Sequential([
        Dense(18, activation='softmax', input_shape=(19,),
              kernel_initializer='random_normal')
    ])
    adam = Adam(lr=learning_rate, beta_1=0.9, beta_2=0.999,
                amsgrad=False)
    model.compile(optimizer=adam,
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
    return model

In this function, a single dense layer is defined. The Adam optimizer is created and the model is “compiled” by combining the layer definition, optimizer, loss function. The returned model is ready for training.

Training and Validation: No Hidden Layer

Next, we call this function to construct the model, then train it. A learning rate of 0.0003 is used along with 5 epochs and a batch size of 32. In the following episode, we will run experiments and vary these parameters.

model_0 = NN_Model_no_hidden(0.0003)
model_0_history = model_0.fit(train_features,
            train_L_onehot,
            epochs=5, batch_size=32,
            validation_data=(test_features, test_L_onehot),
            verbose=2)
Epoch 1/5
 - 7s - loss: 1.6841 - acc: 0.5622 - val_loss: 1.2613 - val_acc: 0.7086
Epoch 2/5
 - 7s - loss: 1.1109 - acc: 0.7380 - val_loss: 1.0026 - val_acc: 0.7739
Epoch 3/5
 - 7s - loss: 0.9253 - acc: 0.7854 - val_loss: 0.8670 - val_acc: 0.7985
Epoch 4/5
 - 7s - loss: 0.8160 - acc: 0.8050 - val_loss: 0.7785 - val_acc: 0.8109
Epoch 5/5
 - 7s - loss: 0.7409 - acc: 0.8188 - val_loss: 0.7143 - val_acc: 0.8208

Note how the values change after each epoch and note the timing of each epoch. Specifically, observe that the validation accuracy (val_acc:) starts at around 71% and increases at a relatively constant rate until reaching 82% once finishing the final epoch.

Visualizing Model Training

To better analyze the training process, we would like to visualize model training history. In Keras, we can collect the history with history function, returned from training the model and creates two charts:

  • A plot of accuracy on the training and validation datasets over training epochs.
  • A plot of loss on the training and validation datasets over training epochs.
model_0_history.history.keys()
dict_keys(['loss', 'acc', 'val_loss', 'val_acc'])
def plot_loss(model_history):
    # summarize history for loss
    plt.plot(model_history.history['loss'])
    plt.plot(model_history.history['val_loss'])
    plt.title('Model Loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.legend(['train', 'test'], loc='upper right')
    plt.show()
	
def plot_acc(model_history):
    # summarize history for accuracy
    plt.plot(model_history.history['acc'])
    plt.plot(model_history.history['val_acc'])
    plt.title('Model Accuracy')
    plt.ylabel('accuracy')
    plt.xlabel('epoch')
    plt.legend(['train', 'test'], loc='upper left')
    plt.show()
plot_loss(model_0_history)
plot_acc(model_0_history)

Graph of Loss per Epoch

Graph of Accuracy per Epoch

Model with One Hidden Layer

Now we look at an example which builds a simple neural network model with only one hidden layer. It is the same process as it was when using no hidden layer, except that another layer is added between the input and output layer.

model = Sequential([
            Dense(hidden_neurons, input_shape=(19,), activation='relu',
                  kernel_initializer='random_normal'),
            Dense(18, activation='softmax'
                  kernel_initializer='random_normal')
        ])

Two dense layers are defined here. The first dense layer is the hidden layer, which feeds directly from the input. The input dataset (after input one-hot encoding where needed), The second layer connects to, and takes the inputs from, the first layer. We follow the standard practice of using relu activation function for the hidden layer, and softmax for the output layer.

Below is a graphical depiction of the neural network model with hidden_neurons = 24. The input layer is shown in yellow and it contains 19 neurons. The green-colored layer in the middle is that hidden layer and the red-colored layer is the output layer with 18 neurons.

A simple NN model with 24 hidden neurons

def NN_Model(hidden_neurons,learning_rate):
    """Definition of deep learning model with one dense hidden layer"""
    # (optional if these were already imported earlier)
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Dense
    from tensorflow.keras.optimizers import Adam
    # define the network
    model = Sequential([
        Dense(hidden_neurons, activation='relu',
              input_shape=(19,),
              kernel_initializer='random_normal'),
        Dense(18, activation='softmax',
              kernel_initializer='random_normal')
    ])
    # define the optimization algorithm
    adam = optimizers.Adam(lr=learning_rate,
                           beta_1=0.9, beta_2=0.999,
                           amsgrad=False)
    model.compile(optimizer=adam,
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
    return model

This function takes two parameters:

Training and Validation: Model with One Hidden Neuron

Now, let’s create a practical model (saving it to a Python variable named model_1) and train it! During the training process, we measure the model accuracy using two sets of data: the training data themselves, and the validation (test) data. We use hidden_neurons = 8, learning_rate = 0.0003 as our baseline:

model_1 = NN_Model(8, 0.0003)
train1 = model_1.fit(train_features,
                     train_L_onehot,
                     epochs=10, batch_size=32,
                     validation_data=(test_features, test_L_onehot),
                     verbose=2)

epochs is a hyperparameter which is defined before training a model. One epoch is when an entire dataset is passed both forward and backward through the neural network only once. One epoch is too big to feed to the computer at once. So, we divide it in several smaller batches.

Train on 218461 samples, validate on 54616 samples
Epoch 1/10
 - 7s - loss: 1.4324 - acc: 0.5323 - val_loss: 0.8332 - val_acc: 0.7153
Epoch 2/10
 - 7s - loss: 0.6667 - acc: 0.8175 - val_loss: 0.5515 - val_acc: 0.8768
Epoch 3/10
 - 7s - loss: 0.4765 - acc: 0.8998 - val_loss: 0.4242 - val_acc: 0.9138
Epoch 4/10
 - 7s - loss: 0.3849 - acc: 0.9164 - val_loss: 0.3579 - val_acc: 0.9196
Epoch 5/10
 - 7s - loss: 0.3355 - acc: 0.9233 - val_loss: 0.3192 - val_acc: 0.9265
Epoch 6/10
 - 7s - loss: 0.3032 - acc: 0.9273 - val_loss: 0.2911 - val_acc: 0.9295
Epoch 7/10
 - 7s - loss: 0.2788 - acc: 0.9305 - val_loss: 0.2690 - val_acc: 0.9327
Epoch 8/10
 - 7s - loss: 0.2589 - acc: 0.9343 - val_loss: 0.2504 - val_acc: 0.9367
Epoch 9/10
 - 7s - loss: 0.2424 - acc: 0.9391 - val_loss: 0.2364 - val_acc: 0.9423
Epoch 10/10
 - 7s - loss: 0.2295 - acc: 0.9434 - val_loss: 0.2243 - val_acc: 0.9438

This training process above has 10 epochs. At the end of each epoch, the total time taken to complete epoch is printed (about 7 seconds), the value of the loss function (loss:) is printed, and the accuracy (acc:) of the prediction. The accuracy refers to the fraction of training data outcomes that are correctly categorized by the network at that particular instance in time. As we see, the accuracy increases as we take more epochs. Finally, the val_acc is computed using the validation data which we set aside earlier for this purpose. Your accuracy and loss numbers would not be identical to that printed above, but should be very close, because the initial weights of the network would not be identical from run to run due to random initialization.

We get a pretty good accuracy using this model, but it still is some percentage points from 100%.

Plotting Training Progress

Note that we saved the output of model.fit to a variable (e.g. train1, train2, …). The training history is actually stored comprehensively in this variable. For example, train1.history contains detailed history of the loss and accuracy values printed during the training.

Exercise: Make two plots—one for the loss and the other for accuracy values (both values from training and validation data) to show how the model improves as a result of the training.

Solution

train_acc = train1.history['acc']
val_acc = train1.history['val_acc']
train_loss = train1.history['loss']
val_loss = train1.history['val_loss']
epoch_counter = range(1, len(train_acc)+1)  # 1, 2, 3, ... epochs

plt.plot(epoch_counter, train_acc, label="train")
plt.plot(epoch_counter, val_acc, label="val")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
# plt.ylim(bottom=0.9, top=1.0)  # use this to adjust the y values displayed on the plot
plt.legend()

Training Progress: Accuracy

plt.plot(epoch_counter, train_loss, label="train")
plt.plot(epoch_counter, val_loss, label="val")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()

Training Progress: Loss function

Limit of Accuracy

What will be the ultimate accuracy of the network defined in model_1 above, if we can train longer?

Saving and Loading Model

At the end of the training, the model can be saved to disk for later usage:

from tensorflow.keras.model import save_model, load_model
model_name = "deeplearning_1"
save_model(model_name + ".json", model_name + ".h5")

Loading is just as easy:

model_reloaded = load_model(model_name + ".json", model_name + ".h5")

Traditional Machine Learning

Now, we will compare our Deep Learning models to the traditional machine learning algorithms learned in the previous session. Here we test on Decision Tree and Logistic Regression. To simplify the code, we will use the model_evaluate function to evaluate the performance of a machine learning model (whether traditional ML or neural network model).

def model_evaluate(model,test_F,test_L):
    test_L_pred = model.predict(test_F)
    print("Evaluation by using model:",type(model).__name__)
    print("accuracy_score:",accuracy_score(test_L, test_L_pred))
    print("confusion_matrix:","\n",confusion_matrix(test_L, test_L_pred))
    return

Decision Tree

ML_dtc = DecisionTreeClassifier(criterion='entropy',
                                   max_depth=6,
                                   min_samples_split=8)
%time ML_dtc.fit(train_features, train_labels)
CPU times: user 897 ms, sys: 8.37 ms, total: 906 ms
Wall time: 906 ms

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=6, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=8,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')
model_evaluate(ML_dtc, test_features, test_labels)
Evaluation by using model: DecisionTreeClassifier
accuracy_score: 0.9497216932766954
confusion_matrix: 
 [[ 1829     1     0     0     0     0     0     0     0     0     0    18     0     0     0     0     1     0]
 [    0  5477     0     0     0    69     0     0     0     0     0     0     0     5     0     0     2     0]
 [    1   610  2753     0     0    25     0     5     0     1     1     1     0     2     0     0     0     0]
 [    0     0     0  4029     0     0    15     0     0     0     0     0     0     0     0     0     0    10]
 [    0     0     0     0  4006     0     0     0     0     0     0     0     0     0     0     0     0     0]
 [   64    28     0     0     0  3183     1     0     0     0     0     1     0    49     0     0     0     0]
 [    0   143     0     0     0     2 10459     0     0     0    15     0     0     0     0     0  1369     0]
 [    0    58     0     0     0    24     4  1408     0     1     0     0     0     1     0     0    11     0]
 [    3    39     0     0     0     0     1     0   935     0     0     0     0     0     1     0     4     0]
 [    0     0     0     0     0     0     0     0     1   486     0     0     0     8     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0  4016     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0     0  1697     0     0     0     0     0     0]
 [    0    13     0     0     4     0     0     0     0     0     0     0   680     1     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0     0     6     0  3473     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0     0     0     0     0  1003     0     0     0]
 [    0     0     0     0     3     0     0     0     0     0     0     0     0     0     0  1642     0     0]
 [    0     4     0     0     0     0     4     0     0     0     0     0     0     0     0     0  3897     0]
 [    0     0     0     0     0   116     0     0     0     0     0     0     0     0     0     0     0   897]]

Logistic Regression

ML_log = LogisticRegression(solver='lbfgs')
%time ML_log.fit(train_features, train_labels)
CPU times: user 20.6 s, sys: 2.48 s, total: 23 s
Wall time: 23.1 s

/shared/apps/auto/py-scikit-learn/0.22.2.post1-gcc-7.3.0-wpia/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
model_evaluate(ML_log, test_features, test_labels)
Evaluation by using model: LogisticRegression
accuracy_score: 0.9197854108686099
confusion_matrix: 
 [[ 1387     3    63     0     0   319     0     0     0     0     0     0     0    72     5     0     0     0]
 [    0  4590   390     0     0    37    77    10     6    64     0     0     0    72    31   273     3     0]
 [   60   271  2817     0     0     7    13     4     0     0     0     0     0    85   141     1     0     0]
 [    0     1     0  4021     0     2    11     4     0     0     5     0     0     0     0     2     0     8]
 [    0     0     0     0  3999     0     0     0     0     7     0     0     0     0     0     0     0     0]
 [   47    39    14     0     0  3189    24    10     1     0     0     0     0     2     0     0     0     0]
 [    7    93     0    51     0    19 11628     8     0     0    29    58     0     0     0     0    93     2]
 [    0    28     0     2     0    33     1  1442     0     0     0     1     0     0     0     0     0     0]
 [  147    27   673     0     0     1     0     0   113     0     0     0     0     4     7     0    11     0]
 [    0     0     0     0     0     0     0     0     0   433     0     0     0    24     0    38     0     0]
 [    0     0     0     0     0     0     0     0     0     0  4016     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     3     0     0     0     0  1642     0    52     0     0     0     0]
 [    0     1     0     0     0     0     0     1     0     4     0     0   692     0     0     0     0     0]
 [   17     2   239     0     0     0     0     0    17    31     0     0     0  3080    80    13     0     0]
 [   99     5   172     0     0    45     0     0     7     0     0     0     0     2   673     0     0     0]
 [    0     3     0     2     0     0     0     0     0     0     0     0     0     6     0  1634     0     0]
 [    0     0     0     0     0     0    33     0     0     0     0     0     0     0     0     0  3872     0]
 [    0     0     0     0     0     0     0     0     0     0     0     0     0     0     6     0     0  1007]]

By now, we have a pretty good background knowledge about this dataset, and we know the accuracy scores we can get by using the Decision Tree and Logistic Regression methods, which are reasonably good, but a few percentage points away from 99%. Our Decision Tree model ended up performing with nearly identical accuracy to our Neural Network with one hidden layer, which means we need to find ways to create models with higher accuracy in order for it to make sense to use Neural Networks. We will explore these ways later.

Further improvement

Except we can tune those two hyper-parameters, there are a lot of things we can do in Neural Network. For example:

  1. Using activation function sigmoid instead of relu
  2. Using 1000 epochs instead of 10 epochs
  3. Adding more hidden layers
  4. Using different optimizer algorithmns.

Overall, the trend of the train and dev loss and accuracy should be monitored and relevant hyperparameters should be modified based on the results.

Key Points

  • On KERAS, we can easily build the network by defining the layers and connecting them together.