Classifying Smartphone Apps with Keras
Overview
Teaching: 20 min
Exercises: 40 minQuestions
How do we build a neural network on KERAS to perform multi-class classification?
How do we monitor the progress of neural-network training?
How do we set up appropriate hyperparameters?
Objectives
Understanding how to build a general neural network with KERAS.
Understanding how to tune the hyperparameters based on the results.
Prior to this episode, we focused on an extremely simple problem:
a binary classification task using the sherlock_2apps
(“2-apps”) data.
Certainly, this is a very basic problem;
real-world problems have much richer complexities.
Beginning from this episode, we will use a richer subset of Sherlock’s
Applications.csv
to build a classifier to distinguish nearly 20 smartphone apps.
As in the previous episode, we will build simple neural networks with KERAS
(using all the essential building blocks of a neural network introduced in that
episode) and quantify their performance metrics. As we progress, we will
introduce additional techniques that are helpful in practical ML modeling.
Along the way, we will be guided to answer the following questions:
-
How far can we push the accuracy of a neural network model for smartphone app classification?
-
As a bonus exercise, what will the accuracies be if we build decision tree and logistic regression models to perform the same classification task?
Loading Required Python Libraries and Objects
Please make sure that the necessary libraries and objects are loaded into your environment:
import os import sys import pandas as pd import numpy as np import matplotlib.pyplot as plt # Tools for machine learning: import sklearn from sklearn import preprocessing from sklearn.model_selection import train_test_split # for evaluating model performance from sklearn.metrics import accuracy_score, confusion_matrix # classic machine learning models from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier # Tools for deep learning: import tensorflow as tf import tensorflow.keras as keras # Import key Keras objects from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense from tensorflow.keras.optimizers import Adam
These are the same imports used at the beginning of the previous episode on binary classification. you start with a fresh Python session in this episode to avoid any confusion caused by using a different dataset.
The sherlock_18apps
Dataset
The dataset used for this episode is the “18-apps” dataset, a significantly more diverse subset of the SherLock dataset that includes nearly 20 apps. Not only are there more classes (apps), but the dataset also contains more features. As with the 2-apps counterpart, the rows stored in this table were generated by periodically measuring the resource (CPU, memory, network, input/output) utilization stats for the individual apps. The table below presents the features of this dataset and offers a brief explanation of each:
No | Feature | Data Type | Meaning |
---|---|---|---|
0 | CPU_USAGE |
float | Instantaneous percent utilization of CPU |
1 | UidRxBytes |
int | Number of bytes received by this app via network |
2 | UidRxPackets |
int | Number of network packets received by this app |
3 | UidTxBytes |
int | Number of bytes transmitted (sent) by this app via network |
4 | UidTxPackets |
int | Number of network packets received transmitted by this app |
5 | cutime |
int | (Linux) Amount of CPU time spent in “user-mode” by the spawned & waited on child process |
6 | guest_time |
int | (Linux) Amount of CPU time spent running a virtual CPU |
7 | importance |
int | (Android) The relative importance of this app, as set by the Android system |
8 | lru |
int | (Android) An additional ordering within a particular Android importance category |
9 | num_threads |
int | (Linux) Number of threads in this app |
10 | otherPrivateDirty |
int | (Android) Amount of dirty memory (i.e. written by this app), in units of kiB |
11 | priority |
int | (Linux) The process’s priority in terms of CPU scheduling policy. |
12 | rss |
int | (Linux) The amount of memory (RAM) actually occupied by this app, in units of kiB |
13 | state |
char | (Linux) The state of the app’s process (Sleeping, Running, Busy I/O (D), Zombie |
14 | stime |
int | (Linux) Amount of CPU time spent in “system-mode” by the app |
15 | utime |
int | (Linux) Amount of CPU time spent in “user-mode” by the app |
16 | vsize |
int | (Linux) Amount of virtual memory allocated for the app, in units of bytes |
17 | cminflt |
int | (Linux) Number of minor page faults of the spawned & waited child process |
(Source: Sherlock Dataset Data Field Description, version 2.4.1 by the SherLock team at BGU.) The explanations in the “meaning” field above are terse; they do not precisely explain everything we need to know about these fields. Those marked with “(Linux)” are tracked and reported by the operating system (Linux OS), those with “(Android)” come from Android system, and the rest are synthesized from OS or other measurements by the SherLock agent.
For comparison, the “2-apps” dataset introduced earlier, after cleaning and preprocessing, contains the following fields:
CPU_USAGE
, cutime
, lru
, num_threads
,
otherPrivateDirty
, priority
, utime
, vsize
,
cminflt
.
Advice to Learners and Instructors
We strongly encourage all learners to familiarize themselves with the data by actually doing the the data exploration and identifying issues with the data before running the cleaning steps below. However, the complete codes for cleaning and preprocessing are given in the “Solution” boxes below, in case they are needed. In any case, the cleaning steps must be executed before proceeding to the preprocessing and modeling steps. While executing the codes, please read them and understand what steps were needed to make the data ready for machine learning modeling.
Initial Exploration
Because this is a new dataset, we will need to reconstruct the entire data wrangling and preparation steps. The steps will be similar to those used for the “2-apps” dataset. But each dataset will require a unique procedure of cleaning and preparation, therefore we cannot blindly apply the same recipe to every dataset. An exploration of a new dataset is required in order to know how to clean and prepare the dataset for use in machine learning. We will first load and explore the new dataset, then identify the necessary preprocessing and cleaning.
df = pd.read_csv("sherlock/sherlock_18apps.csv", index_col=0)
## Summarize the dataset
print("* shape:", df.shape)
print()
print("* info::\n")
df.info()
print()
print("* describe::\n")
print(df.describe().T)
print()
Output (click/tap to reveal)
* shape: (273129, 19) * info:: <class 'pandas.core.frame.DataFrame'> Int64Index: 273129 entries, 0 to 999994 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ApplicationName 273129 non-null object 1 CPU_USAGE 273077 non-null float64 2 UidRxBytes 273129 non-null int64 3 UidRxPackets 273129 non-null int64 4 UidTxBytes 273129 non-null int64 5 UidTxPackets 273129 non-null int64 6 cutime 273077 non-null float64 7 guest_time 273077 non-null float64 8 importance 273129 non-null int64 9 lru 273129 non-null int64 10 num_threads 273077 non-null float64 11 otherPrivateDirty 273129 non-null int64 12 priority 273077 non-null float64 13 rss 273077 non-null float64 14 state 273077 non-null object 15 stime 273077 non-null float64 16 utime 273077 non-null float64 17 vsize 273077 non-null float64 18 cminflt 0 non-null float64 dtypes: float64(10), int64(7), object(2) memory usage: 41.7+ MB * describe:: count mean std min 25% \ CPU_USAGE 273077.0 6.618322e-01 3.207833e+00 0.0 5.000000e-02 UidRxBytes 273129.0 3.922973e+02 3.693198e+04 -280.0 0.000000e+00 UidRxPackets 273129.0 4.204643e-01 2.790607e+01 -11.0 0.000000e+00 UidTxBytes 273129.0 2.454729e+02 2.977305e+04 -60.0 0.000000e+00 UidTxPackets 273129.0 3.878826e-01 2.420920e+01 -1.0 0.000000e+00 cutime 273077.0 3.279844e-01 1.768488e+00 0.0 0.000000e+00 guest_time 273077.0 0.000000e+00 0.000000e+00 0.0 0.000000e+00 importance 273129.0 3.139921e+02 8.891191e+01 100.0 3.000000e+02 lru 273129.0 4.712480e+00 6.348188e+00 0.0 0.000000e+00 num_threads 273077.0 3.928061e+01 2.682408e+01 2.0 1.700000e+01 otherPrivateDirty 273129.0 1.211232e+04 2.026702e+04 0.0 1.480000e+03 priority 273077.0 1.975093e+01 1.170649e+00 9.0 2.000000e+01 rss 273077.0 8.500590e+03 4.942350e+03 0.0 4.894000e+03 stime 273077.0 1.378527e+03 3.568420e+03 3.0 9.100000e+01 utime 273077.0 2.509427e+03 5.325113e+03 2.0 1.020000e+02 vsize 273077.0 2.049264e+09 1.179834e+08 0.0 1.958326e+09 cminflt 0.0 NaN NaN NaN NaN 50% 75% max CPU_USAGE 1.300000e-01 3.700000e-01 1.108900e+02 UidRxBytes 0.000000e+00 0.000000e+00 8.872786e+06 UidRxPackets 0.000000e+00 0.000000e+00 6.165000e+03 UidTxBytes 0.000000e+00 0.000000e+00 9.830372e+06 UidTxPackets 0.000000e+00 0.000000e+00 6.748000e+03 cutime 0.000000e+00 0.000000e+00 1.100000e+01 guest_time 0.000000e+00 0.000000e+00 0.000000e+00 importance 3.000000e+02 4.000000e+02 4.000000e+02 lru 0.000000e+00 1.100000e+01 1.600000e+01 num_threads 3.000000e+01 5.500000e+01 1.410000e+02 otherPrivateDirty 4.308000e+03 1.354800e+04 1.928560e+05 priority 2.000000e+01 2.000000e+01 2.000000e+01 rss 6.959000e+03 1.120600e+04 5.466800e+04 stime 3.450000e+02 1.474000e+03 4.662900e+04 utime 5.650000e+02 2.636000e+03 4.284500e+04 vsize 2.026893e+09 2.125877e+09 2.456613e+09 cminflt NaN NaN NaN
Exploring the New “18-apps” Dataset
Please use the standard pandas functions to explore the new dataset (e.g.
info()
,describe()
,head()
,tail()
, and so on) and answer the following questions:
- How many features exist in the original table? Which column contains the label?
- From the pandas output in the previous cell do you see any irregularities in the dataset?
- What are the names of the applications contained in this “18-apps” dataset? Do you recognize some of these apps?
- What are the frequencies of these apps in the dataset? Are there apps that are much represented or underrepresented in the dataset? According to this data, which apps are used most often by this user?
Solution
The original table has 18 features, from
CPU+USAGE
throughcminflt
. TheApplicationName
column contains the labels.Several irregularities can be uncovered by carefully looking at the outputs of
df.info()
anddf.describe()
:
cminflt
column does not contain any data.guest_time
column contains all zeros.- Several columns have missing data:
CPU_USAGE
,cutime
,num_threads
,priority
,rss
,stime
,utime
,vsize
.The answers to questions 3 and 4 are provided and discussed below.
Questions 3 and 4 in the challenge box above pertains the distribution
of the classes (i.e. labels) in the dataset.
As the table name suggests (sherlock_18apps
),
there are 18 apps contained in the dataset,
and the frequencies of these apps appearing in the table are as follows:
app_frequencies = df['ApplicationName'].value_counts()
print('Total num of apps = ', app_frequencies)
Google App 60001
Chrome 28046
Facebook 20103
Geo News 19991
Messenger 19989
WhatsApp 19985
Photos 17382
ES File Explorer 16667
Gmail 16417
Calendar 8996
Moovit 8365
Waze 8237
Hangouts 7608
YouTube 5173
Maps 5159
Skype 4877
Moriarty 3616
Messages 2517
Name: ApplicationName, dtype: int64
Total num of apps = 18
The frequencies of the apps recorded in the table
are representative of how frequently these apps are running.
(This stems from the fact that the Sherlock’s Application.csv
table
contains the records of running apps on the phone
which were taken periodically with a regular interval–every 5 seconds.)
We can infer, although not definitely, that Google app is the most frequently run app, significantly more than the other apps, followed by Chrome. Then social media and messaging apps also appear frequently (Facebook, Messenger, WhatsApp) as well as a news app (Geo News). All these suggest (though do not prove) that this user spent much time on the web, social media as well as messaging platforms. The user also spent some amount of time on a news site.
Data Cleaning and Preprocessing
This section will guide you to clean and preprocess the “18-apps” data so that it is suitable for neural network modeling.
Follow All the Steps
All the exercises below are mandatory. They constitute all the required steps to make the data ready for machine learning.
Required: Cleaning the “18-apps” Dataset
Let us first clean the data, based on the issues identified in the previous exercise box. Create a new dataframe called
df2
which contains the cleaned data.Hint: Only two pandas statements are required: one to remove bad data and the other to address missing data. As this is very similar to the previous dataset (“2-apps”), please do your best to work this out before looking at the solution.
Solution (Minimal)
The absolute bare minimum cleaning steps for the Sherlock’s “18-apps” data would be like this:
df2 = df.drop(['cminflt', 'guest_time'], axis=1) df2.dropna(inplace=True)
Solution (Comprehensive)
Verbose code is often helpful especially when automating the machine learning workflow–which we will do at a later episode of this lesson. The following code segments are examples of self-documenting code which also prints clear messages as it processes the data.
STEP 1: Columns with obviously irrelevant and missing data are removed.
# Missing data or bad data or irrelevant data del_features_bad = [ 'cminflt', # all-missing feature 'guest_time', # all-flat feature ] df2 = df.drop(del_features_bad, axis=1) print("Cleaning:") print("- dropped %d columns: %s" % (len(del_features_bad), del_features_bad))
Cleaning: - dropped 2 columns: ['cminflt', 'guest_time']
STEP 2: Remove rows with missing data.
print("- remaining missing data (per feature):") isna_counts = df2.isna().sum() print(isna_counts[isna_counts > 0]) print("- dropping the rest of missing data") df2.dropna(inplace=True) print("- remaining shape: %s" % (df2.shape,))
- remaining missing data (per feature): CPU_USAGE 52 cutime 52 num_threads 52 priority 52 rss 52 state 52 stime 52 utime 52 vsize 52 dtype: int64 - dropping the rest of missing data - remaining shape: (273077, 17)
Required: Separating Labels from Features
After the data is cleaned, we must separate the label column from the features. Create two variables named
labels
anddf_features
to contain the separated label array and feature matrix, respectively.Solution
labels = df2['ApplicationName'] df_features = df2.drop('ApplicationName', axis=1)
One-Hot Encoding
In order to properly build and train neural networks for classification tasks, we need to encode the labels using one-hot encoding. This is necessary because most machine learning algorithms must treat both inputs and outputs as numerical values. Labels in classification machine learning are variables of categorical type. Categorical variables, however, do not possess any numerical significance. Take the 18 apps in the Sherlock table we just loaded as an example: There is clearly no intrinsic order among these apps. Classification variables are frequently represented in computer as integers for efficiency, or as text for human convenience. The integers that represent the different classes, however, do not possess any order in the numerical sense, nor can they be operated on mathematically. (We have briefly discussed this in our Big Data lesson, under “Data Wrangling and Visualization”.)
Converting Labels to One-Hot Encoding
One-hot encoding gets around this dilemma by representing each class by a separate integer, which can only be 0 or 1. An 18-class variable will be represented by a vector of 18 integers that are mostly zeros except for one. Here is an example of such an encoding:
App name | One-hot representation |
---|---|
Calendar | 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 |
Chrome | 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 |
ES File Explorer | 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 |
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 |
|
… | |
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 |
|
Zelle | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 |
This representation is reminiscent of button-and-light interfaces in old electronic devices such as casette tape player or old-fashioned DVD player! Training a classification NN model, therefore, amounts to training the model to switch on the correct light and switch off the rest.
Pandas has a built-in tool to convert non-numerical, non-boolean values
into one-hot representation, using pandas’ get_dummies
function:
df_labels_onehot = pd.get_dummies(labels)
Each variable of N unique categorical values will be substituted with
N columns containing ones and zeroes.
Below is a snippet of the labels (df_labels_onehot
)
after one-hot encoding.
df_labels_onehot.head(5)
For each row, there is a single 1
corresponding to the selected category
while there is a 0
for each of the remaining columns in that row.
The table above shows the first five rows of df_labels_onehot
.
Notice that there is only a single 1
in each row, with the rest being 0
.
For example, the first row contains a 1
in the Maps
column with the rest of
the columns containing 0
.
The following two rows contain 1
in the Gmail
column only.
Finally, notice that the input Series (labels
)
was converted into a DataFrame.
Categorical Features Need One-Hot Encoding, Too!
Similarly, any input features that are of categorical data type must also be encoded using either integer encoding or one-hot encoding.
Which Feature Is Categorical?
There is one categorical variable among the features of the
sherlock_18apps
table. Can you identify which feature is categorical? (Hint: consider thedf.head()
ordf.info()
output.)Solution
The
state
feature is a four-class categorical variable. This is evidenced by the data type ofstate
printed bydf.info()
,<class 'pandas.core.frame.DataFrame'> Int64Index: 273129 entries, 0 to 999994 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ApplicationName 273129 non-null object 1 CPU_USAGE 273077 non-null float64 2 UidRxBytes 273129 non-null int64 3 UidRxPackets 273129 non-null int64 4 UidTxBytes 273129 non-null int64 5 UidTxPackets 273129 non-null int64 6 cutime 273077 non-null float64 7 guest_time 273077 non-null float64 8 importance 273129 non-null int64 9 lru 273129 non-null int64 10 num_threads 273077 non-null float64 11 otherPrivateDirty 273129 non-null int64 12 priority 273077 non-null float64 13 rss 273077 non-null float64 14 state 273077 non-null object 15 stime 273077 non-null float64 16 utime 273077 non-null float64 17 vsize 273077 non-null float64 18 cminflt 0 non-null float64 dtypes: float64(10), int64(7), object(2) memory usage: 41.7+ MB
state
is the only one that has anobject
datatype; this is the most likely candidate. All the other features have numerical in type and meaning. Take a peek at the values:df_features.head(5)
Clearly, the
state
feature is non-numerical. The number of classes in this variable can be discovered by thevalue_counts()
method:df_features['state'].value_counts()
S 271951 R 995 D 114 Z 17 Name: state, dtype: int64
The
state
feature indicates [the state of the process]( (i.e. a running program):
- S stands for “sleeping” (where the process is idle);
- R indicates that the process is “running”, i.e. using much CPU;
- D usually means the process is busy waiting for data from/to storage device;
- Z means the process is already terminated but has not been cleaned up by the operating system.
The categorical columns in the raw table can be converted to one-hot encoding in the same way we converted the labels:
df_features = pd.get_dummies(df_features)
One-Hot Encoding in Scikit-Learn
One-hot for labels is generally not necessary in Scikit-learn. Why did we not have to explicitly apply one-hot encoding to the labels in scikit-learn? This is because ML objects such as
DecisionTreeClassifier
perform this for us, behind the scene.One-hot encoding is still necessary for categorical features for Scikit-learn. Interested learners are referred to read Encoding of Categorical Variables from the Scikit-learn MOOC from INRIA for in-depth discussion.
For more information on why we need one-hot encoding, see these articles:
To summarize: With one-hot encoding, the categorical variables (including the label array, which was a vector of strings), are converted to a matrix of ones and zeros to represent input or output categorical values for NN models.
Feature Scaling
The next step we must do is to scale the features in order to normalize the values.
print("Step: Feature scaling with StandardScaler")
df_features_unscaled = df_features
scaler = preprocessing.StandardScaler()
scaler.fit(df_features_unscaled)
# Recast the features still in a dataframe form
df_features = pd.DataFrame(scaler.transform(df_features_unscaled),
columns=df_features_unscaled.columns,
index=df_features_unscaled.index)
print("After scaling:")
print(df_features.head(10))
print()
Step: Feature scaling with StandardScaler
After scaling:
CPU_USAGE UidRxBytes UidRxPackets UidTxBytes UidTxPackets cutime \
0 -0.165792 -0.010623 -0.015068 -0.008245 -0.016022 -0.185461
6 0.308049 -0.010623 -0.015068 -0.008245 -0.016022 -0.185461
11 -0.140853 -0.010623 -0.015068 -0.008245 -0.016022 -0.185461
18 -0.196966 -0.010623 -0.015068 -0.008245 -0.016022 -0.185461
19 -0.143970 -0.010623 -0.015068 -0.008245 -0.016022 -0.185461
28 -0.047332 -0.003421 0.092426 -0.001260 0.149189 -0.185461
29 -0.196966 -0.010623 -0.015068 -0.008245 -0.016022 -0.185461
32 -0.200083 -0.010623 -0.015068 -0.008245 -0.016022 -0.185461
35 -0.181379 -0.010623 -0.015068 -0.008245 -0.016022 -0.185461
39 -0.206318 -0.010623 -0.015068 -0.008245 -0.016022 -0.185461
importance lru num_threads otherPrivateDirty priority rss \
0 0.967513 1.621094 -0.271421 -0.188207 0.212762 0.497013
6 0.967513 1.621094 -0.830621 0.748432 0.212762 2.335413
11 -0.157189 -0.742150 1.219779 -0.039008 0.212762 1.090659
18 -0.157189 -0.742150 -0.942461 -0.555284 0.212762 -0.567867
19 -0.157189 -0.742150 1.406179 -0.312737 0.212762 0.002106
28 -0.157189 -0.742150 2.039939 0.680739 0.212762 1.858918
29 0.967513 1.621094 -0.532381 -0.336024 0.212762 -0.112617
32 -0.157189 -0.742150 -1.054301 -0.593373 0.212762 -0.918915
35 0.967513 1.305995 -0.718781 -0.435688 0.212762 -0.369580
39 -1.281891 -0.742150 -1.091581 -0.570282 0.212762 -1.120843
stime utime vsize state_D state_R state_S state_Z
0 -0.335871 -0.412842 0.127145 -0.020436 -0.060473 0.064346 -0.00789
6 -0.367538 -0.431809 -0.010889 -0.020436 -0.060473 0.064346 -0.00789
11 -0.275620 -0.369463 0.487610 -0.020436 -0.060473 0.064346 -0.00789
18 -0.374264 -0.463921 -1.320928 -0.020436 -0.060473 0.064346 -0.00789
19 -0.273939 -0.384110 1.316752 -0.020436 -0.060473 0.064346 -0.00789
28 -0.108319 -0.140735 1.898536 -0.020436 -0.060473 0.064346 -0.00789
29 -0.379588 -0.463921 -1.018926 -0.020436 -0.060473 0.064346 -0.00789
32 -0.381830 -0.461104 -1.064197 -0.020436 -0.060473 0.064346 -0.00789
35 -0.352405 -0.439696 -0.751398 -0.020436 -0.060473 0.064346 -0.00789
39 -0.343438 -0.459038 -1.089401 -0.020436 -0.060473 0.064346 -0.00789
Splitting to Training and Validation Datasets
As the final step, we split the original dataset into training and validation sets:
test_size = 0.2
random_state = np.random.randint(1000000)
print("Step: Train-test split test_size=%s random_state=%s" \
% (test_size, random_state))
train_features, test_features, train_L_onehot, test_L_onehot = \
train_test_split(df_features, df_labels_onehot,
test_size=test_size, random_state=random_state)
print("- training dataset: %d records" % (len(train_features),))
print("- testing dataset: %d records" % (len(test_features),))
print("Now the data is ready for machine learning!")
sys.stdout.flush()
Training and Validating Neural Network Models
Let us train and validate a couple of neural network models and observe how they perform to classify the various running apps.
Model with No Hidden Layers
We will begin by defining the simplest neural network model, which has no hidden layers:
model = Sequential([
Dense(18, activation='softmax', input_shape=(19,),
kernel_initializer='random_normal')
])
This code construct is similar to that introduced in the previous episode to create a single-neuron model. A thorough explanation of this code has been given in the previous episode, so we only recap the most important points and make mention of the additional options.
-
Only one dense layer is defined, which will be the output layer. The input layer is not presented as a separate layer here, as it literally simply passes on the input features with no modification. The output layer feeds directly from the input; there is no hidden layer. After one-hot encoding, the
sherlock_18apps
input dataset contains 19 features—thus, theinput_shape=(19,)
argument. (As a reminder,(19,)
refers to a tuple with one element, not a mere number 19. Do not omit the trailing comma.) There are 18 neurons in this layer, equivalent to the 18 output elements of the one-hot-encoded labels. The number of the neurons must be 18, because the model must distinguish among the 18 applications contained in the dataset. -
The
activation
argument defines the type of nonlinear function used to yield the response of each neuron in this layer. Today, therelu
(ReLU = Rectified Linear Unit) function is a popular choice for hidden neurons. In classification models, the output layer needs to yield the (approximately) one-hot outcome—that is, one for the target class, and zero elsewhere. Therefore, the output layer typically uses thesoftmax
orsigmoid
activation function. Thesoftmax
function is more widely used today. -
(Optional) The optional
kernel_initializer
argument determines what kind of initial values are given to the the weights of the neurons. Therandom_normal
choice sets the values to a random values drawn from a normal distribution function (e.g. Gaussian function with unit standard deviation). This is a more advanced feature that may need to be tweaked later, as necessary. To learn more, refer to this article: A Gentle Introduction To Weight Initialization for Neural Networks. Generally, we do not need to tweak this option at the initial stage of modeling.
It is often helpful to define a function which will prepare a neural network model ready to for training, like this one for a model without hidden layer:
def NN_Model_no_hidden(learning_rate):
"""Definition of deep learning model with no hidden layer"""
# (optional if these were already imported earlier)
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
model = Sequential([
Dense(18, activation='softmax', input_shape=(19,),
kernel_initializer='random_normal')
])
adam = Adam(lr=learning_rate, beta_1=0.9, beta_2=0.999,
amsgrad=False)
model.compile(optimizer=adam,
loss='categorical_crossentropy',
metrics=['accuracy'])
return model
In this function, a single dense layer is defined.
The Adam
optimizer is created and the model is “compiled”
by combining the layer definition, optimizer, loss function.
The returned model
is ready for training.
Training and Validation: No Hidden Layer
Next, we call this function to construct the model,
then train it.
A learning rate of 0.0003
is used along with 5
epochs
and a batch size of 32
.
In the following episode, we will run experiments and vary these parameters.
model_0 = NN_Model_no_hidden(0.0003)
model_0_history = model_0.fit(train_features,
train_L_onehot,
epochs=5, batch_size=32,
validation_data=(test_features, test_L_onehot),
verbose=2)
Epoch 1/5
- 7s - loss: 1.6841 - acc: 0.5622 - val_loss: 1.2613 - val_acc: 0.7086
Epoch 2/5
- 7s - loss: 1.1109 - acc: 0.7380 - val_loss: 1.0026 - val_acc: 0.7739
Epoch 3/5
- 7s - loss: 0.9253 - acc: 0.7854 - val_loss: 0.8670 - val_acc: 0.7985
Epoch 4/5
- 7s - loss: 0.8160 - acc: 0.8050 - val_loss: 0.7785 - val_acc: 0.8109
Epoch 5/5
- 7s - loss: 0.7409 - acc: 0.8188 - val_loss: 0.7143 - val_acc: 0.8208
Note how the values change after each epoch and note the timing of each epoch.
Specifically, observe that the validation accuracy (val_acc:
) starts at
around 71% and increases at a relatively constant rate until reaching 82%
once finishing the final epoch.
Visualizing Model Training
To better analyze the training process, we would like to visualize model training history. In Keras, we can collect the history with
history
function, returned from training the model and creates two charts:
- A plot of accuracy on the training and validation datasets over training epochs.
- A plot of loss on the training and validation datasets over training epochs.
model_0_history.history.keys()
dict_keys(['loss', 'acc', 'val_loss', 'val_acc'])
def plot_loss(model_history): # summarize history for loss plt.plot(model_history.history['loss']) plt.plot(model_history.history['val_loss']) plt.title('Model Loss') plt.ylabel('loss') plt.xlabel('epoch') plt.legend(['train', 'test'], loc='upper right') plt.show() def plot_acc(model_history): # summarize history for accuracy plt.plot(model_history.history['acc']) plt.plot(model_history.history['val_acc']) plt.title('Model Accuracy') plt.ylabel('accuracy') plt.xlabel('epoch') plt.legend(['train', 'test'], loc='upper left') plt.show()
plot_loss(model_0_history) plot_acc(model_0_history)
Model with One Hidden Layer
Now we look at an example which builds a simple neural network model with only one hidden layer. It is the same process as it was when using no hidden layer, except that another layer is added between the input and output layer.
model = Sequential([
Dense(hidden_neurons, input_shape=(19,), activation='relu',
kernel_initializer='random_normal'),
Dense(18, activation='softmax'
kernel_initializer='random_normal')
])
Two dense layers are defined here.
The first dense layer is the hidden layer,
which feeds directly from the input.
The input dataset (after input one-hot encoding where needed),
The second layer connects to, and takes the inputs from, the first layer.
We follow the standard practice of using relu
activation function
for the hidden layer, and softmax
for the output layer.
Below is a graphical depiction of the neural network model with
hidden_neurons = 24
. The input layer is shown in yellow and it contains 19
neurons.
The green-colored layer in the middle is that hidden layer and the red-colored
layer is the output layer with 18 neurons.
def NN_Model(hidden_neurons,learning_rate):
"""Definition of deep learning model with one dense hidden layer"""
# (optional if these were already imported earlier)
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
# define the network
model = Sequential([
Dense(hidden_neurons, activation='relu',
input_shape=(19,),
kernel_initializer='random_normal'),
Dense(18, activation='softmax',
kernel_initializer='random_normal')
])
# define the optimization algorithm
adam = optimizers.Adam(lr=learning_rate,
beta_1=0.9, beta_2=0.999,
amsgrad=False)
model.compile(optimizer=adam,
loss='categorical_crossentropy',
metrics=['accuracy'])
return model
This function takes two parameters:
hidden_neurons
, specifying the number of the neurons in the hidden layer;learning_rate
, determining the step size at each iteration.
Training and Validation: Model with One Hidden Neuron
Now, let’s create a practical model (saving it to a Python variable named model_1
)
and train it!
During the training process, we measure the model accuracy using two sets of data:
the training data themselves, and the validation (test) data.
We use hidden_neurons = 8
, learning_rate = 0.0003
as our baseline:
model_1 = NN_Model(8, 0.0003)
train1 = model_1.fit(train_features,
train_L_onehot,
epochs=10, batch_size=32,
validation_data=(test_features, test_L_onehot),
verbose=2)
epochs
is a hyperparameter which is defined before training a model.
One epoch is when an entire dataset is passed both forward and backward through
the neural network only once. One epoch is too big to feed to the computer at once.
So, we divide it in several smaller batches.
Train on 218461 samples, validate on 54616 samples
Epoch 1/10
- 7s - loss: 1.4324 - acc: 0.5323 - val_loss: 0.8332 - val_acc: 0.7153
Epoch 2/10
- 7s - loss: 0.6667 - acc: 0.8175 - val_loss: 0.5515 - val_acc: 0.8768
Epoch 3/10
- 7s - loss: 0.4765 - acc: 0.8998 - val_loss: 0.4242 - val_acc: 0.9138
Epoch 4/10
- 7s - loss: 0.3849 - acc: 0.9164 - val_loss: 0.3579 - val_acc: 0.9196
Epoch 5/10
- 7s - loss: 0.3355 - acc: 0.9233 - val_loss: 0.3192 - val_acc: 0.9265
Epoch 6/10
- 7s - loss: 0.3032 - acc: 0.9273 - val_loss: 0.2911 - val_acc: 0.9295
Epoch 7/10
- 7s - loss: 0.2788 - acc: 0.9305 - val_loss: 0.2690 - val_acc: 0.9327
Epoch 8/10
- 7s - loss: 0.2589 - acc: 0.9343 - val_loss: 0.2504 - val_acc: 0.9367
Epoch 9/10
- 7s - loss: 0.2424 - acc: 0.9391 - val_loss: 0.2364 - val_acc: 0.9423
Epoch 10/10
- 7s - loss: 0.2295 - acc: 0.9434 - val_loss: 0.2243 - val_acc: 0.9438
This training process above has 10 epochs.
At the end of each epoch, the total time taken to complete epoch is printed
(about 7 seconds),
the value of the loss function (loss:
) is printed, and the
accuracy (acc:
) of the prediction.
The accuracy refers to the fraction of training data outcomes that
are correctly categorized by the network at that particular instance
in time.
As we see, the accuracy increases as we take more epochs.
Finally, the val_acc
is computed using the
validation data which we set aside earlier for this purpose.
Your accuracy and loss numbers would not be identical to that printed above,
but should be very close, because the initial weights of the network would not
be identical from run to run due to random initialization.
We get a pretty good accuracy using this model, but it still is some percentage points from 100%.
Plotting Training Progress
Note that we saved the output of
model.fit
to a variable (e.g.train1
,train2
, …). The training history is actually stored comprehensively in this variable. For example,train1.history
contains detailed history of the loss and accuracy values printed during the training.Exercise: Make two plots—one for the loss and the other for accuracy values (both values from training and validation data) to show how the model improves as a result of the training.
Solution
train_acc = train1.history['acc'] val_acc = train1.history['val_acc'] train_loss = train1.history['loss'] val_loss = train1.history['val_loss'] epoch_counter = range(1, len(train_acc)+1) # 1, 2, 3, ... epochs plt.plot(epoch_counter, train_acc, label="train") plt.plot(epoch_counter, val_acc, label="val") plt.xlabel("Epoch") plt.ylabel("Accuracy") # plt.ylim(bottom=0.9, top=1.0) # use this to adjust the y values displayed on the plot plt.legend()
plt.plot(epoch_counter, train_loss, label="train") plt.plot(epoch_counter, val_loss, label="val") plt.xlabel("Epoch") plt.ylabel("Loss") plt.legend()
Limit of Accuracy
What will be the ultimate accuracy of the network defined in
model_1
above, if we can train longer?
Saving and Loading Model
At the end of the training, the model can be saved to disk for later usage:
from tensorflow.keras.model import save_model, load_model
model_name = "deeplearning_1"
save_model(model_name + ".json", model_name + ".h5")
Loading is just as easy:
model_reloaded = load_model(model_name + ".json", model_name + ".h5")
Traditional Machine Learning
Now, we will compare our Deep Learning models to the traditional machine learning algorithms learned
in the previous session.
Here we test on Decision Tree and Logistic Regression.
To simplify the code, we will use the model_evaluate
function to evaluate the performance of a machine
learning model (whether traditional ML or neural network model).
def model_evaluate(model,test_F,test_L):
test_L_pred = model.predict(test_F)
print("Evaluation by using model:",type(model).__name__)
print("accuracy_score:",accuracy_score(test_L, test_L_pred))
print("confusion_matrix:","\n",confusion_matrix(test_L, test_L_pred))
return
Decision Tree
ML_dtc = DecisionTreeClassifier(criterion='entropy',
max_depth=6,
min_samples_split=8)
%time ML_dtc.fit(train_features, train_labels)
CPU times: user 897 ms, sys: 8.37 ms, total: 906 ms
Wall time: 906 ms
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
max_depth=6, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=8,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=None, splitter='best')
model_evaluate(ML_dtc, test_features, test_labels)
Evaluation by using model: DecisionTreeClassifier
accuracy_score: 0.9497216932766954
confusion_matrix:
[[ 1829 1 0 0 0 0 0 0 0 0 0 18 0 0 0 0 1 0]
[ 0 5477 0 0 0 69 0 0 0 0 0 0 0 5 0 0 2 0]
[ 1 610 2753 0 0 25 0 5 0 1 1 1 0 2 0 0 0 0]
[ 0 0 0 4029 0 0 15 0 0 0 0 0 0 0 0 0 0 10]
[ 0 0 0 0 4006 0 0 0 0 0 0 0 0 0 0 0 0 0]
[ 64 28 0 0 0 3183 1 0 0 0 0 1 0 49 0 0 0 0]
[ 0 143 0 0 0 2 10459 0 0 0 15 0 0 0 0 0 1369 0]
[ 0 58 0 0 0 24 4 1408 0 1 0 0 0 1 0 0 11 0]
[ 3 39 0 0 0 0 1 0 935 0 0 0 0 0 1 0 4 0]
[ 0 0 0 0 0 0 0 0 1 486 0 0 0 8 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 4016 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 1697 0 0 0 0 0 0]
[ 0 13 0 0 4 0 0 0 0 0 0 0 680 1 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 6 0 3473 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1003 0 0 0]
[ 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 1642 0 0]
[ 0 4 0 0 0 0 4 0 0 0 0 0 0 0 0 0 3897 0]
[ 0 0 0 0 0 116 0 0 0 0 0 0 0 0 0 0 0 897]]
Logistic Regression
ML_log = LogisticRegression(solver='lbfgs')
%time ML_log.fit(train_features, train_labels)
CPU times: user 20.6 s, sys: 2.48 s, total: 23 s
Wall time: 23.1 s
/shared/apps/auto/py-scikit-learn/0.22.2.post1-gcc-7.3.0-wpia/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
model_evaluate(ML_log, test_features, test_labels)
Evaluation by using model: LogisticRegression
accuracy_score: 0.9197854108686099
confusion_matrix:
[[ 1387 3 63 0 0 319 0 0 0 0 0 0 0 72 5 0 0 0]
[ 0 4590 390 0 0 37 77 10 6 64 0 0 0 72 31 273 3 0]
[ 60 271 2817 0 0 7 13 4 0 0 0 0 0 85 141 1 0 0]
[ 0 1 0 4021 0 2 11 4 0 0 5 0 0 0 0 2 0 8]
[ 0 0 0 0 3999 0 0 0 0 7 0 0 0 0 0 0 0 0]
[ 47 39 14 0 0 3189 24 10 1 0 0 0 0 2 0 0 0 0]
[ 7 93 0 51 0 19 11628 8 0 0 29 58 0 0 0 0 93 2]
[ 0 28 0 2 0 33 1 1442 0 0 0 1 0 0 0 0 0 0]
[ 147 27 673 0 0 1 0 0 113 0 0 0 0 4 7 0 11 0]
[ 0 0 0 0 0 0 0 0 0 433 0 0 0 24 0 38 0 0]
[ 0 0 0 0 0 0 0 0 0 0 4016 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 3 0 0 0 0 1642 0 52 0 0 0 0]
[ 0 1 0 0 0 0 0 1 0 4 0 0 692 0 0 0 0 0]
[ 17 2 239 0 0 0 0 0 17 31 0 0 0 3080 80 13 0 0]
[ 99 5 172 0 0 45 0 0 7 0 0 0 0 2 673 0 0 0]
[ 0 3 0 2 0 0 0 0 0 0 0 0 0 6 0 1634 0 0]
[ 0 0 0 0 0 0 33 0 0 0 0 0 0 0 0 0 3872 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 1007]]
By now, we have a pretty good background knowledge about this dataset, and we know the accuracy scores we can get by using the Decision Tree and Logistic Regression methods, which are reasonably good, but a few percentage points away from 99%. Our Decision Tree model ended up performing with nearly identical accuracy to our Neural Network with one hidden layer, which means we need to find ways to create models with higher accuracy in order for it to make sense to use Neural Networks. We will explore these ways later.
Further improvement
Except we can tune those two hyper-parameters, there are a lot of things we can do in Neural Network. For example:
- Using activation function
sigmoid
instead ofrelu
- Using 1000 epochs instead of 10 epochs
- Adding more hidden layers
- Using different optimizer algorithmns.
Overall, the trend of the train and dev loss and accuracy should be monitored and relevant hyperparameters should be modified based on the results.
Key Points
On KERAS, we can easily build the network by defining the layers and connecting them together.