This lesson is still being designed and assembled (Pre-Alpha version)

An Introduction to Keras with Binary Classification Task

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is Keras?

  • What are the advantages of using Keras to build neural network models?

  • What are the basic steps for building a neural network model in Keras?

Objectives
  • Understanding Keras and its relationship to other neural network libraries available in the community.

  • Understanding the steps required to build, train, and validate basic neural network models in Keras.

Keras is a high-level software framework for building, training, and deploying neural network (NN) models. Keras provides a high-level interface to another framework named TensorFlow, which implements the actual complex mathematical algorithms for NN training and inference. Keras offers an easy-to-use application programming interface (API) in the Python programming language. As we shall see shortly, expressing an NN model using Keras is quite intuitive; this makes Keras suitable for new users. They can focus on the structure of the network model without having to be concerned with low-level mathematical details.

In this episode, we will use Keras and the preprocessed sherlock_2apps dataset to build and train a simple NN model for a binary classification task, i.e. distinguishing two smartphone apps. This is an extremely simple model for such a powerful ML method, but with it we shall learn all the fundamentals of building, training, and validating an NN model. This fundamental knowledge and skillset will be applicable to all realistic and most complicated NN models.

Defining a Neural Network in Keras

Required Python Libraries

First, we need to import the necessary Python libraries. We will be using the sequential model as stated previously.

import os
import sys
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import sklearn

# Borrow data preparation and model validation tools from scikit-learn:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
# Classic machine learning models:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# TensorFlow & Keras
import tensorflow as tf
import tensorflow.keras as keras

# Import Keras objects
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

The Keras API is accessed via the tensorflow.keras Python module. Since Keras is quite a complex library, it is subdivided into many submodules. There are three basic Keras submodules that we should know:

(Many other submodules are avaiable, which we will likely need to use in real research projects.)

These three submodules are interrelated and represent the three basic building blocks of a NN model in Keras: A Keras model consists of one or more layers that are arranged in a specific order; an optimizer defines an algorithm by which the model is trained according to the given training dataset.

Data Preparation

Let us load and prepare the sherlock_2apps data (features and labels) to train a classification NN model. The data are already preprocessed and only need to be split into training and testing datasets using train_test_split, provided by scikit-learn:

df2_features = pd.read_csv('sherlock/2apps_4f/sherlock_2apps_features.csv')
df2_labels = pd.read_csv('sherlock/2apps_4f/sherlock_2apps_labels.csv')

train_F, test_F, train_L, test_L = train_test_split(df2_features, df2_labels, test_size=0.2)

In the split above, 80% of the data are used for training and 20% for validating (testing) the model.

Neuron, Layer, Network: Constructing a Model

There are two main ways we can programmatically build models in Keras, corresponding to two different APIs: sequential model API and functional model API.

Sequential Model API

A sequential model creates models layer-by-layer, with the outputs of the previous layer connecting to the inputs of the subsequent layer. This easy-to-use API is useful to construct a straightforward NN model that has exactly one input data and one output data. (A single input could be a multi-dimensional array of values; similarly for the output.)

Limitations of sequential models:

A Sequential NN Model

Figure: An example of a sequential NN model, visualized by the Keras API. This network may be trained to perform digit recognition by using the MNIST dataset (having 28x28 pixel input data and 10-category output data). Source: Keras Developer’s Guide.

Functional Model API

The functional model provides a way to create arbitrarily complicated models that include shared layers, or layers with multiple inputs and/or outputs. In this lesson, we will focus on the sequential model. Once we understand how to build a network with the sequential model, it will be straightforward to learn the functional model.

Functional Model

Figure: An example of a complex functional model from the Keras API. Source: Keras Developer’s Guide.

Examples of Complex Neural Network Models

The model depicted in the figure above could be useful for analyzing text-based documents or emails. The model has three distinct inputs (title, text body, and tags) and returns two output values (the priority and the appropriate department for the document). The three inputs are fed together into one network, and the network is trained to classify both the appropriate priority and department of the documents it analyzes. Other complex networks that need the functional API include those models with feedback loops, such as recurrent neural networks. These complex models are beyond the scope of this introductory lesson.

Defining a Neuron Layer

A single layer of neurons is defined by creating a Dense object. Let us create a single layer, with only one neuron, that takes in four input features and outputs one number (0 or 1, to indicate the app described by the input features):

dense1 = Dense(1, activation='sigmoid', input_shape=(4,))

The Dense object declares a standard, fully-connected neuron layer, which can be a hidden layer or an output layer. The Dense arguments have the following meaning:

Please refer to Keras documentation for the Dense layer for more information and additional parameters.

Defining a Neural Network Model

Keras’ Sequential object defines an NN model that takes the form of a simple sequence of layers. The layers can be declared at the same time that the sequential model is constructed, as follows:

model = Sequential([
            Dense(1, activation='sigmoid', input_shape=(4,))
        ])

In this example, there is only one layer, which is the output layer.

Where Are the Parameters?

Each Dense layer contains parameters (i.e. weights) that will be optimized while training the network. Further, each neuron can have a bias parameter that can also be optimized. For a fully-connected layer that has N neurons connected to M input values, there will be N × (M + 1) weights, where the extra factor 1 comes from the bias. As a consequence, a deep neural network with many layers will quickly accummulate an enormous number of weights, especially if the fully-connected layers have many neurons.

How Many Parameters Are There?

How many adjustable parameters are in the following NN model?

a NN model with 3 input values, 4 hidden neurons, and 2 output values

Solution

The network consists of 3 input values, a layer with 4 hidden neurons, and 2 output values. In the hidden layer, there are 4 × 4 parameters; in the output layer, there are 2 × 5 parameters. In total, therefore, there are 26 parameters.

Model Optimizer: Preparation for Training

In order to train the NN model we defined above, we need a few additional ingredients:

How an Optimization Algorithm Works in Training a Neural Network Model

Training an NN model involves progressively updating the weights of the neurons through backpropagation until the errors made by the model’s prediction are sufficiently minimized. These errors are quantified by a loss function, sometimes also called a cost function. Training (i.e. optimization) a neural network model is as much an art as it is science. The use of an appropriate optimization algorithm is critical for ensuring that we obtain the best model.

When a new model is trained for the first time, the weights are typically initialized to random values. Among the various methods for optimizing NN models, gradient descent has become a general method of choice, optimizing an NN model by minimizing its error function. The following graphics illustrate how gradient descent works:

Gradient descent illustration

Figure: An illustration of gradient descent for two-parameter optimization. Three optimization scenarios are indicated by the three black points and their tracks. Two points converge to the same terminal endpoint, but one converges to a different endpoint. The linked Wikimedia source includes animated graphics that demonstrate the progression of the optimization algorithm. Source: Jacopo Bertolotti, Wikimedia.

Imagine a single-neuron model that has only two weight parameters (e.g. two inputs and no adjustable bias). The loss function, plotted as a function of these parameters, will look like the mountainous terrain shown in the picture above. The model will start with a set of parameters, represented as a tiny “dot” on this terrain. The slope (negative gradient) of the terrain plus the imaginary “downward gravity” will pull this “dot” along a certain trajectory until it meets the lowest point of a valley. Gradient descent is an iterative algorithm by which the “dot” is moved through successive steps until the lowest point is closely reached. This is, in a nutshell, how the gradient descent algorithm works!

However, as the same illustration shows, not all the “dots” reach the same final point, as the parameters could get stuck in a higher valley, often called a local minimum. For this reason, it is often necessary to repeat the training process by starting at a different starting point to provide more confidence regarding the optimality of the trained model.

In an actual network optimization scenario, the gradient descent algorithm must work in a high-dimensional space, according to the number of adjustable parameters in the network model. This can involve many millions, or even billions, of parameters. For example, the AlexNet model, which set the record for image classification accuracy in 2012, has over 62 million parameters. More recently, a powerful text-generation NN model called GPT-3, introduced in 2021, has over 175 billion parameters! The sheer dimensionality of the parameter space alone indicates that it can be very challenging to train an NN model correctly and optimally. There are many tricks to help reduce the amount of time needed to train a model. For example, weights saved from a previously trained model can often be used as the initial values to speed up the training.

A detailed discussion on NN optimization algorithms is beyond the scope of this short training program; serious learners are advised to either take a reputable course on machine learning and deep learning or read a book on this subject. Some examples can be found in the reference section of this lesson.

Defining an Optimizer

Let us now create an optimizer object, which encapsulates the optimization algorithm used to improve the network during the training process. The tensorflow.keras.optimizers module contains many algorithms that can be used to train the network. Each algorithm has its own strengths and weaknesses that make it suitable for certain types of use cases.

We will be using the Adam optimizer, which was designed to effectively optimize neural networks in a wide range of scenarios:

adam_opt = Adam(learning_rate=0.0003, beta_1=0.9, beta_2=0.999, amsgrad=False)

Adam is a specific variant of the gradient descent algorithm implementation and has recently seen broad adoption in many application areas, such as computer vision and natural language processing. An article by Jason Brownlee, Gentle Introduction to the Adam Optimization Algorithm for Deep Learning, provides a gentle introduction to the Adam optimizer. Interested readers can also learn more about this optimizer in a paper, “Adam: A Method for Stochastic Optimization”, written by Diederik P. Kingma and Jimmy Ba, the original inventors of this algorithm.

The learning rate (explained below) must be determined when the optimizer object is created. The Adam optimizer will automatically adjust the learning rate as the training progresses. The rate of this adjustment is determined by two hyperparameters: β1 and β2. The following optional arguments to Adam object creation can tweak the optimizer’s behavior:

Please consult Keras API for Adam optimizer for more information about the Keras API for the Adam optimizer.

Learning Rate

The learning rate is a most important hyperparameter that controls how much the model parameters (neuron weights and biases) are changed according to the corrections estimated by backpropagation. The learning rate must be a real number between 0 and 1: 0 means that the model is not changed at all (i.e. ignoring changes estimated by backpropagation), whereas 1 means that an aggressive correction based on the backpropagation results is necessary. Typically, we use a value that is very small (closer to 0). For example, Keras’ default value for the learning rate is 0.001; we will start with a smaller value, which means a more conservative model update.

In choosing an appropriate learning rate, there is a trade-off between the rate of convergence (how fast the weights are changed from one iteration to the other) and the stability (whether the algorithm is making its way toward the lowest minimum of the loss function) of the algorithm. Additionally, a learning rate that is too small may lead to a suboptimal model because the parameters may become stuck in a local minimum. We must therefore find a “sweet spot” to obtain the most optimal model within a reasonable amount of time. For a more detailed explanation, see this article: Understand the Impact of Learning Rate on Neural Network Performance. We will revisit the issue of learning rate in a later episode.

Loss Function

The loss function determines how the error in the NN model’s prediction is quantified. Just like the optimizers, many kinds of loss functions are used in real applications, each tailored for a specific kind of machine learning task. There are two important functions that we must know when working with classification tasks:

  1. Binary cross-entropy loss, designed for binary classification tasks such as the ML task at hand with the sherlock_2apps dataset.

  2. Categorical cross-entropy loss, for multi-class classification tasks (i.e. classification involving more than two categories).

These loss functions are provided by Keras under the tensorflow.keras.losses submodule. In Keras, the choice of the loss function is specified when we compile the NN model (see the loss argument of the compile function in the next section). The various loss functions provided by the Keras package can be found in Keras API documentation on Losses. The most frequently used loss functions and their specific applications are described briefly in this article: “Understanding Different Loss Functions for Neural Networks.

Compiling the Model: Putting Them All Together

As a final step before training the model in Keras, we need to compile the model. The compile function of the model object ties three components together: the neural model itself, the optimizer, and the loss function. This step furnishes a workable model that can be trained and later used for inference. Here is the code:

model.compile(optimizer=adam_opt,
              loss='binary_crossentropy',
              metrics=['accuracy'])

The key compile arguments are:

For the loss function specification, Keras allows us to specify the function name as a string (as we did above) or the constructed object (i.e. tf.keras.losses.BinaryCrossentropy()). The loss functions for classification problems can be specified in this way:

  1. For binary classification tasks: loss='binary_crossentropy' or loss=tf.keras.losses.BinaryCrossentropy().

  2. For multi-class classification tasks: loss='categorical_crossentropy' or loss=tf.keras.losses.CategoricalCrossentropy().

Congratulations, now the model is truly ready for training! Let us now train the model and observe how well it performs.

Recap: Steps in Neural Network Model Construction

Based on the discussion and hands-on code you have learned so far, please recap the steps required to construct an NN model in Keras using the model sequential API.

Solution

Step 1: Define the NN model by creating the Sequential object and declaring the layers of the network in its argument.

model = Sequential([
            Dense(1, activation='sigmoid', input_shape=(4,))
        ])

Step 2: Create an optimizer object to be used in the training phase. This includes defining the learning rate value.

adam_opt = Adam(learning_rate=0.0003, beta_1=0.9, beta_2=0.999, amsgrad=False)

Step 3: Compile the model. This step will integrate the optimizer defined in step 2, the loss function, and the metric(s) for model evaluation.

model.compile(optimizer=adam_opt,
              loss='binary_crossentropy',
              metrics=['accuracy'])

Training and Validating the Model

The Keras NN model is trained by calling the fit function of the model object:

model_history = model.fit(train_F, train_L,
                          epochs=5, batch_size=32,
                          validation_data=(test_F, test_L),
                          verbose=2)

The key arguments to fit are as follows:

More on Training Algorithm: Epoch and Batch Size

Training a network is an iterative process. In each step, the optimization algorithm will perform a full sweep over the entire training dataset (often called an epoch), wherein the network weights are updated through the forward and backward propagation steps. In normal cases, as more iterations are performed, the loss function will eventually decrease to a minimum value, in which case the optimization has reached its goal. We therefore must supply an adequate number of epochs in order to adequately train a network model.

Another important argument to the fit function is the batch_size. Many algorithms that are used in practice perform the network weight updates in small batches (the so-called “mini-batch gradient descent”). This approach was empirically found to offer the best trade-off between computational cost and algorithm accuracy. A larger batch size results in a more accurate gradient estimate at the cost of very expensive computation. The size of the batch is an important hyperparameter that must be carefully controlled in the training phase, since it can affect the quality (e.g. accuracy) of the trained model.

Training Results

Here is an example of the results of the training, printed by the fit function:

Epoch 1/5
15303/15303 - 38s - loss: 0.4681 - accuracy: 0.7984 - val_loss: 0.3646 - val_accuracy: 0.8491
Epoch 2/5
15303/15303 - 23s - loss: 0.3554 - accuracy: 0.8491 - val_loss: 0.3526 - val_accuracy: 0.8491
Epoch 3/5
15303/15303 - 26s - loss: 0.3497 - accuracy: 0.8491 - val_loss: 0.3500 - val_accuracy: 0.8492
Epoch 4/5
15303/15303 - 26s - loss: 0.3480 - accuracy: 0.8494 - val_loss: 0.3490 - val_accuracy: 0.8495
Epoch 5/5
15303/15303 - 25s - loss: 0.3472 - accuracy: 0.8497 - val_loss: 0.3484 - val_accuracy: 0.8496

Keras prints out a progress report at the end of every epoch (this behavior is determined by the verbose=2 argument to the fit function). Consider the first record printed, where the meaning of each reported field is as follows:

The fit function’s output shows the results of training from 5 iterations or epochs. In each epoch, the training algorithm makes the model go through the all training data once, then adjusts the model parameters (neuron weights and biases) to better conform to the training data. As the result shows, each epoch took ~20 seconds to complete (this timing will vary based on the actual computer hardware used to run this training process). At the end, the loss function drops down to just below ~0.35, and the accuracy is ~0.85.

Interestingly, Keras by default validates the model at the end of every epoch; therefore, we do not need to perform a separate validation.

Some Caveats

Your exact printout of the metrics will be different from the printout above, since there are variabilities and randomness associated with the NN training process (different computers, whether a GPU is used or not, and different random numbers being used). As long as the results of the same epoch number fluctuate around similar values, these results are all valid.

Keras computes the metrics from the training data before the model’s weights are updated but the metrics from the validation data after the weights are updated. As a result, the validation metrics seem to be better, but that is not the case. This difference will become negligible as the model is further refined with more epochs. As with any machine learning method in general, it is better to consider only the validation metrics to judge the performance of the trained network, because validation data provide an unbiased estimate of the model’s performance.

Recording Training History

The fit function returns a data structure that contains the history of the optimization (such as the values of the loss function, measured accuracy using training and validation data, etc.). We will learn how to use this to create training plots in the next episode.

Did We Train Enough?

Consider the values of the val_accuracy metric in the training output above. Do you think we have trained the model reasonably well? What will happen if we train the network with more epochs?

Solution

Optimal model training involves a trade-off between how much computing we are willing to perform (which translates to how long we have to wait until we get the best model) and the accuracy of the model. (There is also an issue of overfitting vs. underfitting, as will be explained in a later episode.) In the example above, improvement in the accuracy can only be seen in the fourth digit; hence, it is negligible. Therefore, five epochs is sufficient in this example.

How Did Our Model Perform?

The final accuracy of this model, according to the output printed above, is about 85% (the last val_accuracy value). The final loss function is just below 0.35, which did not drop much from the first value of 0.36. Considering that our model had only one neuron layer, which also serves as the output layer, this is actually not bad, though it is far from being a reliable model. Therefore, we should expect a meager accuracy outcome. The purpose of neural network modeling is to improve its performance by adding as many layers as possible to achieve the best result, within the bounds of what is computationally feasible.

Comparing Neural Network vs Traditional Machine Learning Models

We just finished constructing and testing our first NN model with Keras. It is instructive that we compare the performance results of our one-neuron model against the performance of traditional ML models.

Results from Traditional Machine Learning Methods

In the Machine Learning lesson, we introduced and studied in detail two ML models: the logistic regression and decision tree classifiers. Train these models and evaluate their accuracy scores. Please compare the accuracy of the NN model we just trained above with that of these two classic ML models.

Hint: For comparison purposes, it is helpful to define a function to evaluate a classic ML model’s performance (accuracy and confusion matrix):

def model_evaluate(model,test_F,test_L):
    test_L_pred = model.predict(test_F)
    print("Evaluation by using model:",type(model).__name__)
    print("accuracy_score:",accuracy_score(test_L, test_L_pred))
    print("confusion_matrix:\n",confusion_matrix(test_L, test_L_pred))
    return

Decision Tree

Example code & output:

model_dtc = DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_split=8)
model_dtc.fit(train_F, train_L)
model_evaluate(model_dtc, test_F, test_L)
Evaluation by using model: DecisionTreeClassifier
accuracy_score: 0.9896506375435988
confusion_matrix:
[[75592   228]
[ 1039 45564]]

Logistic Regression

Example code & output:

model_lr = LogisticRegression(solver='lbfgs')
model_lr.fit(train_F, train_L.to_numpy().ravel())
model_evaluate(model_lr, test_F, test_L)
Evaluation by using model: LogisticRegression
accuracy_score: 0.8495789189939799
confusion_matrix:
[[73706  1973]
 [16442 30302]]

Which ML model performs the best? Do you suspect that the one-neuron NN model is equal to one of these traditional models in terms of performance? What is your basis for making that claim?

Solution

As it currently stands, the best performing ML model we’ve encountered so far is actually the decision tree classifier, at around 99% accuracy! This means that the decision tree’s model architecture is flexible enough to capture the complexity in the data. In contrast, both logistic regression and the one-neuron NN have an accuracy that is about 85%, which is not very high.

Interestingly, a one-neuron NN model with a sigmoid activation function is mathematically identical to the logistic regression function. (Some minute differences exist in the training process, maybe due to the different training algorithms and different regularization terms used; however, these should not cause a qualitative difference.) This is why they have the same accuracy score. In fact, logistic regression can be thought of as the original building block of the NN model, before more variations in the activation functions and types of layers were introduced.

Programming Exercise: Encapsulating the Neural Network Model Construction

In computer programming, writing functions is an important way to help us repeat a sequence of commands efficiently. A function is essentially a callable computer subprogram that may accept user-specified parameters to vary the actions of the subprogram’s behavior.

Back to our NN modeling, it is very convenient to define a function to construct a ready-to-train NN model since we will use this many times in our experiments. As we shall see going forward, deep learning involves tedious trial-and-error runs. We will have to repeatedly modify, train, and validate our network models in order to find the most optimal one.

As an exercise, let’s create a function to construct a one-neuron model that is ready to be trained to perform the binary classification task on the sherlock_2apps data. It must have four inputs and one output. Here’s the skeleton:

def NN_binary_clf(learning_rate):
    """Constructs a one-neuron binary classifier using Keras."""
    model = ...
    ...
    return model

After this function, one should be able to simply construct and train the Keras model in this way:

# An example invocation:
model1 = NN_binary_clf(0.0003)
history1 = model1.fit(...)

Question 1: First, remind yourself: What are the required steps to include in the NN_binary_clf function above?

Solution 1

  1. Define the NN model
  2. Define the optimizer
  3. Compile the NN model to bring together the optimizer, loss function, and model evaluation metrics.

Question 2: Now, write the contents (definition) of the NN_binary_clf function. Use the same choices as we used earlier for the optimizer, loss function, etc.

Solution 2

def NN_binary_clf(learning_rate):
    """Create a one-neuron binary classifier using Keras.
    `learning_rate` is the only adjustable hyperparameter here."""
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Dense
    from tensorflow.keras.optimizers import Adam
    model = Sequential([
                Dense(1, activation='sigmoid', input_shape=(4,))
            ])
    adam_opt = Adam(learning_rate=learning_rate,
                    beta_1=0.9, beta_2=0.999, amsgrad=False)
    model.compile(optimizer=adam_opt,
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model

Note: The importation of Keras objects (Sequential, etc.) is optional, but this will allow the function to be used anywhere regardless of the availability of these objects as global symbols.

Improving the Neural Network Model

Discuss with your colleague: What can we do to improve our one-neuron NN model? What will be the simplest next step? If possible, implement the improvement by creating and training a new model. What is the accuracy of the new model?

Solution

The easiest next step is to add one hidden layer. As a starting example, this hidden layer can have five neurons. The training and validating of this model is left for learners to do on their own as a very simple exercise. One may ask, what about having only three or two neurons in this layer? Which version will be the best?

Summary

In this episode, we learned the basic mechanics of building and training a neural network model using Keras. We used a binary classification task to distinguish between only two apps: Facebook and WhatsApp. While this is a rather basic problem, we have learned a lot of important concepts that we must know to carry out a deep learning modeling.

Spoiler: Performance Summary Notes for sherlock_2apps Classification

In terms of performance, the correctly trained one-neuron model above had an accuracy score (val_acc) of around 85%. Compared to the results of classic ML methods in the Machine Learning lesson, this NN performance was the same as that of logistic regression, significantly below the ~99% achievable with the decision tree model.

By adding more layers, the NN method can afford us extremely flexible models that can easily outperform any traditional ML technique. In later episodes, we will learn how to achieve better results with neural networks by adding more hidden neurons. We will also discuss when one ought to use deep learning approaches and when traditional ML methods should be used insetad.

Steps of Deep Learning with Keras

  1. Import necessary libraries (TensorFlow, scikit-learn);
  2. Load the clean, preprocessed datasets (labels and features);
  3. Split the data into training and testing datasets;
  4. Design the neural network model architecture (i.e. what types of layers to use, how many neurons per layer, and which activation function to choose);
  5. Declare the neural network model using either a sequential or functional API;
  6. Select an optimizer and determine its hyperparameters;
  7. Select a loss function;
  8. Compile the model;
  9. Fit the model using training data, and validate with testing data.

Key Points

  • Keras is an easy-to-use high-level API for building neural networks.

  • Main parts of a neural network layer: number of hidden neurons, activation function.

  • Main parts of a neural network model: layers, optimizer, learning rate, loss function, performance metrics.