{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**DeapSECURE module 4: Deap Learning**\n", "\n", "# Session 1: Binary Classification\n", "\n", "Welcome to the DeapSECURE online training program!\n", "This is a Jupyter notebook for the hands-on learning activities of the\n", "[\"Deap Learning\" module](https://deapsecure.gitlab.io/deapsecure-lesson04-nn/), episode 4: \n", "Please visit the [DeapSECURE](https://deapsecure.gitlab.io/) website to learn more about our training program.\n", "\n", "In this notebook, we will learn how to use Keras framework to build a very simple \"binary classfication model\".\n", "We will build a one-neuron model to perform the \"application classification task\" using the SherLock's \"**2-apps**\" dataset introduced in the [\"Machine Learning\"](https://deapsecure.gitlab.io/deapsecure-lesson03-ml/) module.\n", "A single neuron is the simplest neural network model for this classification task, because there is only one output needed to distinguish the two different apps.\n", "\n", "\n", "**QUICK LINKS**\n", "* [Setup](#sec-setup)\n", "* [Loading Sherlock Data](#sec-load-sherlock)\n", "* [Binary Classification](#sec-Binary_clf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 1. Setup Instructions\n", "\n", "If you are opening this notebook from the Wahab OnDemand interface, you're all set.\n", "\n", "If you see this notebook elsewhere, and want to perform the exercises on Wahab cluster, please follow the steps outlined in our setup procedure.\n", "\n", "1. Make sure you have activated your HPC service.\n", "2. Point your web browser to https://ondemand.wahab.hpc.odu.edu/ and sign in with your MIDAS ID and password.\n", "3. Create a new Jupyter session using \"legacy\" Python suite, then create a new \"Python3\" notebook. (See ODU HPC wiki for more detailed help.)\n", "4. Get the necessary files using commands below within Jupyter:\n", "\n", " mkdir -p ~/CItraining/module-nn\n", " cp -pr /shared/DeapSECURE/module-nn/. ~/CItraining/module-nn\n", " cd ~/CItraining/module-nn\n", "\n", "The file name of this notebook is `NN-session-1.ipynb`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 Reminder\n", "\n", "* Throughout this notebook, `#TODO` is used as a placeholder where you need to fill in with something appropriate. \n", "\n", "* To run a code in a cell, press `Shift+Enter`.\n", "\n", "* Pandas cheatsheet\n", "\n", "* Summary table of the commonly used indexing syntax from our own lesson.\n", "\n", "* Keras API document\n", "\n", "We recommend you open these on separate tabs or print them;\n", "they are handy help for writing your own codes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 Loading Python Libraries\n", "\n", "Next step, we need to import the required libraries into this Jupyter Notebook:\n", "`pandas`, `numpy`,`matplotlib.pyplot`,`sklearn` and `tensorflow`.\n", "\n", "**For Wahab cluster only**: before importing these libraries, we have to load the `DeapSECURE` environment module:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Run to load environment modules on HPC\n", "module(\"load\", \"DeapSECURE\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Few additional modules need to be loaded:\n", "* `cuda` module for the calculations on GPU\n", "* `py-tensorflow` module for the TensorFlow and Keras" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "\n", "\n", "Currently Loaded Modules:\n", " 1) intel-mkl/2019.4.243 22) py-cycler/0.10.0\n", " 2) texlive/2020 23) py-kiwisolver/1.1.0\n", " 3) zlib/1.2.11 24) py-pyparsing/2.4.2\n", " 4) libpng/1.6.37 25) libjpeg-turbo/2.0.3\n", " 5) bzip2/1.0.8 26) py-pillow/6.2.0\n", " 6) freetype/2.10.1 27) py-matplotlib/3.1.1\n", " 7) xz/5.2.4 28) py-scipy/1.3.1\n", " 8) libtiff/4.0.10 29) py-seaborn/0.11.1\n", " 9) openjpeg/2.3.1 30) py-pip/19.3\n", " 10) python/3.7.3 31) py-joblib/0.14.0\n", " 11) py-markupsafe/1.0 32) py-scikit-learn/0.22.2.post1\n", " 12) py-babel/2.6.0 33) gmp/6.1.2\n", " 13) py-jinja2/2.10 34) mpfr/4.0.2\n", " 14) py-six/1.12.0 35) mpc/1.1.0\n", " 15) py-jupyter/1.1.4 36) DeapSECURE/2020\n", " 16) py-python-dateutil/2.8.0 37) hdf5/1.10.5\n", " 17) py-pytz/2019.3 38) py-numpy/1.16.3\n", " 18) py-numexpr/2.7.0 39) py-h5py/2.9.0\n", " 19) py-bottleneck/1.2.1 40) cuda/10.2.89\n", " 20) py-pandas/0.25.1 41) py-tensorflow/1.13.1\n", " 21) py-setuptools/41.4.0\n", "\n", " \n", "\n", "\n" ] } ], "source": [ "module(\"load\", \"cuda\")\n", "module(\"load\", \"py-tensorflow\")\n", "module(\"list\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can import all the required modules into Python:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "\"\"\"Import the necessary Python modules\"\"\";\n", "\n", "import os\n", "import sys\n", "import pandas\n", "import numpy\n", "import seaborn\n", "from matplotlib import pyplot\n", "import sklearn\n", "\n", "# tools for machine learning:\n", "from sklearn import preprocessing\n", "from sklearn.model_selection import train_test_split\n", "# for evaluating model performance\n", "from sklearn.metrics import accuracy_score, confusion_matrix\n", "# classic machine learning models:\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.tree import DecisionTreeClassifier\n", "\n", "# TensorFlow\n", "import tensorflow\n", "import tensorflow.keras as keras\n", "\n", "# KERAS objects\n", "from tensorflow.keras.models import Sequential\n", "from tensorflow.keras.layers import Dense\n", "from tensorflow.keras import optimizers\n", "\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# Some advanced learners may like to use shortcuts,\n", "# so we give them here:\n", "pd = pandas\n", "np = numpy\n", "plt = pyplot\n", "sns = seaborn\n", "\n", "import tensorflow as tf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 2. Loading Preprocessed SherLock \"2-apps\" dataset\n", "\n", "First, we load the SherLock's \"2-apps\" _preprocessed_ features and labels into DataFrames.\n", "We use the reduced set of features saved at the end of the \"Machine Learning\" module." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "df2_features = pd.read_csv('sherlock/2apps_4f/sherlock_2apps_features.csv')\n", "df2_labels = pd.read_csv('sherlock/2apps_4f/sherlock_2apps_labels.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After preprocessing and feature selection, we only have 4 features, namely: `cutime`,`num_threads`,`otherPrivateDirty`,`priority`. \n", "The label has two values: `0` representing **Facebook**, and `1` **WhatsApp**.\n", "\n", "As we do in the ML module, we first split the data into training and testing sets." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "train_F, test_F, train_L, test_L = train_test_split(df2_features, df2_labels, test_size=0.2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 3. Binary Classification Task in Keras\n", "\n", "Keras is a powerful, high-level framework to develop and deploy neural network models in Python.\n", "Keras is intuitive to use, allowing rapid prototyping, experimentation, as well as deployment of deep learning models for real-world problems.\n", "Keras began as a high-level interface to several lower-level software frameworks such as Theano and TensorFlow; however, newer versions are [built exclusively for TensorFlow](https://github.com/keras-team/keras/releases/tag/2.4.0).\n", "In this notebook, we show how easy it is to define, train, evaluate, and deploy neural networks with Keras.\n", "\n", "The steps involved in deep learning are very similar to the steps of traditional machine learning:\n", "\n", "1. Loading and preprocessing the input data;\n", "2. Defining a neural network model using Keras;\n", "3. Compiling the network (model);\n", "4. Fitting (training) the network using the training data;\n", "5. Evaluating the performance of the network;\n", "6. Improving the model's performance iteratively by adjusting the network's hyperparameters and retraining;\n", "7. Deploying the model to make predictions (i.e. \"inference\").\n", "\n", "Of these steps, the second and third steps will require the Keras-specific objects and functions.\n", "Keras model object" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1 Defining a Neural Network the Keras Way\n", "\n", "**There are mainly two ways that we can build models in Keras:**\n", "\n", "* Sequential\n", "* Functional\n", "\n", "As the name suggests, the *sequential* model create models layer-by-layer, where the outputs from the previous layer simply connect to the input of the subsequent layer.\n", "Please refer to [Keras documentation on the Sequential model](https://keras.io/guides/sequential_model/) to learn more.\n", "\n", "Limitation of a sequential model:\n", "\n", "* It cannot create multiple models that share layers;\n", "* It cannot create models where layers have multiple inputs and outputs.\n", "\n", "The [*functional* model](https://keras.io/guides/functional_api/) provides a way to create an arbitrarily complicated models that include shared layers, or layers with multiple inputs and/or outputs.\n", "In this series of notebooks, we will focus on Keras sequential model.\n", "Once we understand how to build a network with the sequential model, it is straightforward to learn the other model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Constructing a Neural Network Model\n", "\n", "Let us create a function to construct a neural network with Keras.\n", "This model has four inputs defined by the SherLock \"2-apps\" dataset and one output to distinguish between the two applications: Facebook and WhatsApp.\n", "This function will be called `NN_binary_clf` (*clf* is the short for \"classifier\"):" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "def NN_binary_clf(learning_rate):\n", " \"\"\"Create a one-neuron binary classifier using Keras\"\"\"\n", " model = Sequential([\n", " Dense(1, activation='sigmoid',input_shape=(4,))\n", " ])\n", " adam = tf.keras.optimizers.Adam(lr=learning_rate,\n", " beta_1=0.9, beta_2=0.999, amsgrad=False)\n", " model.compile(optimizer=adam,\n", " loss='binary_crossentropy',\n", " metrics=['accuracy'])\n", " return model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This function builds a `Sequential` model (full object name: `tensorflow.keras.models.Sequential`).\n", "The model has only one layer defined by this declaration:\n", "\n", "```python\n", "Dense(1, activation='sigmoid', input_shape=(4,))\n", "```\n", "\n", "The `Dense` function declares a regular fully-connected neural layer, which can be a hidden layer or an output layer.\n", "The arguments have the following meaning:\n", "\n", "* `1`: the number of outputs from this layer, which also defines the number of fully connected neurons in this layer.\n", "\n", "* `activation='sigmoid'` defines the (nonlinear) activation function used to transform the weighted sum of the input values to the output values.\n", "\n", "* `input_shape=(4,)` defines that this layer connects to the input layer that has four inputs.\n", "\n", "Please see Keras' [documentation for the Dense layer](https://keras.io/api/layers/core_layers/dense/) for more information and additional parameters.\n", "\n", "In the `NN_binary_clf` function, this dense layer is the first and last layer in the model.\n", "\n", "The next line in the function above,\n", "\n", "```python\n", "adam = tf.keras.optimizers.Adam(lr=learning_rate,\n", " beta_1=0.9, beta_2=0.999, amsgrad=False)\n", "```\n", "\n", "defines an *optimizer* to use to train the model, i.e. to minimize the loss function.\n", "We use the Adam optimizer, which is a \"stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments\" ([ref](https://keras.io/api/optimizers/adam/)).\n", "This is the go-to optimizer by many deep learning practitioners.\n", "The critical parameter here is the *learning rate*, which determines how fast the model \"learn\" based on the feedback from the previous iteration.\n", "\n", "The last line *compiles* the model, by integrating it with the other key component of a network, which is the *loss function*:\n", "\n", "```python\n", "model.compile(optimizer=adam,\n", " loss='binary_crossentropy',\n", " metrics=['accuracy'])\n", "```\n", "\n", "The loss function is one of the important components of neural networks. Loss is nothing but a prediction error of neural net. And the method to calculate the loss is called loss function. In simple words, the Loss is used to calculate the gradients. And gradients are used to update the weights of the Neural Net. This is how a Neural Net is trained. the followings are essential loss functions which could be used for most of the models. [(Towardsdatascince)](https://towardsdatascience.com/understanding-different-loss-functions-for-neural-networks-dd1ed0274718)\n", " * Mean Squared Error (MSE)\n", " * Binary Crossentropy (BCE)\n", " * Categorical Crossentropy (CC)\n", " * Sparse Categorical Crossentropy (SCC)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.3 Model Fitting and Validation\n", "\n", "Then, we use a model object to call `NN_binary_clf` function and start the fitting process:\n", "\n", "* epochs: The number of epochs is a hyperparameter that defines the number times that the learning algorithm will work through the entire training dataset.\n", "\n", "* batch size: The number of examples from the training dataset used in the estimate of the error gradient is called the batch size and is an important hyperparameter that influences the dynamics of the learning algorithm.\n", "\n", "* Loss function used: `binary_crossentropy`\n", "\n", "* Optimizer used: Adam optimizer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Because data validation is part of this syntax, there is no need to write seperate data validation codes.**" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train on 489691 samples, validate on 122423 samples\n", "WARNING:tensorflow:From /shared/apps/auto/py-tensorflow/1.13.1-gcc-7.3.0-j7tz/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use tf.cast instead.\n", "Epoch 1/5\n", " - 19s - loss: 0.4087 - acc: 0.8427 - val_loss: 0.3593 - val_acc: 0.8478\n", "Epoch 2/5\n", " - 18s - loss: 0.3536 - acc: 0.8487 - val_loss: 0.3512 - val_acc: 0.8488\n", "Epoch 3/5\n", " - 18s - loss: 0.3493 - acc: 0.8493 - val_loss: 0.3490 - val_acc: 0.8491\n", "Epoch 4/5\n", " - 18s - loss: 0.3478 - acc: 0.8497 - val_loss: 0.3481 - val_acc: 0.8491\n", "Epoch 5/5\n", " - 18s - loss: 0.3471 - acc: 0.8498 - val_loss: 0.3475 - val_acc: 0.8492\n" ] } ], "source": [ "model = NN_binary_clf(0.0003)\n", "model_history = model.fit(train_F, train_L,\n", " epochs=5, batch_size=32,\n", " validation_data=(test_F, test_L),\n", " verbose=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.4 Explaining the result:\n", "\n", "**Our model had only input layers and output layer with no hidden layer in between which in this scenario it works similar to a logistic regression. Therefore, we should expect a fairly low accuracy outcome. The porpuse of neural networks is to add as much layers feasible to achieve the best result possible.**\n", "\n", "This output has 5 iteration or epochs, in each epoch the model went through the training data once and fits it. as the result shows, our first epochs took 26 seconds to complete. our loss is 0.3464 and the accuracy is 0.8501. What is important here in terms of data validation, is the 'val_loss' and 'val_acc'.\n", "\n", "**Question: Why val_loss and val_acc are important?**\n", "\n", "The answer is, 'val_loss' is the value of loss for your validation data and 'loss' is the value of loss for your training data.\n", "Also, 'acc' is the accuracy on the training data and 'val_acc' is the accuracy on the validation data.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Conclusion:\n", "\n", "So far, we did import necessary libraries, loaded our preprocessed dataset and fitted our model using keras. Comparing our data validation to previous machine learning models we see our neural network with no hidden layer did a worse job compared to decision tree and it is identical to logistic regression. Thus,\n", "\n", "**Which model performed better so far; decision tree, logistic regression or neural networks?**\n", " \n", "**Which model trained faster?**\n", "\n", "**Why in this example neural networks performed worse?**\n", " \n", "**How can we improve the performance of neural networks?**\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }