{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**DeapSECURE module 4: Deap Learning**\n", "\n", "# Session 2: Deep Learning\n", "\n", "Welcome to the DeapSECURE online training program!\n", "This is a Jupyter notebook for the hands-on learning activities of the\n", "[\"Deap Learning\" module](https://deapsecure.gitlab.io/deapsecure-lesson04-nn/),\n", "Please visit the [DeapSECURE](https://deapsecure.gitlab.io/) website to learn more about our training program.\n", "\n", "In this session, We will use this notebook to prepare the Sherlock dataset for the DL lesson\n", "\n", "## Data Preparation\n", "\n", "When preparing data for analytics and machine learning, up to two-thirds of the time is actually spent preparing the data.\n", "This may sound like a waste of time, but that step is absolutely crucial to obtaining trustworthy insight from the data.\n", "The goal of **data preparation** is to achieve a clean, consistent and processable state of data.\n", "\n", "In this session, you will perform data preparation used in the previous ML workshop.\n", "\n", "**QUICK LINKS**\n", "* [Setup](#sec-setup)\n", "* [Loading Sherlock Data](#sec-load_data)\n", "* [Traditional Machine Learning](#sec-ML)\n", "* [Deep Neural Network](#sec-NN)\n", "* [Parallel Computing](#sec-Par)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 1. Setup Instructions\n", "\n", "If you are opening this notebook from the Wahab OnDemand interface, you're all set.\n", "\n", "If you see this notebook elsewhere, and want to perform the exercises on Wahab cluster, please follow the steps outlined in our setup procedure.\n", "\n", "1. Make sure you have activated your HPC service.\n", "2. Point your web browser to https://ondemand.wahab.hpc.odu.edu/ and sign in with your MIDAS ID and password.\n", "3. Create a new Jupyter session using \"legacy\" Python suite, then create a new \"Python3\" notebook. (See ODU HPC wiki for more detailed help.)\n", "4. Get the necessary files using commands below within Jupyter:\n", "\n", " mkdir -p ~/CItraining/module-nn\n", " cp -pr /shared/DeapSECURE/module-nn/. ~/CItraining/module-nn\n", " cd ~/CItraining/module-nn\n", "\n", "The file name of this notebook is `NN-session-2.ipynb`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 Reminder\n", "\n", "* Throughout this notebook, `#TODO` is used as a placeholder where you need to fill in with something appropriate. \n", "\n", "* To run a code in a cell, press `Shift+Enter`.\n", "\n", "* Pandas cheatsheet\n", "\n", "* Summary table of the commonly used indexing syntax from our own lesson.\n", "\n", "* Keras API document\n", "\n", "We recommend you open these on separate tabs or print them;\n", "they are handy help for writing your own codes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 Loading Python Libraries\n", "\n", "Next step, we need to import the required libraries into this Jupyter Notebook:\n", "`pandas`, `numpy`,`matplotlib.pyplot`,`sklearn` and `tensorflow`.\n", "\n", "**For Wahab cluster only**: before importing these libraries, we have to load the `DeapSECURE` environment module:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "The following have been reloaded with a version change:\n", " 1) libjpeg-turbo/2.0.2 => libjpeg-turbo/2.0.3\n", "\n", "\n" ] } ], "source": [ "# Run to load environment modules on HPC\n", "module(\"load\", \"DeapSECURE\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Few additional modules need to be loaded to access the GPU via CUDA and TensorFlow library.\n", "Keras is now part of TensorFlow:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "\n", "The following have been reloaded with a version change:\n", " 1) py-numpy/1.17.3 => py-numpy/1.16.3\n", "\n", "\n", "\n", "Currently Loaded Modules:\n", " 1) intel-mkl/2019.4.243 22) py-cycler/0.10.0\n", " 2) texlive/2020 23) py-kiwisolver/1.1.0\n", " 3) zlib/1.2.11 24) py-pyparsing/2.4.2\n", " 4) libpng/1.6.37 25) libjpeg-turbo/2.0.3\n", " 5) bzip2/1.0.8 26) py-pillow/6.2.0\n", " 6) freetype/2.10.1 27) py-matplotlib/3.1.1\n", " 7) xz/5.2.4 28) py-scipy/1.3.1\n", " 8) libtiff/4.0.10 29) py-seaborn/0.11.1\n", " 9) openjpeg/2.3.1 30) py-pip/19.3\n", " 10) python/3.7.3 31) py-joblib/0.14.0\n", " 11) py-markupsafe/1.0 32) py-scikit-learn/0.22.2.post1\n", " 12) py-babel/2.6.0 33) gmp/6.1.2\n", " 13) py-jinja2/2.10 34) mpfr/4.0.2\n", " 14) py-six/1.12.0 35) mpc/1.1.0\n", " 15) py-jupyter/1.1.4 36) DeapSECURE/2020\n", " 16) py-python-dateutil/2.8.0 37) cuda/10.2.89\n", " 17) py-pytz/2019.3 38) hdf5/1.10.5\n", " 18) py-numexpr/2.7.0 39) py-numpy/1.16.3\n", " 19) py-bottleneck/1.2.1 40) py-h5py/2.9.0\n", " 20) py-pandas/0.25.1 41) py-tensorflow/1.13.1\n", " 21) py-setuptools/41.4.0\n", "\n", " \n", "\n", "\n" ] } ], "source": [ "module(\"load\", \"cuda\")\n", "module(\"load\", \"py-tensorflow\")\n", "module(\"list\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can import all the required modules into Python:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import os\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import sklearn\n", "from sklearn import preprocessing\n", "\n", "import tensorflow as tf\n", "import tensorflow.keras as keras\n", "\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# tools for machine learning:\n", "from sklearn import preprocessing\n", "from sklearn.model_selection import train_test_split\n", "# for evaluating model performance\n", "from sklearn.metrics import accuracy_score, confusion_matrix\n", "# classic machine learning models:\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.tree import DecisionTreeClassifier" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Import KERAS objects\n", "from tensorflow.keras.models import Sequential\n", "from tensorflow.keras.layers import Dense\n", "from tensorflow.keras import optimizers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 2. Loading Sherlock Application Dataset\n", "\n", "First of all, let us review the data preparation step, data wrangling step and machine learning step one by one in this bigger dataset. In the first step above we are actually using two Python scripts: `Prep_ML.py` and `analysis_sherlock_ML.py`\n", "\n", "The script `Prep_ML.py` contains all the steps necessary to read the data, remove useless data, handle missing data, extract the feature matrix and labels, then do the train/dev split. Load the commands contained in this script into your current Jupyter notebook using the IPython’s `%load` magic. Then you can run this function.\n", "\n", "The script `analysis_sherlock_ML.py` is a library of functions, which contains the steps we described in the earlier lesson. These functions are clearly named such as: `preprocess_sherlock_19F17C`, `step0_label_features`, `step_onehot_encoding`, and `step_feature_scaling`. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Uncomment and run the magic statement `%load Prep_ML.py` below.\n", "(It will replace the cell with the contents of Prep_ML.py.)\n", "You may have to run this cell twice with `Shift+Enter` to actually run the loaded code." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "#%load Prep_ML.py\n", "\"\"\"^^^ Uncomment and run the magic statement above.\n", " You may have to run the cell twice to actually run this cell!\"\"\";" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After the cell above is executed, you will find the training & test data in the following members of `Rec` object:\n", "\n", "* `Rec.df_features`: DataFrame of the features for the machine learning models\n", "* `Rec.labels`: The labels (expected output of the ML models)\n", "* `Rec.train_features` = training data's features\n", "* `Rec.test_features` = testing data's features\n", "* `Rec.train_labels` = training data's labels\n", "* `Rec.test_labels` = testing data's labels\n", "\n", "We use this approach to manage the complexity of having too many variables (e.g. `train_F`, `train_F2`, `train_F3`, ...)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 About the SherLock \"18-apps\" Dataset\n", "\n", "This is a more diverse subset of the SherLock Application dataset, covering significantly more applications and features.\n", "\n", "> Your challenge is to train a similar model (like in the previous notebooks) using the \"18-apps\" dataset to correctly classify running apps on the smartphone with very high accuracy (> 99%)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXERCISE**\n", "\n", "Take a peek at the training feature DataFrame." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "\"\"\"Take a peek at the training feature DataFrame.\"\"\";\n", "#TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From above, we know that we are working with a significantly larger data file, `sherlock/sherlock_18apps.csv`.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question:**\n", "\n", "- Please check out the `analysis_sherlock_ML.py` and see how this function defined.\n", "- How many features for each record?\n", "- How many applications in the total dataset?\n", "- How many records in the seperated training and testing dataset?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This dataset has 19 features for each record and 18 applications in total." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 3. Traditional Machine Learning \n", "\n", "Now, we first try the traditional machine learning algorithms we learn in the previous session. \n", "Here we test on **Decision Tree** and **Logistic Regression**. \n", "To simplify the code, we will use the `model_evaluate` function to evaluate the performance of a machine learning model (whether traditional ML or neural network model)." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "def model_evaluate(model,test_F,test_L):\n", " test_L_pred = model.predict(test_F)\n", " print(\"Evaluation by using model:\",type(model).__name__)\n", " print(\"accuracy_score:\",accuracy_score(test_L, test_L_pred))\n", " print(\"confusion_matrix:\",\"\\n\",confusion_matrix(test_L, test_L_pred))\n", " return" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.29 s, sys: 16.4 ms, total: 1.3 s\n", "Wall time: 1.31 s\n" ] }, { "data": { "text/plain": [ "DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',\n", " max_depth=6, max_features=None, max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=8,\n", " min_weight_fraction_leaf=0.0, presort='deprecated',\n", " random_state=None, splitter='best')" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ML_dtc = DecisionTreeClassifier(criterion='entropy',\n", " max_depth=6,\n", " min_samples_split=8)\n", "%time ML_dtc.fit(Rec.train_features, Rec.train_labels)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Evaluation by using model: DecisionTreeClassifier\n", "accuracy_score: 0.9497216932766954\n", "confusion_matrix: \n", " [[ 1829 1 0 0 0 0 0 0 0 0 0 18 0 0 0 0 1 0]\n", " [ 0 5477 0 0 0 69 0 0 0 0 0 0 0 5 0 0 2 0]\n", " [ 1 610 2753 0 0 25 0 5 0 1 1 1 0 2 0 0 0 0]\n", " [ 0 0 0 4029 0 0 15 0 0 0 0 0 0 0 0 0 0 10]\n", " [ 0 0 0 0 4006 0 0 0 0 0 0 0 0 0 0 0 0 0]\n", " [ 64 28 0 0 0 3183 1 0 0 0 0 1 0 49 0 0 0 0]\n", " [ 0 143 0 0 0 2 10459 0 0 0 15 0 0 0 0 0 1369 0]\n", " [ 0 58 0 0 0 24 4 1408 0 1 0 0 0 1 0 0 11 0]\n", " [ 3 39 0 0 0 0 1 0 935 0 0 0 0 0 1 0 4 0]\n", " [ 0 0 0 0 0 0 1 0 0 486 0 0 0 8 0 0 0 0]\n", " [ 0 0 0 0 0 0 0 0 0 0 4016 0 0 0 0 0 0 0]\n", " [ 0 0 0 0 0 0 0 0 0 0 0 1697 0 0 0 0 0 0]\n", " [ 0 13 0 0 4 0 0 0 0 0 0 0 680 1 0 0 0 0]\n", " [ 0 0 0 0 0 0 0 0 0 0 0 6 0 3473 0 0 0 0]\n", " [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1003 0 0 0]\n", " [ 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 1642 0 0]\n", " [ 0 4 0 0 0 0 4 0 0 0 0 0 0 0 0 0 3897 0]\n", " [ 0 0 0 0 0 116 0 0 0 0 0 0 0 0 0 0 0 897]]\n" ] } ], "source": [ "model_evaluate(ML_dtc, Rec.test_features, Rec.test_labels)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/shared/apps/auto/py-scikit-learn/0.22.2.post1-gcc-7.3.0-wpia/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):\n", "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n", "\n", "Increase the number of iterations (max_iter) or scale the data as shown in:\n", " https://scikit-learn.org/stable/modules/preprocessing.html\n", "Please also refer to the documentation for alternative solver options:\n", " https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n", " extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)\n" ] }, { "data": { "text/plain": [ "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", " intercept_scaling=1, l1_ratio=None, max_iter=100,\n", " multi_class='auto', n_jobs=None, penalty='l2',\n", " random_state=None, solver='lbfgs', tol=0.0001, verbose=0,\n", " warm_start=False)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ML_log = LogisticRegression(solver='lbfgs')\n", "%time ML_log.fit(Rec.train_features, Rec.train_labels)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Evaluation by using model: LogisticRegression\n", "accuracy_score: 0.9197854108686099\n", "confusion_matrix: \n", " [[ 1387 3 63 0 0 319 0 0 0 0 0 0 0 72 5 0 0 0]\n", " [ 0 4590 390 0 0 37 77 10 6 64 0 0 0 72 31 273 3 0]\n", " [ 60 271 2817 0 0 7 13 4 0 0 0 0 0 85 141 1 0 0]\n", " [ 0 1 0 4021 0 2 11 4 0 0 5 0 0 0 0 2 0 8]\n", " [ 0 0 0 0 3999 0 0 0 0 7 0 0 0 0 0 0 0 0]\n", " [ 47 39 14 0 0 3189 24 10 1 0 0 0 0 2 0 0 0 0]\n", " [ 7 93 0 51 0 19 11628 8 0 0 29 58 0 0 0 0 93 2]\n", " [ 0 28 0 2 0 33 1 1442 0 0 0 1 0 0 0 0 0 0]\n", " [ 147 27 673 0 0 1 0 0 113 0 0 0 0 4 7 0 11 0]\n", " [ 0 0 0 0 0 0 0 0 0 433 0 0 0 24 0 38 0 0]\n", " [ 0 0 0 0 0 0 0 0 0 0 4016 0 0 0 0 0 0 0]\n", " [ 0 0 0 0 0 0 3 0 0 0 0 1642 0 52 0 0 0 0]\n", " [ 0 1 0 0 0 0 0 1 0 4 0 0 692 0 0 0 0 0]\n", " [ 17 2 239 0 0 0 0 0 17 31 0 0 0 3080 80 13 0 0]\n", " [ 99 5 172 0 0 45 0 0 7 0 0 0 0 2 673 0 0 0]\n", " [ 0 3 0 2 0 0 0 0 0 0 0 0 0 6 0 1634 0 0]\n", " [ 0 0 0 0 0 0 33 0 0 0 0 0 0 0 0 0 3872 0]\n", " [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 1007]]\n" ] } ], "source": [ "model_evaluate(ML_log, Rec.test_features, Rec.test_labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**QUESTIONS**:\n", "\n", "* Do you notice issues with the training process of any of the models above?\n", "* (Optional) Can you find a way to ensure full convergence of the training?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By now, we have a pretty good background knowledge about this dataset.\n", "And we know the accuracy scores we can get by using the Decision Tree and Logistic Regression methods,\n", "which are reasonably good, but not close to 99%." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Timing the Computation\n", "\n", "Do you notice that the training of logistic regression model takes a while?\n", "Often we want to know *how long* this actually takes place.\n", "We can get this timing easily in Jupyter by prepending `%time` to the Python statement we'd like to measure the execution time.\n", "\n", "**EXERCISE**:\n", "If you haven't already, let's retrain the logistic regression model here and get the timing:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> #### About the Warning Message\n", ">\n", "> The training phase stops with an error:\n", ">\n", "> ```\n", "> ConvergenceWarning: lbfgs failed to converge (status=1):\n", "> STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n", "> ```\n", ">\n", "> This happens because the solver fails to reach convergence after the maximum number of iteration (default=100) is reached.\n", "> You may want to investigate by trying different solvers in the `LogisticRegression` object.\n", "> Please Scikit-learn documentation on [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), the `solver` argument, if you are interested.\n", "> Our internal test showed that with another solver" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 4. Building Neural Networks to Classify Applications\n", "\n", "Let us now proceed by building some neural network models to classify smartphone apps." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.1 One-Hot Encoding\n", "\n", "When using neural networks to do a classification task, we need to encode the labels using **one-hot encoding**.\n", "This is necessary because many machine learning algorithms require numeric labels due to implementation efficiency, as such, any categorical data must be converted to numerical data.\n", "\n", "\n", "For more information on why we need one-hot encoding, see these articles:\n", "\n", "* https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/\n", "* https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/\n", "\n", "Comment: We did not have to do one-hot in scikit-learn, because the ML objects such as `DecisionTreeClassifier` does it for us behind the scene.\n", "\n", "Similarly, any input features that are of categorical data type will also have to be encoded using either integer encoding or one-hot encoding.\n" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "Rec.train_L_onehot = pd.get_dummies(Rec.train_labels)\n", "Rec.test_L_onehot = pd.get_dummies(Rec.test_labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For one-hot encoding, there is a `1` in a distinct spot for every category and `0` everywhere else.\n", "Below shows the first five rows; notice that there is only a single `1` in each row, with the rest being `0`." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Calendar | \n", "Chrome | \n", "ES File Explorer | \n", "Geo News | \n", "Gmail | \n", "Google App | \n", "Hangouts | \n", "Maps | \n", "Messages | \n", "Messenger | \n", "Moovit | \n", "Moriarty | \n", "Photos | \n", "Skype | \n", "Waze | \n", "YouTube | \n", "||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
247525 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
93942 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
22691 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
202123 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "
230029 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "