{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**DeapSECURE module 3: Machine Learning**\n", "\n", "# Session 3: Tuning the Machine Learning Model\n", "\n", "Welcome to the DeapSECURE online training program!\n", "This is a Jupyter notebook for the hands-on learning activities of the\n", "[\"Machine Learning\" module](https://deapsecure.gitlab.io/deapsecure-lesson03-ml/), Episode 6: [\"Tuning the Machine Learning Model\"](https://deapsecure.gitlab.io/deapsecure-lesson03-ml/40-tuning/index.html) (*new episode to be written, as of 2021--stay tuned!*).\n", "Please visit the [DeapSECURE](https://deapsecure.gitlab.io/) website to learn more about our training program.\n", "\n", "In this session, we will use this notebook to learn how to optimize the predictive performance a model to classify the running applications based on their resource usage signatures.\n", "\n", "\n", "**Quick Links** (sections of this notebook):\n", "\n", "* 1 [Setup](#sec-Setup)\n", "* 2 [Preprocessing Sherlock Dataset](#sec-Load_data)\n", "* 3 [Feature Selection](#sec-Feature_selection)\n", "* 4 [Better Validation in the Training Phase](sec-Better_validation)\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 1. Setup Instructions\n", "\n", "If you are opening this notebook from Wahab cluster's OnDemand interface, you're all set.\n", "\n", "If you see this notebook elsewhere and want to perform the exercises on Wahab cluster, please follow the steps outlined in our setup procedure.\n", "\n", "1. Make sure you have activated your HPC service.\n", "2. Point your web browser to https://ondemand.wahab.hpc.odu.edu/ and sign in with your MIDAS ID and password.\n", "3. Create a new Jupyter session using \"legacy\" Python suite, then create a new \"Python3\" notebook. (See ODU HPC wiki for more detailed help.)\n", "\n", "4. Get the necessary files using commands below within Jupyter:\n", "\n", " mkdir -p ~/CItraining/module-ml\n", " cp -pr /shared/DeapSECURE/module-ml/. ~/CItraining/module-ml\n", " cd ~/CItraining/module-ml\n", "\n", "The file name of this notebook is `ML-session-3.ipynb`.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 Reminder\n", "\n", "* Throughout this notebook, `#TODO` is used as a placeholder where you need to fill in with something appropriate. \n", "* To run a code in a cell, press `Shift+Enter`.\n", "* Use `ls` to view the contents of a directory.\n", "\n", "* Pandas cheatsheet\n", "\n", "* Summary table of the commonly used indexing syntax from our own lesson.\n", "\n", "* Scikit-learn Machine Learning in Python package" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 Loading Python Libraries\n", "\n", "Next step, we need to import the required libraries into this Jupyter Notebook:\n", "`pandas`, `matplotlib.pyplot` and `seaborn`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**For Wahab cluster only**: before importing these libraries, we need to load the `DeapSECURE` *environment modules*:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "module(\"load\", \"DeapSECURE\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can import the requisite Python libraries, most notably:\n", "_pandas_, NumPy, Matplotlib, Seaborn, and Scikit-learn." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Import the necessary Python modules\"\"\";\n", "\n", "import os\n", "import sys\n", "import pandas\n", "import numpy\n", "import seaborn\n", "from matplotlib import pyplot\n", "import sklearn\n", "\n", "# also add more tools:\n", "from sklearn import preprocessing\n", "from sklearn.model_selection import train_test_split\n", "# machine learning models:\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.tree import DecisionTreeClassifier\n", "# for evaluating model performance\n", "from sklearn.metrics import accuracy_score, confusion_matrix\n", "\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Some advanced learners may like to use shortcuts,\n", "# so we give them here:\n", "pd = pandas\n", "np = numpy\n", "plt = pyplot\n", "sns = seaborn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also copy some functions we defined in the previous notebook:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def model_evaluate(model,test_F,test_L):\n", " test_L_pred = model.predict(test_F)\n", " print(\"Evaluation by using model:\",type(model).__name__)\n", " print(\"accuracy_score:\",accuracy_score(test_L, test_L_pred))\n", " print(\"confusion_matrix:\",\"\\n\",confusion_matrix(test_L, test_L_pred))\n", " return" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 2. Preprocessing Sherlock Dataset\n", "\n", "First, we load and preprocess the SherLock \"2-apps\" dataset as we did in the previous notebook.\n", "Instead of doing them cell-by-cell, let's put all the steps into one cell and execute them in one shot:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2 = pandas.read_csv('sherlock/sherlock_mystery_2apps.csv')\n", "\n", "# Remove irrelevant feature(s)\n", "df2.drop('Unnamed: 0', axis=1, inplace=True)\n", "\n", "# Remove rows with missing values\n", "df2.dropna(inplace=True)\n", "\n", "# Remove duplicate features\n", "df2.drop('Mem', axis=1, inplace=True)\n", "\n", "# Separate labels from features\n", "df2_labels = df2['ApplicationName']\n", "df2_features = df2.drop('ApplicationName', axis=1)\n", "\n", "# Feature scaling\n", "scaler = preprocessing.StandardScaler()\n", "scaler.fit(df2_features)\n", "df2_features_n = pandas.DataFrame(scaler.transform(df2_features),\n", " columns=df2_features.columns,\n", " index=df2_features.index)\n", "print(\"Normalized features:\")\n", "df2_features_n.head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a backup\n", "df2_features_n_backup = df2_features_n.copy()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> *HINT*: If you did not finish notebook 2, then the `sherlock_features.csv` file did not exist yet.\n", "> In that case, please use `solutions/sherlock_features.csv`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Features:\")\n", "print(df2_features.head(10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 3. Feature Selection\n", "\n", "In the previous notebook (`ML-session-2.ipynb`) we have discovered that the performance a machine learning model may be strongly affected by the choices of the features. Even a model that can perform very good can perform poorly when an inappropriate set of features are used for the learning.\n", "\n", "In a machine learning project, generally speaking, we want to start with a handful of features (2-4) with the most predictive power. These are features that have the strongest influence on the model’s output. How do we select such features?\n", "We need a way to *reason* why certain columns can be dropped first, so that our model is as compact as possible.\n", "In this section, we will attempt to build some way to reason the selection of features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, let's review the existing features in the preprocessed \"2-apps\" SherLock dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2_features_n.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are **11** features. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we want to find features that are very similar, then drop the (near) duplicate features.\n", "We will use two complementary means to detect such duplicates:\n", "\n", "- Histograms\n", "- Correlation plot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1 Histograms\n", "\n", "A histogram is a visualization of the distribution of values in a feature.\n", "Let’s make a panel of histogram for all the normalized features: this will easily help spotting features that may be duplicate of one another:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# plt stands for matplotlib.pyplot\n", "plt.figure(figsize=(10.0, 8.0))\n", "for (i, col) in enumerate(df2_features_n.columns):\n", " # Creates a 4 row by 3 cols plot matrix\n", " plt.subplot(4,3,i+1)\n", " plt.hist(df2_features_n[col], bins=50)\n", " plt.title(col)\n", "\n", "plt.subplots_adjust(top=0.92, bottom=0.08, left=0.10, right=0.95, hspace=0.75,\n", " wspace=0.35)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Visualizing histograms of multiple features in a panel form is a powerful tool to detect features that are identical or very similar." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**QUESTION:**\n", "From the histogram above, can you spot features that are suspected to be identical or similar?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXERCISE**:\n", "Repeat the histogram panel above, but color the histogram differently for each category (`ApplicationName`) to verify the identical features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2_labels.unique()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Separate the rows in the feature matrix based on the associated app names\"\"\";\n", "Apps = df2_labels.unique()\n", "indx_app = {}\n", "features_app = {}\n", "# The first loop filters the rows by the app names\n", "# using the df2_labels\n", "for app in Apps:\n", " print(\"\\nApp:\", app)\n", " indx_app[app] = df2_labels[df2_labels == app].index\n", " print(\"Index:\")\n", " print(indx_app[app][:5])\n", " features_app[app] = df2_features_n.loc[indx_app[app]]\n", " print(\"Features:\")\n", " print(features_app[app].head(5))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Draw the multi-app histogram panel\"\"\";\n", "pyplot.figure(figsize=(12.0, 9.0))\n", "for (i, col) in enumerate(df2_features_n.columns):\n", " # Creates a 4 row by 3 cols plot matrix\n", " pyplot.subplot(4,3,i+1)\n", " for app in Apps:\n", " pyplot.hist(features_app[app][col], bins=50)\n", " pyplot.title(col)\n", "\n", "pyplot.subplots_adjust(top=0.92, bottom=0.08, left=0.10, right=0.95, hspace=0.75,\n", " wspace=0.35)\n", "pyplot.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**QUESTIONS**:\n", "\n", "* From this second graph, further confirm that there are *two features* are identical.\n", "\n", "* If you inspect the raw (unnormalized) values are these two features identical?\n", " This shows the value of normalizing the features--it further exposes duplicate features that may be masked by a multiplicative factor." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Correlation\n", "\n", "At this time, we may want to do further feature selection from the correlation between each feature pairs. Feature pairs that are highly correlated can be deemed as duplicate features, thus we can delete one of each pair. The pair correlations can be computed using the `DataFrame.corr()` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2_corr = df2_features_n.corr()\n", "df2_corr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `.corr()` method returns a matrix of correlation between feature pairs.\n", "The maximum value is 1 (perfectly correlated, i.e. identical), whereas the minimum value is -1 (perfectly anti-correlated).\n", "For a pair with negative correlation, it means that the increase in one feature leads to the decrease in the other." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use a *heatmap* to visualize the correlation matrix above and find the highly-correlated feature pair(s) by using the `seaborn.heatmap()` function. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pyplot.figure(figsize=(10.0,10.0))\n", "seaborn.heatmap(df2_corr, annot=True, vmax=1, square=True, cmap=\"Blues\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**QUESTION**: From the matrix or heatmap above, please\n", "\n", "* Identify three pairs whose correlation values are the highest (close to +1 or -1);\n", "* Identify additional pairs whose correlation values are beyond 0.5.\n", "\n", "Compare your observation with the similar features discovered by the histogram panel earlier!\n", "Are they the same pairs?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*--> (Enter your responses here) <--*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on our discussion above, we can definitely delete `vsize`, `queue` and `guest_time` because of their very high correlations with other three features:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2_features_n.drop(['vsize', 'queue', 'guest_time'], axis=1, inplace=True)\n", "print(df2_features_n.columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Eight features remaining!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next pairs that can be considered for dropping would be:\n", "* (`otherPrivateDirty`, `utime`)\n", "* (`cutime`, `cminflt`)\n", "\n", "The first pair also shows similarity in the histogram visuals (see earlier plot).\n", "We can drop `utime` and `cminflt` because of their marked correlations with the other two." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2_features_n.drop(['utime', 'cminflt'], axis=1, inplace=True)\n", "print(df2_features_n.columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.3 Simple Group Analysis\n", "\n", "At this point, we have reduced our feature set to just six for the two applications (\"WhatsApp\" and \"Facebook\").\n", "The next thing we can consider is the distribution of each feature grouped by the application category.\n", "When two features are similar, we may argue that the similarity will be reflected in the value distributions.\n", "Histograms can help uncover some similarities, but descriptive statistics provide a complementary way.\n", "This can be achieved by employing the `.groupby()` method before computing the descriptive statistics.\n", "\n", "We recombine the label temporarily to do this group analysis:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2_with_label = df2_features_n.copy()\n", "df2_with_label['ApplicationName'] = df2_labels\n", "df2_with_label.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's get the feature values for each app by `.groupby()`, get the information of each feature from same app." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2_with_label.groupby('ApplicationName')['CPU_USAGE'].describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2_with_label.groupby('ApplicationName')['lru'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**QUESTION**: Observe how similar or dissimilar are the statistical quantities (mean, standard deviation, as well as the quartiles)\n", "\n", "1. Do the means of `CPU_USAGE` (for the different applications) overlap within their standard deviations?\n", "2. What about `lru`?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Compare the descriptive statistics of other features as well...\"\"\";\n", "#TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**DECISION**:\n", "After some explorations, we found that the averages of `CPU_USAGE` and `lru` for the two different apps are much closer to each other, compared to the others.\n", "Thus let us remove these two features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2_features_n.drop(['CPU_USAGE','lru'],axis=1,inplace=True)\n", "df2_features_n.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.4 Feature Selection Summary\n", "\n", "We now have the four features we want: `cutime`, `num_threads`, `otherPrivateDirty`, `priority`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Save this featureset in a new variable:\n", "df2_features_n1 = df2_features_n_backup[['cutime', 'num_threads', 'otherPrivateDirty', 'priority']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save these featuers into a file for further usage." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "labels_save = df2_labels.replace(['Facebook', 'WhatsApp'], [0, 1])\n", "labels_save.to_csv('sherlock_2apps_labels.csv',header=True,index=False)\n", "\n", "df2_features_n1.to_csv('sherlock_2apps_features.csv',index=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "labels_save.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.5 Training and Validating Machine Learning Model\n", "\n", "**EXERCISES**:\n", "Now do the same procedure as elaborated in the previous notebook to train the machine learning models (linear regression and decision tree) to train and validate them based on the newly selected features.\n", "Record these accuracy scores and the necessary details (such as the list of features, tweaked hyperparameters) on your notebook/spreadsheet." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Train and validate the LogisticRegression model wih the new feature set\"\"\";\n", "\n", "#train_F1, test_F1, train_L1, test_L1 = train_test_split(#TODO)\n", "model_lr1 = LogisticRegression(solver='lbfgs')\n", "#...TODO" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**QUESTIONS**:\n", "\n", "* Compare the Performance of the two trained models\n", "\n", "* Discuss which model may be better for our dataset and think about the possible reasons.\n", "\n", "* Have we achieved the maximum accuracy of the methods that we see at the previous notebook (`ML-session-2.ipynb`)?\n", " Why--or why not?\n", " \n", "**The last question is very important to ponder.**\n", "If the current featureset is indeed a perfect reduced set of features, then the accuracy should be pretty close to the maximum possible accuracy.\n", "Otherwise there is still something amiss!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2_features_n_backup.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 4. Better Validation in the Training Phase\n", "\n", "In the previous ML modeling, we only use the training dataset to train the model.\n", "The evaluation of a model's performance should not rely on the training dataset, otherwise it would result in a biased performance score.\n", "We have held out a portion of the data as test dataset for validation purposes to give an unbiased estimate of the performance.\n", "One problem is that we do not know the uncertainty of this performance score (e.g. accuracy score).\n", "\n", "Here we introduce the *k-fold cross-validation* approach.\n", "In the k-fold cross-validation, the data is divided into *k* folds.\n", "The model is trained on k-1 folds with one fold held back for testing.\n", "This process gets repeated to ensure each fold of the dataset gets the chance to be the \"test\" set.\n", "Once the process is completed, we can summarize the evaluation metric using the mean and quantify its uncertainty using the measured standard deviation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn import model_selection\n", "\n", "kfold = model_selection.KFold(n_splits=10)\n", "model_kfold = LogisticRegression(solver='lbfgs')\n", "results_kfold = model_selection.cross_val_score(model_kfold, train_F1, train_L1, cv=kfold)\n", "print(\"Accuracy: %.2f%%\" % (results_kfold.mean()*100.0)) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "results_kfold" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This answer is consistent with the previous `train_test_split` approach." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }