{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**DeapSECURE module 3: Machine Learning**\n", "\n", "# Session 2: Data Preprocessing for Machine Learning\n", "\n", "Welcome to the DeapSECURE online training program!\n", "This is a Jupyter notebook for the hands-on learning activities of the\n", "[\"Machine Learning\" module](https://deapsecure.gitlab.io/deapsecure-lesson03-ml/), Episodes 4 and 5: [\"Data Preprocessing for Machine Learning\"](https://deapsecure.gitlab.io/deapsecure-lesson03-ml/20-preprocessing/index.html), [\"Machine Learning for Smartphone Application Classification\"](https://deapsecure.gitlab.io/deapsecure-lesson03-ml/30-learning/index.html).\n", "\n", "Please visit the [DeapSECURE](https://deapsecure.gitlab.io/) website to learn more about our training program.\n", "\n", "In this session, we will use this notebook to perform data preparation & initial experiment with machine learning, so that learners will see the complete steps of a machine learning workflow.\n", "We will build upon the skill and insight already acquired in the [\"Big Data\"](https://deapsecure.gitlab.io/deapsecure-lesson02-bd/) module.\n", "\n", "\n", "\n", "**Quick Links** (sections of this notebook):\n", "\n", "* 1 [Setup](#sec-Setup)\n", "* 2 [Loading Sherlock dataset](#sec-Load_data)\n", "* 3 [Data Preprocessing](#sec-Data_preprocess)\n", "* 4 [Initial Machine Learning Experiments](#sec-Exp_Machine_Learning)\n", "* 5 [Summary](#sec-Summary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 1. Setup Instructions\n", "\n", "If you are opening this notebook from Wahab cluster's OnDemand interface, you're all set.\n", "\n", "If you see this notebook elsewhere and want to perform the exercises on Wahab cluster, please follow the steps outlined in our setup procedure.\n", "\n", "1. Make sure you have activated your HPC service.\n", "2. Point your web browser to https://ondemand.wahab.hpc.odu.edu/ and sign in with your MIDAS ID and password.\n", "3. Create a new Jupyter session using \"legacy\" Python suite, then create a new \"Python3\" notebook. (See ODU HPC wiki for more detailed help.)\n", "\n", "4. Get the necessary files using commands below within Jupyter:\n", "\n", " mkdir -p ~/CItraining/module-ml\n", " cp -pr /shared/DeapSECURE/module-ml/. ~/CItraining/module-ml\n", " cd ~/CItraining/module-ml\n", "\n", "The file name of this notebook is `ML-session-2.ipynb`.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 Reminder\n", "\n", "* Throughout this notebook, `#TODO` is used as a placeholder where you need to fill in with something appropriate. \n", "* To run a code in a cell, press `Shift+Enter`.\n", "* Use `ls` to view the contents of a directory.\n", "\n", "* Pandas cheatsheet\n", "\n", "* Summary table of the commonly used indexing syntax from our own lesson.\n", "\n", "* If you have not done so, we recommend that you review the [Data Wrangling and Visualization](https://deapsecure.gitlab.io/deapsecure-lesson02-bd/30-data-wrangling-viz/index.html) episode of our Big Data lesson module." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 Loading Python Libraries\n", "\n", "Next step, we need to import the required libraries into this Jupyter Notebook:\n", "`pandas`, `matplotlib.pyplot` and `seaborn`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**For Wahab cluster only**: before importing these libraries, we need to load the `DeapSECURE` *environment modules*:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "module(\"load\", \"DeapSECURE\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can import the requisite Python libraries, most notably:\n", "_pandas_, NumPy, Matplotlib, Seaborn, and Scikit-learn." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Import the necessary Python modules\"\"\";\n", "\n", "import os\n", "import sys\n", "import pandas\n", "import numpy\n", "import seaborn\n", "from matplotlib import pyplot\n", "import sklearn\n", "\n", "# also add more tools:\n", "from sklearn import preprocessing\n", "from sklearn.model_selection import train_test_split\n", "# machine learning models:\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.tree import DecisionTreeClassifier\n", "# for evaluating model performance\n", "from sklearn.metrics import accuracy_score, confusion_matrix\n", "\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Some advanced learners may like to use shortcuts,\n", "# so we give them here:\n", "pd = pandas\n", "np = numpy\n", "plt = pyplot\n", "sns = seaborn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 2. Loading Sherlock Dataset\n", "\n", "Let us load the Sherlock's \"Applications\" dataset into a DataFrame for analysis and machine learning.\n", "The dataset contains measurements of resource utilization from two applications on a smartphone, namely Facebook and WhatsApp.\n", "This was the same dataset used in the *Data Wrangling & Visualization* notebook of the DeapSECURE's Big Data lesson, where we familiarize ourselves with this dataset and identified the necessary steps to clean the data.\n", "In this present notebook, we will continue the data preprocessing to make it ready for machine learning." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2 = pd.read_csv('sherlock/sherlock_mystery_2apps.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dataset Features\n", "\n", "The `sherlock_mystery_2apps.csv` file actually contains a small subset of a much larger `Application.csv` data file.\n", "There are 14 columns in this subset:\n", "\n", "* `Unnamed: 0` [int]: Record index.\n", "\n", "* `ApplicationName` [str]: Name of the application.\n", "\n", "* `CPU_USAGE` [float]: CPU utilization (100% = completely busy CPU).\n", "\n", "* `cutime` [int]: CPU \"user time\" spent the spawned (child) processes.\n", "\n", "* `lru` [int]: \"Least Recently Used\"; This is a parameter of the Android application memory management.\n", "\n", "* `num_threads` [int]: Number of threads in this process.\n", "\n", "* `otherPrivateDirty` [int]: The private dirty pages used by everything else\n", " other than Dalvik heap and native heap.\n", "\n", "* `priority` [int]: Process's scheduling priority. \n", "\n", "* `utime` [int]: Measured CPU \"user time\".\n", "\n", "* `vsize` [int]: The size of the virtual memory, in bytes.\n", "\n", "* `cminflt` [int]: Count of minor faults that the process's child processes.\n", "\n", "* `guest_time` [int]: Running time of \"virtual CPU\".\n", "\n", "* `Mem` [int]: Size of memory, in bytes.\n", "\n", "* `queue` [int]: The waiting order (priority)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 3. Data Preprocessing\n", "\n", "Up to two-thirds of the time of data analysis is spent on **data preparation**, in order to achieve a clean, consistent, and processable state of data.\n", "Data preparation is absolutely crucial to obtaining trustworthy insight from the data.\n", "We have covered in this topic in great detail in the lesson on the [Data Wrangling and Visualization](https://deapsecure.gitlab.io/deapsecure-lesson02-bd/30-data-wrangling-viz/index.html).\n", "(See also the corresponding notebook, `BigData-session-3.ipynb`.)\n", "\n", "> #### IMPORTANT!\n", "> You must do all the **EXERCISE**s in this section (do not skip any one), so that you obtain the clean dataset.\n", "\n", "### 3.1. Known Issues in `sherlock_mystery_2apps.csv`\n", "\n", "From the *Data Wrangling & Visualization* notebook, we identified the following issues with the raw data:\n", "\n", "* irrelevant data (column: `Unnamed: 0`),\n", "* missing data (about 22% data of the `cminflt` column are undefined),\n", "* a duplicate feature (`Mem`, a duplicate of `vsize`).\n", "\n", "We also identified the necessary course of action to address these defects.\n", "In this notebook, we will simply execute the necessary steps in order to prepare, or *preprocess*, the data for machine learning." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Removing Irrelevant Features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Dropping `Unnamed: 0` Column\n", "\n", "**EXERCISE**: Remove the `Unnamed: 0` column from `df2` because it is irrelevant for our analysis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Drop the `Unnamed: 0` column from df2\"\"\";\n", "#df2.drop(#TODO, inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.3 Dealing with Missing Data\n", "\n", "#### Removing Missing Data from `cminflt`\n", "\n", "Missing data is cause by several reasons. We can use the `.isna().sum()` operation to identify features with missing data and how many values are missing:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2.isna().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**OPTIONAL QUESTION**: Of what fraction is the data missing in that one column?\n", "\n", "*Hint*: One way is to use `df2[COLUMN_NAME].size` to get the total number of rows." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Compute the fraction of mising data to the total number of rows in the `cminflt` column\"\"\";\n", "#TODO" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2['cminflt'].isna().sum() / df2['cminflt'].size" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**OPTIONAL**: It is interesting to plot *where* the `cminflt` is missing, using the following trick." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2['cminflt'].isna().astype(int).plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**DECISION:** we decided to *drop the rows where the `cminflt` values are missing*.\n", "Reason: The number of records missing data in `cminflt` is large;\n", "but we also have a lot of data to begin with (nearly 800k rows in the raw dataset)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXERCISE**:\n", "Use DataFrame's `dropna()` method to remove rows that have missing values.\n", "Perform the operation in-place.\n", "Then verify that the new df2 no longer have any missing data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Remove rows with missing values from df2\"\"\";\n", "#TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No more duplicate found." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.4 Removing Duplicate Features\n", "\n", "Now let's remove `Mem`, `guest_time` and `queue` from the dataset because they are duplicates of the other features, or have a very strong linear correlation to those other features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Drop the following columns from the DataFrame: Mem, guest_time, queue\"\"\";\n", "#df2.drop(#TODO)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2.drop(['Mem'], axis=1, inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXERCISE**: Please verify that the unwanted columns above have been removed from `df2` now!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.5 Separating Labels from Features\n", "\n", "Quite frequently, labels (output values) come in the same dataframe as the features.\n", "In this case, we need to separate the label column(s) from the input features.\n", "In our dataset, the `ApplicationName` column contains the labels for classification machine learning. \n", "Let’s extract that into `df2_labels`, whereas the features go to `df2_features`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2_labels = df2['ApplicationName']\n", "df2_features = df2.drop('ApplicationName', axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Note**: We do *NOT* drop the `ApplicationName` column in-place, so we have the backup of the cleaned data.\n", "> Therefore we assign the feature-only dataframe to a new variable called `df2_features`.\n", "\n", "Let's inspect the cleaned data (features & labels):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Print first few rows of the labels and features.\n", " You can also inspect the descriptive statistics of the features.\"\"\";\n", "print(\"Labels:\")\n", "#TODO\n", "print(\"Features:\")\n", "#TODO\n", "print(\"Statistics and range of values:\")\n", "#TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> If a dataset contains categorical features, these features will need to be converted to numbers using schemes such as integer encoding or one-hot encoding to be properly represented as numbers.\n", "> Consult the [Data Preprocessing](https://deapsecure.gitlab.io/deapsecure-lesson03-ml/20-preprocessing/index.html#encoding-categorical) episode to learn more.\n", "\n", "At this point, an object `df2_features` contains only numerical values.\n", "This condition is a prerequisite for using the dataset for machine learning.\n", "The feature-only part of the data is often referred to as *feature matrix*,\n", "because it is in a matrix form by now.\n", "There are two more steps required before we actually perform the training step in machine learning:\n", "feature scaling and test-train split." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.5 Feature Scaling (Data Normalization)\n", "\n", "Many ML algorithms work best when the typical values of the features are of the same order of magnitude.\n", "A range of a feature is typically the difference between the minimum and maximum values in that feature.\n", "Because in general each feature has its own value range, *feature scaling* is necessary to bring all the features into a similar order of magnitude.\n", "\n", "Scikit-learn also contains a number of scalers that can be used--each tailored for a certain kind of conditions in the dataset.\n", "We will use *standard scaler*, which normalizes the features according to their respective means and standard deviations.\n", "(This is usually a reasonable starting point; you may want to try out other scalers, see [Scikit-learn's document on data preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html).)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scaler = preprocessing.StandardScaler()\n", "scaler.fit(df2_features)\n", "df2_features_n = pd.DataFrame(scaler.transform(df2_features),\n", " columns=df2_features.columns,\n", " index=df2_features.index)\n", "df2_features_n.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The normalized features are stored in a new variable, `df2_features_n`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 4. Initial Machine Learning Experiments\n", "\n", "Our data is now ready and we can try out some machine learning models!\n", "Let us now do our first experiment with machine learning.\n", "Remember our goal?\n", "\n", "> We want to build a machine learning model to predict the name of the running app on the phone, given the observation of the behavior of the app.\n", "> Thus we want to do an *application classification task*, given their measured usage of `CPU_USAGE`, `cutime`, `num_threads`, etc.\n", "\n", "**The main goal of this section is to guide you through the all the steps necessary to train and assess the quality of the machine learning model.**\n", "There are many models that we can try out, but all of them follow the same set of steps.\n", "\n", "One important art in machine learning is choosing the best set of features to go into the model to achieve the best predictive ability.\n", "From the observations in the \"Data Wrangling and Visualization\" episode of Big Data module, we can intuitively guess that `CPU_USAGE` and `vsize` may be two important features for an application detection task.\n", "After all, different applications would differ in the CPU and memory usage.\n", "Let us build our first machine learning model with these two features and observe the outcome of the prediction." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features = df2_features_n[['CPU_USAGE', 'vsize']]\n", "features.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "labels = df2_labels.copy()\n", "labels.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `labels` and `features` variables contain the labels and selected features to use in the model, and `labels` contains the associated application labels.\n", "(Scikit-learn classification models know how to handle different classes in the `labels`, so we do not need to perform special encoding for the labels.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.1 Train-Test Split\n", "\n", "As the last step before building and training a ML model, we need to split the dataset into \"training\" and \"testing\" sets (both features and labels).\n", "The training set is used to *train* the model, whereas the testing set will be used to *validate* the performance of the trained model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and run\"\"\";\n", "#from sklearn.model_selection import train_test_split\n", "#train_F, test_F, train_L, test_L = train_test_split(features, labels, test_size=0.2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `_F` and `_L` suffixes in the variables above refer to the *features* and *labels*, respectively.\n", "We reserve 80% of the dataset for training, and 20% for testing (`test_size=0.2`)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Training set shapes:\")\n", "print(train_F.shape)\n", "print(train_L.shape)\n", "print(\"Testing set shapes:\")\n", "print(test_F.shape)\n", "print(test_L.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.2 Building and Testing Machine Learning Models\n", "\n", "In this notebook, we will experiment two machine learning models: **Decision Tree** and **Logistic Regression**. \n", "We will start with establishing a **Logistic Regression** model.\n", "\n", "#### Training the ML Model: Logistic Regression" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_lr = LogisticRegression(solver='lbfgs')\n", "model_lr.fit(train_F, train_L)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first statement above creates a `LogisticRegression` object, named `model_lr`, that will perform the logistic regression classification.\n", "In the second statement, we train the model using the training dataset (features and labels).\n", "\n", "> #### Timing a Python statement\n", ">\n", "> Do you notice that the `model_lr.fit` does not return immediately?\n", "> Indeed, training an ML model can take awhile.\n", "> It is useful to note how long the training takes place.\n", "> With Jupyter, you can time the function call easily, like this:\n", ">\n", "> ```python\n", "> %time model_lr.fit(train_F, train_L)\n", "> ```\n", ">\n", "> Please do this from this time on so you will get the timing.\n", "\n", "After the training, `model_lr` is ready to do the classification task.\n", "But we need to first *evaluate* the model using the testing dataset (`test_F` and `test_L`) to measure the its ability to make correct predictions.\n", "The *accuracy score* is the most popular metric, defined as the fraction of the number of correct predictions (i.e. classification) over the total number of predictions made.\n", "We will introduce two common metrics to evaluate our model's performance: `accuracy_score` and `confusion_matrix`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Evaluating the ML Model\n", "\n", "To evaluate, we use the trained model to predict the applications based on the test features:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_pred = model_lr.predict(test_F)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_pred[:20]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The prediction result is a numpy array. These predictions can be compared to the elements of `test_L`.\n", "\n", "**EXERCISE**: Manually compare the first 20 elements of `test_pred` and `test_L`: how many correct answers do you get?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### `accuracy_score`\n", "\n", "Well, Scikit-learn has a lot of tools to make our lives easier!\n", "To measure accuracy (the fraction of the correct answers), we can simply use the `accuracy_score`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"accuracy_score: \", accuracy_score(test_L, test_pred))\n", "print(\"num of correct answers:\", accuracy_score(test_L, test_pred, normalize=False))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**QUESTION**:\n", "\n", "* What accuracy score that you obtain?\n", " Is this a good accuracy?\n", "\n", "* Where do the wrong answers go? What mispredictions do the model make?\n", "\n", "Compare your result with ours: We obtained just below 70%, meaning only about 70% of the answers are correct." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### `confusion_matrix`\n", "\n", "The `confusion_matrix` computes the confusion matrix which visually quantifies the correct and incorrect answers, showing the number of various mispredictions: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"confusion_matrix:\\n\", confusion_matrix(test_L, test_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a two-dimensional array where the row position corresponds to the true classes whereas the column position corresponds to the ML-predicted classes.\n", "But what are the classes in this matrix?\n", "We can query the ML object:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_lr.classes_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So class number `0` is `Facebook` and `1` is `WhatsApp`.\n", "Scikit-learn can even plot the confusion matrix in a nice-looking graph:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sklearn.metrics.plot_confusion_matrix(model_lr, test_F, test_L, cmap='Blues_r')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the example above, we have a total of ~75k Facebook records and ~47k \n", "\n", "Since we will be evaluating different models, let us define a function to evaluate the accuracy of a model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def model_evaluate(model,test_F,test_L):\n", " test_L_pred = model.predict(test_F)\n", " print(\"Evaluation by using model:\",type(model).__name__)\n", " print(\"accuracy_score:\",accuracy_score(test_L, test_L_pred))\n", " print(\"confusion_matrix:\",\"\\n\",confusion_matrix(test_L, test_L_pred))\n", " return" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can use the `model_evaluate` function to evaluate the model, like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_evaluate(model_lr,test_F,test_L)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Alternative Model: Decision Tree\n", "\n", "Next, we can try the **Decision Tree** model. \n", "We are still using the same training and testing sets, just a different model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment, complete, and run this code cell to train a decision tree.\n", " Use the same `fit()` function call as before to train the model.\"\"\";\n", "#model_dtc = DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_split=8)\n", "#model_dtc.fit(#TODO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We created a decision tree classifier and adjusted two (hyper)parameters: `max_depth` and `min_samples_split`.\n", "Check the accuracy of this model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Evaluate the accuracy of model_dtc\"\"\";\n", "#model_evaluate(#TODO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**QUESTION**:\n", "\n", "How's the accuracy of decision tree compared to logistic regression? Discuss your finding and compare with the previous model (`model_lr`)!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using the ML Model\n", "\n", "Once a model has been trained and evaluated for accuracy, it is ready to be deployed!\n", "For example, we can use this model as part of a system monitoring on the phone:\n", "The software gather the resource utilization data in real time, preprocess to extract the `FEATURE_MATRIX`, invoke `MODEL.predict(FEATURE_MATRIX)` to classify the running apps.\n", "Is this cool?\n", "\n", "**DISCUSSION**: Consider how this type of machine learning can be used to detect known malware.\n", "\n", "* How does this differ from the traditional approach to malware/virus detection?\n", "* What are the strengths and weaknesses of each approach?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Know Your Model! Read the References\n", "\n", "Scikit-learn makes it convenient to construct, train, and validate machine learning models.\n", "They appear to be \"magic black box\" that will give us the best solution given a task.\n", "They are NOT.\n", "In fact, you need to become familiar with the basic idea behind the models, their general behavior, their (hyper)parameters, etc.\n", "Do read the Scikit-learn references and gain familiarity with them.\n", "\n", "- [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) documentation;\n", "- [DecisionTree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) documentation.\n", "\n", "The user's guide contain excellent overview of these models with plenty of examples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.3 Summary: A Complete Pipeline for Machine Learning\n", "\n", "Congratulations!\n", "You have completed all the necessary steps to perform *supervised* machine learning.\n", "\n", "First, we need to define the purpose of the model, i.e. *what* the model is supposed to perform or predict.\n", "Once we are clear about this goal, then\n", "we follow a step-by-step procedure to do the machine learning modeling:\n", "\n", "1. Preprocessing the data;\n", "2. Separating the labels from the features, if necessary;\n", "3. Selecting the features to use for the model;\n", "4. Splitting the dataset into training data and testing data;\n", "5. Deciding which machine learning model to use: this largely depends on the nature of the task, and on the characteristics of the input data;\n", "6. Training (fitting) the machine learning model using the training data;\n", "7. Evaluating the model's performance.\n", "\n", "In the upcoming notebook, we will focus on tuning the model in order to improve its performance.\n", "Our goal is to obtain the best model to accomplish the task at hand." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.4 Additional Experiments\n", "\n", "Next to data wrangling/preparation step, the tweaking of machine learning models require a lot of exploration and experimentation.\n", "Without some experimentation, it is difficult to ascertain that we have arrived at the best model.\n", "We encourage you to apply the same procedure summarized above to try out a few variants of ML models and evaluate their performance." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXERCISE:**\n", "Establish new ML models by using different set of features as the input:\n", "\n", "1. Feature set 2: (`CPU_USAGE`, `cutime`)\n", "2. Feature set 3: (`CPU_USAGE`, `queue`)\n", "\n", "Evaluate the accuracy of logistic regression and decision tree models when using these feature sets." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Experiment 2: Using feature set (`CPU_USAGE`, `cutime`)\n", "\n", "Try both logistic regression and decision tree.\n", "\n", "*Hint*: Start with redefining the `features` by choosing the new set of columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Select `CPU_USAGE` and `cutime` to try out a new model\"\"\";\n", "#features = df2_features_n[#TODO]\n", "#features.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Split the dataset into training and testing sets\n", " (the 80%-20% split used above is fine)\"\"\";\n", "\n", "#train_F, test_F, train_L, test_L = train_test_split(#TODO...)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Create and train a new Logistic Regression model using the new training dataset\"\"\";\n", "\n", "model_lr2 = LogisticRegression(solver='lbfgs')\n", "#model_lr2.fit(#TODO)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Evaluate the new model_lr2\"\"\";\n", "\n", "#model_evaluate(#TODO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**DISCUSSIONS**:\n", "\n", "* Compare the performance of `model_lr2` against the previous logistic regression model (`model_lr`). Which one does better?\n", "* Compare not only the accuracy, but also the confusion matrix.\n", "* Which model is more apt at getting the Facebook class correct? And which one is more reliable at predicting WhatsApp?\n", "\n", "\n", "\n", "> #### Keep Track of Your Results!\n", "> Your workshop instructor may set up a shared spreadsheet to save your results (at least accuracy); if so, please use that to keep track of all your experiment results!\n", "> You can also store your result in a text file or a spreadsheet of your own.\n", "> This will ease analysis and comparison later on." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> #### WARNING Regarding Jupyter\n", ">\n", "> Jupyter allows you to go back and re-run earlier code cells, but there are potential pitfalls you have to be aware off when doing this.\n", "> Here is one of them:\n", "> In the last few cells we redefined the variables `features`, `test_F`, `test_L`, ...;\n", "> when we do this, we should not go back to the earlier cells where those variables take up different values and re-run the cells.\n", "> For example, if we re-train `model_lr` declared earlier with the new `train_F` and `train_L`, then the model will change (it becomes the same as `model_lr2`).\n", "> On the other hand, we have lost access to the dataset we used to train and validate `model_lr`.\n", "> One way to get around this problem is to redefine a new set of variables (i.e. `features2`, `train_F2`, `train_L2`, ...) that correspond to the new model (`model_lr2`).\n", "> We will introduce a different approach to deal with the sprawling of new variables." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Redo the same with a new Decision Tree model: create, train, evaluate\"\"\";\n", "\n", "model_dtc2 = DecisionTreeClassifier(criterion='entropy',\n", " max_depth=3, min_samples_split=8)\n", "#model_dtc2.fit(#TODO...)\n", "#model_evaluate(#TODO...)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how things also get more routine and boring now?\n", "This is where we can start leveraging the old-fashioned way of running Python: **scripting**!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**DISCUSSION**:\n", "Again, carefully examine the results of this model and compare it against previous models.\n", "Which model performs the best so far?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Experiment 2: Using feature set (`CPU_USAGE`, `queue`)\n", "\n", "Now try creating a model using the features, `CPU_USAGE` and `queue`.\n", "Try both logistic regression and decision tree, again.\n", "Use exactly the same procedure as we have practiced above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Use features `CPU_USAGE` and `queue`; create and train `model_lr3` and `model_dtc3`\"\"\";\n", "\n", "#features = #TODO\n", "#features.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Create, train, validate the Logistic Regression model\"\"\";\n", "\n", "#model_lr3 = LogisticRegression(solver='lbfgs')\n", "#TODO" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Create the Decision Tree model\"\"\";\n", "\n", "model_dtc3 = DecisionTreeClassifier(criterion='entropy',\n", " max_depth=3, min_samples_split=8)\n", "#TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**DISCUSSION**\n", "\n", "The scores for the new models that replaced `vsize` with `queue` were relatively low.\n", "Features `CPU_USAGE` and `queue` also happen to be the least correlated out of all the features.\n", "Is it clear that the (`CPU_USAGE`, `vsize`) is the best set of features to use?\n", "Clearly we need a way to select the best features to go into the model.\n", "\n", "**QUESTIONS:**\n", "\n", "* Can you think of any other features to select?\n", "* What would happen if we selected all features? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*--> (Enter your responses here) <--*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Unleashing It All: Using All Features\n", "\n", "**CHALLENGE**:\n", "Use the cell below to test the machine learning results when all the available features are used." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Use all features to train and build LR and DTC models!\"\"\";\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**QUESTIONS**\n", "\n", "If you have reached this point and done the last experiment, then:\n", "\n", "* What is the ultimate accuracy of the two models (LR and DTC) given the original dataset?\n", "\n", "* Why don't we want to use all the features to build a machine learning model in real-life problems?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*--> (Enter your responses here) <--*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 5. Summary\n", "\n", "(Edit this cell to produce your own summary. Change the questions to the answers based on your experimental results to do that.)\n", "\n", "* The accuracy of the model is `_____` on the input features.\n", "\n", "* Is training a ML model a fast or slow process?\n", "\n", "* What is the purpose of experimentation with different models in machine learning?\n", "\n", "* Write down a summary table / list of the accuracies and confusion matrices from different models.\n", "\n", "* Take heed of the Jupyter warning above regarding re-running earlier cells.\n", "\n", "* The process of overwriting the variables and running *similar *code cells in a strict, sequential order presents an annoyance; is there a better way to test these machine learning models outside of a Jupyter Notebook environment?" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }