{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**DeapSECURE module 2: Dealing with Big Data**\n", "\n", "# Session 3: Data Wrangling and Visualization\n", "\n", "Welcome to the DeapSECURE online training program!\n", "This is a Jupyter notebook for the hands-on learning activities of the\n", "[\"Big Data\" module](https://deapsecure.gitlab.io/deapsecure-lesson02-bd/), Episode 5: [\"Data Wrangling and Visualization\"](https://deapsecure.gitlab.io/deapsecure-lesson02-bd/30-data-wrangling-viz/index.html) .\n", "\n", "## Objective\n", "\n", "The goal of big data analytics is to obtain meaningful and valuable insight from the data, which can be translated into useful actions and/or decisions.\n", "While in the previous notebooks we learn the basic capabilities of using _pandas_, in this notebook we are embarking some activities which can generate the desired insight from data.\n", "We will learn how to use _pandas_ and visualization tools to uncover relationships, patterns, correlations from the Sherlock application dataset.\n", "\n", "This notebook will give you a *taste* of a data scientist's work on **data wrangling** and **exploratory data analysis**.\n", "These two steps are intertwined in practice; both are parts of **data preparation** step before the full data analysis takes place.\n", "The goal of data preparation is to achieve a clean, consistent and processable state of data,\n", "while at the same time familiarizing us with the characteristics of the data.\n", "While many of the principles we learn here still hold, each problem and dataset has its own specific issues that may not generalize.\n", "There is an art to this process, which needs to be learned through much practice and experience.\n", "\n", "\n", "## Data Exploration Questions\n", "\n", "Here are several data exploration questions we want to ask ourselves upon receiving a large Sherlock application dataset.\n", "This dataset will contain resource utilization from two applications: WhatsApp and Facebook.\n", "\n", "1. What are the statistical properties of the individual features of the dataset: mean, spread, distribution\n", "\n", "2. Can we spot issues with the data that will require cleaning?\n", "\n", " - useless features\n", " - missing data\n", " - duplicate data\n", " - outliers\n", "\n", "3. Can we extract some basic statistics about the applications through the data, such as:\n", "\n", " - which application uses more CPU cycles on average?\n", " - which application uses more memory on average?\n", "\n", "4. Are there correlations among the features in the dataset?\n", " If there are, what do they look like?\n", " Correlations can uncover which features are most important to consider in data analysis.\n", "\n", "Some of these questions are best answered by means of visual inspection;\n", "therefore we will cover a few basic visualization techniques in this notebook.\n", "\n", "\n", "**QUICK LINKS**\n", "* [Setup](#sec-setup)\n", "* [Loading Sherlock Data](#sec-load-sherlock)\n", "* [Data Wrangling](#sec-data-wrangling)\n", "* [Types of Data](#sec-types-of-data)\n", "* [Cleaning Data](#sec-cleanData)\n", "* [Visualization](#sec-visualization)\n", "* [Data Distribution](#sec-data-distribution)\n", "* [Feature Correlation](#sec-data-correlation)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 1. Setup Instructions\n", "\n", "If you are opening this notebook from Wahab cluster's OnDemand interface, you're all set.\n", "\n", "If you obtained this notebook elsewhere, and you want to perform the exercises on Wahab cluster, please follow the steps outlined in our setup procedure:\n", "\n", "1. Make sure you have activated your HPC service.\n", "2. Point your web browser to https://ondemand.wahab.hpc.odu.edu/ and sign in with your MIDAS ID and password.\n", "3. Create a new Jupyter session using \"legacy\" Python suite, then create a new \"Python3\" notebook. (See ODU HPC wiki for more detailed help.)\n", "4. Get the necessary files using commands below within Jupyter's terminal interface:\n", "\n", " mkdir -p ~/CItraining/module-bd\n", " cp -pr /scratch/Workshops/DeapSECURE/module-bd/. ~/CItraining/module-bd\n", " cd ~/CItraining/module-bd\n", "\n", "The file name of this notebook is `BigData-session-3.ipynb`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 Reminder\n", "\n", "* Throughout this notebook, `#TODO` is used as a placeholder where you need to fill in with something appropriate. \n", "\n", "* To run a code in a cell, press `Shift+Enter`.\n", "\n", "* Pandas cheatsheet provides a reminder of most frequently used _pandas_ syntax and functions.\n", "\n", "* Summary table of the commonly used indexing syntax from our own lesson.\n", "\n", "We recommend you open these on separate tabs or print them;\n", "they are handy help for writing your own codes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 Loading Python Libraries\n", "\n", "Next step, we need to import the required libraries into this Jupyter Notebook:\n", "`pandas`, `matplotlib.pyplot` and `seaborn`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**For Wahab cluster only**: before importing these libraries, we need to load the `DeapSECURE` *environment modules*:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "module(\"load\", \"DeapSECURE\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can import all the required modules into Python:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Run to import libraries\"\"\";\n", "import numpy\n", "import pandas\n", "from matplotlib import pyplot\n", "import matplotlib\n", "import seaborn\n", "%matplotlib inline\n", "##^^ This is an ipython magic command to ensure images are rendered inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> #### Optional: Increasing Matplotlib's Figure Sizes\n", ">\n", "> Matplotlib sets the default size of images to (6.4, 4.8) (horizontal and vertical dimensions).\n", "> If you prefer larger images by default, you can add this command at the beginning of the notebook.\n", "> This will only apply to *this* notebook--not all other notebooks that you have created or will create.\n", "> ```python \n", "> #matplotlib.rcParams['figure.figsize'] = (10.0, 7.0)\n", "> ```\n", ">\n", "> If you want to set the size of only one figure, then use this statement in the same code cell, *just before* the plotting commands, for example:\n", "> ```python\n", "> pyplot.figure(figsize=(10.0, 7.0))\n", "> pyplot.hist(df2['CPU_USAGE'],bins=20) # an example plot\n", "> ```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 2. Loading Sherlock Applications Dataset\n", "\n", "We will working with a significantly larger data file, `sherlock/sherlock_mystery_2apps.csv`, roughly 76MB in size.\n", "Load the data into a DataFrame object named `df2`.\n", "This still has only two applications, WhatsApp and Facebook.\n", "\n", "> Let us give the nickname of \"SherLock 2-apps\" or simply \"**2-apps**\" for this dataset, as it contains only information from two apps." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and modify to load sherlock_mystery_2apps.csv into df2\"\"\";\n", "\n", "#df2 = pandas.#TODO(\"#TODO\");" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Save a backup of the raw DataFrame\n", "df2_backup = df2.copy()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1. Initial Exploration\n", "\n", "Always perform an initial exploration on a new dataset!\n", "Use Pandas methods and attributes to answer the following questions:\n", "\n", "* How many rows and columns are in this dataset?\n", "* How do the numbers look like? (I.e., typical values, fluctuations)\n", "* How does the statistical information look like?\n", "* What does the feature look like? (i.e. the data types)\n", "\n", "*Hint:* Use a combination of the DataFrame attributes `shape`, `dtypes`, and/or methods like `head`, `tail`, `describe`, `info`.\n", "\n", "*Reminder*: The lesson page contains the [description of the columns (features)](https://deapsecure.gitlab.io/deapsecure-lesson02-bd/20-analytics-intro/index.html#Applications.csv) contained in the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment to perform basic data exploration on df2 DataFrame\"\"\";\n", "\n", "# Example:\n", "#df2.describe().T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 Comparing Datasets\n", "\n", "**QUESTION**: Compare the statistics of this dataset with that of the previous dataset (from file `sherlock/sherlock_mystery.csv`). Are these two datasets statistically the same?\n", "\n", "* Compare the columns of the two tables.\n", "* Compare the sizes of the data.\n", "* How do the mean and std (standard deviation) look like between the two datasets? Are they similar? Any statistics that look significantly different?\n", "* Any difference in the range (max - min) and spread of the data?\n", "\n", "*Hint*: It's more covenient to load that smaller dataset in this notebook and compare the `describe()` outputs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment to read in sherlock_mystery.csv and compare with df2 Dataframe\"\"\";\n", "\n", "#df_mystery = pandas.#TODO(#TODO)\n", "#df_mystery.describe().T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 3. Data Preparation\n", "\n", "When analyzing data, up to two-thirds of the time is actually spent preparing the data.\n", "This may sound like a waste of time, but that step is absolutely crucial to obtaining trustworthy insight from the data.\n", "The goal of **data preparation** is to achieve a clean, consistent and processable state of data.\n", "\n", "**Common issues with data** include the following:\n", "* Missing data\n", "* Bad or inconsistent data\n", "* Duplicate data\n", "* Irrelevant data\n", "* Format mismatch\n", "* Representational issues\n", "\n", "Data preparation is roughly made up of the following steps:\n", "\n", "* **Data wrangling** (data munging)\n", "* **Exploratory data analysis** (EDA)\n", "* **Feature engineering**\n", "\n", "This notebook will cover the first two steps above.\n", "(The third step is very closely related to machine learning and will be covered in detail in the subsequent lesson module.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1 Identifying Missing Data\n", "\n", "_pandas_ has several convenient methods to identify missing values in a DataFrame.\n", "One easy method is to check the output of `df2.info()` against the shape of the DataFrame:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If any feature has fewer than the first number in `df2.shape`, then there are missing values in that feature.\n", "(There is an alternative way that can single out only rows with missing values. We will cover it later in this notebook.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 4. Data Wrangling (Data Munging)\n", "\n", "Data wrangling transforms raw data into an appropriate and valuable format for a variety of downstream purposes including analytics.\n", "Data wrangling addresses issues such as the following:\n", "\n", "* Understanding the nature of each feature;\n", "* Handling missing data;\n", "* Removing duplicate data, bad or irrelevant data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### 4.1 Types of Data\n", "\n", "_Pandas_ supports many data types, including: discrete numbers (ints), continuous numbers (floats), and strings.\n", "But to work effectively and properly with data, we need to further understand the *nature* and *meaning* (semantics) of our data.\n", "There are different ways to classify data beyond whether they are numbers or words.\n", "\n", "In tabular datasets, each column contains a *variable* or a feature.\n", "We need to consider the nature of each of these variables:\n", "\n", "1. *Numerical* vs *categorical*: Do the values in a feature have numerical significance or just signify categories?\n", "For example, memory usage is a numerical feature, whereas application names are categorical in nature.\n", "\n", "2. *Discrete* vs *continuous*: Can the feature assume any value in the range of the measuring scale, or limited to certain discrete choices. Example: CPU temperature vs. number of threads.\n", "\n", "3. *Qualitative* vs *quantitative*: Truly quantitative values have the sense of distance, e.g. you can *quantify* the difference between two values (e.g. the distance between 3.6 and 5.2 is 1.6). Some features may have number-like properties (such as user rating values of: poor, fair, good, excellent) but they cannot be truly measured numerically. Categorical values do not even have such a sense of comparison.\n", "\n", "It is important that we recognize these distinctions and characterize the nature of the features in our data.\n", "Our lesson page has a more in-depth [discussion on the topic of the nature of data](https://deapsecure.gitlab.io/deapsecure-lesson02-bd/30-data-wrangling-viz/index.html#types-of-data)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**DISCUSSION QUESTIONS**: Consider the features of the Sherlock dataset we use in this notebook (see [the list of features on our lesson page](https://deapsecure.gitlab.io/deapsecure-lesson02-bd/20-analytics-intro/index.html#features) or notebook #2 (`BigData-session-2.ipynb`):\n", "\n", "* Which feature(s) are categorical in nature?\n", "\n", "* Which feature(s) are quantitative in nature?\n", "\n", "* Of those that are quantitative, which features are continuous in values?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.2 Visual Data Inspection\n", "\n", "In our earlier notebook we introduced two types of plots that can be very useful in inspecting a new datasets: raw data plot and box plot.\n", "We can use a loop in Python as well as a more advanced subplot feature in matplotlib to automate the creation of these plots for us.\n", "Of all the columns in `df2` only numerical (quantitative) data can be plotted in this way.\n", "We ignore `ApplicationName`, which does not contain numerical values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "Run this code cell to generate a panel of raw data plots.\n", "Be patient, it will take a few seconds to complete.\n", "Take this code and adapt it for your own analysis.\n", "Feel free to adjust the parameters.\n", "\"\"\";\n", "\n", "fig = pyplot.figure(figsize=(16.0, 10.0))\n", "nx = 3\n", "ny = 5\n", "columns = ( c for c in df2.columns if c != \"ApplicationName\" )\n", "\n", "print(\"Visually inspecting individual values: \", end=\"\")\n", "for i, col in enumerate(columns):\n", " print(\" \", col, sep=\"\", end=\"\")\n", " axes = fig.add_subplot(ny, nx, i+1)\n", " axes.set_ylabel(col)\n", " df2[col].plot(ax=axes)\n", " #if i > 3: break\n", "print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Visualization is a powerful tool to identify anomalies or problems in the data.\n", "The plots shown above do not yet differentiate the different applications, but the measured stats are ordered by the time they were taken.\n", "\n", "**DIAGNOSIS QUESTIONS**:\n", "\n", "1. Do you notice **one** feature that behaves strangely (unlike any other features)?\n", "\n", "2. Do you suspect some features that look identical?\n", "\n", "Take note of these potential issues.\n", "They may be signs of bad features---we will confirm whether this is the case, after additional checks.\n", "\n", "*HINTS*: Remember that the dataset contains *measurement* values of CPU usage, memory usage, and other types of observables in the phone's operating system.\n", "(Compare these measurements to temperatures, wind speeds, pressures, rainfall measures, etc. over a period of time.)\n", "It is typical that measurements may have spikes (where events are not regular) or fluctuations (ups and downs)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----\n", "\n", "***(Enter your answers in this cell; these are important questions that must be answered.)***\n", "\n", "-----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXERCISE**: Here's another diagnostic tool: Let's create a panel of box plots using seaborn.\n", "Let's use this syntax to make a horizontal boxplot (you can try the other orientation by substituting `x=` with `y=`:\n", "```python\n", "seaborn.boxplot(x=SERIES, ax=axes) # the `ax=` argument is needed for multi-panel plots\n", "```\n", "where `SERIES` is a placeholder for a specific DataFrame column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Fix up this code to generate a panel of boxplots,\n", "similar to the panel generated above.\n", "Uncomment and edit the commented lines.\"\"\";\n", "fig = pyplot.figure(figsize=(16.0, 10.0))\n", "nx = 3\n", "ny = 5\n", "columns = ( c for c in df2.columns if c != \"ApplicationName\" )\n", "\n", "print(\"Generating boxplots: \", end=\"\")\n", "for i, col in enumerate(columns):\n", " print(\" \", col, sep=\"\", end=\"\")\n", " #axes = #TODO\n", " ## (optional) Add additional code to improve the presentation (e.g. axis label)\n", " #seaborn.boxplot(#TODO, ax=axes)\n", "print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**DIAGNOSIS QUESTIONS**:\n", "Just as in the previous plot panel, we can ask again:\n", "\n", "1. Do you notice a feature that behaves strangely (unlike any other features)?\n", "\n", "2. Do you suspect that there are some duplicated features?\n", "\n", "**OUTLIERS**: Individual dots shown on the box plots above indicate outliers--i.e. they are outside the (2%, 98%) of the percentiles according to [normal distribution](https://www.mathsisfun.com/data/standard-normal-distribution.html) a.k.a. \"bell curve\".\n", "Sometimes, outliers identified in this way represent bad data.\n", "But not all features follow this idealized distribution, therefore not all outliers are bad data.\n", "Therefore, one must consider carefully the plausible range of values in a feature before judging whether (or which) outliers are bad data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----\n", "\n", "***(Enter your answers in this cell; these are important questions that must be answered.)***\n", "\n", "-----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 5. Cleaning Data\n", "\n", "This sections discusses approaches involved in cleaning data.\n", "In practice, your judgment as the data analyst is very important so as not to introduce bias into the data.\n", "\n", "----\n", "\n", "*REVIEW*: If you work through the diagnostic questions above, you will identify:\n", "\n", "* one irrelevant feature;\n", "\n", "* one feature with missing data (missing values);\n", "\n", "* two features that are duplicate of two other features.\n", "\n", "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.1 Useless or Irrelevant Data\n", "\n", "**EXAMPLE**:\n", "There is one feature in `df2` that is irrelevant because the values are the same as the labels in the index of that DataFrame." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and run to identify one irrelevant feature (column).\n", "The one below is just a starting point.\"\"\";\n", "\n", "#df2.head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**OPTIONAL EXERCISE**:\n", "Use _pandas_ functions creatively to prove that this irrelevant column indeed has the same values as the row labels for all rows.\n", "*Hints*: Use `head` and `tail`; use the comparison function--that also works between a Series and an Index object." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Useless or irrelevant columns should be removed.\n", "You can remove column(s) using the\n", "`df.drop([COLUMN1, COLUMN2, ...], axis=1, inplace=True)`\n", "syntax.\n", "\n", "**EXERCISE**:\n", "Remove the irrelevant feature from `df2`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Modify and uncomment to remove the irrelevant column\"\"\";\n", "\n", "#df2.#TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.2 Missing Data\n", "\n", "Missing data can be caused by several reasons. We will examine common practices for handling missing values.\n", "We strongly encourage you to read the lesson section on missing data for a deeper understanding." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Missing Data Exercises\n", "\n", "This section will help you become familiar with _pandas_' facilities for handling missing data.\n", "We encourage discussion with other participants to help you learn.\n", "Also consult [_pandas_ documentation on missing data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html).\n", "\n", "Undertake the following:\n", " * Create a toy DataFrame named `ex0` below with some missing values\n", " * Explore some _pandas_ methods below for identifying and handling missing values\n", "\n", "_pandas_ uses `nan` (a special not-a-number value defined in the `numpy` library) to denote a missing value." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Execute the following code to create a toy dataset with missing values\"\"\";\n", "\n", "nan = numpy.nan\n", "ex0 = pandas.DataFrame([[1, 2, 3, 0 ],\n", " [3, 4, nan, 1 ],\n", " [nan, nan, nan, nan],\n", " [nan, 3, nan, 4 ]],\n", " columns=['A','B','C','D'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A DataFrame or Series object has the following methods to deal with missing data:\n", "\n", "* `notnull()` and `isna()` methods detect the defined (non-null) or missing (null) values in the object, by returning a DataFrame or Series of boolean values;\n", "* `dropna()` method removes the rows or columns with missing values;\n", "* `fillna()` fills the missing cells with a default value.\n", "\n", "Here are some examples of detecting missing data:\n", "\n", "* `ex0.notnull()`\n", "* `ex0.isna()`\n", "* `ex0.isna().sum()`\n", "* `ex0.isna().sum(axis=0)`\n", "* `ex0.isna().sum(axis=1)`\n", "\n", "Run these commands in the cell(s) below and observe the outcome.\n", "What does each command mean?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Experiment with each expression above in this cell.\n", "Create and run more cells as needed.\"\"\";\n", "\n", "#ex0.#TODO()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `isna()` method call followed by `sum()` can be used to count the number of missing values in each column (default, with `axis=0`) or row (`axis=1`).\n", "In Python, for summing purposes, a `True` value counts as numerical 1 and a `False` as 0." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are some examples of handling missing data: What does each command mean?\n", "\n", "* `ex0.dropna()`\n", "* `ex0.dropna(how='all')`\n", "* `ex0.dropna(axis=1)`\n", "* `ex0.fillna(7)`\n", "* `ex0.fillna(ex0.mean(skipna=True))`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Experiment with each expression above in this cell.\n", "Create and run more cells as needed.\"\"\";\n", "\n", "#ex0.#TODO()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**NOTE**: The `dropna()` and `fillna()` methods can be given an optional `inplace=True` argument to perform the data correction in-place (i.e. modifying the original DataFrame object)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Detecting Columns with Missing Data\n", "\n", "Earlier in Sec 3.1 we described a method to detect columns with missing data using the `info` method.\n", "Based on the *Missing Data Exercises* above, we may have discovered that there is another way to detect columns *or* rows with missing data.\n", "\n", "**EXERCISE:** Using one of the statements already demonstrated just above, identify features (columns) in `df2` that have some missing data.\n", "That same statement also identifies how many values in each column are missing." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Write a code below to identify features with missing values in df2\"\"\";\n", "\n", "#df2.#TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**CHALLENGE**:\n", "What is the fraction of the missing data compared to the total number of records (rows) in the dataset?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Use this cell to find the fraction of the rows with missing data\"\"\";\n", "#TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Addressing Missing Data\n", "\n", "The `cminflt` column has a significant number of missing values (over 20%). We mention several possible options in [the lesson text](https://deapsecure.gitlab.io/deapsecure-lesson02-bd/30-data-wrangling-viz/index.html#missing-data) to deal with the case of the missing data:\n", "\n", "* Drop all the rows that have missing any data;\n", "\n", "* Drop the one column with missing data (cminflt) from the table;\n", "\n", "* *Imputation*: Fill the missing value with something reasonable and plausible.\n", "\n", "**QUESTION:** What do *you* want to do with the missing data?\n", "Discuss the pros and cons of each possible choice!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*(Enter your response here)*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----\n", "\n", "**OUR COURSE OF ACTION**:\n", "For our learning purposes, in this notebook we will simply drop the rows that have missing `cminflt` values, because we have nearly 800k rows in the original dataset. While 20% of samples with missing data sounds a lot, we still have over 600k rows after removing those rows.\n", "\n", "-----\n", "\n", "> **Data Cleaning in Real World** -- In real data-science projects, one will have to perform a post-analysis to get an additional confirmation that our choice of treatment of missing data is not causing a bias in the analysis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Use this cell to fix the missing data in df2\"\"\";\n", "\n", "#TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.3 Duplicate Data\n", "\n", "There are many reasons that duplicate features can enter into a dataset.\n", "Whether it happens during the collection or the integration of data, one must watch for duplicate data as they affect the quality of data--and the outcome of the analysis.\n", "Common problems include duplicated rows and duplicated columns.\n", "\n", "> * `DataFrame.duplicated()` checks row after row for duplicates and returns a Series of Boolean `True` per duplicate line\n", "> * `reset_index()` Rearranges indexes\n", "\n", "#### Duplicate Data Exercise (optional)\n", "\n", "In this *optional* exercise we undertake the following:\n", "\n", "* Create a new dataset with duplicate data in several rows\n", "* Identify duplicates in dataset\n", "* Remove the duplicates from the dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Create a synthetic dataframe (df3) from portions of df2 with duplicates\"\"\";\n", "\n", "# this is a set of overlapping samples from df2 but with different index\n", "df2_extras = df2.iloc[11:17]\n", "df2_extras.index = range(181000, 181000+df2_extras.shape[0])\n", "\n", "df3 = pandas.concat([ df2.iloc[0:7],\n", " df2.iloc[5:6],\n", " df2.iloc[7:13],\n", " df2_extras ])\n", "\n", "print(\"Initial df3 shape:\", df3.shape)\n", "df3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment to check for duplicates\"\"\";\n", "\n", "#df3_dups = df3.duplicated()\n", "#df3_dups" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Based on df3_dups, print only the labels of the rows where duplicates were found!\"\"\";\n", "#TODO" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and edit to remove the duplicate rows\"\"\";\n", "\n", "#df3.drop_duplicates(inplace=True)\n", "#print(\"New shape of df3 (no duplicates):\", df3.shape)\n", "#df3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **ADVICE**: Before removing rows that appear to be duplicated, always make sure that they are genuine duplicates.\n", "> Please check the context of the data to obtain certainty on this.\n", "> It is *possible* that, in certain cases, the data contain two or more observations that happen to have the same values in all the columns.\n", "> In the SherLock dataset, for example, measurements from different snapshots *could* fortuituously have exactly yielded the same set of values, although this is an extremely rare occurence." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Duplicate Columns\n", "\n", "The exercise above demonstrates how one ought to deal with duplicate data appearing in multiple rows.\n", "In an earlier part of this notebook, we visually inspected the raw data, column by column.\n", "We noticed several pairs of features that appear identical:\n", "\n", "* `guest_time` and `utime`;\n", "* `vsize` and `Mem`.\n", "\n", "Are they really identical? Features that are truly identical should be removed.\n", "We will come back to this issue when we plot the pair correlation function below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----\n", "\n", "**CHALLENGE**: Using your existing _pandas_ skill, determine whether these pairs are genuine duplicates.\n", "\n", "-----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 6. Visualization\n", "\n", "Visualization is a method of presenting data visually in many different ways, each uncovering patterns and trends existing in data in a particular way.\n", "Visualization is indispensible when handling and analyzing massive amounts of data. \n", "In this sections we will introduce common visualization techniques that can greatly help exploratory data analysis.\n", "\n", "We will use two major visualization libraries in Python:\n", "Matplotlib and Seaborn:\n", "\n", "* Matplotlib: A de facto Python 2D plotting library supported in Python scripts, IPython shells and other Python platforms including Jupyter.\n", "The plotting capabilities is provided by the `pyplot` module within this library.\n", "\n", "* Seaborn: Provides a high-level interface for drawing attractive and informative statistical graphics.\n", "Seaborn uses Matplotlib as its backend." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> *HINTS*\n", ">\n", "> * Use `pyplot.figure(figsize=(x_size,y_size))` in a cell to create a blank figure with a custom size.\n", "> This should be specified before calling the plotting functions.\n", "> The default plot size is `x_size=6.4` and `y_size=4.8`.\n", ">\n", "> * The Python code for visualization can get very complex pretty soon, given all the elements to specify (such as, axis labels, tweaking tick labels, making a panel of plots...).\n", "> We recommend that you adapt codes written by others that provide visualization similar to what you want, rather than writing from scratch.\n", "> For this reason, we provide fairly complex visualization codes in this notebook to help you start your own visualization.\n", "> Other good starting places are the galleries of Matplotlib or Seaborn's sample plots (see their websites)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.1 Count Plot\n", "\n", "A *count plot* shows the number of occurrences of various values in a categorical variable.\n", "For example, to draw a count plot of the `COLUMN_NAME` column in a DataFrame `DF`, use one of the following syntaxes:\n", "```python\n", "seaborn.countplot(x=SERIES)\n", "seaborn.countplot(x=COLUMN_NAME, data=DF)\n", "```\n", "\n", "**QUESTION:** How many records exist in `df2` for each application? Which application has more records in the the dataset?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Modify and uncomment to generate a countplot of ApplicationName in 'df2' DataFrame\"\"\";\n", "\n", "#seaborn.countplot(x='#TODO', data=#TODO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*The graph above displays a count plot representing the number of records for each application recorded in the `df2` dataframe.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXERCISE**: Using grouping and aggregation operations described in the previous notebook, cross-check the count plots above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Modify and uncomment code below to count the number of records\n", "grouped by the individual application names.\"\"\";\n", "\n", "#df2.#TODO('#TODO').size()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.2 Histogram\n", "\n", "A histogram displays the distribution of values (shape and spread) in the form of vertical bars.\n", "The range of the values are partitioned equally into multiple bins on the horizontal axis.\n", "The frequency of values within each bin's range is displayed as a vertical bar for each bin. \n", "Taller bars show that more data points fall in those bin ranges.\n", "\n", "In this section, we will experiment with histograms produced using Matplotlib and Seaborn packages to demonstrate their capabilities.\n", "\n", "#### Histogram with Pyplot\n", "\n", "We produce a histogram plot of `COLUMN_NAME` of the `df` DataFrame using the following syntax:\n", "```python\n", "pyplot.hist(df[COLUMN_NAME], bins=BIN_COUNT)\n", "```\n", "The `bins=BIN_COUNT` argument sets the number of bins to display.\n", "\n", "**EXERCISE**:\n", "Using `pyplot`, create a histogram of `CPU_USAGE` column in `df2` using 20 bins." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Modify and uncomment to plot a histogram of CPU_USAGE in df2 DataFrame using 20 bins\"\"\";\n", "\n", "#hist_plot = pyplot.hist(#TODO,bins=#TODO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Comment*: The `pyplot.hist` returns a tuple of three components: \n", "\n", " * An array of value counts on the individual bins\n", " * An array of bin edges\n", " * A list of `Patch` objects, each corresponding to a histogram bin\n", "\n", "**QUESTIONS**:\n", "\n", " * How many bars were visibly plotted?\n", " * How many bars are actually displayed?\n", " *Hints*: Look at `hist_plot[0]`; also look at the (x) range of the bins.\n", " * What is the range of values of `CPU_USAGE` in the dataset?\n", "\n", "**OPTIONAL EXERCISES**:\n", "\n", " * Examine the content of `hist_plot` to learn about these values.\n", " The arrays are helpful for close-up analysis.\n", " * Can we alter the appearance of the plot so that the other bars are visible?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Histogram (Distribution) Plot with Seaborn\n", "\n", "Seaborn generates a histogram plot of `df[COLUMN_NAME]` using the following syntax:\n", "```python\n", "seaborn.distplot(df[COLUMN_NAME], kde=False, bins=BIN_COUNT)\n", "```\n", "\n", "**EXERCISE**:\n", "Re-create the histogram of `df2['CPU_USAGE']` using `seaborn` package, also with 20 bins." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Modify and uncomment to plot a histogram of 'CPU_USAGE' in df2 DataFrame using seaborn\"\"\";\n", "\n", "#res_sns = seaborn.distplot(#TODO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Comments on Plotting Packages\n", "\n", "The plotting packages have a lot of \"knobs\" to adjust the appearance (visual) of the graph.\n", "For example, it can alternatively draw the bars horizontally (flipping the meaning of the axes):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hist_plot = pyplot.hist(df2['CPU_USAGE'], bins=20, orientation='horizontal')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of the two histogram plots drawn with the two different packages, which plot has more desirable appearance to you?\n", "\n", "> Seaborn is a newer package and is still under heavy development.\n", "> For example, newer version of Seaborn has `seaborn.displot` function which is a new interface to replace `seaborn.distplot`.\n", "> The new `displot` function produces a graph that has all the necessary axis labels.\n", "> The version of `seaborn` on Wahab is not currently supporting this function." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Plotting Exercises\n", "\n", "**EXERCISE**: Plot a histogram of `priority` in `df2` using `pyplot`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Modify and uncomment to plot a histogram of priority in df2 using pyplot with 20 bins\"\"\";\n", "\n", "# Res = pyplot.hist(#TODO,#TODO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**QUESTIONS**:\n", "\n", "* How many bars are visible?\n", "* How many bars are actually displayed? *Hints*: Look at `Res[0]`; also look at the (x) range of the bins." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Upon completing exercise, you will observe that `priority` contains integer values with only a few values displayed.\n", "Frequently, data appearing as such after plotting is a tell-tale sign of `categorical` or `ordinal` data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXERCISE**:\n", "Plot a histogram of `num_threads` in `df2` using pyplot (plt)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Modify and uncomment to plot a histogram of `num_threads` in df2 using pyplot with 20 bins\"\"\";\n", "\n", "#Res2 = pyplot.#TODO(df2['#TODO'], bins=#TODO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Upon completing the exercise, observe the number of threads shows a multimodal (two major and smaller peaks).\n", "The histogram plots so far, however, mixes the data from two apps (Facebook and WhatsApp) in one graph.\n", "Let us try to separate them visually." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Drawing Multiple Histograms in One Plot\n", "\n", "We can further plot a histogram of `num_threads` grouped by application type.\n", "Showing multiple graphs for different categories in one plot can shed some light.\n", "\n", "**OPTIONAL EXERCISE**: Replot the histogram above by separating the `num_threads` belonging to Facebook from those of WhatsApp.\n", "Draw the two histogram bars in one plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Create a combined histogram plot where we separate the records\n", "belonging to the two different apps.\n", "Modify and uncomment appropriate to run\"\"\";\n", "\n", "# get the num threads for Facebook\n", "#nthrds_FB = df2[#TODO]['num_threads']\n", "# get the num threads for WhatsApp\n", "#nthrds_WA = #TODO\n", "\n", "# (Optional) Get the min and max values for *all* the app categories\n", "# so we have a uniform histogram appearance\n", "min_val = df2['num_threads'].min()\n", "max_val = df2['num_threads'].max()\n", "print('num_threads range:', min_val, '..', max_val)\n", "\n", "# Uncomment and edit to display the histograms\n", "#pyplot.hist(nthrds_FB, label='Facebook', bins=20)\n", "#pyplot.hist(#TODO)\n", "\n", "# Hint: add the `range=(min_val,max_val)` argument to both function calls above\n", "# to make the bars have the same width\n", "\n", "# Uncomment this one to get the legend:\n", "# pyplot.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**QUESTION**:\n", "Upon completion of the plot, can you observe the different characteristics of the two applications?\n", "Discuss the differences based on the visual cues." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.3 Box Plot\n", "\n", "Box plot displays data distribution based on a five number summary as follows;\n", "* Minimum\n", "* First quartile (Q1)\n", "* Median\n", "* Third quartile (Q3)\n", "* Maximum \n", "\n", "**Note**\n", "> * Interquartile range (IQR): 25th to the 75th percentile.\n", "> * “maximum”: Q3 + 1.5*IQR\n", "> * “minimum”: Q1 -1.5*IQR \n", "\n", "We have covered box plot earlier in this notebook; please see that part for code examples.\n", "\n", "#### Exercise\n", "\n", "Use `seaborn.boxplot(DataFrame[COLUMN_NAME])` to create A box plot with Seaborn.\n", "In this exercise, create a boxplot of first 2000 records of `guess_time` in `df2`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Modify and uncomment to select and plot all records of 'guest_time' in 'df2' Dataframe\"\"\";\n", "\n", "##TODO(df2[#TODO].guest_time)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.4 Bar Plot\n", "\n", "A bar plot displays an estimate of central tendency of numerical variables, with the height of each rectangle being its mean value and the errorbar providing some indication of the uncertainty around that mean value. The syntax is as follows:\n", "\n", "```python\n", "sns.barplot(x=CATEGORY_COLUMN, y=COLUMN, data=DF)\n", "```\n", "\n", "This will plot the bar plot of `DF[COLUMN]` values, grouped by the different categories according to `DF[CATEGORY_COLUMN]`.\n", "\n", "**EXERCISE**:\n", "Create a bar plot of `CPU_USAGE` to show the average CPU usage of the two different apps in `df2`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Modify and uncomment to plot a barplot\"\"\";\n", "\n", "#seaborn.barplot(x=#TODO, y=#TODO, data=df2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**QUESTION**: Which application uses more CPU on average?\n", "\n", "*Notes*: Standard deviation of the samples can be plotted (instead of the uncertainty of the mean) by adding `ci='std'` argument on the function call." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Questions and Discussions\n", "\n", "* Which type of data can use a barplot to present?\n", "* What are black lines in barplot?\n", "* Can you try other parameters and infer some interesting result?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 7. Correlations Among Features\n", "\n", "\"In statistics, *correlation* or *dependence* is any statistical relationship, whether causal or not, between two random variables or bivariate data.\n", "In the broadest sense, correlation is any statistical association, though it commonly refers to the degree to which a pair of variables are linearly related.\"\n", "([Wikipedia](https://en.wikipedia.org/wiki/Correlation_and_dependence))\n", "See the [lesson for further explanation](https://deapsecure.gitlab.io/deapsecure-lesson02-bd/30-data-wrangling-viz/index.html#correlations).\n", "\n", "While previous sections focused on individual features within a dataset, many times correlations exist amongst these features which could affect the quality of the dataset.\n", "Two features (variables) are said to be *correlated* when the value changes in one feature is accompanied by the changes in the other feature in a systematic fashion.\n", "Visualization is a great tool to help identify correlations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 7.1 Scatter Plot and Joint Plot\n", "\n", "*Scatter plots* display a variable pair as points in a two-dimensional plot. The value of one variable is shown on one axis whereas the value from the other is on the other axis.\n", "With this kind of plot, we can visually identify how much one variable affects the other.\n", "Related to scatter plot, there is a variation of the plot called *joint plot*, where the two-dimensional scatter plot is superimposed with the one-dimensional distributions (i.e. histograms) on the two axes.\n", "Use the following syntaxes to create a scatter plot and/or a joint plot:\n", "\n", "```python\n", "seaborn.scatterplot(x=COLUMN_NAME_1, y=COLUMN_NAME_2, data=DF)\n", "seaborn.jointplot(x=COLUMN_NAME_2, y=COLUMN_NAME_2, data=DF)\n", "```\n", "\n", "where `COLUMN_NAME_1` and `COLUMN_NAME_2` are the names of the columns of the variable pair in the `DF` DataFrame.\n", "\n", "**EXERCISE**:\n", "Create a scatter plot of the following column pairs in `df2` DataFrame:\n", "\n", "* X-axis: `utime` Y-axis: `vsize`\n", "* X-axis: `Mem` Y-axis: `vsize`\n", "\n", "Optionally, create a jointplot using `utime` on x-axis and `vsize` on y-axis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Plot a scatter plot of vsize against utime in df2 DataFrame and explain the output\"\"\";\n", "\n", "#TODO" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Plot a scatter plot of vsize against Mem in df2 DataFrame and explain the output\"\"\";\n", "\n", "#TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Correlations among two features would appear as patterns in the scatter plot.\n", "When two features are correlated, then there is a strong interdependence of the value of one feature to the other.\n", "In such a case, we can *guess* what the value of the second feature, knowing the value of the first feature.\n", "When the two features are not correlated, the scatter plot contain randomly distributed points with no clear trend.\n", "\n", "**QUESTIONS**:\n", "\n", "1. From the scatter plots above, which variable pairs exhibit " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"plot a jointplot of vsize against Mem in df2 DataFrame and explain output\"\"\";\n", "\n", "#TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 7.2 Pair Plot\n", "\n", "To plot multiple pairwise bivariate distributions in a dataset, you can use the pairplot() function. This creates a matrix of
\n", "axes and shows the relationship for each pair of columns in a DataFrame. By default, it also draws the univariate distribution
\n", "of each variable on the diagonal Axes:\n", "\n", "\n", "#### 7.2.1 Exercise\n", "\n", "Create a dataframe with plot a pairplot of it using codes below\n", " * df2_demo=df2.iloc[:,5:9]\n", " * seaborn.pairplot(df2_demo)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Enter codes above to create a dataframe and plot a pairplot\"\"\";\n", "\n", "#TODO" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2_demo = df2[['CPU_USAGE', 'num_threads', 'otherPrivateDirty', 'priority', 'vsize']]\n", "seaborn.pairplot(df2_demo)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2_demo.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 7.3 Correlation Heat Map\n", "\n", "A *correlation heat map* shows indicates correlations that exist between pairs of variables (\"pairwise correlations\") in a color-coded image plot.\n", "The color indicates the magnitude of the correlation. \n", "\n", "**EXERCISE**\n", "\n", "Let us we compute and plot the pairwise correlation among pairs of variables in the dataset.\n", "\n", "We will use the following functions/methods:\n", "\n", "* `DataFrame.corr()` -- Computes pairwise correlation of columns, excluding NA/null values. By default the Pearson correlation function is used.\n", "* `seaborn.heatmap(DataFrame, [options...])` -- Plots a heatmap by passing in a computed correlated dataframe\n", "\n", "This exercise entails the following steps:\n", "\n", "* Compute the pairwise correlations of `df2`, save them in a new dataframe called `df_corr`.\n", "* Plot a heat map of the correlation dataframe." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Modify and uncomment to create a heat map of df2 pairwise correlations\"\"\";\n", "\n", "#df_corr = #TODO\n", "#pyplot.figure(figsize=(12, 12)) \n", "#seaborn.#TODO(#TODO, annot=True, vmax=1, square=True, cmap=\"Blues\")\n", "#pyplot.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXPLANATION**\n", "\n", "Two variables have a linear relationship if changes in one affects the other by a proportional constant. Mathematically,\n", "\n", "```\n", "var2 = constant * var1\n", "```\n", "\n", "The Pearson correlation function returns 1.0 if the correlation is perfect with a positive constant factor.\n", "It returns -1.0 if the correlation is perfect with a negative constant factor.\n", "\n", "For this exercise, observe and discuss the heatmap plotted just above.\n", "Identify a few pairs that have very strong correlations (+1 or -1).\n", "\n", " * By definition, the diagonal element of the heat map is +1 because the a variable is linearly correlated with itself.\n", "\n", " * It appears that `vsize` has a perfect correlation with `Mem`.\n", " In fact, you can prove this fact using _pandas_, `Mem` is identical to `vsize`.\n", " On the other hand, `utime` and `vsize` don’t have this kind of relationship.\n", "\n", " * Identify two other pairs that have strong correlations (+1 or -1).\n", " \n", " * Identify another pair that is highly correlated (the absolute value is well above 0.5)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "To conclude the activities in this notebook, please write down all your observation on the following issues:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Issues with Data in `sherlock_mystery_2apps.csv`\n", "\n", "#### Missing data\n", "\n", "Which column(s) (feature[s]) have missing data?\n", "What course of action do we choose to deal with this issue?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*--> (enter your response here)*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Irrelevant Data\n", "\n", "Which column(s) contain irrelevant data for our analysis?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*--> (enter your response here)*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Duplicate Data\n", "\n", "Which column(s) contain duplicate of the other column(s)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*--> (enter your response here, e.g. Column \"X\" is a duplicate feature of column \"Y\")*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Observed Patterns in Data\n", "\n", "* Which application uses more CPU cycles on average? Facebook or WhatsApp?\n", "\n", "* Which application uses more memory on average?\n", "\n", "* Which pair of features have high correlations?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*--> (enter your response here)*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }