{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**DeapSECURE module 2: Dealing with Big Data**\n",
"\n",
"# Session 3: Data Wrangling and Visualization\n",
"\n",
"Welcome to the DeapSECURE online training program!\n",
"This is a Jupyter notebook for the hands-on learning activities of the\n",
"[\"Big Data\" module](https://deapsecure.gitlab.io/deapsecure-lesson02-bd/), Episode 5: [\"Data Wrangling and Visualization\"](https://deapsecure.gitlab.io/deapsecure-lesson02-bd/30-data-wrangling-viz/index.html) .\n",
"\n",
"\n",
"## Data Preparation\n",
"\n",
"When analyzing data, up to two-thirds of the time is actually spent preparing the data.\n",
"This may sound like a waste of time, but that step is absolutely crucial to obtaining trustworthy insight from the data.\n",
"The goal of **data preparation** is to achieve a clean, consistent and processable state of data.\n",
"\n",
"**Common issues with data** include the following:\n",
"* Missing data\n",
"* Bad or inconsistent data\n",
"* Duplicate data\n",
"* Irrelevant data\n",
"* Format mismatch\n",
"* Representational issues\n",
"\n",
"Data preparation is roughly made up of the following steps:\n",
"\n",
"* **Data wrangling** (or **data munging**)\n",
"* **Exploratory data analysis** (EDA)\n",
"* **Feature engineering**\n",
"\n",
"This session will cover the first two steps above.\n",
"We will give you a *taste* of a data scientist's work on **data wrangling** and **exploratory data analysis**.\n",
"These two steps are interconnected and \n",
"While many of the principles we learn here still hold, each problem and dataset has its own specific issues that may not generalize.\n",
"There is an art to this process, which needs to be learned through much practice and experience.\n",
"\n",
"**QUICK LINKS**\n",
"* [Setup](#sec-setup)\n",
"* [Loading Sherlock Data](#sec-load-sherlock)\n",
"* [Data Wrangling](#sec-data-wrangling)\n",
"* [Types of Data](#sec-types-of-data)\n",
"* [Cleaning Data](#sec-cleanData)\n",
"* [Visualization](#sec-visualization)\n",
"* [Data Distribution](#sec-data-distribution)\n",
"* [Feature Correlation](#sec-data-correlation)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## 1. Setup Instructions\n",
"\n",
"If you are opening this notebook from the Wahab OnDemand interface, you're all set.\n",
"\n",
"If you see this notebook elsewhere, and want to perform the exercises on Wahab cluster, please follow the steps outlined in our setup procedure.\n",
"\n",
"1. Make sure you have activated your HPC service.\n",
"2. Point your web browser to https://ondemand.wahab.hpc.odu.edu/ and sign in with your MIDAS ID and password.\n",
"3. Create a new Jupyter session using \"legacy\" Python suite, then create a new \"Python3\" notebook. (See ODU HPC wiki for more detailed help.)\n",
"4. Get the necessary files using commands below within Jupyter:\n",
"\n",
" mkdir -p ~/CItraining/module-bd\n",
" cp -pr /scratch/Workshops/DeapSECURE/module-bd/. ~/CItraining/module-bd\n",
" cd ~/CItraining/module-bd\n",
"\n",
"The file name of this notebook is `BigData-session-3.ipynb`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.1 Reminder\n",
"\n",
"* Throughout this notebook, `#TODO` is used as a placeholder where you need to fill in with something appropriate. \n",
"\n",
"* To run a code in a cell, press `Shift+Enter`.\n",
"\n",
"* Pandas cheatsheet\n",
"\n",
"* Summary table of the commonly used indexing syntax from our own lesson.\n",
"\n",
"We recommend you open these on separate tabs or print them;\n",
"they are handy help for writing your own codes."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.2 Loading Python Libraries\n",
"\n",
"Next step, we need to import the required libraries into this Jupyter Notebook:\n",
"`pandas`, `matplotlib.pyplot` and `seaborn`.\n",
"\n",
"**For Wahab cluster only**: before importing these libraries, we have to load the following *environment modules*: `py-pandas`, `py-matplotlib`, `py-seaborn`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Run to load modules\"\"\";\n",
"module(\"load\", \"py-pandas\", \"py-matplotlib\", \"py-seaborn\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can import all the required modules into Python:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Run to import libraries\"\"\";\n",
"import pandas\n",
"import matplotlib\n",
"from matplotlib import pyplot\n",
"import numpy\n",
"import seaborn\n",
"%matplotlib inline\n",
"##^^ This is an ipython magic command to ensure images are rendered inline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## Optional: Increase figure sizes globally.\n",
"## The default size is (6.4, 4.8)\n",
"#matplotlib.rcParams['figure.figsize'] = (10.0, 7.0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## 2. Loading Sherlock Applications Dataset\n",
"\n",
"We will working with a significantly larger data file, `sherlock/sherlock_mystery_2apps.csv`, roughly 76MB in size.\n",
"Load the data into a DataFrame object named `df2`.\n",
"This still has only two applications, WhatsApp and Facebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Uncomment and modify to load sherlock_mystery_2apps.csv into df2\"\"\";\n",
"\n",
"#df2 = pandas.#TODO(\"#TODO\");"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.1. Initial Exploration\n",
"\n",
"Always perform initial exploration of the new dataset; use Pandas methods and attributes to answer the following questions:\n",
"\n",
"* How many rows and columns are in this dataset?\n",
"* How do the numbers look like?\n",
"* How does the statistical information look like?\n",
"* What does the feature look like? (i.e. the data types)\n",
"\n",
"*Hint:* use a combination of the DataFrame attributes `shape`, `dtypes`, and/or methods like `head`, `tail`, `describe`, `info`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Uncomment to perform basic data exploration on df2 DataFrame\"\"\";\n",
"\n",
"#df2.describe().T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**QUESTION**: Compare the numbers and statistics with that of the previous dataset (`df_mystery`, from file `sherlock/sherlock_mystery.csv`):\n",
"\n",
"* Compare the columns of the two tables.\n",
"* Compare the sizes of the data.\n",
"* How do the mean and std (standard deviation) look like between the two datasets? Are they similar? Any statistics that look significantly different?\n",
"* Any difference in the range (min-max) and spread of the data?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Uncomment to read in sherlock_mystery.csv and compare with df2 Dataframe\"\"\";\n",
"\n",
"#df_mystery = pandas.#TODO(#TODO)\n",
"#df_mystery.describe().T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## 3. Data Preparation\n",
"\n",
"When analyzing data, up to two-thirds of the time is actually spent preparing the data.\n",
"This may sound like a waste of time, but that step is absolutely crucial to obtaining trustworthy insight from the data.\n",
"The goal of **data preparation** is to achieve a clean, consistent and processable state of data.\n",
"\n",
"**Common issues with data** include the following:\n",
"* Missing data\n",
"* Bad or inconsistent data\n",
"* Duplicate data\n",
"* Irrelevant data\n",
"* Format mismatch\n",
"* Representational issues\n",
"\n",
"Data preparation is roughly made up of the following steps:\n",
"\n",
"* **Data wrangling** (data munging)\n",
"* **Exploratory data analysis** (EDA)\n",
"* **Feature engineering**\n",
"\n",
"This notebook will cover the first two steps above."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## 4. Data Wrangling (Data Munging)\n",
"\n",
"Data wrangling transforms raw data into an appropriate and valuable format for a variety of downstream purposes including analytics.\n",
"Data wrangling addresses issues such as the following:\n",
"\n",
"* Understanding the nature of each feature;\n",
"* Handling missing data;\n",
"* Removing duplicate data, bad or irrelevant data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### 4.1 Types of Data\n",
"\n",
"_Pandas_ supports many data types, including: discrete numbers (ints), continuous numbers (floats), and strings.\n",
"But to work effectively and properly with data, we need to further understand the nature and meaning of our data.\n",
"There are different ways to classify data beyond whether they are numbers or words.\n",
"\n",
"In tabular datasets, each column contains a *variable* or a feature.\n",
"We need to consider the nature of each of these variables:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Numerical vs. Categorical\n",
"\n",
"Generally, a variable or feature can be classified as either _numerical_ or _categorical_:\n",
"\n",
"* *Numerical variable* has a specific quantitative value assigned. Examples: age, weight, memory usage, number of threads.\n",
"* *Categorical variable* is defined by the classes (categories) into which the variable may fall. Examples: eye color, ethicity, application name, application type.\n",
"\n",
"#### Discrete vs. Continuous\n",
"\n",
"Data variables can further be described as continuous or discrete.\n",
"\n",
"* *Continuous variable* is represented by a real number that can assume any value in the range of the measuring scale. Example: weight, speed, probability.\n",
"* *Discrete variable* takes on only discrete values which can be numerical or categorical. The possible values can be finite or infinite.\n",
" \n",
"#### Qualitative vs. Quantitative\n",
"\n",
"* *Qualitative variable* describes data as verbal groupings; the values may or may not have the notion of ranking, but their numerical difference cannot be defined. Categorical data is qualitative. An example of qualitative variable that can be ranked include user rating (e.g. poor, fair, good, excellent).\n",
"* *Quantitative data* describes data using numerical quantities and differences in values have definite meanings.\n",
"Examples include price, temperature, network bandwidth."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## 5. Cleaning Data\n",
"\n",
"This sections discusses approaches involved in cleaning data.\n",
"In practice, your judgment as the data analyst is very important so as not to introduce bias into the data.\n",
"\n",
"### 5.1 Useless or Irrelevant Data\n",
"\n",
"Useless or irrelevant columns should be removed from the data.\n",
"*Reminder:* you can remove irrelevant columns using\n",
"`df.drop([COLUMN1, COLUMN2, ...], axis=1, inplace=True)` \n",
"\n",
"**EXAMPLE:**\n",
"Examine the features (columns) of `df2` DataFrame.\n",
"There is one column that is obviously irrelevant.\n",
"Remove that one irrelevant feature."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Uncomment and run to identify one irrelevant feature (column).\n",
"The one below is just a starting point.\"\"\";\n",
"\n",
"#df2.head(20)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Modify and uncomment to remove the irrelevant column\"\"\";\n",
"\n",
"#df2.#TODO"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5.2 Missing Data\n",
"\n",
"Missing data is cause by several reasons. We will examine best practices with handling missing values.\n",
"Read main lesson for a deeper understanding.\n",
"\n",
"#### Missing Data Exercise\n",
"Undertake the following:\n",
" * Create a toy DataFrame below with varied types of missing values\n",
" * Explore some _pandas_ functions below for identifying and handling missing values"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Execute the following code to create a toy dataset with missing values\"\"\";\n",
"\n",
"nan = numpy.nan\n",
"ex0 = pandas.DataFrame([[1, 2, 3, 0],\n",
" [3, 4, nan, 1],\n",
" [nan, nan, nan, nan],\n",
" [nan, 3, nan, 4]],\n",
" columns=['A','B','C','D'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use one or more of the following expressions to practice detection and handling of missing data:\n",
"\n",
"* `notnull()` and `isna()` methods detect defined (non-null) or undefined (null, missing) data, respectively;\n",
"* `dropna()` removes records or columns with missing values;\n",
"* `fillna()` fills the missing cells with a value.\n",
"\n",
"Here are some examples of detecting missing data to try below:\n",
"\n",
"* `ex0.notnull()`\n",
"* `ex0.isna()`\n",
"* `ex0.isna().sum()`\n",
"* `ex0.isna().sum(axis=0)`\n",
"* `ex0.isna().sum(axis=1)`\n",
"\n",
"What does each command mean?\n",
"\n",
"Here are some examples of handling missing data: What does each command mean?\n",
"\n",
"* `ex0.dropna()`\n",
"* `ex0.dropna(how='all')`\n",
"* `ex0.dropna(axis=1)`\n",
"* `ex0.fillna(7)`\n",
"* `ex0.fillna(ex0.mean(skipna=True))`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Experiment with each expression above in this cell\"\"\";\n",
"\n",
"#ex0.#TODO()\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#RUNIT\n",
"display(ex0.dropna())\n",
"display(ex0.dropna(how='all'))\n",
"display(ex0.dropna(axis=1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Detecting Columns with Missing Data\n",
"\n",
"**EXAMPLE:** Identify feature(s) in `df2` that have some missing data.\n",
"*Hint:* Use one of the commands already demonstrated just above."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Write a code below to identify features with missing values in df2\"\"\";\n",
"\n",
"#df2.#TODO"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Hint:* Some of the exploratory functions we learned earlier can also unveil missing data.\n",
"Which one(s) is that?\n",
"\n",
"**QUESTION:** What do you want to do with the missing data?\n",
"Discuss the pros and cons of each of the possible choices!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Use this cell to fix the missing data\"\"\";\n",
"\n",
"#TODO"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5.3 Duplicate Data\n",
"\n",
"There are many reasons that duplicate features can enter into a dataset.\n",
"Whether it happens during the data collection or in the integration of data, one must watch for duplicate data as they affect the quality of data--and the outcome of the analysis.\n",
"\n",
"> * `DataFrame.duplicate()` checks line after line for duplicates and returns `True` per duplicate line\n",
"> * `reset_index()` Rearranges indexes\n",
"\n",
"#### Exercise\n",
"\n",
"In this exercise we undertake the following; \n",
"* Create a new dataset with duplicate data\n",
"* Identify duplicates in dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Uncomment to create a dataset (df3) from df2 and create duplicates\"\"\";\n",
"\n",
"#df3 = df2.iloc[0:2].copy()\n",
"#df3.rename(index={0: 'a'},inplace=True)\n",
"#df3.rename(index={1: 'b'},inplace=True)\n",
"#df2 = df3.append(df2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Uncomment to check for duplicates\"\"\";\n",
"\n",
"#df2.duplicated()\n",
"#numpy.asarray(df2.duplicated()).nonzero()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Modify and uncomment to remove duplicate data\"\"\"\n",
"#df2.drop(#RUNIT, axis=0,inplace=True)\n",
"#df2.reset_index(drop=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## 6. Visualization\n",
"\n",
"Visualization is a method of presenting data visually in many different ways, each uncovering patterns and trends existing in data.\n",
"Visualization is indispensible when handling and analyzing massive amounts of data. \n",
"In this and the next few sections we will introduce some common visualizations that can greatly help the process of exploratory data analysis.\n",
"\n",
"We will explore many visualization capabilities provided by\n",
"Matplotlib and Seaborn libraries:\n",
"\n",
"* Matplotlib: A de facto Python 2D plotting library supported in Python scripts, IPython shells and other Python platforms including Jupyter.\n",
"The plotting capabilities is provided by the `pyplot` module in this library.\n",
"\n",
"* Seaborn: Provides a high-level interface for drawing attractive and informative statistical graphics.\n",
"By default, Seaborn uses Mtplotlib as its backend.\n",
" \n",
"**Note** \n",
"\n",
"Use `pyplot.figure(figsize=(x-size,y-size))` in a cell to modify the display sizes of a single plot. This should be specified before calling plotting function."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 6.1 Count Plot\n",
"A Count plot shows the number of occurrences of various values in categorical data.\n",
"\n",
"`seaborn.countplot(x=COLUMN_NAME, data=DataFrame)` plots a count plot of the `COLUMN_NAME` data within the DataFrame.\n",
"\n",
"**QUESTION:** How many records exist in `df2` for each application?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Modify and uncomment to generate a Countplot of ApplicationName in 'df2' DataFrame\"\"\";\n",
"\n",
"#seaborn.countplot(x='#TODO', data=#TODO)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Above graph displays a countplot representing the number of each application in the `df2` dataframe** "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Using grouping and aggregation operations as studied earlier, cross-check the count plots above. \n",
" Modify and uncomment code below to complete task\"\"\";\n",
"\n",
"#df2.#TODO('#TODO').size()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 6.2 Histogram\n",
"\n",
"A histogram displays the distribution of values (shape and spread) in the form of vertical bars. The range of the values are split\\\n",
"into multiple bins on the horizontal axis. The frequency of values within each bin’s range is displayed as a vertical bar for each bin. \n",
"\n",
"Taller bars show that more data points fall in those bin ranges. Horizontal and Vertical bars can be exchanged as a variation. \n",
"\n",
"We will experiment with histograms using both matplotlib's pyplot and seaborn\n",
"\n",
"#### 6.2.1 Histogram with Pyplot\n",
"\n",
"Using `pyplot.hist(df[COLUMN_NAME], bins=BIN_COUNT)` a histogram of `COLUMN_NAME` within `DataFrame` with number of bins equal to `BIN_COUNT`. \n",
"\n",
"##### Exercise\n",
"Using `pyplot`, create a histogram of `CPU_USAGE` in `df2` using 20 Bins. \n",
"\n",
"Upon successful completion, observe the ouput.\n",
"\n",
"A tuple of three components should be displayed\n",
"> these include;
\n",
" * A array of counts on individual bins\n",
" * An array of of bin edges\n",
" * A list of 20 matplotlib's `Patch` objects\n",
"\n",
"The arrays are helpful for close-up analysis."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Modify and uncomment to plot a histogram of CPU_USAGE in df2 DataFrame using 20 bins\"\"\";\n",
"\n",
"#hist_plot = pyplot.hist(#TODO,bins=#TODO)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 6.2.2 Histogram (Distribution) Plot with Seaborn\n",
"\n",
"Using `seaborn.displot(df[COLUMN_NAME])` a histogram of `COLUMN_NAME` within `df` is plotted.\n",
"\n",
"##### Exercise\n",
"Using `seaborn` create a histogram of `CPU_USAGE` in `df2`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Modify and uncomment to plot a histogram of 'CPU_USAGE' in df2 DataFrame using seaborn\"\"\";\n",
"\n",
"#res_sns = seaborn.distplot(#TODO)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.2.3.1 Exercise Plot 1\n",
"Plot a histogram of `priority` in `df2` using `pyplot`.\n",
"\n",
"Upon completing exercise , you will observe that `priority` contains integer values with few values displayed.\\\n",
"Frequently, data appearing as such after plotting is a tell-tale sign of `categorical` or `ordinal` data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Modify and uncomment to plot a histogram of priority in df2 using pyplot with 20 bins\"\"\";\n",
"\n",
"# Res = pyplot.hist(#TODO,#TODO)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Exercise Plot 2\n",
"Plot a histogram of `num_threads` in `df2` using pyplot (plt).\n",
"\n",
"Upon completing exercise , observe the number of threads shows a multimodal (two major and smaller peaks). "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Modify and uncomment to plot a histogram of `num_threads` in df2 using pyplot with 20 bins\"\"\";\n",
"\n",
"\n",
"#Res_2 = pyplot.#TODO(df2['#TODO'],bins=#TODO)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Exercise Plot 3\n",
"We can further plot a histogram of `num_threads` grouped by application type.\n",
"\n",
"**TASK**\n",
"\n",
"Using `seaborn.distplot(DataFrame,kde=False,label='Application_Name')`, create a plot of grouped application types. \n",
"\n",
"\n",
"\n",
"**Observation**: Upon completion, can you observe the differences between both applications?\n",
" Discuss the differences based on visual cues."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Similar to nthrds_FB for Facebook, complete same with nthrds_WA for WhatsApp.\n",
" Modify and uncomment appropriate to run\"\"\";\n",
"\n",
"\n",
"nthrds_FB = df2[df2['ApplicationName'] == 'Facebook']['num_threads']\n",
"#nthrds_WA = #TODO\n",
"\n",
"#seaborn.distplot(nthrds_FB,kde=False,label='Facebook')\n",
"#seaborn.distplot(#TODO)\n",
"#pyplot.legend()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Discussion Questions\n",
"\n",
"* Can we plot a histogram of categorical data?\n",
"* Why does `seaborn.distplot` produce histogram bars with\n",
" radically different heights `pyplots, plt.hist`?\n",
" _Can we make them same?_\n",
"* Do bins have to be of same width?\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Histogram implies data is numerically ordered, therefore it cannot display the distribution of categorical data.\\\n",
" Count plot is the appropriate tool for categorical data, analogous to histogram for categorical data.\n",
"\n",
"* The y (vertical) axis usally represents frequency count. However, this count can be normalized by dividing by total
\n",
" count to create a normalized histogram (i.e. a density function). seaborn.distplot produces density function by\\\n",
" default, whereas pyplot.hist a raw-value histogram.\n",
"\n",
"> To make pyplot.hist display like seaborn.distplot:\n",
" `plt.hist(df2['CPU_USAGE'], bins=50, normed=True)`\n",
" \n",
"> To make seaborn.distplot display raw values,\n",
" `seaborn.distplot(df2['CPU_USAGE'], norm_hist=False, kde=False)`\n",
" unspecified bin causes seaborn to guesstimate a good number of bins for given data.\n",
"\n",
"* Although histogram bins usually have same widths, one can manually specify bin edges using `bins argument`.\n",
" > However the interpretation of the bar height would vary based on requirements such as `raw count` or `normalized density` of values.\\\n",
" Displaying a normalized density such as product of height and width of a bar corresponding to count of values per bin appears more logical. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 6.3 Box Plot\n",
"\n",
"Box plot displays data distribution based on a five number summary as follows;\n",
"* Minimum\n",
"* First quartile (Q1)\n",
"* Median\n",
"* Third quartile (Q3)\n",
"* Maximum \n",
"\n",
"**Note**\n",
"> * Interquartile range (IQR): 25th to the 75th percentile.\n",
"> * “maximum”: Q3 + 1.5*IQR\n",
"> * “minimum”: Q1 -1.5*IQR \n",
"\n",
"\n",
"Use `seaborn.boxplot(DataFrame['Column_Name'])` to create boxplot with seaborn\n",
"\n",
"\n",
"#### Exercise\n",
"\n",
"In this exercise, create a boxplot of first 2000 records of `guess_time` in `df2`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Modify and uncomment to select and plot all records of 'guest_time' in 'df2' Dataframe\"\"\";\n",
"\n",
"##TODO(df2[#TODO].guest_time)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 6.4 Bar Plot\n",
"\n",
"A bar plot displays an estimate of central tendency of numerical variables with the height of each rectangle being its mean value\n",
"and the errorbar providing some indication of the uncertainty around that mean value.\n",
"\n",
"#### Exercise\n",
"Create a barplot using seaborn; \n",
"`sns.barplot(x='ApplicationName', y='CPU_USAGE', data=df2)`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Modify and Uncomment to plot a barplot\"\"\";\n",
"\n",
"#seaborn.barplot(x='ApplicationName', y='CPU_USAGE', data=df2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Questions and Discussions\n",
"\n",
"* Which type of data can use a barplot to present?\n",
"* What are black lines in barplot?\n",
"* Can you try other parameters and infer some interesting result?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#RUNIT\n",
"* The data are numerical\n",
"* In a barplot, the black lines shows the distribution of the values."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## 7. Correlations Among Features\n",
"\n",
"In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. In the broadest sense correlation is any statistical association, though it commonly refers to the degree to which a pair of variables are linearly related. [Lesson document provides further explanation](https://deapsecure.gitlab.io/deapsecure-lesson02-bd/30-data-wrangling-viz/index.html).\n",
"\n",
"While previous sections focussed on individual features within a dataset, many times correlations exist amongst these features which could affect the quality of the dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 7.1 Scatter Plot and Joint Plot\n",
"\n",
"Scatter plots represent data points on horizontal axis while the vertical axis shows how much one variable is affected by another.\n",
"\n",
"#### 7.1.1 Exercise\n",
"1. Create a scatter plot of the following columns in `df2` DataFrame:\n",
"* X-axis: `utime` Y-axis: `vsize`\n",
"* X-axis: `Mem` Y-axis: `vsize`\n",
"\n",
"> Code Syntax: `seaborn.scatterplot(x=\"Column_A\", y=\"Column_B\", data=\"DataFrame\")`\n",
"\n",
"\n",
"2. Create a jointplot using `utime` on x-axis and `vsize` on y-axis\n",
" \n",
"> Code Syntax: `seaborn.jointplot(x=\"Mem\", y=\"vsize\", data=df2)`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"plot a scatter plot of vsize against utime in df2 DataFrame and explain output\"\"\";\n",
"\n",
"#TODO"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"plot a scatter plot of vsize against Mem in df2 DataFrame and explain output\"\"\";\n",
"\n",
"#TODO"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"plot a jointplot of vsize against Mem in df2 DataFrame and explain output\"\"\";\n",
"\n",
"#TODO"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 7.2 Pair Plot\n",
"\n",
"To plot multiple pairwise bivariate distributions in a dataset, you can use the pairplot() function. This creates a matrix of
\n",
"axes and shows the relationship for each pair of columns in a DataFrame. By default, it also draws the univariate distribution
\n",
"of each variable on the diagonal Axes:\n",
"\n",
"\n",
"#### 7.2.1 Exercise\n",
"\n",
"Create a dataframe with plot a pairplot of it using codes below\n",
" * df2_demo=df2.iloc[:,5:9]\n",
" * seaborn.pairplot(df2_demo)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Enter codes above to create a dataframe and plot a pairplot\"\"\";\n",
"\n",
"#TODO"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 7.3 Heat Map\n",
"\n",
"A heat map shows indicates data which depends on two independent variables in a color-coded image plot.The color indicates the\\\n",
"magnitude of the correlation. \n",
"\n",
"#### 7.3.1 Exercise\n",
"\n",
"let us we compute and plot the pairwise correlation among pairs of variables in the dataset.\n",
"\n",
"We will use the following ;\n",
"* `DataFrame.corr()` Computes pairwise correlation of columns, excluding NA/null values.\n",
"* `pyplot.subplot()` Create a grid of different plot figures\n",
"* `seaborn.heatmap(DataFrame)` Plot a heatmap by passing in a computed correlated dataframe\n",
"\n",
"\n",
"Exercise entails the following;\n",
"* Create a dataframe of computed correlations of `df2`\n",
"* Plot a Heat map of Correlation dataframe created.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Modify and uncomment to create a heat map of df2 correlations\"\"\";\n",
"\n",
"#df_corr = #TODO\n",
"#pyplot.subplots(figsize=(12, 12)) \n",
"#seaborn.#TODO(#TODO, annot=True, vmax=1, square=True, cmap=\"Blues\")\n",
"#pyplot.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 7.3.2 Exercise\n",
"\n",
"Two variables have a linear relationship if changes in one affects the other proportionally by a constant.\n",
" \n",
" > Mathematically\n",
" `var2 = constant * var1` \n",
" \n",
"For this exercise,\n",
" > Observe and discuss the heatmap plotted earlier\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> * It appears that `vsize` has perfect correlation with `Mem`. In fact, if you examine the data, Mem is identical to vsize.
On the other hand, utime and vsize don’t have this kind of relationship.\n",
"> \n",
"> * A linear relationship is one where increasing or decreasing one variable n times will cause a corresponding increase
\n",
" or decrease of n times in the other variable too. Observe that `vsize` grows as long as `Mem` grows so that is a linear
\n",
" relationship; `utime` and `vsize` don’t have this kind of relationship \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}