DeapSECURE module 2: Dealing with Big Data
Welcome to the DeapSECURE online training program! This is a Jupyter notebook for the hands-on learning activities of the "Big Data" module, Episode 5: "Data Wrangling and Visualization" .
When analyzing data, up to two-thirds of the time is actually spent preparing the data. This may sound like a waste of time, but that step is absolutely crucial to obtaining trustworthy insight from the data. The goal of data preparation is to achieve a clean, consistent and processable state of data.
Common issues with data include the following:
Data preparation is roughly made up of the following steps:
This session will cover the first two steps above. We will give you a taste of a data scientist's work on data wrangling and exploratory data analysis. These two steps are interconnected and While many of the principles we learn here still hold, each problem and dataset has its own specific issues that may not generalize. There is an art to this process, which needs to be learned through much practice and experience.
QUICK LINKS
If you are opening this notebook from the Wahab OnDemand interface, you're all set.
If you see this notebook elsewhere, and want to perform the exercises on Wahab cluster, please follow the steps outlined in our setup procedure.
Get the necessary files using commands below within Jupyter:
mkdir -p ~/CItraining/module-bd
cp -pr /scratch/Workshops/DeapSECURE/module-bd/. ~/CItraining/module-bd
cd ~/CItraining/module-bd
The file name of this notebook is BigData-session-3.ipynb
.
Throughout this notebook, #TODO
is used as a placeholder where you need to fill in with something appropriate.
To run a code in a cell, press Shift+Enter
.
Summary table of the commonly used indexing syntax from our own lesson.
We recommend you open these on separate tabs or print them; they are handy help for writing your own codes.
Next step, we need to import the required libraries into this Jupyter Notebook:
pandas
, matplotlib.pyplot
and seaborn
.
For Wahab cluster only: before importing these libraries, we have to load the following environment modules: py-pandas
, py-matplotlib
, py-seaborn
:
"""Run to load modules""";
module("load", "py-pandas", "py-matplotlib", "py-seaborn")
Now we can import all the required modules into Python:
"""Run to import libraries""";
import pandas
import matplotlib
from matplotlib import pyplot
import numpy
import seaborn
%matplotlib inline
##^^ This is an ipython magic command to ensure images are rendered inline
## Optional: Increase figure sizes globally.
## The default size is (6.4, 4.8)
#matplotlib.rcParams['figure.figsize'] = (10.0, 7.0)
We will working with a significantly larger data file, sherlock/sherlock_mystery_2apps.csv
, roughly 76MB in size.
Load the data into a DataFrame object named df2
.
This still has only two applications, WhatsApp and Facebook.
"""Uncomment and modify to load sherlock_mystery_2apps.csv into df2""";
#df2 = pandas.#TODO("#TODO");
Always perform initial exploration of the new dataset; use Pandas methods and attributes to answer the following questions:
Hint: use a combination of the DataFrame attributes shape
, dtypes
, and/or methods like head
, tail
, describe
, info
.
"""Uncomment to perform basic data exploration on df2 DataFrame""";
#df2.describe().T
QUESTION: Compare the numbers and statistics with that of the previous dataset (df_mystery
, from file sherlock/sherlock_mystery.csv
):
"""Uncomment to read in sherlock_mystery.csv and compare with df2 Dataframe""";
#df_mystery = pandas.#TODO(#TODO)
#df_mystery.describe().T
When analyzing data, up to two-thirds of the time is actually spent preparing the data. This may sound like a waste of time, but that step is absolutely crucial to obtaining trustworthy insight from the data. The goal of data preparation is to achieve a clean, consistent and processable state of data.
Common issues with data include the following:
Data preparation is roughly made up of the following steps:
This notebook will cover the first two steps above.
Data wrangling transforms raw data into an appropriate and valuable format for a variety of downstream purposes including analytics. Data wrangling addresses issues such as the following:
Pandas supports many data types, including: discrete numbers (ints), continuous numbers (floats), and strings. But to work effectively and properly with data, we need to further understand the nature and meaning of our data. There are different ways to classify data beyond whether they are numbers or words.
In tabular datasets, each column contains a variable or a feature. We need to consider the nature of each of these variables:
Generally, a variable or feature can be classified as either numerical or categorical:
Data variables can further be described as continuous or discrete.
This sections discusses approaches involved in cleaning data. In practice, your judgment as the data analyst is very important so as not to introduce bias into the data.
Useless or irrelevant columns should be removed from the data.
Reminder: you can remove irrelevant columns using
df.drop([COLUMN1, COLUMN2, ...], axis=1, inplace=True)
EXAMPLE:
Examine the features (columns) of df2
DataFrame.
There is one column that is obviously irrelevant.
Remove that one irrelevant feature.
"""Uncomment and run to identify one irrelevant feature (column).
The one below is just a starting point.""";
#df2.head(20)
"""Modify and uncomment to remove the irrelevant column""";
#df2.#TODO
Missing data is cause by several reasons. We will examine best practices with handling missing values. Read main lesson for a deeper understanding.
Undertake the following:
"""Execute the following code to create a toy dataset with missing values""";
nan = numpy.nan
ex0 = pandas.DataFrame([[1, 2, 3, 0],
[3, 4, nan, 1],
[nan, nan, nan, nan],
[nan, 3, nan, 4]],
columns=['A','B','C','D'])
Use one or more of the following expressions to practice detection and handling of missing data:
notnull()
and isna()
methods detect defined (non-null) or undefined (null, missing) data, respectively;dropna()
removes records or columns with missing values;fillna()
fills the missing cells with a value.Here are some examples of detecting missing data to try below:
ex0.notnull()
ex0.isna()
ex0.isna().sum()
ex0.isna().sum(axis=0)
ex0.isna().sum(axis=1)
What does each command mean?
Here are some examples of handling missing data: What does each command mean?
ex0.dropna()
ex0.dropna(how='all')
ex0.dropna(axis=1)
ex0.fillna(7)
ex0.fillna(ex0.mean(skipna=True))
"""Experiment with each expression above in this cell""";
#ex0.#TODO()
#RUNIT
display(ex0.dropna())
display(ex0.dropna(how='all'))
display(ex0.dropna(axis=1))
EXAMPLE: Identify feature(s) in df2
that have some missing data.
Hint: Use one of the commands already demonstrated just above.
"""Write a code below to identify features with missing values in df2""";
#df2.#TODO
Hint: Some of the exploratory functions we learned earlier can also unveil missing data. Which one(s) is that?
QUESTION: What do you want to do with the missing data? Discuss the pros and cons of each of the possible choices!
"""Use this cell to fix the missing data""";
#TODO
There are many reasons that duplicate features can enter into a dataset. Whether it happens during the data collection or in the integration of data, one must watch for duplicate data as they affect the quality of data--and the outcome of the analysis.
DataFrame.duplicate()
checks line after line for duplicates and returnsTrue
per duplicate linereset_index()
Rearranges indexes
In this exercise we undertake the following;
"""Uncomment to create a dataset (df3) from df2 and create duplicates""";
#df3 = df2.iloc[0:2].copy()
#df3.rename(index={0: 'a'},inplace=True)
#df3.rename(index={1: 'b'},inplace=True)
#df2 = df3.append(df2)
"""Uncomment to check for duplicates""";
#df2.duplicated()
#numpy.asarray(df2.duplicated()).nonzero()
"""Modify and uncomment to remove duplicate data"""
#df2.drop(#RUNIT, axis=0,inplace=True)
#df2.reset_index(drop=True)
Visualization is a method of presenting data visually in many different ways, each uncovering patterns and trends existing in data. Visualization is indispensible when handling and analyzing massive amounts of data. In this and the next few sections we will introduce some common visualizations that can greatly help the process of exploratory data analysis.
We will explore many visualization capabilities provided by Matplotlib and Seaborn libraries:
Matplotlib: A de facto Python 2D plotting library supported in Python scripts, IPython shells and other Python platforms including Jupyter.
The plotting capabilities is provided by the pyplot
module in this library.
Seaborn: Provides a high-level interface for drawing attractive and informative statistical graphics. By default, Seaborn uses Mtplotlib as its backend.
Note
Use pyplot.figure(figsize=(x-size,y-size))
in a cell to modify the display sizes of a single plot. This should be specified before calling plotting function.
A Count plot shows the number of occurrences of various values in categorical data.
seaborn.countplot(x=COLUMN_NAME, data=DataFrame)
plots a count plot of the COLUMN_NAME
data within the DataFrame.
QUESTION: How many records exist in df2
for each application?
"""Modify and uncomment to generate a Countplot of ApplicationName in 'df2' DataFrame""";
#seaborn.countplot(x='#TODO', data=#TODO)
Above graph displays a countplot representing the number of each application in the df2
dataframe
"""Using grouping and aggregation operations as studied earlier, cross-check the count plots above.
Modify and uncomment code below to complete task""";
#df2.#TODO('#TODO').size()
A histogram displays the distribution of values (shape and spread) in the form of vertical bars. The range of the values are split\ into multiple bins on the horizontal axis. The frequency of values within each bin’s range is displayed as a vertical bar for each bin.
Taller bars show that more data points fall in those bin ranges. Horizontal and Vertical bars can be exchanged as a variation.
We will experiment with histograms using both matplotlib's pyplot and seaborn
Using pyplot.hist(df[COLUMN_NAME], bins=BIN_COUNT)
a histogram of COLUMN_NAME
within DataFrame
with number of bins equal to BIN_COUNT
.
Using pyplot
, create a histogram of CPU_USAGE
in df2
using 20 Bins.
Upon successful completion, observe the ouput.
A tuple of three components should be displayed
these include;
- A array of counts on individual bins
- An array of of bin edges
- A list of 20 matplotlib's
Patch
objects
The arrays are helpful for close-up analysis.
"""Modify and uncomment to plot a histogram of CPU_USAGE in df2 DataFrame using 20 bins""";
#hist_plot = pyplot.hist(#TODO,bins=#TODO)
"""Modify and uncomment to plot a histogram of 'CPU_USAGE' in df2 DataFrame using seaborn""";
#res_sns = seaborn.distplot(#TODO)
Plot a histogram of priority
in df2
using pyplot
.
Upon completing exercise , you will observe that priority
contains integer values with few values displayed.\
Frequently, data appearing as such after plotting is a tell-tale sign of categorical
or ordinal
data.
"""Modify and uncomment to plot a histogram of priority in df2 using pyplot with 20 bins""";
# Res = pyplot.hist(#TODO,#TODO)
Plot a histogram of num_threads
in df2
using pyplot (plt).
Upon completing exercise , observe the number of threads shows a multimodal (two major and smaller peaks).
"""Modify and uncomment to plot a histogram of `num_threads` in df2 using pyplot with 20 bins""";
#Res_2 = pyplot.#TODO(df2['#TODO'],bins=#TODO)
We can further plot a histogram of num_threads
grouped by application type.
TASK
Using seaborn.distplot(DataFrame,kde=False,label='Application_Name')
, create a plot of grouped application types.
Observation: Upon completion, can you observe the differences between both applications? Discuss the differences based on visual cues.
"""Similar to nthrds_FB for Facebook, complete same with nthrds_WA for WhatsApp.
Modify and uncomment appropriate to run""";
nthrds_FB = df2[df2['ApplicationName'] == 'Facebook']['num_threads']
#nthrds_WA = #TODO
#seaborn.distplot(nthrds_FB,kde=False,label='Facebook')
#seaborn.distplot(#TODO)
#pyplot.legend()
seaborn.distplot
produce histogram bars with
radically different heights pyplots, plt.hist
?
Can we make them same?Histogram implies data is numerically ordered, therefore it cannot display the distribution of categorical data.\ Count plot is the appropriate tool for categorical data, analogous to histogram for categorical data.
The y (vertical) axis usally represents frequency count. However, this count can be normalized by dividing by total
count to create a normalized histogram (i.e. a density function). seaborn.distplot produces density function by\
default, whereas pyplot.hist a raw-value histogram.
To make pyplot.hist display like seaborn.distplot:
plt.hist(df2['CPU_USAGE'], bins=50, normed=True)
To make seaborn.distplot display raw values,
seaborn.distplot(df2['CPU_USAGE'], norm_hist=False, kde=False)
unspecified bin causes seaborn to guesstimate a good number of bins for given data.
bins argument
.However the interpretation of the bar height would vary based on requirements such as
raw count
ornormalized density
of values.\ Displaying a normalized density such as product of height and width of a bar corresponding to count of values per bin appears more logical.
Box plot displays data distribution based on a five number summary as follows;
Note
- Interquartile range (IQR): 25th to the 75th percentile.
- “maximum”: Q3 + 1.5*IQR
- “minimum”: Q1 -1.5*IQR
Use seaborn.boxplot(DataFrame['Column_Name'])
to create boxplot with seaborn
In this exercise, create a boxplot of first 2000 records of guess_time
in df2
"""Modify and uncomment to select and plot all records of 'guest_time' in 'df2' Dataframe""";
##TODO(df2[#TODO].guest_time)
A bar plot displays an estimate of central tendency of numerical variables with the height of each rectangle being its mean value and the errorbar providing some indication of the uncertainty around that mean value.
Create a barplot using seaborn;
sns.barplot(x='ApplicationName', y='CPU_USAGE', data=df2)
"""Modify and Uncomment to plot a barplot""";
#seaborn.barplot(x='ApplicationName', y='CPU_USAGE', data=df2)
In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. In the broadest sense correlation is any statistical association, though it commonly refers to the degree to which a pair of variables are linearly related. Lesson document provides further explanation.
While previous sections focussed on individual features within a dataset, many times correlations exist amongst these features which could affect the quality of the dataset.
Scatter plots represent data points on horizontal axis while the vertical axis shows how much one variable is affected by another.
df2
DataFrame:utime
Y-axis: vsize
Mem
Y-axis: vsize
Code Syntax:
seaborn.scatterplot(x="Column_A", y="Column_B", data="DataFrame")
utime
on x-axis and vsize
on y-axisCode Syntax:
seaborn.jointplot(x="Mem", y="vsize", data=df2)
"""plot a scatter plot of vsize against utime in df2 DataFrame and explain output""";
#TODO
"""plot a scatter plot of vsize against Mem in df2 DataFrame and explain output""";
#TODO
"""plot a jointplot of vsize against Mem in df2 DataFrame and explain output""";
#TODO
To plot multiple pairwise bivariate distributions in a dataset, you can use the pairplot() function. This creates a matrix of
axes and shows the relationship for each pair of columns in a DataFrame. By default, it also draws the univariate distribution
of each variable on the diagonal Axes:
Create a dataframe with plot a pairplot of it using codes below
"""Enter codes above to create a dataframe and plot a pairplot""";
#TODO
A heat map shows indicates data which depends on two independent variables in a color-coded image plot.The color indicates the\ magnitude of the correlation.
let us we compute and plot the pairwise correlation among pairs of variables in the dataset.
We will use the following ;
DataFrame.corr()
Computes pairwise correlation of columns, excluding NA/null values.pyplot.subplot()
Create a grid of different plot figuresseaborn.heatmap(DataFrame)
Plot a heatmap by passing in a computed correlated dataframeExercise entails the following;
df2
"""Modify and uncomment to create a heat map of df2 correlations""";
#df_corr = #TODO
#pyplot.subplots(figsize=(12, 12))
#seaborn.#TODO(#TODO, annot=True, vmax=1, square=True, cmap="Blues")
#pyplot.show()
Two variables have a linear relationship if changes in one affects the other proportionally by a constant.
Mathematically
var2 = constant * var1
For this exercise,
Observe and discuss the heatmap plotted earlier
- It appears that
vsize
has perfect correlation withMem
. In fact, if you examine the data, Mem is identical to vsize.
On the other hand, utime and vsize don’t have this kind of relationship.- A linear relationship is one where increasing or decreasing one variable n times will cause a corresponding increase
or decrease of n times in the other variable too. Observe thatvsize
grows as long asMem
grows so that is a linear
relationship;utime
andvsize
don’t have this kind of relationship