DeapSECURE module 2: Dealing with Big Data

Session 3: Data Wrangling and Visualization

Welcome to the DeapSECURE online training program! This is a Jupyter notebook for the hands-on learning activities of the "Big Data" module, Episode 5: "Data Wrangling and Visualization" .

Data Preparation

When analyzing data, up to two-thirds of the time is actually spent preparing the data. This may sound like a waste of time, but that step is absolutely crucial to obtaining trustworthy insight from the data. The goal of data preparation is to achieve a clean, consistent and processable state of data.

Common issues with data include the following:

  • Missing data
  • Bad or inconsistent data
  • Duplicate data
  • Irrelevant data
  • Format mismatch
  • Representational issues

Data preparation is roughly made up of the following steps:

  • Data wrangling (or data munging)
  • Exploratory data analysis (EDA)
  • Feature engineering

This session will cover the first two steps above. We will give you a taste of a data scientist's work on data wrangling and exploratory data analysis. These two steps are interconnected and While many of the principles we learn here still hold, each problem and dataset has its own specific issues that may not generalize. There is an art to this process, which needs to be learned through much practice and experience.

QUICK LINKS

1. Setup Instructions

If you are opening this notebook from the Wahab OnDemand interface, you're all set.

If you see this notebook elsewhere, and want to perform the exercises on Wahab cluster, please follow the steps outlined in our setup procedure.

  1. Make sure you have activated your HPC service.
  2. Point your web browser to https://ondemand.wahab.hpc.odu.edu/ and sign in with your MIDAS ID and password.
  3. Create a new Jupyter session using "legacy" Python suite, then create a new "Python3" notebook. (See ODU HPC wiki for more detailed help.)
  4. Get the necessary files using commands below within Jupyter:

    mkdir -p ~/CItraining/module-bd
    cp -pr /scratch/Workshops/DeapSECURE/module-bd/. ~/CItraining/module-bd
    cd ~/CItraining/module-bd

The file name of this notebook is BigData-session-3.ipynb.

1.1 Reminder

We recommend you open these on separate tabs or print them; they are handy help for writing your own codes.

1.2 Loading Python Libraries

Next step, we need to import the required libraries into this Jupyter Notebook: pandas, matplotlib.pyplot and seaborn.

For Wahab cluster only: before importing these libraries, we have to load the following environment modules: py-pandas, py-matplotlib, py-seaborn:

In [ ]:
"""Run to load modules""";
module("load", "py-pandas", "py-matplotlib", "py-seaborn")

Now we can import all the required modules into Python:

In [ ]:
"""Run to import libraries""";
import pandas
import matplotlib
from matplotlib import pyplot
import numpy
import seaborn
%matplotlib inline
##^^ This is an ipython magic command to ensure images are rendered inline
In [ ]:
## Optional: Increase figure sizes globally.
## The default size is (6.4, 4.8)
#matplotlib.rcParams['figure.figsize'] = (10.0, 7.0)

2. Loading Sherlock Applications Dataset

We will working with a significantly larger data file, sherlock/sherlock_mystery_2apps.csv, roughly 76MB in size. Load the data into a DataFrame object named df2. This still has only two applications, WhatsApp and Facebook.

In [ ]:
"""Uncomment and modify to load sherlock_mystery_2apps.csv into df2""";

#df2 = pandas.#TODO("#TODO");

2.1. Initial Exploration

Always perform initial exploration of the new dataset; use Pandas methods and attributes to answer the following questions:

  • How many rows and columns are in this dataset?
  • How do the numbers look like?
  • How does the statistical information look like?
  • What does the feature look like? (i.e. the data types)

Hint: use a combination of the DataFrame attributes shape, dtypes, and/or methods like head, tail, describe, info.

In [ ]:
"""Uncomment to perform basic data exploration on df2 DataFrame""";

#df2.describe().T

QUESTION: Compare the numbers and statistics with that of the previous dataset (df_mystery, from file sherlock/sherlock_mystery.csv):

  • Compare the columns of the two tables.
  • Compare the sizes of the data.
  • How do the mean and std (standard deviation) look like between the two datasets? Are they similar? Any statistics that look significantly different?
  • Any difference in the range (min-max) and spread of the data?
In [ ]:
"""Uncomment to read in sherlock_mystery.csv and compare with df2 Dataframe""";

#df_mystery = pandas.#TODO(#TODO)
#df_mystery.describe().T

3. Data Preparation

When analyzing data, up to two-thirds of the time is actually spent preparing the data. This may sound like a waste of time, but that step is absolutely crucial to obtaining trustworthy insight from the data. The goal of data preparation is to achieve a clean, consistent and processable state of data.

Common issues with data include the following:

  • Missing data
  • Bad or inconsistent data
  • Duplicate data
  • Irrelevant data
  • Format mismatch
  • Representational issues

Data preparation is roughly made up of the following steps:

  • Data wrangling (data munging)
  • Exploratory data analysis (EDA)
  • Feature engineering

This notebook will cover the first two steps above.

4. Data Wrangling (Data Munging)

Data wrangling transforms raw data into an appropriate and valuable format for a variety of downstream purposes including analytics. Data wrangling addresses issues such as the following:

  • Understanding the nature of each feature;
  • Handling missing data;
  • Removing duplicate data, bad or irrelevant data.

4.1 Types of Data

Pandas supports many data types, including: discrete numbers (ints), continuous numbers (floats), and strings. But to work effectively and properly with data, we need to further understand the nature and meaning of our data. There are different ways to classify data beyond whether they are numbers or words.

In tabular datasets, each column contains a variable or a feature. We need to consider the nature of each of these variables:

Numerical vs. Categorical

Generally, a variable or feature can be classified as either numerical or categorical:

  • Numerical variable has a specific quantitative value assigned. Examples: age, weight, memory usage, number of threads.
  • Categorical variable is defined by the classes (categories) into which the variable may fall. Examples: eye color, ethicity, application name, application type.

Discrete vs. Continuous

Data variables can further be described as continuous or discrete.

  • Continuous variable is represented by a real number that can assume any value in the range of the measuring scale. Example: weight, speed, probability.
  • Discrete variable takes on only discrete values which can be numerical or categorical. The possible values can be finite or infinite.

Qualitative vs. Quantitative

  • Qualitative variable describes data as verbal groupings; the values may or may not have the notion of ranking, but their numerical difference cannot be defined. Categorical data is qualitative. An example of qualitative variable that can be ranked include user rating (e.g. poor, fair, good, excellent).
  • Quantitative data describes data using numerical quantities and differences in values have definite meanings. Examples include price, temperature, network bandwidth.

5. Cleaning Data

This sections discusses approaches involved in cleaning data. In practice, your judgment as the data analyst is very important so as not to introduce bias into the data.

5.1 Useless or Irrelevant Data

Useless or irrelevant columns should be removed from the data. Reminder: you can remove irrelevant columns using df.drop([COLUMN1, COLUMN2, ...], axis=1, inplace=True)

EXAMPLE: Examine the features (columns) of df2 DataFrame. There is one column that is obviously irrelevant. Remove that one irrelevant feature.

In [ ]:
"""Uncomment and run to identify one irrelevant feature (column).
The one below is just a starting point.""";

#df2.head(20)
In [ ]:
"""Modify and uncomment to remove the irrelevant column""";

#df2.#TODO

5.2 Missing Data

Missing data is cause by several reasons. We will examine best practices with handling missing values. Read main lesson for a deeper understanding.

Missing Data Exercise

Undertake the following:

  • Create a toy DataFrame below with varied types of missing values
  • Explore some pandas functions below for identifying and handling missing values
In [ ]:
"""Execute the following code to create a toy dataset with missing values""";

nan = numpy.nan
ex0 = pandas.DataFrame([[1, 2, 3, 0],
                        [3, 4, nan, 1],
                        [nan, nan, nan, nan],
                        [nan, 3, nan, 4]],
                       columns=['A','B','C','D'])

Use one or more of the following expressions to practice detection and handling of missing data:

  • notnull() and isna() methods detect defined (non-null) or undefined (null, missing) data, respectively;
  • dropna() removes records or columns with missing values;
  • fillna() fills the missing cells with a value.

Here are some examples of detecting missing data to try below:

  • ex0.notnull()
  • ex0.isna()
  • ex0.isna().sum()
  • ex0.isna().sum(axis=0)
  • ex0.isna().sum(axis=1)

What does each command mean?

Here are some examples of handling missing data: What does each command mean?

  • ex0.dropna()
  • ex0.dropna(how='all')
  • ex0.dropna(axis=1)
  • ex0.fillna(7)
  • ex0.fillna(ex0.mean(skipna=True))
In [ ]:
"""Experiment with each expression above in this cell""";

#ex0.#TODO()
In [ ]:
#RUNIT
display(ex0.dropna())
display(ex0.dropna(how='all'))
display(ex0.dropna(axis=1))

Detecting Columns with Missing Data

EXAMPLE: Identify feature(s) in df2 that have some missing data. Hint: Use one of the commands already demonstrated just above.

In [ ]:
"""Write a code below to identify features with missing values in df2""";

#df2.#TODO

Hint: Some of the exploratory functions we learned earlier can also unveil missing data. Which one(s) is that?

QUESTION: What do you want to do with the missing data? Discuss the pros and cons of each of the possible choices!

In [ ]:
"""Use this cell to fix the missing data""";

#TODO

5.3 Duplicate Data

There are many reasons that duplicate features can enter into a dataset. Whether it happens during the data collection or in the integration of data, one must watch for duplicate data as they affect the quality of data--and the outcome of the analysis.

  • DataFrame.duplicate() checks line after line for duplicates and returns True per duplicate line
  • reset_index() Rearranges indexes

Exercise

In this exercise we undertake the following;

  • Create a new dataset with duplicate data
  • Identify duplicates in dataset
In [ ]:
"""Uncomment to create a dataset (df3) from df2 and create duplicates""";

#df3 = df2.iloc[0:2].copy()
#df3.rename(index={0: 'a'},inplace=True)
#df3.rename(index={1: 'b'},inplace=True)
#df2 = df3.append(df2)
In [ ]:
"""Uncomment to check for duplicates""";

#df2.duplicated()
#numpy.asarray(df2.duplicated()).nonzero()
In [ ]:
"""Modify and uncomment to remove duplicate data"""
#df2.drop(#RUNIT, axis=0,inplace=True)
#df2.reset_index(drop=True)

6. Visualization

Visualization is a method of presenting data visually in many different ways, each uncovering patterns and trends existing in data. Visualization is indispensible when handling and analyzing massive amounts of data. In this and the next few sections we will introduce some common visualizations that can greatly help the process of exploratory data analysis.

We will explore many visualization capabilities provided by Matplotlib and Seaborn libraries:

  • Matplotlib: A de facto Python 2D plotting library supported in Python scripts, IPython shells and other Python platforms including Jupyter. The plotting capabilities is provided by the pyplot module in this library.

  • Seaborn: Provides a high-level interface for drawing attractive and informative statistical graphics. By default, Seaborn uses Mtplotlib as its backend.

Note

Use pyplot.figure(figsize=(x-size,y-size)) in a cell to modify the display sizes of a single plot. This should be specified before calling plotting function.

6.1 Count Plot

A Count plot shows the number of occurrences of various values in categorical data.

seaborn.countplot(x=COLUMN_NAME, data=DataFrame) plots a count plot of the COLUMN_NAME data within the DataFrame.

QUESTION: How many records exist in df2 for each application?

In [ ]:
"""Modify and uncomment to generate a Countplot of ApplicationName in 'df2' DataFrame""";

#seaborn.countplot(x='#TODO', data=#TODO)

Above graph displays a countplot representing the number of each application in the df2 dataframe

In [ ]:
"""Using grouping and aggregation operations as studied earlier, cross-check the count plots above. 
    Modify and uncomment code below to complete task""";

#df2.#TODO('#TODO').size()

6.2 Histogram

A histogram displays the distribution of values (shape and spread) in the form of vertical bars. The range of the values are split\ into multiple bins on the horizontal axis. The frequency of values within each bin’s range is displayed as a vertical bar for each bin.

Taller bars show that more data points fall in those bin ranges. Horizontal and Vertical bars can be exchanged as a variation.

We will experiment with histograms using both matplotlib's pyplot and seaborn

6.2.1 Histogram with Pyplot

Using pyplot.hist(df[COLUMN_NAME], bins=BIN_COUNT) a histogram of COLUMN_NAME within DataFrame with number of bins equal to BIN_COUNT.

Exercise

Using pyplot, create a histogram of CPU_USAGE in df2 using 20 Bins.

Upon successful completion, observe the ouput.

A tuple of three components should be displayed

these include;

  • A array of counts on individual bins
  • An array of of bin edges
  • A list of 20 matplotlib's Patch objects

The arrays are helpful for close-up analysis.

In [ ]:
"""Modify and uncomment to plot a histogram of CPU_USAGE in df2 DataFrame using 20 bins""";

#hist_plot = pyplot.hist(#TODO,bins=#TODO)

6.2.2 Histogram (Distribution) Plot with Seaborn

Using seaborn.displot(df[COLUMN_NAME]) a histogram of COLUMN_NAME within df is plotted.

Exercise

Using seaborn create a histogram of CPU_USAGE in df2.

In [ ]:
"""Modify and uncomment to plot a histogram of 'CPU_USAGE' in df2 DataFrame using seaborn""";

#res_sns = seaborn.distplot(#TODO)

4.2.3.1 Exercise Plot 1

Plot a histogram of priority in df2 using pyplot.

Upon completing exercise , you will observe that priority contains integer values with few values displayed.\ Frequently, data appearing as such after plotting is a tell-tale sign of categorical or ordinal data.

In [ ]:
"""Modify and uncomment to plot a histogram of priority in df2 using pyplot with 20 bins""";

# Res = pyplot.hist(#TODO,#TODO)

Exercise Plot 2

Plot a histogram of num_threads in df2 using pyplot (plt).

Upon completing exercise , observe the number of threads shows a multimodal (two major and smaller peaks).

In [ ]:
"""Modify and uncomment to plot a histogram of `num_threads` in df2 using pyplot with 20 bins""";


#Res_2 = pyplot.#TODO(df2['#TODO'],bins=#TODO)

Exercise Plot 3

We can further plot a histogram of num_threads grouped by application type.

TASK

Using seaborn.distplot(DataFrame,kde=False,label='Application_Name'), create a plot of grouped application types.

Observation: Upon completion, can you observe the differences between both applications? Discuss the differences based on visual cues.

In [ ]:
"""Similar to nthrds_FB for Facebook, complete same with nthrds_WA for WhatsApp.
    Modify and uncomment appropriate to run""";


nthrds_FB = df2[df2['ApplicationName'] == 'Facebook']['num_threads']
#nthrds_WA = #TODO

#seaborn.distplot(nthrds_FB,kde=False,label='Facebook')
#seaborn.distplot(#TODO)
#pyplot.legend()

Discussion Questions

  • Can we plot a histogram of categorical data?
  • Why does seaborn.distplot produce histogram bars with radically different heights pyplots, plt.hist? Can we make them same?
  • Do bins have to be of same width?
  • Histogram implies data is numerically ordered, therefore it cannot display the distribution of categorical data.\ Count plot is the appropriate tool for categorical data, analogous to histogram for categorical data.

  • The y (vertical) axis usally represents frequency count. However, this count can be normalized by dividing by total
    count to create a normalized histogram (i.e. a density function). seaborn.distplot produces density function by\ default, whereas pyplot.hist a raw-value histogram.

To make pyplot.hist display like seaborn.distplot: plt.hist(df2['CPU_USAGE'], bins=50, normed=True)

To make seaborn.distplot display raw values, seaborn.distplot(df2['CPU_USAGE'], norm_hist=False, kde=False) unspecified bin causes seaborn to guesstimate a good number of bins for given data.

  • Although histogram bins usually have same widths, one can manually specify bin edges using bins argument.

    However the interpretation of the bar height would vary based on requirements such as raw count or normalized density of values.\ Displaying a normalized density such as product of height and width of a bar corresponding to count of values per bin appears more logical.

6.3 Box Plot

Box plot displays data distribution based on a five number summary as follows;

  • Minimum
  • First quartile (Q1)
  • Median
  • Third quartile (Q3)
  • Maximum

Note

  • Interquartile range (IQR): 25th to the 75th percentile.
  • “maximum”: Q3 + 1.5*IQR
  • “minimum”: Q1 -1.5*IQR

Use seaborn.boxplot(DataFrame['Column_Name']) to create boxplot with seaborn

Exercise

In this exercise, create a boxplot of first 2000 records of guess_time in df2

In [ ]:
"""Modify and uncomment to select and plot all records of 'guest_time' in 'df2'  Dataframe""";

##TODO(df2[#TODO].guest_time)

6.4 Bar Plot

A bar plot displays an estimate of central tendency of numerical variables with the height of each rectangle being its mean value and the errorbar providing some indication of the uncertainty around that mean value.

Exercise

Create a barplot using seaborn; sns.barplot(x='ApplicationName', y='CPU_USAGE', data=df2)

In [ ]:
"""Modify and Uncomment to plot a barplot""";

#seaborn.barplot(x='ApplicationName', y='CPU_USAGE', data=df2)

Questions and Discussions

  • Which type of data can use a barplot to present?
  • What are black lines in barplot?
  • Can you try other parameters and infer some interesting result?

RUNIT

  • The data are numerical
  • In a barplot, the black lines shows the distribution of the values.

7. Correlations Among Features

In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. In the broadest sense correlation is any statistical association, though it commonly refers to the degree to which a pair of variables are linearly related. Lesson document provides further explanation.

While previous sections focussed on individual features within a dataset, many times correlations exist amongst these features which could affect the quality of the dataset.

7.1 Scatter Plot and Joint Plot

Scatter plots represent data points on horizontal axis while the vertical axis shows how much one variable is affected by another.

7.1.1 Exercise

  1. Create a scatter plot of the following columns in df2 DataFrame:
  2. X-axis: utime Y-axis: vsize
  3. X-axis: Mem Y-axis: vsize

Code Syntax: seaborn.scatterplot(x="Column_A", y="Column_B", data="DataFrame")

  1. Create a jointplot using utime on x-axis and vsize on y-axis

Code Syntax: seaborn.jointplot(x="Mem", y="vsize", data=df2)

In [ ]:
"""plot a scatter plot of vsize against utime in df2 DataFrame and explain output""";

#TODO
In [ ]:
"""plot a scatter plot of vsize against Mem in df2 DataFrame and explain output""";

#TODO
In [ ]:
"""plot a jointplot of vsize against Mem in df2 DataFrame and explain output""";

#TODO

7.2 Pair Plot

To plot multiple pairwise bivariate distributions in a dataset, you can use the pairplot() function. This creates a matrix of
axes and shows the relationship for each pair of columns in a DataFrame. By default, it also draws the univariate distribution
of each variable on the diagonal Axes:

7.2.1 Exercise

Create a dataframe with plot a pairplot of it using codes below

  • df2_demo=df2.iloc[:,5:9]
  • seaborn.pairplot(df2_demo)
In [ ]:
"""Enter codes above to create a dataframe and plot a pairplot""";

#TODO

7.3 Heat Map

A heat map shows indicates data which depends on two independent variables in a color-coded image plot.The color indicates the\ magnitude of the correlation.

7.3.1 Exercise

let us we compute and plot the pairwise correlation among pairs of variables in the dataset.

We will use the following ;

  • DataFrame.corr() Computes pairwise correlation of columns, excluding NA/null values.
  • pyplot.subplot() Create a grid of different plot figures
  • seaborn.heatmap(DataFrame) Plot a heatmap by passing in a computed correlated dataframe

Exercise entails the following;

  • Create a dataframe of computed correlations of df2
  • Plot a Heat map of Correlation dataframe created.
In [ ]:
"""Modify and uncomment to create a heat map of df2 correlations""";

#df_corr = #TODO
#pyplot.subplots(figsize=(12, 12)) 
#seaborn.#TODO(#TODO, annot=True, vmax=1, square=True, cmap="Blues")
#pyplot.show()

7.3.2 Exercise

Two variables have a linear relationship if changes in one affects the other proportionally by a constant.

Mathematically var2 = constant * var1

For this exercise,

Observe and discuss the heatmap plotted earlier

  • It appears that vsize has perfect correlation with Mem. In fact, if you examine the data, Mem is identical to vsize.
    On the other hand, utime and vsize don’t have this kind of relationship.
  • A linear relationship is one where increasing or decreasing one variable n times will cause a corresponding increase
    or decrease of n times in the other variable too. Observe that vsize grows as long as Mem grows so that is a linear
    relationship; utime and vsize don’t have this kind of relationship
In [ ]: