DeapSECURE module 2: Dealing with Big Data
Welcome to the DeapSECURE online training program! This is a Jupyter notebook for the hands-on learning activities of the "Big Data" module, Episode 4: "Analytics of Sherlock Data with Pandas".
In this notebook we will learn how to manipulate data in pandas to extract knowledge from our data. To do this, we will need to know several basic building blocks of data manipulation shown below:
Quick Links (sections of this notebook)
These building blocks can be combined to create analytic pipelines to provide answers to detailed questions such as:
If you are opening this notebook from Wahab cluster's OnDemand interface, you're all set.
If you see this notebook elsewhere and want to perform the exercises on Wahab cluster, please follow the steps outlined in our setup procedure.
Get the necessary files using commands below within Jupyter:
mkdir -p ~/CItraining/module-bd
cp -pr /scratch/Workshops/DeapSECURE/module-bd/. ~/CItraining/module-bd
cd ~/CItraining/module-bd
The file name of this notebook is BigData-session-2.ipynb
.
Throughout this notebook, #TODO
is used as a placeholder where you need to fill in with something appropriate.
To run a code in a cell, press Shift+Enter
.
Summary table of the commonly used indexing syntax from our own lesson.
We recommend you open these on separate tabs or print them; they are handy help for writing your own codes.
Next step, we need to import the required libraries into this Jupyter Notebook:
pandas
, matplotlib.pyplot
and seaborn
.
For Wahab cluster only: before importing these libraries, we have to load the following environment modules: py-pandas
, py-matplotlib
, py-seaborn
:
module("load", "py-pandas", "py-matplotlib", "py-seaborn")
Now we can import all the required modules into Python:
"""Uncomment and edit code below to import libraries.
Replace #TODO with the appropriate entity""";
#import #TODO
#import #TODO
#import #TODO
#from #TODO import pyplot
%matplotlib inline
# ^^ The last line is to make plots to appear automatically
# after the cells are executed
(Advanced) You can use display
in lieu of print
to show multiple pandas objects in nice format in one code cell:
from IPython.display import display
As in the preceding session, we continue to use a small table extracted from the full Sherlock dataset.
Using pandas.read_csv
, load the data contained in sherlock/sherlock_mystery.csv
into a variable named df_mystery
.
NOTE: The required files should have been copied into current directory tree. The data file
sherlock_mystery.csv
is located inside thesherlock
subdirectory. (Usels sherlock/
statement to verify if that file exists.)
"""Uncomment and replace #TODO with appropriate filename""";
#df_mystery = pandas.read_csv('sherlock/#TODO')
"""Also reate a copy of the dataframe as a backup""";
#df_backup = df_mystery.copy()
A copy of the dataset was made; in case you make a mistake with df_mystery
at a later time, you can just re-initialize by the data by re-copying df_backup
:
df_mystery = df_backup.copy()
QUESTIONS: If you have not already, perform initial exploration of the dataset now; use Pandas methods and attributes to answer the following questions:
Hint: use a combination of the DataFrame attributes shape
, dtypes
, and/or methods like head
, tail
, describe
, info
.
"""Experiment with functions and attributes listed above to
obtain initial knowledge into df_mystery.""";
#print(df_mystery.#TODO)
#...
The sherlock_mystery
DataFrame contains a tiny extract of Sherlock's Application.csv
data file, which contains snapshots of running app's resource utilization (CPU, memory, network).
sherlock_mystery
contains 14 columns (also known as features or variables) and 200 rows (records).
Columns and corressponding datatypes described below
Columns | Datatype | Description |
---|---|---|
Unnamed: 0 |
int | Record index |
CPU_USAGE |
float | CPU utilization (100% = completely busy CPU) |
cutime |
int | CPU "user time" spent the spawned (child) processes |
lru |
int | "Least Recently Used"; This is a parameter of the Android application memory management |
num_threads |
int | Number of threads in this process |
otherPrivateDirty |
int | The private dirty pages used by everything else other than Dalvik heap and native heap |
priority |
int | Process's scheduling priority |
utime |
int | Measured CPU "user time" |
vsize |
int | The size of the virtual memory, in bytes |
cminflt |
int | Count of minor faults that the process's child processes |
guest_time |
int | Running time of "virtual CPU" |
Mem |
int | Size of memory, in bytes |
queue |
int | The waiting order (priority) |
Pandas provides convenient and intuitive ways to operate on Series
and columns of Dataframe
objects.
In this section we will cover the following basic operations:
When working with data to answer a specific question,
we often want to work with only a subset of columns from the full table.
In the previous session we learn how to create a DataFrame that contains only selected columns using the []
indexing notation.
QUESTION:
Using the indexing notation we learned in the previous session, create a new DataFrame that contains only ApplicationName
and memory-related features: 'vsize', 'Mem', 'otherprivatedirty'.
"""Create a DataFrame containing only memory features:
ApplicationName, 'vsize', 'Mem', 'otherPrivateDirty'""";
#df_memstats = df_mystery[#TODO]
#RUNIT
# Select only processor statistics & the application name
df_cpustats = df_mystery[['ApplicationName',
'CPU_USAGE', 'utime', 'cutime', 'guest_time']]
df_cpustats.head()
In pandas, rows or columns that are irrelevant or contain bad data can be deleted using the drop()
method.
We have to specify the type of deletion by supplying the axis=0
argument to delete column(s), or axis=1
to delete row(s).
To return a new DataFrame with one or more columns removed:
# one column only:
df_dropped = df.drop(COLUMN1, axis=1)
# one or more columns:
df_dropped = df.drop([COLUMN1, COLUMN2, ...], axis=1)
where COLUMN1
, COLUMN2
... are the names of the columns we want to drop.
We can also perform an in-place drop operation by adding the inplace=True
argument:
# in-place deletion, it returns nothing
df.drop([COLUMN1, COLUMN2, ...], axis=1, inplace=True)
This is an alternative way to select columns, i.e. by dropping the unwanted column(s).
EXAMPLE 1: Dropping a useless feature.
Create df_mystery2
that does not have the Unnamed: 0
field, which is not a feature at all.
"""Create new DataFrame which does not contain 'Unnamed: 0'.
Make sure to verify the result.""";
#df_mystery2 = df_mystery.drop(#TODO, axis=1)
df_mystery2 = df_mystery.drop(['Unnamed: 0'], axis=1)
df_mystery2.head()
EXAMPLE 2: Dropping all non-features.
ApplicationName
and Unnamed: 0
are not features.
Create df_features_only
that has only features.
"""Create new DataFrame which does not contain 'Unnamed: 0' and 'ApplicationName'.
Make sure to verify the result.""";
#df_features_only = df_mystery.#TODO
EXAMPLE 3: Dropping a useless feature forever.
Now remove column Unnamed: 0
from df_mystery
for good: we don't need to see it anymore.
Hint: This is an in-place operation.
"""Write a code to remove 'Unnamed: 0' column from the original DataFrame""";
To delete a row, simply change axis=1
to axis=0
and specify the row labels instead of column labels.
EXAMPLE 4: Dropping rows labeled 0 through 9. Create a new DataFrame which remove rows labeled 0 through 9.
"""Write a code to create a new DataFrame where rows labeled 0..9 are removed""";
#df_mystery3 = #TODO
Arithmetic manipulations on Series
and DataFrame
objects is as intuitive as we expect it would be.
EXAMPLE 1: The vsize
column has values in the billions (because today's typical memory sizes are in the gigabytes).
It is better for us to work with units of gigabytes.
Let us create a new Series by removing the "billion" prefactor:
"""Modify and uncomment code below to convert 'vsize' from bytes to gigabytes""";
#vsize_gb = df_mystery['#TODO'] * 1e-9
print("Original data:")
#print(#TODO.head())
print("Same data in units of GB:")
#print(#TODO)
We can create a new column in the DataFrame that contains the vsize
data in units of gigabytes:
df_mystery['vsize_gb'] = df_mystery['vsize'] * 1e-9
df_mystery.head()
NOTE: We can also update the existing column to contain the transformed values by reusing the same column name for the assignment, e.g.
df_mystery['vsize'] = df_mystery['vsize'] * 1e-9
Arithmetic involving multiple columns:
Simply add, subtract, multiply or divide two or more columns as if they were ordinary variables: e.g.
df[COL1] - df[COL2]
. The operation will be done element-wise, matched by the index.
EXAMPLE 2:
Subtract guest_time
from utime
and assign it to a new column called utime_self
.
"""Uncomment and modify to compute `utime_self`:""";
#df_mystery['utime_self'] = #TODO
BONUS QUESTION: Print the three columns side-by-side: utime_self
, utime
, guest_time
.
A Series
object has a str
attribute, which contains an extensive collection of string functions, such as:
letter case manipulation, substring search-and/or-replace, etc.
EXAMPLE: Raw data tends to be messy with inconsistent character cases, e.g.
Facebook
, facebook
, FACEBOOK
and faceBOOK
all refer to the same thing.
To clean this data, we need to make these values more consistent,
e.g. by capitalizing all the letters.
Let us illustrate such a cleaning process on the ApplicationName
column by converting the values to upper-case letters.
"""CAPITALIZE all letters in 'ApplicationName' column:""";
#df_mystery['ApplicationName'] = df_mystery['#TODO'].str.upper()
#print(df_mystery['ApplicationName'].head(10))
The str.upper()
method converts the string values to UPPER CASE wherEas str.lower()
converts to lower case.
To learn more about pandas string capabilities, refer to
Working with Text Data (method summary),
part of pandas user guide.
A more complex transformation can be applied for each element of a DataFrame or Series by defining a standard Python function, and pass that function to the apply()
method of the DataFrame or Series.
Here are some examples:
EXAMPLE 1: We can reimplement the unit conversion (bytes-to-gigabytes) using a user-defined function:
# Define a function to convert to unit of gigabytes.
# It must take a single argument and return a single value:
def to_giga(val):
return val * 1e-9
df_mystery['vsize_gb2'] = df_mystery['vsize'].apply(to_giga)
# Check the result:
print(df_mystery[['vsize', 'vsize_gb2']].head(10))
This does the same thing as the previous one (vsize_gb
).
NOTES ON PERFORMANCE (advanced)¶
Whenever possible, rely on the tools provided by pandas to perform your data manipulations. Pandas' user guide on Computational Tools is a good starting point. Functions or operations that involve hand-written loops written in Python have very high performance penalty. This performance issue is discussed briefly in our lesson.
EXERCISE 2:
The column guest_time
contains negative values (please check!) whereas no time value can be negative. Fix this by replacing all negative values with zeros by using the following function:
# Function to convert negative values to zero
def non_negative(val):
if val > 0:
return val
else:
return 0
"""Uncomment and use the `apply` method to replace negative values with zeros:""";
#df_mystery['guest_time_fix'] = #FIXME
"""Observe the effect of the function:""";
#print(#TODO)
A student named Lee wants to compute the range of the values for each column but he ran into a problem. (A range of a given set of numbers is simply the diffence between the minimum and maximum of the values.) Lee implements the range calculation in this way:
def value_range(col):
return max(col) - min(col)
print(df_mystery.apply(value_range))
Copy the code above to the cell below and run it. Observe the error and help Lee troubleshoot his problem:
<ipython-input-35-5fffcff8afda> in value_range(col)
2
3 def value_range(col):
----> 4 return max(col) - min(col)
5
6
TypeError: ("unsupported operand type(s) for -: 'str' and 'str'", 'occurred at index ApplicationName')
Hint: Examine the types and the name of the column mentioned in the error.
"""Troubleshoot and fix Lee's code in this cell here""";
Actually, the problem above is a perfect example on how we can apply pandas built-in capabilities to compute the range.
We have learned a number of key methods in the previous session---and one of them did return the min
and max
values of each range.
Use the result of that method to compute the range!
Hint: A correct implementation can be had with only two lines of code!
Now we come to the next important building block in data analytics. We often need to filter DataFrame rows based on certain values or conditions satisfied in each row. This alone often gives answer to many questions asked in data analytics.
A boolean expression evaluates whether one or more variables satisfied a prescribed condition. For example,
"""Run this an observe the result""";
df_mystery['CPU_USAGE'] > 50
returns a Series of boolean values:
True
where the corresponding CPU_USAGE
value is greater than 50,
False
otherwise.
QUESTION: Do you see any True
value? Why (or why not)? Try tweaking the 50
to another value until you see some True
.
Filtering Rows--We can combine the boolean expression above with the []
operator to return only rows that satisfy the prescribed condition:
"""Return rows where `CPU_USAGE` value is greater than two:""";
# df_mystery[#TODO]
Operators supported include: ==
(equal to), !=
(not equal to), >
, >=
, <
, <=
. There are a few other frequently used Series methods that also yield boolean values, such as isna()
(returns True when a value is not defined) and notna()
(True when a value is defined).
To learn more, please check out the
list of common boolean expressions and methods
in our lesson page.
Boolean expressions can be combined with AND (&
), OR (|
) and NOT (~
) operators to achieve complex data operations.
However, comparison expressions such as df_mystery['CPU_USAGE'] > 2
must be enclosed with () because of Python's orrder of operator priority.
EXERCISE 1: Display records where ApplicationName
is "Facebook" and CPU_USAGE
is greater than 0.5
.
Hint: The expressions before and after the combine operator need to be enclosed in parentheses to evaluate correctly. You may need to reset the contents of df_mystery
before running this exercise.
"""Uncomment and modify to display records where ApplicationName is Facebook
and CPU_USAGE is greater than 0.5""";
## you may also need: df_mystery = df_backup.copy()
#df_mystery[(df_mystery['#TODO'] == 'Facebook') & (#TODO)]
EXERCISE 2: Display records where ApplicationName
is "Facebook" or CPU_USAGE
is greater than 0.5
.
How many are there?
Hint: Consider from the sentence what operator to use.
"""Uncomment and modify display records where ApplicationName is Facebook
OR CPU_USAGE is greater than 0.5""";
#df_mystery[(#TODO) #TODO (#TODO)]
Sorting is a procedure to reorder data according to certain rules. Most often, rows are ordered according to one or more columns in numerical or alphabetical order.
Sorting is helpful to discover events where applications are using the most CPU cycles, the most memory, the most network bandwidth, etc. Let's check our data for these "top hogs".
Preparation Let's define the following df
to focus our attention only to a subset of relevant columns (and we copy it from the backup):
"""Make a copy of `df_mystery` with fewer columns for practice.""";
df = df_backup[['ApplicationName', 'CPU_USAGE', 'num_threads',
'vsize', 'utime', 'cutime']].copy()
EXAMPLE 1: CPU hogs -- Sort the records based on CPU utilization, from the highest to the lowest.
"""Uncomment and run this to sort the rows in descending order
based on `CPU_USAGE`""";
#df_CPU_hog = df.sort_values(by='CPU_USAGE', ascending=False)
#df_CPU_hog.head(20)
EXAMPLE 2: Memory hogs -- Repeat the analysis above by sorting the data by memory usage (column: vsize
). Call the result df_mem_hog
. Which application uses more memory: Facebook
or WhatsApp
?
"""Produce df_mem_hog and print the top 20 records""";
#df_mem_hog = #TODO
EXAMPLE 3: Thread and CPU hogs -- Repeat the analysis above by sorting the data using two keys: first by num_threads
, then by CPU_USAGE
.
Pass on the sorting keys as a list to the by=
argument.
Call the result df_thread_hog
.
"""Uncomment and modify to sort the data by `num_threads` and `CPU_USAGE`""";
#df_thread_hog = df.sort_values(by=[#TODO], ascending=False)
#df_thread_hog.head(20)
Precise sorting order can be achieved by feeding lists to both the by=
and ascending=
arguments, for example:
df.sort_values(by=['num_threads', 'utime'], ascending=[False, True])
QUESTION: What does the statement above give us? Experiment below and find out:
"""Use this cell to try out the sort statement above""";
EXAMPLE 4: Idle thread hogs -- Let us combine filtering and sorting:
For all records where CPU_USAGE
is zero, sort them in descending order by num_threads
:
"""Modify and run this cell to complete Example 4""";
#df_thread_hog2 = df[ #TODO ].sort_values(by=#TODO, ascending=False)
#df_thread_hog2.head(10)
EXAMPLE 5: For this exercise, consider only records with nonzero CPU_USAGE
.
Sort them with two keys: first in ascending order by CPU_USAGE
, then in descending order by num_threads
.
"""Use this cell to answer the question above""";
#df_thread_hog3 = df[ #TODO ].sort_values(by=#TODO, ascending=False)
#df_thread_hog3.head(20)
A major goal of data analytics is to obtain valuable insights from the data. We often seek statistical knowledge about the data by performing aggregation on the data. For example:
Pandas provides
a rich set of aggregation (statistical) methods
that are applicable to both DataFrame
and Series
objects:
df.max()
--- the maximum valuedf.min()
--- the minimum valuedf.mean()
--- the mean (average) of the valuesdf.sum()
--- the sum of the valuesdf.mode()
--- the most frequently occuring values (there can be multiple values of such)df.median()
--- the median of the valuesdf.describe()
on the DataFrame include many of these.
EXPERIMENTS: Run these expressions in the cell(s) below and observe the output. Also try other statistical functions mentioned in
"""Uncomment and run individual statements below and observe the output""";
#df_mystery.max()
#df_mystery['lru'].mean()
#df_mystery['ApplicationName'].mode()
#df_mystery['cutime'].median()
#df_mystery['cutime'].sum()
#df_mystery.max(axis=1, numeric_only=True)
More exercises will be found after we introduce the grouping operation below.
In the previous section, we computed aggregate quantities for the entire data. But sometimes we want to know detailed statistical information such as:
These questions require us to group the observations based on the application name before applying the aggregate functions. Group aggregation, often known as group by or split-apply-combine, involves one or more of the following steps:
Let's continue using the df
DataFrame.
The first step, splitting, is performed by the groupby
method:
df.groupby('ApplicationName')
This method has not resulted in anything, because we have not defined the next two steps.
Immediately following groupby
, we can apply an aggregate method which will perform the apply and combine steps and yield the result.
EXAMPLE 1: The following example computes in the maximum value of memory usage for each application:
df.groupby('ApplicationName')['vsize'].max()
We can obtain the same aggregation for all the columns:
df.groupby('ApplicationName').max()
EXAMPLE 2: Compute the total CPU time spent by each application.
Hint: Sum over the utime
field to get the total CPU time.
"""Uncomment and modify to calculate the total CPU time for each application""";
#print(df.groupby('ApplicationName')[#TODO].#TODO)
EXAMPLE 3: Count the number of records for each application name.
Hint: Use the size()
method.
"""Count the number of records per application name""";
#TODO
EXAMPLE 4: The describe()
method is actually an aggregate function.
Please apply this function following the groupby
:
"""Group data by application name, then compute descriptive statistics""";
#TODO
Hint: Transposing the DataFrame makes reading easier.
Note: Combining datasets is more advanced than the building blocks, but included here for completeness.
Real-world big data analytics will involve integrating datasets in various forms and from various sources.
In our study case, the Sherlock's Applications
data may need to be augmented with additional measurements or related information bits from other tables.
There are two primary ways to combine datasets:
Adding more measurements of the same kind, which means to add rows into our existing table. This is done by concatenating the two or more tables together.
Adding different but related features from other source(s), which technically inserts additional columns into our existing table. This is done by merging or joining two tables.
Pandas has the concat
function to concatenate rows from two or more tables:
df_all = pandas.concat((df1, df2, ...))
.
This is straightforward if the source DataFrames have the same columns.
We will cover below a case that is slightly less trivial.
EXAMPLE (advanced):
In your hands-on package there is another application stat table, similar but not identical to sherlock_mystery.csv
.
Our goal is to concatenate the two to make a big table.
Let's load that other table and examine it:
"""Read in the following data file..."""
df_head2k = pandas.read_csv("sherlock/sherlock_head2k.csv")
"""... and explore the columns, data types, number of records, etc."""
#TODO
#RUNIT
df_head2k.info()
Did you notice that the second table has significantly more rows and columns?
If you peeked into the data, you will also notice it contains records from far more diverse applications than what sherlock_mystery
has!
If we go ahead and concatenate them, what will happen?
## make sure df_mystery is reset to the original file:
df_mystery = df_backup.copy()
df_concat = pandas.concat((df_mystery, df_head2k))
There is a warning which makes us a litle uncomfortable: Sorting because non-concatenation axis is not aligned. ...
"""Always check the resulting table before proceeding""";
df_concat.info()
df_concat.head(10)
df_concat.tail(10)
Clearly, the tables are concatenated because the beginning records contain data from sherlock_mystery.csv
and the tail records are from sherlock_head2k.csv
. pandas is smart enough to align the columns based on their names;
it also fills the columns that are missing in the source table with NaN
(not-a-number), indicating missing data.
But the result is not perfect because not all columns are present in either table.
We have to make a judgment call regarding the missing data (which will be discussed in detail in the subsequent episode of the lesson).
It is reasonable to limit ourselves to the columns that exist on both tables.
# Use Python set operation to get the overlapping columns
columns_both = sorted(list(set(df_backup.columns) & set(df_head2k.columns)))
print(columns_both)
Now it's time for action...re-read the tables and merge them:
"""Uncomment and modify codes below to load the two tables then merge""";
#df_set1 = pandas.read_csv("sherlock/sherlock_mystery.csv", usecols=columns_both)
#df_set2 = pandas.read_csv("sherlock/sherlock_head2k.csv", usecols=columns_both)
#df_combined = #TODO
## Always check the result using info() and describe()
In the previous example,tables were stacked vertically.
We can extend tables horizontally using the DataFrame's merge()
method, also known as the JOIN operator in SQL (Structured Query Language).
In the JOIN operator, rows in the two tables will be matched at one or more columns (also known as join key(s)).
Where there is match, the two rows will be placed side-by-side to become the row in the output table.
Let's consider this example, which we will do later:
Table 1 ("left table"):
ApplicationName CPU_USAGE num_threads
WhatsApp 10.0 100
Facebook 0.9 72
Facebook 0.0 68
WhatsApp 5.0 120
Table 2 ("right table"):
ApplicationName ApplicationType
WhatsApp Messaging
Facebook Social Media
When we join the two tables above using ApplicationName
as the join key, then we will obtain
Output table:
ApplicationName CPU_USAGE num_threads ApplicationType
WhatsApp 10.0 100 Messaging
Facebook 0.9 72 Social Media
Facebook 0.0 68 Social Media
WhatsApp 5.0 120 Messaging
The joined table will not contain one column of each the join key (i.e. no duplicate ApplicationName
column).
In this notebook we will limit ourselves to the "LEFT OUTER JOIN" operation.
All rows on the left table will be included in the resulting table.
If there are row(s) on the right table that match, they will be joined to the left-table row.
Otherwise, the columns will be marked with NaN
to indicate missing data.
Subsequest exercises merge two tables forming a new dataset with more columns based on matching criteria of a common column (join key) in both tables.
EXAMPLE:
Let us practice the outer join operation by appending extra information about the application in every row.
The left table is given by df_combined
from the previous section;
the right table is an additional table that we load here:
"""Uncomment to load `sherlock/app_info.csv`""";
#df_appinfo = pandas.read_csv('sherlock/app_info.csv')
#print(df_appinfo.head(10))
QUESTION: What are the additional information bits to be attached to df_combined
?
Now let's do the "LEFT JOIN" merge and observe the output:
"""Uncomment and modify to merge df_combined and df_appinfo DataFrames""";
#df_master = df_combined.merge(right=df_appinfo, how='left', on='ApplicationName')
#print("Merged data (head):")
#display(df_master.head(10))
#print("Merged data (tail):")
#display(df_master.tail(10))
A quick summary of data manipulation building blocks introduced in this session:
df.shape
, df.dtypes
, df.head()
, df.tail()
, df.describe()
, and df.info()
df.T
df[BOOLEAN_EXPRESSION]
df.sort_values()
df.max()
, df.min()
, df.mean()
, df.sum()
, df.mode()
, df.median()
df.apply()
df.groupby(['COLUMN1','COLUMN2',...]).FUNC()
df.merge()