DeapSECURE module 2: Dealing with Big Data

Session 2: Analytics of Sherlock Data with Pandas

Welcome to the DeapSECURE online training program! This is a Jupyter notebook for the hands-on learning activities of the "Big Data" module, Episode 4: "Analytics of Sherlock Data with Pandas".

In this notebook we will learn how to manipulate data in pandas to extract knowledge from our data. To do this, we will need to know several basic building blocks of data manipulation shown below:

Quick Links (sections of this notebook)

  1. Setup
  2. Column Operations
  3. Filtering
  4. Sorting
  5. Aggregation
  6. Grouping
  7. Combining Multiple Datasets
  8. Key Points

These building blocks can be combined to create analytic pipelines to provide answers to detailed questions such as:

  • "What's the average memory usage for each application?"
  • "What's the memory usage across all applications for a range of time instances?"
  • "Which application consumes the most memory on May 16?"
  • "Which application consumes the most CPU cycles?"
  • "What is the memory usage pattern (over time) for different kinds of messaging applications?"

1. Setup Instructions

If you are opening this notebook from Wahab cluster's OnDemand interface, you're all set.

If you obtained this notebook elsewhere, and you want to perform the exercises on Wahab cluster, please follow the steps outlined in our setup procedure:

  1. Make sure you have activated your HPC service.
  2. Point your web browser to https://ondemand.wahab.hpc.odu.edu/ and sign in with your MIDAS ID and password.
  3. Create a new Jupyter session using "legacy" Python suite, then create a new "Python3" notebook. (See ODU HPC wiki for more detailed help.)
  4. Get the necessary files using commands below within Jupyter notebook or terminal interface:

    mkdir -p ~/CItraining/module-bd
    cp -pr /scratch/Workshops/DeapSECURE/module-bd/. ~/CItraining/module-bd
    cd ~/CItraining/module-bd

The file name of this notebook is BigData-session-2.ipynb.

1.1 Reminder

We recommend you open these on separate tabs or print them; they are handy help for writing your own codes.

1.2 Loading Python Libraries

Next step, we need to import the required libraries into this Jupyter Notebook: pandas, matplotlib.pyplot and seaborn.

For Wahab cluster only: before importing these libraries, we need to load the DeapSECURE environment modules:

In [ ]:
module("load", "DeapSECURE")

Now we can import all the required modules into Python:

In [ ]:
import numpy
import pandas
import seaborn
from matplotlib import pyplot as plt

%matplotlib inline
# ^^ The last line is to make plots to appear automatically
#    after the cells are executed

(Advanced) You can use display in lieu of print to show multiple pandas objects in nice format in one code cell:

In [ ]:
from IPython.display import display

1.3 Loading Sherlock Data

As in the preceding session, we continue to use a small table extracted from the full Sherlock dataset. Using pandas.read_csv, load the data contained in sherlock/sherlock_mystery.csv into a variable named df_mystery.

NOTE: The required files should have been copied into current directory tree. The data file sherlock_mystery.csv is located inside the sherlock subdirectory. (Use ls sherlock/ statement to verify if that file exists.)

In [ ]:
"""Uncomment and replace #TODO with appropriate filename""";

#df_mystery = pandas.read_csv('sherlock/#TODO')

"""Also create a copy of the dataframe as a backup""";
#df_backup = df_mystery.copy()

A copy of the dataset was made; in case you make a mistake with df_mystery at a later time, you can just re-initialize by the data by re-copying df_backup:

df_mystery = df_backup.copy()

QUESTIONS: If you have not already, perform initial exploration of the dataset now; use Pandas methods and attributes to answer the following questions:

  1. How many rows and columns are in this dataset?
  2. How do the numbers look like?
  3. How does the statistical information look like?
  4. What does the feature look like? (i.e. the data types)

HINT: use a combination of the DataFrame attributes shape, dtypes, and/or methods like head, tail, describe, info.

In [ ]:
"""Experiment with functions and attributes listed above to
obtain initial knowledge into df_mystery.""";

#print(df_mystery.#TODO)
#...

1.4 Dataset Features

The sherlock_mystery DataFrame contains a tiny extract of Sherlock's Application.csv data file, which contains snapshots of running app's resource utilization (CPU, memory, network).

sherlock_mystery contains 14 columns (also known as features or variables) and 200 rows (records).

Columns and corressponding datatypes described below

Columns Datatype Description
Unnamed: 0 int Record index
CPU_USAGE float CPU utilization (100% = completely busy CPU)
cutime int CPU "user time" spent the spawned (child) processes
lru int "Least Recently Used"; This is a parameter of the Android application memory management
num_threads int Number of threads in this process
otherPrivateDirty int The private dirty pages used by everything else other than Dalvik heap and native heap
priority int Process's scheduling priority
utime int Measured CPU "user time"
vsize int The size of the virtual memory, in bytes
cminflt int Count of minor faults that the process's child processes
guest_time int Running time of "virtual CPU"
Mem int Size of memory, in bytes
queue int The waiting order (priority)

2. Column Operations

Pandas provides convenient and intuitive ways to operate on Series and columns of Dataframe objects. In this section we will cover the following basic operations:

  • DataFrame column selection
  • Arithmetic operations
  • String operations
  • Custom transformations

2.1 Column Selection

When working with data to answer a specific question, we often want to work with only a subset of columns from the full table. In the previous session we learn how to create a DataFrame that contains only selected columns using the [] subscript operator.

EXERCISE: Using the indexing notation we learned in the previous session, create a new DataFrame that contains only ApplicationName and memory-related features: vsize, Mem, otherPrivateDirty.

In [ ]:
"""Create a DataFrame containing only memory-related features:
ApplicationName, vsize, Mem, otherPrivateDirty""";
#df_memstats = df_mystery[#TODO]

EXERCISE (optional): create another DataFrame that contains only ApplicationName and CPU-related features: CPU_USAGE, utime, cutime, guest_time.

2.2 Removing Column(s) or Row(s)

In pandas, rows or columns that are irrelevant or contain bad data can be deleted using the drop() method. We have to specify the type of deletion by supplying the axis=0 argument to delete row(s), or axis=1 to delete column(s).

Removing Column(s)

To return a new DataFrame with one or more columns removed:

# one column only:
df_dropped = df.drop(COLUMN1, axis=1)

# one or more columns:
df_dropped = df.drop([COLUMN1, COLUMN2, ...], axis=1)

where COLUMN1, COLUMN2 ... are the names of the columns we want to drop. We can also perform an in-place drop operation by adding the inplace=True argument:

# in-place deletion, it returns nothing
df.drop([COLUMN1, COLUMN2, ...], axis=1, inplace=True)

This is an alternative way to select columns, i.e. by dropping the unwanted column(s).

EXAMPLE 1: Dropping a useless feature. Create df_mystery2 that does not have the Unnamed: 0 field, which is not a feature at all.

In [ ]:
"""Create new DataFrame which does not contain 'Unnamed: 0'.
Make sure to verify the result.""";

#df_mystery2 = df_mystery.drop(#TODO, axis=1)
In [ ]:
df_mystery2 = df_mystery.drop(['Unnamed: 0'], axis=1)
df_mystery2.head()

EXAMPLE 2: Dropping all non-features. ApplicationName and Unnamed: 0 are not features. Create df_features_only that has only features.

In [ ]:
"""Create new DataFrame which does not contain 'Unnamed: 0' and 'ApplicationName'.
Make sure to verify the result.""";

#df_features_only = df_mystery.#TODO

EXAMPLE 3: Dropping a useless feature forever. Now remove column Unnamed: 0 from df_mystery for good: we don't need to see it anymore.

Hint: This is an in-place operation which alters df_mystery.

In [ ]:
"""Write a code to remove 'Unnamed: 0' column from the original DataFrame""";

Removing Row(s)

To delete a row, simply change axis=1 to axis=0 and specify the row labels instead of column labels.

EXAMPLE 4: Dropping rows labeled 0 through 9. Create a new DataFrame which remove rows labeled 0 through 9.

In [ ]:
"""Write a code to create a new DataFrame where rows labeled 0..9 are removed""";
#df_mystery3 = #TODO

2.3 Arithmetic Operations

Arithmetic manipulations on Series and DataFrame objects is as intuitive as we expect it would be.

EXAMPLE 1: The vsize column has values in the billions (because today's typical memory sizes are in the gigabytes). It is better for us to work with units of gigabytes. Let us create a new Series by removing the "billion" prefactor:

In [ ]:
"""Modify and uncomment code below to convert 'vsize' from bytes to gigabytes""";

#vsize_gb = df_mystery['#TODO'] * 1e-9

print("Original data:")
#print(#TODO.head())
print("Same data in units of GB:")
#print(#TODO)

We can create a new column in the DataFrame that contains the vsize data in units of gigabytes:

In [ ]:
df_mystery['vsize_gb'] = df_mystery['vsize'] * 1e-9
df_mystery.head()

NOTE: We can also update the existing column to contain the transformed values by reusing the same column name for the assignment, e.g.

df_mystery['vsize'] = df_mystery['vsize'] * 1e-9

Arithmetic involving multiple columns: Simply add, subtract, multiply or divide two or more columns as if they were ordinary variables: e.g. df[COL1] - df[COL2]. The operation will be done element-wise, matched according to the index.

EXAMPLE 2: Subtract guest_time from utime and assign it to a new column called utime_self.

In [ ]:
"""Uncomment and modify to compute `utime_self`:""";
#df_mystery['utime_self'] = #TODO

BONUS QUESTION: Print the three columns side-by-side: utime_self, utime, guest_time.

2.4 String Operations

A Series object has an attribute named str, which contains an extensive collection of string functions, such as: letter case manipulation, substring search-and/or-replace, etc.

EXAMPLE: Raw data tends to be messy with inconsistent character cases, e.g. Facebook, facebook, FACEBOOK and faceBOOK. These all refer to the same thing. To clean this data, we need to make these values more consistent, for example, by capitalizing all the letters. Let us illustrate such a cleaning process on the ApplicationName column by converting the values to upper-case letters.

In [ ]:
"""CAPITALIZE all letters in 'ApplicationName' column:""";

#df_mystery['ApplicationName'] = df_mystery['#TODO'].str.upper()
#print(df_mystery['ApplicationName'].head(10))

The str.upper() method converts the string values to UPPER CASE whereas str.lower() converts to lower case. To learn more about pandas string capabilities, refer to Working with Text Data (method summary), part of pandas user guide.

2.4 Custom Transformations

A more complex transformation can be applied to each element of a DataFrame or Series by defining a standard Python function, and pass that function to the apply() method of the DataFrame or Series. Here are some examples:

EXAMPLE 1: We can reimplement the unit conversion (bytes-to-gigabytes) using a user-defined function:

In [ ]:
# Define a function to convert to unit of gigabytes.
# It must take a single argument and return a single value:
def to_giga(val):
    return val * 1e-9

df_mystery['vsize_gb2'] = df_mystery['vsize'].apply(to_giga)

# Check the result:
print(df_mystery[['vsize', 'vsize_gb2']].head(10))

This does the same thing as the previous one (vsize_gb). Note that the to_giga function was called multiple times: once for each individual element in the Series or DataFrame.

NOTES ON PERFORMANCE (advanced)

Whenever possible, rely on the tools provided by pandas to perform your data manipulations. Pandas' user guide on Computational Tools is a good starting point. Functions or operations that involve hand-written loops written in Python have very high performance penalty. This performance issue is discussed briefly in our lesson.

EXERCISE 2: The column guest_time contains negative values (please check!) whereas no time value can be negative. Fix this by replacing all negative values with zeros using the following function:

In [ ]:
# Function to convert negative values to zero
def non_negative(val):
    if val > 0:
        return val
    else:
        return 0
In [ ]:
"""Uncomment and use the `apply` method to replace negative values with zeros:""";
#df_mystery['guest_time_fix'] = #FIXME

"""Observe the effect of the function:""";
#print(#TODO)

Challenge Problem: Help Needed!

A student named Lee wants to compute the range of the values for each column but he ran into a problem. (A range of a given set of numbers is simply the diffence between the minimum and maximum of the values.) Lee implements the range calculation in this way:

def value_range(col):
    return max(col) - min(col)

print(df_mystery.apply(value_range))

Copy the code above to the cell below and run it. Observe the error and help Lee troubleshoot his problem:

<ipython-input-35-5fffcff8afda> in value_range(col)
      2 
      3 def value_range(col):
----> 4     return max(col) - min(col)
      5 
      6 

TypeError: ("unsupported operand type(s) for -: 'str' and 'str'", 'occurred at index ApplicationName')

Hint: Examine the types and the name of the column mentioned in the error.

In [ ]:
"""Troubleshoot and fix Lee's code in this cell here""";

Actually, the problem above is a perfect example on how we can apply pandas built-in capabilities to compute the range. We have learned a number of key methods in the previous session---and one of them did return the min and max values of each range. Use the result of that method to compute the range!

Hint: A correct implementation can be had with only two lines of code!

(end challenge)


3. Filtering

Now we come to the next important building block in data analytics. We often need to filter DataFrame rows based on certain values or conditions satisfied in each row. This alone often gives answer to many questions asked in data analytics.

A boolean expression evaluates whether one or more variables satisfied a prescribed condition. For example,

In [ ]:
"""Run this an observe the result""";
df_mystery['CPU_USAGE'] > 50

returns a Series of boolean values: True where the corresponding CPU_USAGE value is greater than 50, False otherwise.

QUESTION: Do you see any True value? Why (or why not)? Try tweaking the 50 to a lower value until you see some True.

Filtering Rows--We can combine the boolean expression above with the [] operator to return only rows that satisfy the prescribed condition:

In [ ]:
"""Return rows where `CPU_USAGE` value is greater than two:""";
# df_mystery[#TODO]

Operators supported include: == (equal to), != (not equal to), >, >=, <, <=. There are a few other frequently used Series methods that also yield boolean values, such as isna() (returns True where the corresponding value is not defined) and notna() (True when a value is defined). To learn more, please check out the list of common boolean expressions and methods in our lesson page.

Boolean expressions can be combined with AND (&), OR (|) and NOT (~) operators to achieve complex data operations. However, comparison expressions such as df_mystery['CPU_USAGE'] > 2 must be enclosed with () because of Python's orrder of operator priority.

EXERCISE 1: Display records in df_mystery where ApplicationName is "Facebook" and CPU_USAGE is greater than 0.5.

HINTS: The expressions before and after the combine operator need to be enclosed in parentheses to evaluate correctly. You may need to reset the contents of df_mystery before running this exercise.

In [ ]:
"""Uncomment and modify to display records where ApplicationName is Facebook
and CPU_USAGE is greater than 0.5""";

## you may also need: df_mystery = df_backup.copy()

#df_mystery[(#TODO == 'Facebook') & (#TODO)]

EXERCISE 2: Display records where ApplicationName is "Facebook" or CPU_USAGE is greater than 0.5. How many are there?

Hint: Consider from the sentence what operator to use.

In [ ]:
"""Uncomment and modify  display records where ApplicationName is Facebook
OR CPU_USAGE is greater than 0.5""";

#df_mystery[(#TODO) #TODO (#TODO)]

4. Sorting

Sorting is a procedure to reorder data according to certain rules. Most often, rows are ordered according to one or more columns in numerical or alphabetical order.

Sorting is helpful to discover events where applications are using the most CPU cycles, the most memory, the most network bandwidth, etc. Let's check our data for these "top hogs".

PREPARATION: Let's define the following df to focus our attention only to a subset of relevant columns (and we copy it from the backup):

In [ ]:
"""Make a copy of `df_mystery` with fewer columns for practice.""";

df = df_backup[['ApplicationName', 'CPU_USAGE', 'num_threads',
                'vsize', 'utime', 'cutime']].copy()

EXAMPLE 1: CPU hogs -- Sort the records based on CPU utilization, from the highest to the lowest. Can you claim which app uses more CPU?

In [ ]:
"""Uncomment and run this to sort the rows in descending order
based on `CPU_USAGE`""";

#df_CPU_hog = df.sort_values(by='CPU_USAGE', ascending=False)
#df_CPU_hog.head(20)

EXAMPLE 2: Memory hogs -- Repeat the analysis above by sorting the data by memory usage (column: vsize). Call the result df_mem_hog. Which application uses more memory: Facebook or WhatsApp?

In [ ]:
"""Produce df_mem_hog and print the top 20 records""";
#df_mem_hog = #TODO

EXAMPLE 3: Thread and CPU hogs -- Repeat the analysis above by sorting the data using two keys: first by num_threads, then by CPU_USAGE. Pass on the sorting keys as a list to the by= argument. Call the result df_thread_hog.

In [ ]:
"""Uncomment and modify to sort the data by `num_threads` and `CPU_USAGE`""";
#df_thread_hog = df.sort_values(by=[#TODO], ascending=False)
#df_thread_hog.head(20)

More complex sorting order can be achieved by feeding lists to both the by= and ascending= arguments, for example:

df.sort_values(by=['num_threads', 'utime'], ascending=[False, True])

QUESTION: What does the statement above give us? Experiment below and find out:

In [ ]:
"""Use this cell to try out the sort statement above""";

EXAMPLE 4: Idle thread hogs -- Let us combine filtering and sorting: For all records where CPU_USAGE is zero, sort them in descending order by num_threads:

In [ ]:
"""Modify and run this cell to complete Example 4""";
#df_thread_hog2 = df[ #TODO ].sort_values(by=#TODO, ascending=False)
#df_thread_hog2.head(10)

EXAMPLE 5: For this exercise, consider only records with nonzero CPU_USAGE. Sort them with two keys: first in ascending order by CPU_USAGE, then in descending order by num_threads.

In [ ]:
"""Use this cell to answer the question above""";
#df_thread_hog3 = df[ #TODO ].sort_values(by=#TODO, ascending=False)
#df_thread_hog3.head(20)

5. Aggregation

A major goal of data analytics is to obtain valuable insights from the data. We often seek statistical knowledge about the data by performing aggregation on the data. For example:

  • top 10 most used applications;
  • average memory usage over time;
  • total CPU utilization over time.

Pandas provides a rich set of aggregation (statistical) methods that are applicable to both DataFrame and Series objects:

  • df.max() --- the maximum value
  • df.min() --- the minimum value
  • df.mean() --- the mean (average) of the values
  • df.sum() --- the sum of the values
  • df.mode() --- the most frequently occuring values (there can be multiple values of such)
  • df.median() --- the median of the values

For a DataFrame, df.describe() include many of these.

EXPERIMENTS: Run these expressions in the cell(s) below and observe the output. Also try other statistical functions mentioned in the list above.

  • df.max()
  • df['lru'].mean()
  • df['ApplicationName'].mode()
  • df['cutime'].median()
  • df['cutime'].sum()
  • df.max(axis=1, numeric_only=True)
In [ ]:
"""Uncomment and run individual statements below and observe the output""";

#df_mystery.max()
#df_mystery['lru'].mean()
#df_mystery['ApplicationName'].mode()
#df_mystery['cutime'].median()
#df_mystery['cutime'].sum()
#df_mystery.max(axis=1, numeric_only=True)

More exercises will be found after we introduce the grouping operation below.

6. Grouping

In the previous section, we computed aggregate quantities for the entire data. But sometimes we want to know detailed statistical information such as:

  • Maximum memory usage for each application
  • Total CPU usage for each application
  • CPU usage over time for each application

These questions require us to group the observations based on the application name before applying the aggregate functions. Group aggregation, often known as group by or split-apply-combine, involves one or more of the following steps:

  • Splitting the data into groups based on defined criteria;
  • Applying a function to each group independently;
  • Combining the results into a data structure.

Let's continue using the df DataFrame. The first step, splitting, is performed by the groupby method:

In [ ]:
df.groupby('ApplicationName')

This method has not resulted in anything, because we have not defined the next two steps. Immediately following groupby, we can apply an aggregate method which will perform the apply and combine steps and yield the result.

EXAMPLE 1: The following example computes in the maximum value of memory usage for each application:

In [ ]:
df.groupby('ApplicationName')['vsize'].max()

We can obtain the same aggregation for all the columns:

In [ ]:
df.groupby('ApplicationName').max()

EXAMPLE 2: Compute the total CPU time spent by each application. Hint: Sum over the utime field to get the total CPU time.

In [ ]:
"""Uncomment and modify to calculate the total CPU time for each application""";
#print(df.groupby('ApplicationName')[#TODO].#TODO)

EXAMPLE 3: Count the number of records for each application name. Hint: Use the size() method.

In [ ]:
"""Count the number of records per application name""";
#TODO

EXAMPLE 4: The describe() method is actually an aggregate function. Please apply this function following the groupby:

In [ ]:
"""Group data by application name, then compute descriptive statistics""";
#TODO

Hint: Transposing the DataFrame makes reading easier.

7. Combining Multiple Datasets

Note: Combining datasets is more advanced than the building blocks above, but is included here for completeness.

Real-world big data analytics will involve integrating datasets in various forms and from various sources. In our study case, the Sherlock's Applications data may need to be augmented with additional measurements or related bits of information from other tables (e.g., sensor information, screen on/off status, user present/absent, ...). There are two primary ways to combine datasets:

  1. Adding more measurements of the same kind, which means adding rows into our existing table. This is done by concatenating the two or more tables together.

  2. Adding different but related features from other source(s), which technically inserts additional columns into our existing table. This is done by merging or joining two tables.

7.1 Concatenating Datasets

Pandas has the concat function to concatenate rows from two or more tables: df_all = pandas.concat([df1, df2, ...]). This is straightforward if the source DataFrames have the same columns. We will cover below a case that is slightly less trivial.

EXAMPLE (advanced): In your hands-on package there is another application stat table, similar but not identical to sherlock_mystery.csv. Our goal is to concatenate the two to make a big table. Let's load that other table and examine it:

In [ ]:
"""Read in the following data file..."""
df_head2k = pandas.read_csv("sherlock/sherlock_head2k.csv")
In [ ]:
"""... and explore the columns, data types, number of records, etc."""
#TODO

QUESTION: Did you notice that the second table has significantly more rows and columns? Please peek into the data using the functions introduced earlier, you will also notice it contains records from far more diverse applications than what sherlock_mystery has!

If we go ahead and concatenate them, what will happen?

In [ ]:
## make sure df_mystery is reset to the original file:
df_mystery = df_backup.copy()
df_concat = pandas.concat([df_mystery, df_head2k])

There is a warning which makes us a litle uncomfortable: Sorting because non-concatenation axis is not aligned. ...

In [ ]:
"""Always check the resulting table before proceeding""";
df_concat.info()

QUESTION: Do you notice any anomaly from the report of the info() method above?

EXERCISE: Review the contents of df_concat--do you observe any issue? (Hints: head, tail)

In [ ]:
 

Clearly, the tables are concatenated because the beginning records contain data from sherlock_mystery.csv and the tail records are from sherlock_head2k.csv. pandas is smart enough to align the columns based on their names; it also fills the columns that are missing in the source table with NaN (not-a-number), indicating missing data. But the result is not perfect because not all columns are present in either table. We have to make a judgment call regarding the missing data (which will be discussed in detail in the subsequent episode of the lesson).

It is reasonable to limit ourselves to the columns that exist on both tables. Let's only use six columns:

In [ ]:
# Use Python set operation to get the overlapping columns
columns_both = ['ApplicationName', 'CPU_USAGE', 'utime', 'cutime',
                'vsize', 'num_threads']
print(columns_both)
In [ ]:
print(len(columns_both))

Now it's time for action...re-read the tables and merge them:

In [ ]:
"""Uncomment and modify codes below to load the two tables then merge""";

#df_set1 = pandas.read_csv("sherlock/sherlock_mystery.csv", usecols=columns_both)
#df_set2 = pandas.read_csv("sherlock/sherlock_head2k.csv", usecols=columns_both)
#df_combined = #TODO
In [ ]:
## Always check the result using info() and describe()

7.2 SQL-like JOIN Operations

In the previous example, tables were stacked vertically. We can extend tables horizontally using the DataFrame's merge() method, also known as the JOIN operator in SQL (Structured Query Language). In the JOIN operator, rows in the two tables will be matched at one or more columns (also known as join key(s)). Where there is a match, the two rows will be placed side-by-side to become a single row in the output table. Let's consider this example, which we will do later:

Table 1 ("left table"):

ApplicationName   CPU_USAGE   num_threads
WhatsApp          10.0        100
Facebook           0.9         72
Facebook           0.0         68
WhatsApp           5.0        120

Table 2 ("right table"):

ApplicationName   ApplicationType
WhatsApp          Messaging
Facebook          Social Media

When we join the two tables above using ApplicationName as the join key, then we will obtain

Output table:

ApplicationName   CPU_USAGE   num_threads   ApplicationType
WhatsApp          10.0        100           Messaging
Facebook           0.9         72           Social Media
Facebook           0.0         68           Social Media
WhatsApp           5.0        120           Messaging

The joined table will not contain one column of each the join key (i.e. no duplicate ApplicationName column).

In this notebook we will limit ourselves to the "LEFT OUTER JOIN" operation. All rows on the left table will be included in the resulting table. If there are row(s) on the right table that match, they will be joined to the left-table row. Otherwise, the columns will be marked with NaN to indicate missing data.

Subsequest exercises merge two tables forming a new dataset with more columns based on matching criteria of a common column (join key) in both tables.

EXAMPLE: Let us practice the outer join operation by appending extra information about the application in every row. The left table is given by df_combined from the previous section; the right table is an additional table that we load here:

In [ ]:
"""Uncomment to load `sherlock/app_info.csv`""";

#df_appinfo = pandas.read_csv('sherlock/app_info.csv')
#print(df_appinfo.head(10))

QUESTION: What are the additional information bits to be attached to df_combined?

Now let's do the "LEFT JOIN" merge and observe the output:

In [ ]:
"""Uncomment and modify to merge df_combined and df_appinfo DataFrames""";

#df_master = df_combined.merge(right=df_appinfo, how='left', on='ApplicationName')
#print("Merged data (head):")
#display(df_master.head(10))
#print("Merged data (tail):")
#display(df_master.tail(10))

Key Points

A quick summary of data manipulation building blocks introduced in this notebook:

  • Initial exploration: df.shape, df.dtypes, df.head(), df.tail(), df.describe(), and df.info()
  • Transpose table for readability: df.T
  • Filtering rows: df[BOOLEAN_EXPRESSION]
  • Sort rows/columns: df.sort_values()
  • Data aggregation: df.max(), df.min(), df.mean(), df.sum(), df.mode(), df.median()
  • Execute custom function: df.apply()
  • Group data by column and apply an aggregation function: df.groupby(['COLUMN1','COLUMN2',...]).FUNC()
  • Merge two tables: df.merge()

In this notebook, we demonstrate these building blocks by practical examples.

In [ ]: