{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**DeapSECURE module 2: Dealing with Big Data**\n", "\n", "# Session 2: Analytics of Sherlock Data with Pandas\n", "\n", "Welcome to the DeapSECURE online training program!\n", "This is a Jupyter notebook for the hands-on learning activities of the\n", "[\"Big Data\" module](https://deapsecure.gitlab.io/deapsecure-lesson02-bd/), Episode 4:\n", "[\"Analytics of Sherlock Data with Pandas\"](https://deapsecure.gitlab.io/deapsecure-lesson02-bd/20-analytics-intro/index.html).\n", "\n", "In this notebook we will learn how to manipulate data in _pandas_ to extract knowledge from our data.\n", "To do this, we will need to know several basic building blocks of data manipulation shown below:\n", "\n", "**Quick Links** (sections of this notebook)\n", "1. [Setup](#sec-setup)\n", "2. [Column Operations](#sec-Column-ops)\n", " - [Arithmetric Operations](#sec-Arithmetic)\n", " - [String Operations](#sec-String)\n", " - [Custom Transformations](#sec-Custom)\n", "3. [Filtering](#sec-Filtering)\n", "4. [Sorting](#sec-Sorting)\n", "5. [Aggregation](#sec-Aggregation)\n", "6. [Grouping](#sec-Grouping)\n", "7. [Combining Multiple Datasets](#sec-Combining)\n", "8. [Key Points](#sec-key-points)\n", "\n", "These building blocks can be combined to create _analytic pipelines_ to provide answers to detailed questions such as:\n", "\n", "* \"What's the average memory usage for each application?\"\n", "* \"What's the memory usage across all applications for a range of time instances?\"\n", "* \"Which application consumes the most memory on May 16?\"\n", "* \"Which application consumes the most CPU cycles?\"\n", "* \"What is the memory usage pattern (over time) for different kinds of messaging applications?\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 1. Setup Instructions\n", "\n", "If you are opening this notebook from Wahab cluster's OnDemand interface, you're all set.\n", "\n", "If you obtained this notebook elsewhere, and you want to perform the exercises on Wahab cluster, please follow the steps outlined in our setup procedure:\n", "\n", "1. Make sure you have activated your HPC service.\n", "2. Point your web browser to https://ondemand.wahab.hpc.odu.edu/ and sign in with your MIDAS ID and password.\n", "3. Create a new Jupyter session using \"legacy\" Python suite, then create a new \"Python3\" notebook. (See ODU HPC wiki for more detailed help.)\n", "4. Get the necessary files using commands below within Jupyter notebook or terminal interface:\n", "\n", " mkdir -p ~/CItraining/module-bd\n", " cp -pr /scratch/Workshops/DeapSECURE/module-bd/. ~/CItraining/module-bd\n", " cd ~/CItraining/module-bd\n", "\n", "The file name of this notebook is `BigData-session-2.ipynb`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 Reminder\n", "\n", "* Throughout this notebook, `#TODO` is used as a placeholder where you need to fill in with something appropriate. \n", "\n", "* To run a code in a cell, press `Shift+Enter`.\n", "\n", "* Pandas cheatsheet provides a reminder of most frequently used _pandas_ syntax and functions.\n", "\n", "* Summary table of the commonly used indexing syntax from our own lesson.\n", "\n", "We recommend you open these on separate tabs or print them;\n", "they are handy help for writing your own codes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 Loading Python Libraries\n", "\n", "Next step, we need to import the required libraries into this Jupyter Notebook:\n", "`pandas`, `matplotlib.pyplot` and `seaborn`.\n", "\n", "**For Wahab cluster only**: before importing these libraries, we need to load the `DeapSECURE` *environment modules*:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "module(\"load\", \"DeapSECURE\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can import all the required modules into Python:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy\n", "import pandas\n", "import seaborn\n", "from matplotlib import pyplot as plt\n", "\n", "%matplotlib inline\n", "# ^^ The last line is to make plots to appear automatically\n", "# after the cells are executed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(Advanced) You can use `display` in lieu of `print` to show multiple _pandas_ objects in nice format in one code cell:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.display import display" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.3 Loading Sherlock Data\n", "\n", "As in the preceding session, we continue to use a small table extracted from the full Sherlock dataset.\n", "Using `pandas.read_csv`, load the data contained in `sherlock/sherlock_mystery.csv`\n", "into a variable named `df_mystery`.\n", "\n", "> **NOTE:** The required files should have been copied into current directory tree.\n", "> The data file `sherlock_mystery.csv` is located inside the `sherlock` subdirectory.\n", "> (Use `ls sherlock/` statement to verify if that file exists.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and replace #TODO with appropriate filename\"\"\";\n", "\n", "#df_mystery = pandas.read_csv('sherlock/#TODO')\n", "\n", "\"\"\"Also create a copy of the dataframe as a backup\"\"\";\n", "#df_backup = df_mystery.copy()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A copy of the dataset was made; in case you make a mistake with `df_mystery` at a later time, you can just re-initialize by the data by re-copying `df_backup`:\n", "~~~\n", "df_mystery = df_backup.copy()\n", "~~~" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**QUESTIONS**:\n", "If you have not already, perform initial exploration of the dataset now;\n", "use Pandas methods and attributes to answer the following questions:\n", "\n", "1. How many rows and columns are in this dataset?\n", "2. How do the numbers look like?\n", "3. How does the statistical information look like?\n", "4. What does the feature look like? (i.e. the data types)\n", "\n", "*HINT*: use a combination of the DataFrame attributes `shape`, `dtypes`, and/or methods like `head`, `tail`, `describe`, `info`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Experiment with functions and attributes listed above to\n", "obtain initial knowledge into df_mystery.\"\"\";\n", "\n", "#print(df_mystery.#TODO)\n", "#..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.4 Dataset Features\n", "\n", "The `sherlock_mystery` DataFrame contains a tiny extract of Sherlock's `Application.csv` data file, which contains snapshots of running app's resource utilization (CPU, memory, network).\n", "\n", "`sherlock_mystery` contains 14 columns (also known as *features* or *variables*) and 200 rows (*records*).\n", "\n", "Columns and corressponding datatypes described below\n", "\n", "| Columns | Datatype| Description\n", "|:-------------------|-----|:-----------|\n", "| `Unnamed: 0` |int |Record index|\n", "| `CPU_USAGE` |float| CPU utilization (100% = completely busy CPU) |\n", "| `cutime` |int | CPU \"user time\" spent the spawned (child) processes |\n", "| `lru` |int | \"Least Recently Used\"; This is a parameter of the Android application memory management|\n", "| `num_threads` |int | Number of threads in this process |\n", "| `otherPrivateDirty`|int | The private dirty pages used by everything else other than Dalvik heap and native heap|\n", "| `priority` |int | Process's scheduling priority|\n", "| `utime` |int | Measured CPU \"user time\"|\n", "| `vsize` |int | The size of the virtual memory, in bytes|\n", "| `cminflt` |int | Count of minor faults that the process's child processes|\n", "| `guest_time` |int | Running time of \"virtual CPU\"|\n", "| `Mem` |int | Size of memory, in bytes|\n", "| `queue` |int | The waiting order (priority)|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 2. Column Operations\n", "\n", "Pandas provides convenient and intuitive ways to operate on `Series` and columns of `Dataframe` objects.\n", "In this section we will cover the following basic operations:\n", "\n", "* DataFrame column selection\n", "* Arithmetic operations\n", "* String operations\n", "* Custom transformations\n", "\n", "\n", "### 2.1 Column Selection\n", "\n", "When working with data to answer a specific question,\n", "we often want to work with only a subset of columns from the full table.\n", "In the previous session we learn how to create a DataFrame that contains only selected columns using the `[]` subscript operator.\n", "\n", "\n", "\n", "**EXERCISE**:\n", "Using the indexing notation we learned in the previous session, create a new DataFrame that contains only `ApplicationName` and memory-related features: `vsize`, `Mem`, `otherPrivateDirty`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Create a DataFrame containing only memory-related features:\n", "ApplicationName, vsize, Mem, otherPrivateDirty\"\"\";\n", "#df_memstats = df_mystery[#TODO]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXERCISE (optional)**:\n", "create another DataFrame that contains only `ApplicationName` and CPU-related features: `CPU_USAGE`, `utime`, `cutime`, `guest_time`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 Removing Column(s) or Row(s)\n", "\n", "In _pandas_, rows or columns that are irrelevant or contain bad data can be deleted using the `drop()` method.\n", "We have to specify the type of deletion by supplying the `axis=0` argument to delete row(s), or `axis=1` to delete column(s).\n", "\n", "#### Removing Column(s)\n", "\n", "To return a new DataFrame with one or more columns removed:\n", "~~~python\n", "# one column only:\n", "df_dropped = df.drop(COLUMN1, axis=1)\n", "\n", "# one or more columns:\n", "df_dropped = df.drop([COLUMN1, COLUMN2, ...], axis=1)\n", "~~~\n", "\n", "where `COLUMN1`, `COLUMN2` ... are the names of the columns we want to drop.\n", "We can also perform an in-place drop operation by adding the `inplace=True` argument:\n", "\n", "~~~python\n", "# in-place deletion, it returns nothing\n", "df.drop([COLUMN1, COLUMN2, ...], axis=1, inplace=True)\n", "~~~\n", "This is an alternative way to select columns, i.e. by dropping the unwanted column(s)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXAMPLE 1: Dropping a useless feature.**\n", "Create `df_mystery2` that does not have the `Unnamed: 0` field, which is not a feature at all." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Create new DataFrame which does not contain 'Unnamed: 0'.\n", "Make sure to verify the result.\"\"\";\n", "\n", "#df_mystery2 = df_mystery.drop(#TODO, axis=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_mystery2 = df_mystery.drop(['Unnamed: 0'], axis=1)\n", "df_mystery2.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXAMPLE 2: Dropping all non-features.**\n", "`ApplicationName` and `Unnamed: 0` are not features.\n", "Create `df_features_only` that has only features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Create new DataFrame which does not contain 'Unnamed: 0' and 'ApplicationName'.\n", "Make sure to verify the result.\"\"\";\n", "\n", "#df_features_only = df_mystery.#TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXAMPLE 3: Dropping a useless feature forever.**\n", "Now remove column `Unnamed: 0` from `df_mystery` for good: we don't need to see it anymore.\n", "\n", "*Hint:* This is an in-place operation which alters `df_mystery`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Write a code to remove 'Unnamed: 0' column from the original DataFrame\"\"\";\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Removing Row(s)\n", "\n", "To delete a row, simply change `axis=1` to `axis=0` and specify the row labels instead of column labels.\n", "\n", "**EXAMPLE 4: Dropping rows labeled 0 through 9.** Create a new DataFrame which remove rows labeled 0 through 9." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Write a code to create a new DataFrame where rows labeled 0..9 are removed\"\"\";\n", "#df_mystery3 = #TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### 2.3 Arithmetic Operations\n", "\n", "Arithmetic manipulations on `Series` and `DataFrame` objects is as intuitive as we expect it would be.\n", "\n", "**EXAMPLE 1:** The `vsize` column has values in the billions (because today's typical memory sizes are in the gigabytes).\n", "It is better for us to work with units of gigabytes.\n", "Let us create a new Series by removing the \"billion\" prefactor:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Modify and uncomment code below to convert 'vsize' from bytes to gigabytes\"\"\";\n", "\n", "#vsize_gb = df_mystery['#TODO'] * 1e-9\n", "\n", "print(\"Original data:\")\n", "#print(#TODO.head())\n", "print(\"Same data in units of GB:\")\n", "#print(#TODO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can create a new column in the DataFrame that contains the `vsize` data in units of gigabytes:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_mystery['vsize_gb'] = df_mystery['vsize'] * 1e-9\n", "df_mystery.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*NOTE:* We can also update the existing column to contain the transformed values by reusing the same column name for the assignment, e.g.\n", "~~~\n", "df_mystery['vsize'] = df_mystery['vsize'] * 1e-9\n", "~~~" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Arithmetic involving multiple columns:**\n", "Simply add, subtract, multiply or divide two or more columns as if they were ordinary variables: e.g.\n", "`df[COL1] - df[COL2]`. The operation will be done *element-wise*, matched according to the index.\n", "\n", "**EXAMPLE 2:**\n", "Subtract `guest_time` from `utime` and assign it to a new column called `utime_self`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and modify to compute `utime_self`:\"\"\";\n", "#df_mystery['utime_self'] = #TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**BONUS QUESTION:** Print the three columns side-by-side: `utime_self`, `utime`, `guest_time`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### 2.4 String Operations\n", "\n", "A `Series` object has an attribute named `str`, which contains an extensive collection of string functions, such as:\n", "letter case manipulation, substring search-and/or-replace, etc.\n", "\n", "**EXAMPLE:** Raw data tends to be messy with inconsistent character cases, e.g.\n", "`Facebook`, `facebook`, `FACEBOOK` and `faceBOOK`.\n", "These all refer to the same thing.\n", "To clean this data, we need to make these values more consistent,\n", "for example, by capitalizing all the letters.\n", "Let us illustrate such a cleaning process on the `ApplicationName` column by converting the values to upper-case letters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"CAPITALIZE all letters in 'ApplicationName' column:\"\"\";\n", "\n", "#df_mystery['ApplicationName'] = df_mystery['#TODO'].str.upper()\n", "#print(df_mystery['ApplicationName'].head(10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `str.upper()` method converts the string values to _UPPER CASE_ whereas `str.lower()` converts to _lower case_.\n", "To learn more about _pandas_ string capabilities, refer to\n", "Working with Text Data (method summary),\n", "part of _pandas_ user guide." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### 2.4 Custom Transformations\n", "\n", "A more complex transformation can be applied to each element of a DataFrame or Series by defining a standard Python function, and pass that function to the `apply()` method of the DataFrame or Series.\n", "Here are some examples:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXAMPLE 1:**\n", "We can reimplement the unit conversion (bytes-to-gigabytes) using a user-defined function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define a function to convert to unit of gigabytes.\n", "# It must take a single argument and return a single value:\n", "def to_giga(val):\n", " return val * 1e-9\n", "\n", "df_mystery['vsize_gb2'] = df_mystery['vsize'].apply(to_giga)\n", "\n", "# Check the result:\n", "print(df_mystery[['vsize', 'vsize_gb2']].head(10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This does the same thing as the previous one (`vsize_gb`).\n", "Note that the `to_giga` function was called multiple times: *once for each individual element* in the Series or DataFrame.\n", "\n", "> #### NOTES ON PERFORMANCE (*advanced*)\n", "> Whenever possible, rely on the tools provided by _pandas_ to perform your data manipulations.\n", "> _Pandas_' user guide on Computational Tools\n", "> is a good starting point.\n", "> Functions or operations that involve hand-written loops written in Python have very high performance penalty.\n", "> This performance issue is\n", "> discussed briefly in our lesson." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXERCISE 2:**\n", "The column `guest_time` contains negative values (please check!) whereas no time value can be negative. Fix this by replacing all negative values with zeros using the following function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Function to convert negative values to zero\n", "def non_negative(val):\n", " if val > 0:\n", " return val\n", " else:\n", " return 0" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and use the `apply` method to replace negative values with zeros:\"\"\";\n", "#df_mystery['guest_time_fix'] = #FIXME\n", "\n", "\"\"\"Observe the effect of the function:\"\"\";\n", "#print(#TODO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----\n", "#### Challenge Problem: Help Needed!\n", "\n", "A student named Lee wants to compute the range of the values for each column but he ran into a problem.\n", "(A *range* of a given set of numbers is simply the diffence between the minimum and maximum of the values.)\n", "Lee implements the range calculation in this way:\n", "~~~\n", "def value_range(col):\n", " return max(col) - min(col)\n", "\n", "print(df_mystery.apply(value_range))\n", "~~~\n", "\n", "Copy the code above to the cell below and run it.\n", "Observe the error and help Lee troubleshoot his problem:\n", "\n", "~~~\n", " in value_range(col)\n", " 2 \n", " 3 def value_range(col):\n", "----> 4 return max(col) - min(col)\n", " 5 \n", " 6 \n", "\n", "TypeError: (\"unsupported operand type(s) for -: 'str' and 'str'\", 'occurred at index ApplicationName')\n", "~~~\n", "*Hint:* Examine the types and the name of the column mentioned in the error." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Troubleshoot and fix Lee's code in this cell here\"\"\";\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Actually, the problem above is a perfect example on how we can apply _pandas_ built-in capabilities to compute the range.\n", "We have learned a number of key methods in the previous session---and one of them did return the `min` and `max` values of each range.\n", "Use the result of that method to compute the range!\n", "\n", "*Hint:* A correct implementation can be had with only two lines of code!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*(end challenge)*\n", "\n", "---------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 3. Filtering\n", "\n", "Now we come to the next important building block in data analytics.\n", "We often need to filter DataFrame rows based on certain values or conditions satisfied in each row.\n", "This alone often gives answer to many questions asked in data analytics.\n", "\n", "A **boolean expression** evaluates whether one or more variables satisfied a prescribed condition.\n", "For example," ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Run this an observe the result\"\"\";\n", "df_mystery['CPU_USAGE'] > 50" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "returns a Series of boolean values:\n", "`True` where the corresponding `CPU_USAGE` value is greater than 50,\n", "`False` otherwise." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**QUESTION:** Do you see any `True` value? Why (or why not)? Try tweaking the `50` to a lower value until you see some `True`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Filtering Rows**--We can combine the boolean expression above with the `[]` operator to return only rows that satisfy the prescribed condition:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Return rows where `CPU_USAGE` value is greater than two:\"\"\";\n", "# df_mystery[#TODO]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Operators supported include: `==` (equal to), `!=` (not equal to), `>`, `>=`, `<`, `<=`. There are a few other frequently used Series methods that also yield boolean values, such as `isna()` (returns True where the corresponding value is *not* defined) and `notna()` (True when a value *is* defined).\n", "To learn more, please check out the\n", "list of common boolean expressions and methods\n", "in our lesson page.\n", "\n", "Boolean expressions can be combined with AND (`&`), OR (`|`) and NOT (`~`) operators to achieve complex data operations.\n", "However, comparison expressions such as `df_mystery['CPU_USAGE'] > 2` must be enclosed with () because of Python's orrder of operator priority." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXERCISE 1:** Display records in `df_mystery` where `ApplicationName` is \"Facebook\" **and** `CPU_USAGE` is greater than `0.5`.\n", "\n", "*HINTS*: The expressions before and after the combine operator need to be enclosed in parentheses to evaluate correctly. You may need to reset the contents of `df_mystery` before running this exercise." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and modify to display records where ApplicationName is Facebook\n", "and CPU_USAGE is greater than 0.5\"\"\";\n", "\n", "## you may also need: df_mystery = df_backup.copy()\n", "\n", "#df_mystery[(#TODO == 'Facebook') & (#TODO)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXERCISE 2**: Display records where `ApplicationName` is \"Facebook\" **or** `CPU_USAGE` is greater than `0.5`.\n", "How many are there?\n", "\n", "*Hint*: Consider from the sentence what operator to use." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and modify display records where ApplicationName is Facebook\n", "OR CPU_USAGE is greater than 0.5\"\"\";\n", "\n", "#df_mystery[(#TODO) #TODO (#TODO)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 4. Sorting\n", "\n", "Sorting is a procedure to reorder data according to certain rules.\n", "Most often, rows are ordered according to one or more columns in numerical or alphabetical order.\n", "\n", "Sorting is helpful to discover events where applications are using the most CPU cycles, the most memory, the most network bandwidth, etc.\n", "Let's check our data for these \"top hogs\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*PREPARATION*: Let's define the following `df` to focus our attention only to a subset of relevant columns (and we copy it from the backup):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Make a copy of `df_mystery` with fewer columns for practice.\"\"\";\n", "\n", "df = df_backup[['ApplicationName', 'CPU_USAGE', 'num_threads',\n", " 'vsize', 'utime', 'cutime']].copy()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXAMPLE 1: CPU hogs** -- Sort the records based on CPU utilization, from the highest to the lowest. Can you claim which app uses more CPU?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and run this to sort the rows in descending order\n", "based on `CPU_USAGE`\"\"\";\n", "\n", "#df_CPU_hog = df.sort_values(by='CPU_USAGE', ascending=False)\n", "#df_CPU_hog.head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXAMPLE 2: Memory hogs** -- Repeat the analysis above by sorting the data by memory usage (column: `vsize`). Call the result `df_mem_hog`. Which application uses more memory: `Facebook` or `WhatsApp`?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Produce df_mem_hog and print the top 20 records\"\"\";\n", "#df_mem_hog = #TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXAMPLE 3: Thread and CPU hogs** -- Repeat the analysis above by sorting the data using two keys: first by `num_threads`, then by `CPU_USAGE`.\n", "Pass on the sorting keys as a list to the `by=` argument.\n", "Call the result `df_thread_hog`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and modify to sort the data by `num_threads` and `CPU_USAGE`\"\"\";\n", "#df_thread_hog = df.sort_values(by=[#TODO], ascending=False)\n", "#df_thread_hog.head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More complex sorting order can be achieved by feeding lists to both the `by=` and `ascending=` arguments, for example:\n", "\n", "~~~\n", "df.sort_values(by=['num_threads', 'utime'], ascending=[False, True])\n", "~~~\n", "\n", "**QUESTION:** What does the statement above give us? Experiment below and find out:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Use this cell to try out the sort statement above\"\"\";\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXAMPLE 4: Idle thread hogs** -- Let us combine filtering and sorting:\n", "For all records where `CPU_USAGE` is zero, sort them in descending order by `num_threads`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Modify and run this cell to complete Example 4\"\"\";\n", "#df_thread_hog2 = df[ #TODO ].sort_values(by=#TODO, ascending=False)\n", "#df_thread_hog2.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXAMPLE 5:** For this exercise, consider **only** records with nonzero `CPU_USAGE`.\n", "Sort them with two keys: first in ascending order by `CPU_USAGE`, then in descending order by `num_threads`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Use this cell to answer the question above\"\"\";\n", "#df_thread_hog3 = df[ #TODO ].sort_values(by=#TODO, ascending=False)\n", "#df_thread_hog3.head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 5. Aggregation\n", "\n", "A major goal of data analytics is to obtain valuable insights from the data. We often seek statistical knowledge about the data by performing **aggregation** on the data.\n", "For example:\n", "\n", " * **top** 10 most used applications;\n", " * **average** memory usage over time;\n", " * **total** CPU utilization over time.\n", "\n", "_Pandas_ provides\n", "a rich set of aggregation (statistical) methods\n", "that are applicable to both `DataFrame` and `Series` objects:\n", "\n", "* `df.max()` --- the maximum value\n", "* `df.min()` --- the minimum value\n", "* `df.mean()` --- the mean (average) of the values\n", "* `df.sum()` --- the sum of the values\n", "* `df.mode()` --- the most frequently occuring values (there can be multiple values of such)\n", "* `df.median()` --- the median of the values\n", "\n", "For a DataFrame, `df.describe()` include many of these.\n", "\n", "**EXPERIMENTS:**\n", "Run these expressions in the cell(s) below and observe the output.\n", "Also try other statistical functions mentioned in the list above.\n", "\n", "* `df.max()`\n", "* `df['lru'].mean()`\n", "* `df['ApplicationName'].mode()`\n", "* `df['cutime'].median()`\n", "* `df['cutime'].sum()`\n", "* `df.max(axis=1, numeric_only=True)`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and run individual statements below and observe the output\"\"\";\n", "\n", "#df_mystery.max()\n", "#df_mystery['lru'].mean()\n", "#df_mystery['ApplicationName'].mode()\n", "#df_mystery['cutime'].median()\n", "#df_mystery['cutime'].sum()\n", "#df_mystery.max(axis=1, numeric_only=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More exercises will be found after we introduce the grouping operation below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 6. Grouping\n", "\n", "In the previous section, we computed aggregate quantities for the entire data.\n", "But sometimes we want to know detailed statistical information such as:\n", "\n", " * Maximum memory usage for each application\n", " * Total CPU usage for each application\n", " * CPU usage over time for each application\n", "\n", "These questions require us to *group* the observations based on the application name before applying the aggregate functions.\n", "Group aggregation, often known as **group by** or **split-apply-combine**, involves one or more of the following steps:\n", "\n", " * Splitting the data into groups based on defined criteria;\n", " * Applying a function to each group independently;\n", " * Combining the results into a data structure." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's continue using the `df` DataFrame.\n", "The first step, *splitting*, is performed by the `groupby` method:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.groupby('ApplicationName')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This method has not resulted in anything, because we have not defined the next two steps.\n", "Immediately following `groupby`, we can apply an aggregate method which will perform the *apply* and *combine* steps and yield the result.\n", "\n", "**EXAMPLE 1:** The following example computes in the maximum value of memory usage for each application:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.groupby('ApplicationName')['vsize'].max()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can obtain the same aggregation for all the columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.groupby('ApplicationName').max()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXAMPLE 2:** Compute the total CPU time spent by each application.\n", "*Hint:* Sum over the `utime` field to get the total CPU time." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and modify to calculate the total CPU time for each application\"\"\";\n", "#print(df.groupby('ApplicationName')[#TODO].#TODO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXAMPLE 3:** Count the number of records for each application name.\n", "*Hint:* Use the `size()` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Count the number of records per application name\"\"\";\n", "#TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXAMPLE 4:** The `describe()` method is actually an aggregate function.\n", "Please apply this function following the `groupby`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Group data by application name, then compute descriptive statistics\"\"\";\n", "#TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Hint:* Transposing the DataFrame makes reading easier." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 7. Combining Multiple Datasets\n", "\n", "*Note: Combining datasets is more advanced than the building blocks above, but is included here for completeness.*\n", "\n", "Real-world big data analytics will involve integrating datasets in various forms and from various sources.\n", "In our study case, the Sherlock's `Applications` data may need to be augmented with additional measurements or related bits of information from other tables (e.g., sensor information, screen on/off status, user present/absent, ...).\n", "There are two primary ways to combine datasets:\n", "\n", "1. Adding more measurements of the same kind, which means adding rows into our existing table. This is done by **concatenating** the two or more tables together.\n", "\n", "2. Adding different but related features from other source(s), which technically inserts additional columns into our existing table. This is done by **merging** or **joining** two tables.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 7.1 Concatenating Datasets\n", "\n", "_Pandas_ has the `concat` function to concatenate rows from two or more tables:\n", "`df_all = pandas.concat([df1, df2, ...])`.\n", "This is straightforward if the source DataFrames have the same columns.\n", "We will cover below a case that is slightly less trivial." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXAMPLE (*advanced*):**\n", "In your hands-on package there is another application stat table, similar but not identical to `sherlock_mystery.csv`.\n", "Our goal is to concatenate the two to make a big table.\n", "Let's load that other table and examine it:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Read in the following data file...\"\"\"\n", "df_head2k = pandas.read_csv(\"sherlock/sherlock_head2k.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"... and explore the columns, data types, number of records, etc.\"\"\"\n", "#TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**QUESTION**: Did you notice that the second table has significantly more rows *and* columns?\n", "Please peek into the data using the functions introduced earlier, you will also notice it contains records from far more diverse applications than what `sherlock_mystery` has!\n", "\n", "If we go ahead and concatenate them, what will happen?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## make sure df_mystery is reset to the original file:\n", "df_mystery = df_backup.copy()\n", "df_concat = pandas.concat([df_mystery, df_head2k])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is a warning which makes us a litle uncomfortable: `Sorting because non-concatenation axis is not aligned. ...`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Always check the resulting table before proceeding\"\"\";\n", "df_concat.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**QUESTION**: Do you notice any anomaly from the report of the `info()` method above?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXERCISE**: Review the contents of `df_concat`--do you observe any issue? (Hints: `head`, `tail`)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Clearly, the tables are concatenated because the beginning records contain data from `sherlock_mystery.csv` and the tail records are from `sherlock_head2k.csv`. _pandas_ is smart enough to align the columns based on their names;\n", "it also fills the columns that are missing in the source table with `NaN` (not-a-number), indicating missing data.\n", "But the result is not perfect because not all columns are present in either table.\n", "We have to make a judgment call regarding the missing data (which will be discussed in detail in the subsequent episode of the lesson).\n", "\n", "It is reasonable to limit ourselves to the columns that exist on both tables. Let's only use six columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use Python set operation to get the overlapping columns\n", "columns_both = ['ApplicationName', 'CPU_USAGE', 'utime', 'cutime',\n", " 'vsize', 'num_threads']\n", "print(columns_both)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(len(columns_both))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now it's time for action...re-read the tables and merge them:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and modify codes below to load the two tables then merge\"\"\";\n", "\n", "#df_set1 = pandas.read_csv(\"sherlock/sherlock_mystery.csv\", usecols=columns_both)\n", "#df_set2 = pandas.read_csv(\"sherlock/sherlock_head2k.csv\", usecols=columns_both)\n", "#df_combined = #TODO" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Always check the result using info() and describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 7.2 SQL-like JOIN Operations\n", "\n", "In the previous example, tables were *stacked* vertically.\n", "We can extend tables horizontally using the DataFrame's `merge()` method, also known as the JOIN operator in SQL (Structured Query Language).\n", "In the JOIN operator, rows in the two tables will be matched at one or more columns (also known as *join key(s)*).\n", "Where there is a match, the two rows will be placed side-by-side to become a single row in the output table.\n", "Let's consider this example, which we will do later:\n", "\n", "*Table 1 (\"left table\"):*\n", "~~~\n", "ApplicationName CPU_USAGE num_threads\n", "WhatsApp 10.0 100\n", "Facebook 0.9 72\n", "Facebook 0.0 68\n", "WhatsApp 5.0 120\n", "~~~\n", "\n", "*Table 2 (\"right table\"):*\n", "~~~\n", "ApplicationName ApplicationType\n", "WhatsApp Messaging\n", "Facebook Social Media\n", "~~~\n", "\n", "When we join the two tables above using `ApplicationName` as the join key, then we will obtain\n", "\n", "***Output table:***\n", "\n", "~~~\n", "ApplicationName CPU_USAGE num_threads ApplicationType\n", "WhatsApp 10.0 100 Messaging\n", "Facebook 0.9 72 Social Media\n", "Facebook 0.0 68 Social Media\n", "WhatsApp 5.0 120 Messaging\n", "~~~\n", "\n", "The joined table will not contain one column of each the join key (i.e. no duplicate `ApplicationName` column).\n", "\n", "In this notebook we will limit ourselves to the \"LEFT OUTER JOIN\" operation.\n", "All rows on the left table will be included in the resulting table.\n", "If there are row(s) on the right table that match, they will be joined to the left-table row.\n", "Otherwise, the columns will be marked with `NaN` to indicate missing data.\n", "\n", "Subsequest exercises merge two tables forming a new dataset with more columns based on matching criteria of a common column (**join key**) in both tables. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXAMPLE:**\n", "Let us practice the **outer join** operation by appending extra information about the application in every row.\n", "The left table is given by `df_combined` from the previous section;\n", "the right table is an additional table that we load here:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment to load `sherlock/app_info.csv`\"\"\";\n", "\n", "#df_appinfo = pandas.read_csv('sherlock/app_info.csv')\n", "#print(df_appinfo.head(10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**QUESTION:** What are the additional information bits to be attached to `df_combined`?\n", "\n", "Now let's do the \"LEFT JOIN\" merge and observe the output:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and modify to merge df_combined and df_appinfo DataFrames\"\"\";\n", "\n", "#df_master = df_combined.merge(right=df_appinfo, how='left', on='ApplicationName')\n", "#print(\"Merged data (head):\")\n", "#display(df_master.head(10))\n", "#print(\"Merged data (tail):\")\n", "#display(df_master.tail(10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Key Points\n", "\n", "A quick summary of data manipulation building blocks introduced in this notebook:\n", "\n", "- Initial exploration: `df.shape`, `df.dtypes`, `df.head()`, `df.tail()`, `df.describe()`, and `df.info()`\n", "- Transpose table for readability: `df.T`\n", "- Filtering rows: `df[BOOLEAN_EXPRESSION]`\n", "- Sort rows/columns: `df.sort_values()`\n", "- Data aggregation: `df.max()`, `df.min()`, `df.mean()`, `df.sum()`, `df.mode()`, `df.median()`\n", "- Execute custom function: `df.apply()`\n", "- Group data by column and apply an aggregation function: `df.groupby(['COLUMN1','COLUMN2',...]).FUNC()`\n", "- Merge two tables: `df.merge()`\n", "\n", "In this notebook, we demonstrate these building blocks by practical examples." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }