{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**DeapSECURE module 2: Dealing with Big Data**\n", "\n", "# Session 1: Fundamentals of Pandas\n", "\n", "Welcome to the DeapSECURE online training program!\n", "This is a Jupyter notebook for the hands-on learning activities of the\n", "[\"Big Data\" module](https://deapsecure.gitlab.io/deapsecure-lesson02-bd/),\n", "Episode 3: [\"Fundamentals of Pandas\"](https://deapsecure.gitlab.io/deapsecure-lesson02-bd/10-pandas-intro/index.html).\n", "Please visit the [DeapSECURE](https://deapsecure.gitlab.io/) website to learn more about our training program.\n", "\n", "\n", "**Quick Links** (sections of this notebook):\n", "\n", "* 1 [Setup](#sec-Setup)\n", "* 2 [Series](#sec-Series)\n", "* 3 [DataFrame](#sec-DataFrame)\n", " - [Loading Sherlock dataset](#sec-Sherlock-load-tiny)\n", " - [Accessing Elements (Indexing)](#sec-DataFrame-indexing)\n", " - [Exercises](#sec-DataFrame-exercises)\n", "* 4 [Visualization](#sec-Visualization)\n", "* 5 [Summary & Further Resources](#sec-Summary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 1. Setup Instructions\n", "\n", "If you are opening this notebook from Wahab cluster's OnDemand interface, you're all set.\n", "\n", "If you see this notebook elsewhere and want to perform the exercises on Wahab cluster, please follow the steps outlined in our setup procedure.\n", "\n", "1. Make sure you have activated your HPC service.\n", "2. Point your web browser to https://ondemand.wahab.hpc.odu.edu/ and sign in with your MIDAS ID and password.\n", "3. Create a new Jupyter session using \"legacy\" Python suite, then create a new \"Python3\" notebook. (See ODU HPC wiki for more detailed help.)\n", "4. Get the necessary files using commands below within Jupyter:\n", "\n", " mkdir -p ~/CItraining/module-bd\n", " cp -pr /scratch/Workshops/DeapSECURE/module-bd/. ~/CItraining/module-bd\n", " cd ~/CItraining/module-bd\n", "\n", "The file name of this notebook is `BigData-session-1.ipynb`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 Reminder\n", "\n", "* Throughout this notebook, `#TODO` is used as a placeholder where you need to fill in with something appropriate. \n", "* To run a code in a cell, press `Shift+Enter`.\n", "* Use `ls` to view the contents of a directory." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 Loading Python Libraries\n", "\n", "Now we need to **import** the required libraries into this Jupyter notebook:\n", "`pandas`, `numpy`, `matplotlib` and `seaborn`.\n", "\n", "**Important**: On Wahab HPC, software packages, including Python libraries, are managed and deployed via *environment modules*:\n", "\n", "| Python library | Environment module name |\n", "|--------------------|-------------------------|\n", "| `pandas` | `py-pandas` |\n", "| `numpy` | `py-numpy` |\n", "| `matplotlib` | `py-matplotlib` |\n", "| `seaborn` | `py-seaborn` |\n", "\n", "In practice, before we can import the Python libraries in our current notebook, we have to load the corresponding environment modules.\n", "\n", "* Load the modules above using the `module(\"load\", \"MODULE\")` or `module(\"load\", \"MODULE1\", \"MODULE2\", \"MODULE n\")` statement.\n", "* Next, invoke `module(\"list\")` to confirm that these modules are loaded." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"(OPTIONAL) Modify and uncomment statements below to load the required environment modules\"\"\";\n", "\n", "#module(\"load\", \"#TODO\")\n", "#module(\"load\", \"#TODO\")\n", "#module(\"load\", \"#TODO\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For convenience, we have prepared an environment module called `DeapSECURE` which includes most of the modules needed for the DeapSECURE training.\n", "Please run the following code cell to make the required Python libraries accessible from this notebook:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "module(\"load\", \"DeapSECURE\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the notebooks provided for DeapSECURE training, we recommend that you use this approach to get to the core of the exercises quicker." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Confirm the loaded modules\"\"\";\n", "module(\"list\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can import the following Python libraries:\n", "`pandas`, `numpy`, `pyplot` (a submodule of `matplotlib`), and `seaborn`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment, edit, and run code below to import libraries listed above.\"\"\";\n", "#import #TODO\n", "#import #TODO\n", "#from matplotlib import pyplot\n", "#import #TODO\n", "#%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The last line is an ipython magic command to ensure that plots are rendered inline." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data in Pandas: `Series` and `DataFrame`\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SeriesDataFrame
1-D labeled array of values2-D tabular data with row and column labels
\n", " Properties: labels, values, data type\n", " \n", " Properties: (row) labels, column names,\n", " values, data type\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 2. Working with `Series`\n", "\n", "Pandas stores data in the form of `Series` and `DataFrame` objects.\n", "We will learn about both in this section." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A Series object is a one-dimensional array of values of same datatype. A Series is capable of holding any datatype (integers, strings, float, and many others).\n", "\n", "#### 2.1 Creating a `Series` Object\n", "\n", "A Series object can be created by feeding an input list to `pandas.Series` function.\n", "Let's create a series containing ten measurements of CPU loads:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Create a pandas series object named CPU_USAGE:\"\"\";\n", "\n", "#CPU_USAGE = pandas.#TODO([0.16, 0.07, 0.23, 0.24, 0.14, 4.99, 0.23, 0.47, 0.46, 0.17])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Print the contents of CPU_USAGE\"\"\";\n", "\n", "print(CPU_USAGE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The printout of `CPU_USAGE` shows the three key elements of a Series:\n", "**values**, **labels** (*index*), and **data type**. \n", "\n", "**QUESTION:** Can you identify these elements from the output above?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, the index of a `Series` is a sequence of integers starting from 0.\n", "This follows the standard convention in Python for numbering array elements." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Hint: In a Jupyter notebook, you can also display the value of\n", "an object by simply typing its name and pressing Shift+Enter:\"\"\"\n", "\n", "CPU_USAGE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `len()` function returns the number of elements in the `Series` object, similiar to `list` or `dict` in Python." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Print the length of the CPU_USAGE Series:\"\"\";\n", "\n", "#print(len(#TODO))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(There are many other ways to create series, such as from a Numpy array. Visit [lesson website](https://deapsecure.gitlab.io/deapsecure-lesson02-bd/10-pandas-intro/index.html) for more detail)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 Accessing Elements in a `Series`\n", "\n", "#### 2.2.1 Accessing a Single Element\n", "\n", "An element can be accessed by specifying its *label* in a square bracket notation, for example: `CPU_USAGE[3]`,\n", "where `3` is the *label*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Also try several values of the label instead of 3:\"\"\";\n", "\n", "#CPU_USAGE[#TODO]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Series with an irregular index**: We will encounter many series with index that is not a regular sequence like `0, 1, 2, 3, ...`. As an example, let's create a `Series` named `MEM_USAGE` which has an irregular index:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Run to create a Series object named MEM_USAGE:\"\"\"\n", "\n", "MEM_USAGE = pandas.Series([0.16, 0.07, 0.23, 0.24, 0.14],\n", " index=[3, 1, 4, 2, 7])\n", "print(MEM_USAGE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**QUESTIONS**:\n", "\n", "* Anything different from the previously defined `CPU_USAGE`?\n", "* What is `MEM_USAGE[3]`?\n", "* What is `MEM_USAGE[4]`?\n", "* Based on the observations above, what is the meaning of the subscript in the indexing operator, `[]`, for a Series object?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Use this cell to answer those questions\"\"\";\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.2.1 Accessing a Subset of a Series\n", "\n", "An array of labels (like `[1,3]`) can be fed to the indexing operator to retrieve multiple values from a `Series`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and modify to get the values of MEM_USAGE corresponding to labels 1 and 3\"\"\";\n", "\n", "#print(MEM_USAGE[[#TODO,#TODO]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**QUESTION**: Try out other combinations of labels and observe the returned value." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#print(MEM_USAGE[[#TODO]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**BOOLEAN-VALUED INDEXING**\n", "\n", "Accessing a `Series` with an array of boolean values (`True` or `False`) is a special case: The indexing operation acts as a \"row selector\", where the True values will select the corresponding elements. *(Note: generally, the length of the boolean array is the same as the length of the `Series` object)*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Select elements of MEM_USAGE using boolean arrays \"\"\";\n", "\n", "print(MEM_USAGE[[True, False, False, True, True]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3 Updating Elements in a `Series`\n", "\n", "At times we need to modify certain elements in a `Series` or a `DataFrame`.\n", "This is accomplished by the use of `.loc[]` subscript operator, which can read or update one or more elements corresponding to the specified labels.\n", "\n", "Here is an example for a single-element update:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Before update:\")\n", "print(MEM_USAGE[7])\n", "print(MEM_USAGE.loc[7])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Updating an element with the .loc[] operator\"\"\"\n", "MEM_USAGE.loc[7] = 0.33\n", "print(\"After update:\")\n", "print(MEM_USAGE.loc[7])\n", "print(MEM_USAGE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXERCISE**: Update the element with label `1` to a new value, `1.10`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is an example for updating multiple elements:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "MEM_USAGE.loc[[1,3]] = [4, 2]\n", "print(MEM_USAGE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Boolean-array subscripting can be used to selectively update a subset of values.\n", "\n", "**EXERCISE**: Use the Boolean-array indexing to update the contents of `MEM_USAGE` according to this instruction:\n", "\n", "```\n", "Before update After update\n", " 3 2.00 3 1.00\n", " 1 4.00 1 4.00\n", " 4 0.23 ==> 4 0.23\n", " 2 0.24 2 2.00\n", " 7 0.33 3 3.00\n", "```\n", "\n", "(*HINT*: Note which rows are to be changed, and build the Boolean array accordingly as the subscript to the `.loc[]` operator. The contents before update above will be valid if you had executed every cells involving `MEM_USAGE` earlier.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Modify multiple elements using boolean-array indexing\"\"\";\n", "print(\"Before update:\")\n", "print(MEM_USAGE)\n", "\n", "#MEM_USAGE.loc[#TODO] = #TODO\n", "\n", "print(\"After update:\")\n", "print(MEM_USAGE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.4 Creating a Copy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_pandas_ allows us to duplicate a Series (or later on, a DataFrame) so that we can alter the copy without messing the original, or vice versa.\n", "Use the `copy` method to create a duplicate." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Create a copy of MEM_USAGE into a new variable called COPY_USAGE:\"\"\";\n", "\n", "## Start with the original MEM_USAGE \n", "MEM_USAGE = pandas.Series([0.16, 0.07, 0.23, 0.24, 0.14],\n", " index=[3, 1, 4, 2, 7])\n", "\n", "#COPY_USAGE = MEM_USAGE.#TODO() # use the copy method" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Now alter COPY_USAGE at index label `3` to a new value, 1.23\"\"\";\n", "#COPY_USAGE[#TODO] = #TODO\n", "#print(\"Value in the copy:\", COPY_USAGE[3])\n", "#print(\"Value in the original:\", MEM_USAGE[3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can't we create a copy using an assignment operator? Let us try that to see what happens:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"The following code creates a copy of MEM_USAGE Series,\n", "then modifies the copy. Uncomment the lines, run this cell,\n", "and observe what happens to both the original and the copy.\"\"\";\n", "\n", "## Start with the original MEM_USAGE again\n", "MEM_USAGE = pandas.Series([0.16, 0.07, 0.23, 0.24, 0.14],\n", " index=[3, 1, 4, 2, 7])\n", "\n", "#COPY_USAGE = MEM_USAGE # Isn't this sufficient?\n", "#COPY_USAGE[3] = 1.23 # Alter value in index 3\n", "#print(\"Value in the copy:\", COPY_USAGE[3])\n", "#print(\"Value in the original:\", MEM_USAGE[3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Anything surprising to you?\n", "Can you explain the outcome?\n", "(Hint: Please read https://realpython.com/python-variables/#variable-assignment to understand why.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[(back to top)](#TOC)\n", "## 3. Working with `DataFrame`\n", "\n", "Things are much more interesting when we work with table-like `DataFrame` data structures.\n", "Similar to the `Series` hands-on above,\n", "we will begin by creating `DataFrame` objects, then learn to access and modify elements of a `DataFrame`.\n", "In the next section (the next notebook) we will learn how we can automate data processing using _pandas_.\n", "\n", "### 3.1 Creating a `DataFrame` Object\n", "\n", "A `DataFrame` object can be created from a variety of inputs:\n", "\n", "* a nested list (i.e., a list of lists),\n", "* a dict of lists,\n", "* a JSON (Javascript Object Notation) file,\n", "* CSV (Comma Separated Values) file,\n", "\n", "and many other ways.\n", "For this notebook, we will limit ourselves to two ways: (1) a nested list, and (2) a CSV file.\n", "\n", "#### 3.1.1 `DataFrame` from a Nested List\n", "\n", "Let's create a `DataFrame` object called `avatars` with the contents shown below (the same data in the figure earlier):\n", "\n", "\n", "\n", "\n", " | Name | Weight | Height |\n", " |----------|--------|--------|\n", " | Apple | 50 | 155 |\n", " | Berry | 46 | 154 |\n", " | Coco | 56 | 156 |\n", " | Dunkin | 44 | 167 |\n", " | Ella | 45 | 150 |\n", "\n", "For a small table like this, we can create the object by calling `pandas.DataFrame()` with a nested list as its first argument.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Creates a DataFrame object from a list of lists (named `source_table` here).\n", "Feed this as the first argument of `pandas.DataFrame` below.\n", "Uncomment the code lines and run them\"\"\";\n", "\n", "source_table = [ ['Apple', 50, 155],\n", " ['Berry', 46, 154],\n", " ['Coco', 56, 156],\n", " ['Dunkin', 44, 167],\n", " ['Ella', 45, 150] ] \n", "\n", "#avatars = pandas.DataFrame(#TODO, columns=['Name', 'Weight', 'Height'])\n", "#print(avatars)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**NOTES**:\n", "\n", "* By default, an integer sequence (`0`, `1`, `2`, ...) is used as the labels (index).\n", " This can be overriden by the `index=` argument.\n", "* Specifying column names is optional;\n", " if not given, the default of an integer sequence will also be used.\n", "* Jupyter can display a `DataFrame` variable in a nice format\n", " by simply mentioning the variable name as the last line of a code cell. For example, type `avatars` in the cell below and run it:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Type `avatars` (no quotes) and run the cell\"\"\";\n", "#TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.1.2 How Big is My Data?\n", "\n", "Let's learn how to know the sizes, columns, index, and data types of a `DataFrame` object.\n", "Use the `len` function to get the number of rows (records) in a `DataFrame`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Inquire the number of rows in `avatars`\"\"\";\n", "#print(#TODO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the number of rows and columns using the `shape` attribute:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(avatars.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `shape` attribute yields a *tuple* of two numbers:\n", "the number rows and the number of columns.\n", "Use the subscript operator `[]` to get the individual values:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and modify to obtain the number of rows and columns\"\"\";\n", "#print(\"The number of rows =\", avatars.shape[0])\n", "#print(\"The number of columns =\", avatars.#TODO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `index` attribute yields the index, i.e. the collection of row labels:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Print the index of avatars:\"\"\";\n", "#print(avatars.#TODO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The names of the columns are given by the `columns` attribute:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(avatars.columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like `shape`, the individual elements in `column` can be accessed using the `[]` operator.\n", "\n", "**QUESTION**: Print the name of the second column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Make the following code print the name of the second column in df_mystery:\"\"\";\n", "#print(\"The name of the second column is:\", #TODO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `dtypes` attribute returns the data types associated with the columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Print the data types of the columns in avatars:\"\"\";\n", "#print(avatars.#TODO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will discuss data types more later." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#BEGIN-OPTIONAL-SECTION" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> #### OPTIONAL: Is it a Series or a DataFrame?\n", ">\n", "> At times we may want to know whether an object is a `Series` or a `DataFrame`.\n", "> The `type()` function tells the type of a *Python object*.\n", "> \n", "> **QUESTION**: Please find out the object type of `CPU_USAGE` and `avatars`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and modify to print the type of CPU_USAGE variable\"\"\";\n", "#print(#TODO(CPU_USAGE))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Now print the type of avatars\"\"\";\n", "#TODO" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#END-OPTIONAL-SECTION" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.1.3 Reading Data from a CSV File\n", "\n", "When performing data analytics with _pandas_, data is usually read from an external file, instead of embedded in the notebook or the script.\n", "This becomes an absolute necessity when handling large sizes of data.\n", "The CSV format is frequently used with _pandas_ because it is straightforward to make and comprehend.\n", "It is a plain text file where each adjacent field is separated by a comma character.\n", "_pandas_ provides the `pandas.read_csv()` function to load data from a CSV file to a `DataFrame`.\n", "\n", "As an example, in your current directory there is a file called `avatars.csv`.\n", "Let's load this to a variable called `avatars_read`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and run the code below to load data from 'avatars.csv'.\n", " Replace #TODO with name of the data file.\"\"\";\n", "\n", "#avatars_read = pandas.read_csv(\"#TODO\")\n", "#print(avatars_read)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**QUESTION**: Compare `avatars_read` with the previous DataFrame. Do they have the same contents?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How does _pandas_ know the column names?\n", "To answer that, let's inspect `avatars.csv`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Use the UNIX `cat` command to print the contents of avatars.csv:\"\"\";\n", "! cat avatars.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pandas.read_csv` does a lot of work for you behind the scenes!\n", "By default, it detects the names of the column from the first row of the table;\n", "it also detects the data type of each column (numbers, strings, etc.)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### 3.2 Loading Sherlock Data\n", "\n", "In the `sherlock` subdirectory, we have prepared a tiny subset of the Sherlock's \"Application\" dataset in a file named `sherlock_mystery.csv`.\n", "Let us load that data into an object named `df_mystery` and print the contents.\n", "\n", "> **IMPORTANT**: Make sure that you read this data file at this point in order to do the subsequent exercises!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Edit and uncomment to load the Sherlock data file,\n", "'sherlock/sherlock_mystery.csv' \"\"\";\n", "#df_mystery = pandas.#TODO(\"#TODO\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Display the contents of df_mystery\"\"\";\n", "#TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.2.1 Initial `DataFrame` Exploration\n", "\n", "When working with a new dataset, we always ask a lot of questions to familiarize ourselves with it. For example:\n", "\n", "* How many columns and rows exist in this dataset?\n", "* What columns are available in the dataset?\n", "* What does the data look like? Can we learn some characteristics about the data?\n", "\n", "**QUESTIONS:** Use the functions you've learned so far to answer the questions above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Answer the questions above about the `df_mystery` dataset\"\"\";\n", "# How many columns and rows?\n", "#print(#TODO)\n", "# What are the columns?\n", "#print(#TODO)\n", "# Examine the data; what are the data types?\n", "#print(#TODO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A `DataFrame` has a lot of handy methods and attributes which help us know our new dataset. Use attributes like `shape`, and methods like `info`, `size`, `describe`, `head`, `tail`.\n", "\n", "The `head` and `tail` functions provide a handy way to print only a few records at the beginning and end of a `DataFrame`, respectively:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and run to apply head() to df_mystery.\n", " How many rows get printed?\"\"\";\n", "#df_mystery.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Print the first 10 rows of the DataFrame:\"\"\";\n", "#print(df_mystery.#TODO(10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now experiment with the `tail()` function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Modify the code to apply tail() to df_mystery. Optionally set the number of rows to display.\"\"\";\n", "#df_mystery.#TODO(#TODO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `describe()` function provides the statistical information about all the numerical columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Apply describe() to df_mystery and observe the output:\"\"\";\n", "#df_mystery.#TODO()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**QUESTION**: How many columns are in the `describe()` output above?\n", "How many columns are in the original dataset?\n", "\n", "**NOTE**: The non-numerical columns will be quietly ignored." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> #### QUIZ: What is the mean value of num_threads?\n", ">\n", "> 1. `71.365`\n", "> 2. `62.000`\n", "> 3. `0.075`\n", "> 4. `none of the above`\n", ">\n", "> " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `.T` transpose operator would rotate the table by 90 degrees by swapping the rows and columns.\n", "It can help us view a long horizontal table better:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Transpose the output of df_mystery.describe() use the .T operator:\"\"\";\n", "#df_mystery.describe().#TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pay attention to the mean, minimum, maximum values. These statistical numbers are worth taking time to digest, because they tell us a lot about the variation (distribution) of the values in each column." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `info()` method prints a lot of useful information about a `DataFrame` object:\n", "\n", "* the information about the index\n", "* the information about every column (the name, the number of non-null elements, and the data type)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Apply the info() function and understand the result:\"\"\";\n", "#df_mystery.#TODO()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data types are quite important!\n", "`int64` refers to an integer (whole-number);\n", "`float64` refers to a real number;\n", "`object` in the example above is used to contain text data (`WhatsApp`, `Facebook`).\n", "\n", "We have learned that many of these information are given by the `shape`, `columns`, `index`, and `dtypes` attributes of the `DataFrame` object.\n", "\n", "> **EXERCISE**: If you haven't already, print these attributes and compare them with the `info()` output above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\" Do some exploration using `shape`, `columns`, etc.\"\"\";\n", "#print(#TODO)\n", "#print(#TODO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[(back to top)](#TOC)\n", "### 3.3 Accessing Elements in a `DataFrame`\n", "\n", "Similar to the `Series` object, the indexing operators `[]` and `.loc[]` can be used to access specific elements in a DataFrame object.\n", "However, there are several forms, each accomplishing different purposes, as we shall learn now.\n", "\n", "Let us create a small subset (6 rows) of the data so we don't see a deluge of data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df6 = df_mystery.head(6).copy()\n", "df6" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.3.1 Individual Column\n", "\n", "The `[]` subscripting operator can provide access to an individual column:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment code and run, and observe the result:\"\"\";\n", "#df6['ApplicationName']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**QUESTION**: From the output above, can you tell what type of object is returned?\n", "Is it a `DataFrame` or a `Series`?\n", "Explain the reason.\n", "\n", "*Hint*: Use the `type` function to confirm:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Find out the object type of `df6[ApplicationName`]\"\"\";\n", "#TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.3.2 Multiple Columns\n", "\n", "We can feed a _list_ of column names into the `[]` operator to select multiple columns.\n", "\n", "**QUESTION**: Select the following columns from the `df6` object: `ApplicationName`, `CPU_USAGE`, `num_threads`. What type of object do you anticipate?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and put in the appropriate column names per instruction above\"\"\";\n", "#df6[[#TODO]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.3.3 Filtering Multiple Rows by Boolean\n", "\n", "We can also feed an array of Boolean to the `[]` operator; but unlike the two previous uses, this will select ***rows*** (instead of columns) based on the Boolean values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and run the following codes; explain your observation\"\"\";\n", "\n", "#df6_filt = df6[[True, False, False, True, True, False]]\n", "#df6_filt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.3.4 Selecting Row(s) by Labels\n", " \n", "The `.loc[]` operator with a single argument is used to select DataFrame *rows* by the specified label(s). For example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df6.loc[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like before, since we specify only one row label as a scalar, this operation returns a `Series` object.\n", "\n", "You can also return multiple rows by feeding it with the list of row labels, e.g." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and run to select rows labeled 1, 3, 5\"\"\";\n", "#df6.loc[[#TODO]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.3.5 Selecting Individual or Multiple Cell(s)\n", "\n", "The `.loc[]` operator also supports a two-argument syntax to access one or more specific cells in the DataFrame, by specifying both the row and column labels--in that order.\n", "In this fashion, this operator is akin to making cell references in a spreadsheet (such as `A6`, `B20`, etc.). \n", "\n", "An example of selecting an individual cell at row label 1 and column named `num_threads`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and run to select an individual cell:\"\"\";\n", "#print(df6.loc[1, 'num_threads'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As in the `[]` operator, the row and/or column specification can be a list also to allow us to create a complex inquiry into the data, for example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment the code below, select cells at the rows labeled [1,2,5]\n", "and the following columns: ['num_threads', 'CPU_USAGE', 'ApplicationName']\"\"\";\n", "#print(df6.loc[#TODO,#TODO])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A list of Boolean values can also be given to select the desired row(s) and/or column(s):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and modify:\n", "Pass on [False,True,True,False,True] as the row selector,\n", "and select only two columns: 'num_threads' and 'CPU_USAGE'\"\"\";\n", "#print(df6.loc[#TODO,#TODO])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**QUESTION**: The row selector is definitely too short (only 6 elements), whereas `df_mystery` has 200 rows.\n", "What would happen to the rest of the rows if the length of the row selector is less than the number of rows of `df_mystery`?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similar to the case of `Series`, the `.loc[]` operator can be used to update values in a `DataFrame` object, but the values returned by `[]` should generally be used as read-only.\n", "\n", "**QUESTION**: Please update the value of `CPU_USAGE` for row labeled `2` to `0.01`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and modify to update the cell labeled [2, 'CPU_USAGE'] to a new value, 0.01\"\"\";\n", "#df6.loc#TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.3.6 Slicing Operator\n", "\n", "In the indexing of rows and columns, the colon character (`:`) can be used in lieu of row or column label(s), and it will mean \"everything\" in that dimension.\n", "\n", "**QUESTION:** *Think first before doing!* What does each expression below mean, and what will be the output?\n", "\n", "~~~python\n", "df6.loc[1, :]\n", "df6.loc[:, 'CPU_USAGE']\n", "~~~" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Use this cell to experiment and observe the output\"\"\";\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The slicing operator also supports one or two endpoints.\n", "Here are several expressions to test:\n", "~~~python\n", "df6.loc[:3]\n", "df6.loc[3:]\n", "df6.loc[2:5]\n", "df6.loc[:3, 'CPU_USAGE']\n", "df6.loc[:3, 'CPU_USAGE':'num_threads']\n", "~~~" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment code to run and observe the output\"\"\";\n", "#df6.loc[:3]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df6.loc[3:]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df6.loc[:3, 'CPU_USAGE':'num_threads']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[(back to top)](#TOC)\n", "### 3.4 Data Access Exercises\n", "\n", "**EXERCISE 1**:\n", "For each pair of commands below, what are the difference(s) within each pair?\n", "\n", "(pair 1)\n", "~~~\n", "df_mystery['CPU_USAGE']\n", "df_mystery[['CPU_USAGE']]\n", "~~~\n", "\n", "(pair 2)\n", "~~~\n", "df_mystery.loc[1]\n", "df_mystery.loc[[1]]\n", "~~~" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Use this code cell to experiment\"\"\";\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXERCISE 2**: Extract a subset of `df_mystery` for rows labeled from 5 through 15 (inclusive endpoints), and include only columns from `ApplicationName` through `priority`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Use this code cell to make your solution\"\"\";\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXERCISE 3**: The first column named `Unnamed: 0` seems to contain absurd data.\n", "Using the indexing syntax we've learned here, create a new DataFrame that does not contain this column.\n", "\n", "*Hint:* There are a few ways of doing this, but there is one syntax that is particularly compact. Be mindful of the position of that column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Use this code cell to make your answer\"\"\";\n", "# df_no_Unnamed = df_mystery#TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXERCISE 4**: Let's examine the data in `df6` again.\n", "The `cminflt` column does not contain any valid values (`NaN` means \"not a number\").\n", "Suppose we know that zero is a safe default for `cminflt`.\n", "Write the statement to set all the values in this column to zero.\n", "\n", "*Hint:* In _pandas_, you can use a scalar assignment to assign one value to all the selected cells (columns, rows, or other selections): `df.loc[...] = 0.0`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Use this cell to make your solution\"\"\";\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Please print `df6` below here to confirm that you get the desired effect.\n", "In case you mess up, re-initialize `df6` using:\n", "~~~\n", "df6 = df_mystery.head(6).copy()\n", "~~~\n", "before doing further testings." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df6" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(back to top)\n", "## 4 Making Sense of Data with Visualization\n", "\n", "Visualization is a powerful tool to let us comprehend the data,\n", "because it readily gives us a broader perspective of the data compared to just staring at the numbers.\n", "In this section, we will introduce some basic visualization techniques to make sense of our mystery Sherlock dataset.\n", "A latter notebook will be dedicated to more in-depth visualization techniques." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.1 Plotting Raw Data\n", "\n", "Let us focus only on the `vsize` column of the `df_mystery` dataset.\n", "_pandas_ makes it very easy to plot the raw data, which we will compare to the descriptive statistics we've encountered earlier. The syntax to plot the values in a Series `S` (remember that a DataFrame column is a Series!) is: `S.plot(kind='line')`.\n", " \n", "**QUESTION**: Plot the values of `vsize` below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Modify and uncomment to plot 'vsize'\"\"\";\n", "#df_mystery[#TODO].plot(kind='line')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**NOTES**:\n", "\n", "* The vertical tick labels are in the billions (that's the meaning of `1e9` on the top left of the graph);\n", "* The horizontal tick labels are the row labels." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now compare these values with descriptive statistics of the `vsize` column:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Modify and uncomment to print the statistics of 'vsize' column\"\"\";\n", "#print(df_mystery['#TODO'].describe())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**QUESTIONS**:\n", "\n", "* Take a look at the graph of `vsize` above: what is the typical app's memory usage recorded in the df_mystery dataset?\n", "* Is that the typical value consistent with the output of the `describe()` method above?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**QUESTION**: Create another plot for the `CPU_USAGE` field. What is the typical CPU usage of the apps? (Note: 100 means 100% busy CPU)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Create a line plot for `CPU_USAGE`\"\"\";\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.3 Box-and-Whisker Plot \n", "\n", "Visualizing the descriptive statistics comes in handy when analyzing data.\n", "A box-and-whisker plot (often shortened \"*box plot*\") provides a concise, graphical depiction of the minimum, maximum, median, and quartiles of the values from a set of values.\n", "Khan academy has\n", "a tutorial video that explains\n", "all the parts of a box-and-whisker plot.\n", "\n", "How to do this for the `vsize` column? Simply change the `kind='line'` to `kind='box'` in the plot statement above, and you're all set!\n", "\n", "**QUESTION**: Draw the box-and-whisker plot for the `vsize` column and compare the result with the descriptive statistics above.\n", "\n", "* Where is the median (`50%`) in this graph?\n", "* Where are the `25%` and `75%` percentiles in this graph?\n", "* Where are the `min` and `max` in this graph?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and modify to draw a box plot of the 'vsize' data\"\"\";\n", "\n", "#df_mystery['#TODO'].plot(kind='box')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**QUESTION**: Compare the box plot above with the raw data plot produced earlier.\n", "\n", "* Do the values look consistent?\n", "* Does the descriptive statistics match the raw data?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.4 Visualization with `Seaborn`\n", "\n", "**Seaborn** visualization package can produce similar plots to what _pandas_ alone can make, but with a much more robust set of capabilities and better aesthetics.\n", "Let's try an example here for the box plot:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment and modify to draw a box plot of 'vsize'\"\"\";\n", "\n", "#seaborn.boxplot(x=df_mystery['#TODO'], color='cyan')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(back to top)\n", "## 5. Summary & Further Resources\n", "\n", "### Pandas\n", "\n", "#### Important Notes on DataFrame Indexing\n", "\n", "In a DataFrame, the `.loc[]` operator is the only way to select specific row(s) by the labels, as the `[]` operator is reserved to select only column(s).\n", "The `.loc[]` operator can also be used to write new values to particular location(s) in the DataFrame.\n", "\n", "#### References & Cheatsheets\n", "\n", "* Our lesson page has a [summary table of the commonly used indexing syntax](https://deapsecure.gitlab.io/deapsecure-lesson02-bd/10-pandas-intro/index.html#summary-indexing-syntax).\n", "\n", "* _pandas_ user's guide has a [comprehensive documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#)\n", "on selecting and indexing data.\n", "\n", "* _pandas_ cheatsheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf\n", "\n", "Please study these resources and keep them within easy reach.\n", "These are handy help when you are writing your own analysis pipeline using _pandas_.\n", "\n", "\n", "### Seaborn\n", "\n", "To learn more about `seaborn`, please visit\n", "Seaborn Tutorial.\n", "Seaborn also has a \n", "gallery of examples\n", "to demonstrate what can be done with this package.\n", "There are many sample codes to get your own visualizations started.\n", "\n", "\n", "### Common Conventions\n", "\n", "It is a common practice for Python programmers to shorten module names:\n", "\n", "* `pd` for `pandas`\n", "* `np` for `numpy`\n", "* `plt` for `matplotlib.pyplot`\n", "* `sns` for `seaborn`\n", "\n", "At the beginning of a script or a notebook, they will declare:\n", "\n", "~~~python\n", "import pandas as pd\n", "import numpy as np\n", "from matplotlib import pyplot as plt\n", "import seaborn as sns\n", "~~~\n", "\n", "DataFrame variables often have `df` in its name---whether\n", "`df`, `df2`, `df_mystery`, `mystery_df`, ....\n", "The `df` part gives people a visual cue that the variable is a DataFrame object." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }