{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**DeapSECURE module 2: Dealing with Big Data**\n", "\n", "# Session 1: Fundamentals of Pandas\n", "\n", "Welcome to the DeapSECURE online training program!\n", "This is a Jupyter notebook for the hands-on learning activities of the\n", "[\"Big Data\" module](https://deapsecure.gitlab.io/deapsecure-lesson02-bd/),\n", "Episode 3: [\"Fundamentals of Pandas\"](https://deapsecure.gitlab.io/deapsecure-lesson02-bd/10-pandas-intro/index.html).\n", "Please visit the [DeapSECURE](https://deapsecure.gitlab.io/) website to learn more about our training program.\n", "\n", "\n", "**Quick Links** (sections of this notebook):\n", "\n", "* 1 [Setup](#sec-Setup)\n", "* 2 [Series](#sec-Series)\n", "* 3 [DataFrame](#sec-DataFrame)\n", " - [Loading Sherlock dataset](#sec-Sherlock-load-tiny)\n", " - [Accessing Elements (Indexing)](#sec-DataFrame-indexing)\n", " - [Exercises](#sec-DataFrame-exercises)\n", "* 4 [Visualization](#sec-Visualization)\n", "* 5 [Summary & Further Resources](#sec-Summary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 1. Setup Instructions\n", "\n", "If you are opening this notebook from Wahab cluster's OnDemand interface, you're all set.\n", "\n", "If you see this notebook elsewhere and want to perform the exercises on Wahab cluster, please follow the steps outlined in our setup procedure.\n", "\n", "1. Make sure you have activated your HPC service.\n", "2. Point your web browser to https://ondemand.wahab.hpc.odu.edu/ and sign in with your MIDAS ID and password.\n", "3. Create a new Jupyter session using \"legacy\" Python suite, then create a new \"Python3\" notebook. (See ODU HPC wiki for more detailed help.)\n", "4. Get the necessary files using commands below within Jupyter:\n", "\n", " mkdir -p ~/CItraining/module-bd\n", " cp -pr /scratch/Workshops/DeapSECURE/module-bd/. ~/CItraining/module-bd\n", " cd ~/CItraining/module-bd\n", "\n", "The file name of this notebook is `BigData-session-1.ipynb`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 Reminder\n", "\n", "* Throughout this notebook, `#TODO` is used as a placeholder where you need to fill in with something appropriate. \n", "* To run a code in a cell, press `Shift+Enter`.\n", "* Use `ls` to view the contents of a directory." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 Loading Python Libraries\n", "\n", "Now we need to **import** the required libraries into this Jupyter notebook:\n", "`pandas`, `numpy`, `matplotlib` and `seaborn`.\n", "\n", "**Important**: On Wahab HPC, software packages, including Python libraries, are managed and deployed via *environment modules*:\n", "\n", "| Python library | Environment module name |\n", "|--------------------|-------------------------|\n", "| `pandas` | `py-pandas` |\n", "| `numpy` | `py-numpy` |\n", "| `matplotlib` | `py-matplotlib` |\n", "| `seaborn` | `py-seaborn` |\n", "\n", "In practice, before we can import the Python libraries in our current notebook, we have to load the corresponding environment modules.\n", "\n", "* Load the modules above using the `module(\"load\", \"MODULE\")` or `module(\"load\", \"MODULE1\", \"MODULE2\", \"MODULE n\")` statement.\n", "* Next, invoke `module(\"list\")` to confirm that these modules are loaded." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"(OPTIONAL) Modify and uncomment statements below to load the required environment modules\"\"\";\n", "\n", "#module(\"load\", \"#TODO\")\n", "#module(\"load\", \"#TODO\")\n", "#module(\"load\", \"#TODO\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For convenience, we have prepared an environment module called `DeapSECURE` which includes most of the modules needed for the DeapSECURE training.\n", "Please run the following code cell to make the required Python libraries accessible from this notebook:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "module(\"load\", \"DeapSECURE\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the notebooks provided for DeapSECURE training, we recommend that you use this approach to get to the core of the exercises quicker." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Confirm the loaded modules\"\"\";\n", "module(\"list\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can import the following Python libraries:\n", "`pandas`, `numpy`, `pyplot` (a submodule of `matplotlib`), and `seaborn`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"Uncomment, edit, and run code below to import libraries listed above.\"\"\";\n", "#import #TODO\n", "#import #TODO\n", "#from matplotlib import pyplot\n", "#import #TODO\n", "#%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The last line is an ipython magic command to ensure that plots are rendered inline." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data in Pandas: `Series` and `DataFrame`\n", "\n", "
Series | \n", "DataFrame | \n", "
---|---|
1-D labeled array of values | \n", "2-D tabular data with row and column labels | \n", "
\n", " | \n", " |
\n", " Properties: labels, values, data type\n", " | \n", "\n", " Properties: (row) labels, column names,\n", " values, data type\n", " | \n", "