This lesson is in the early stages of development (Alpha version)

DeapSECURE module 2: Dealing with Big Data: Setup Hands-on Files and Jupyter Notebook

Workshop Resources

Obtaining Hands-on Materials

If you are taking this training using ODU’s Wahab cluster, please read through the instructions on launching a Jupyter session via Open OnDemand and copying the hands-on files in order to set up your own copy of the files in your own home directory on the cluster.

The downloadable resources below are made available here for the general public to use on their own computers. These were taken from the online workshop series in the Summer of 2021 (a.k.a. “WS-2020-2021”).

Older versions of the hands-on materials can be found in Hands-on Materials in the Extras section of this site.

To download the notebooks and the hands-on files, please right-click on the links below and select “Save Link As…” or a similar menu.

Resources: Jupyter Notebooks

(The HTML files were provided for convenient web viewing.)

Resources: Hands-on Package

The hands-on files are packed in ZIP format. The first two ZIP files above are mandatory. To reconstitute: Unzip all the files, preserving the paths, into the same destination directory.

About Jupyter Notebook

Jupyter Notebook is an open source software which provides an interactive web-based platform for code development and execution. Does the UNIX terminal put you off? If yes, Jupyter Notebook will be of great comfort. With Jupyter, Python code and UNIX shell commands can be executed interactively from within a single notebook document. Output from these commands will be saved on the same notebook alongside the commands, resulting in a nice record of a computational work. For this reason, many scientists and analysts have chosen Jupyter Notebook to make their research work reproducible by others. Here is an example of a Jupyter session:

A screenshot of a Jupyter session

In the world of Jupyter, the word “notebook” refers to two things that are closely related: Jupyter Notebook (in title case) refers to the software which provides the rich, web-based capability to interface with an interactive Python session; the other is the notebook (lowercase) document and file format which embodies the work done using the Jupyter Notebook software.

As shown in the figure below, Jupyter Notebook actually consists of several interconnected components: the (Python) kernel, the Notebook server, and the web interface that the user runs in his/her own web browser. The Notebook server is the front-end component that loads and saves the code+output combination in a notebook document format. A notebook document is actually a JavaScript Object Notation (JSON) file that has the .ipynb extension in its filename. The web interface running on the user’s browser provides the “window” by which the user can interact with the Notebook server. The kernel in the backend runs the actual Python interpreter that accepts user inputs (received from the web interface) and passes the resulting outputs to the Notebook server to be displayed on the user’s web interface. Jupyter supports more than just Python kernel: in fact, there are tons of kernels available for many languages to run on a Jupyter Notebook! Further information about Jupyter Notebook can be obtained from the official documentation.

Jupyter Notebooks components

Figure: Components of Jupyter Notebook program. (Source: Jupyter Notebook documentation)

Jupyter Notebook can be installed on a personal computer, in which case both the Notebook server and the kernel run in the background on the same computer. Users who want to run Jupyter Notebook on their own computers are recommended to install Anaconda distribution suite that consists of Python interpreter, popular libraries and tools, including the Jupyter Notebook.

For the subsequent training modules where we are primarily using Python programming language, you are encouraged to use Jupyter Notebook. In this lesson module, we will create and access an individual Jupyter Notebook server on an HPC cluster. Using an appropriate networking setup, we will be able to interact with this Notebook from the web browser on our own laptop, while harnessing the power of a supercomputer to process big data and/or mssive computation. The setup instruction below pertains specifically to running Jupyter Notebook on ODU Turing cluster (in general, a cluster running SLURM batch scheduler). In this approach, both the Notebook server and the kernel run on an allocated compute node, and SSH tunneling is used to provide secure access from user’s browser to the Notebook server.

Setting Up Jupyter Notebook on HPC Cluster

Using Open OnDemand on ODU Wahab Cluster

On ODU’s Wahab cluster, we have set up Open OnDemand, an easy-to-use web-based access method to its HPC capabilities. Going forward, this is the recommended way to perform the hands-on activities in this module. Please refer to ODU’s documentation page for instructions on how to launch Jupyter on the Wahab cluster. Important: Please select the Legacy Python suite to run the hands-on activities in this module.

Instructions for ODU Turing & Wahab Cluster via SSH (Older)

The following instruction will aid launching jupyter on Turing or Wahab to support our workshop. At the heart is a bash script called launch_jupyter_turing or launch_jupyter_wahab that initiates the Jupyter notebook server on a Turing’s or Wahab’s compute node then creates an appropriate SSH tunnel to provide access.

Module environment

(Only needed if you will use Jupyter on Turing.) Edit your ~/.tcshrc_turing file and add the following lines at the end of the file:

enable_lmod
module use /scratch-lustre/DeapSECURE/lmod
module load DeapSECURE

(All users) To provide the module command within the Jupyter environment, create the following directory:

$ mkdir -m 0700 ~/.ipython/profile_default/startup

then create a text file named ~/.ipython/profile_default/startup/000-odurc-lmod.py with the following contents:

import sys
import os

try:
    from lmod_python import module
except ModuleNotFoundError:
    sys.path.append('/shared/apps/common/lmod/scripts/init')
    from lmod_python import module

vdir = os.getenv('VIRTUAL_ENV')

if vdir:
    sys.path.insert(0, '{}/lib/python{}.{}/site-packages'.format(vdir, sys.version_info.major, sys.version_info.minor))

Step 1 (Windows Users Only)

We need a basic UNIX-like environment to run the Jupyter launch script. Windows users needs additional software to provide this functionality. We recommend Windows users to install MobaXterm, which consists of an SSH client, a UNIX-like local environment, and other numerous features. The Home edition is sufficient for most people. With the newer versions of MobaXterm (versions 12 and higher), you will also need to download CygUtils.plugin and place it where the MobaXterm executable resides (usually C:\Program Files (x86)\MobaXterm Home Edition\ (FIXME)). CygUtils.plugin provides a local UNIX shell environment. This UNIX environment will be used briefly to run a script which sets up Jupyter Notebook server on Turing and makes it accessible from your computer. (Alternatively, download the portable version of MobaXterm, which already includes CygUtils.plugin. Unpack all the files in the zip file to a location you can remember.)

Step 2 (All Users)

Do these steps on your own computer—not on Turing or Wahab or other cluster! You only need to do this step once.

Linux and Mac Users (Optional)

You have an option to save the launch script on a different location accessible via the PATH environment variable. We can use the ~/bin directory to store the script and set its executable bit:

$ mkdir ~/bin
$ mv ~/Desktop/launch_jupyter_wahab ~/bin
$ chmod a+x ~/bin/launch_jupyter_wahab

For bash shell users, add the following line at the end of your ~/.bashrc file:

export PATH="$PATH:$HOME/bin"

Most Linux and Mac are using bash shell, unless you know exactly what shell you are currently using. Exception: the users of newer Mac OS (Catalina and higher) have zsh as their default shell. For these users, add the export statement above to ~/.zshenv file instead.

(end optional steps)

Step 3: Launching Jupyter Notebook on Wahab (All Users)

Do these steps on your own computer—not on Wahab or other cluster! Now we are ready to launch and connect to Jupyter on Wahab. Launch the script we downloaded in Step 2 above using bash.

$ bash ~/Desktop/launch_jupyter_wahab --user XXXX --local-port YYYY -t 1-0

Important: Replace XXXX with your own MIDAS ID (something like xjohn099—not xjohn099@odu.edu), and YYYY with a port number greater than 1023.

Wait until a message displays such as below:

OK: Jupyter server launched successfully
Setting up pipeline to connect to the Jupyter server...
Pseudo-terminal will not be allocated because stdin is not a terminal.
xjohn099@turing.hpc.odu.edu's password:
Now please launch your browser to connect to Jupyter:

    http://localhost:8780/?token=0295648a1647dc8eeda92746732918d4410046b61db84412

Do not kill this SSH process as it will kill the pipeline!

In case the pipeline dies, you can reconnect using:

    ssh -L 8780:coreV2-22-036:8425  xjohn099@turing.hpc.odu.edu

remove /Users/xjohn099/.jupyter_remote.CeFzjI?

Seeing above message,

Jupyter appears similar to the image below which displays the contents of your current directory.

Jupyter Displayed in Browser

Well done, you have successfully launched Jupyter on Turing and made it accessible on your computer!

Warning

Do not close or interfere with the current state of the open terminal above, or else the connection to the Jupyter notebook will be terminated.

The launch command above will retain the Jupyter session for 1 day.


Setting Up Jupyter on a Personal Computer (Laptop or Desktop)

The core hands-on materials, as it is presented in the (pandas-based) lesson, can be performed using typical computers with at least 8 GB RAM and about 1 GB disk space. Learners interested in working with the entire sample “SherLock” dataset would obviously require significantly more computing resources (both RAM and disk storage)—this is where HPC and appropriate cloud resources come in. Learners can set up a Jupyter computing environment on his/her own personal computer by installing Anaconda Distribution (formerly was called Anaconda Individual Edition or Anaconda Personal Edition). Next, download the workshop hands-on files listed in the Workshop Resources section and you have the learning environment ready.

Using Jupyter

Jupyter has a very user friendly interface which will be discussed in this section.

Classic Jupyter Notebook vs Jupyter Lab

On Wahab, we are deploying the newer interface called Jupyter Lab. Instructions in this section were made for the classic Jupyter Notebook interface. They are applicable to both the classic and Lab interfaces with slight variations. Please refer to their respective documentations to learn more.

Creating a New Notebook

Click on New in the upper right level as shown in image below. Selecting Python 3 creates a new session with a blank notebook (named Untitled.ipynb by default) in your current working directory.

Clicking on New

Working with Notebooks

A Jupyter notebook consists of a sequence of cells, as shown in the following illustration:

Jupyter created

In the first cell, a Python statement prints a text: Python 3. In the second cell, a shell command named ls was executed, and the output was displayed right following the command. The third (and current) cell is currently empty. Indeed, Jupyter allows Python codes to coexist with shell commands, making the notebook a convenient one-stop interface for both worlds.

The focused cell is displayed with a green or blue bounding box green boundary denotes that you are in the Edit mode, in which you are editing the current cell. On the contrary, a thick blue left margin with gray boundary lines indicates the Command mode. You can toggle between the the Edit and Command modes with Esc and Enter respectively. (On Jupyter Lab, the color and indicator may look different; but on the status bar at the bottom of the screen there is a small text indicating the mode–Edit or Command).

Edit Mode

While in Edit mode, you can type in your code or command into a cell. pressing Enter creates a new line within the cell. To execute a command, use Shift + Enter.

Jupyter Edit Mode

Command Mode

In this mode, the keyboard maps to a set of shortcuts. Some of the actions may not affect just a single cell, but the entire notebook. For example, you can create cells, copy cells, move cells, or restart the kernel; many other actions are available.

jupyter Command Mode

Below are a few shortcuts in the command mode.

To learn more about the keyboard shortcuts available in Edit and Command modes, go to the Help menu, and choose Keyboard Shortcut.

Mouse Navigation

The menubar within the Jupyter interface has icons which execute specific actions. Hover over any icon. Do you observe the tooltip? It explains the specific command of the icon.

menubar icons

Working with Documents, Files, and Terminals

Jupyter also has capabilities to create and/or edit python or other text files, manage files and folders, as well a launch a terminal running a shell.

For further learning on the notebook’s interface, please go to Jupyter documentation website.

Hands-on Files

The DeapSECURE hands-on exercises can be run on many platforms. They were initially created and tested for ODU Wahab cluster, but can also be run on a sufficiently powerful local computer with a standalone Python distribution such as Anaconda. Please find below the instructions for the platform you will be using. Your instructor or mentor should have informed you concerning which platform you should be using.

Preparing Hands-on Files on ODU Wahab Cluster

To prepare for the exercises on Wahab, please run the following commands on the shell. (This can be done using a terminal session under SSH, or a terminal session within Jupyter.)

Hands-on files are located on Wahab on this subdirectory:

/shared/DeapSECURE/module-bd/

(For Turing, the location is /scratch-lustre/DeapSECURE/module-bd/Exercises).

Create a directory ~/CItraining/module-bd:

$ mkdir -p ~/CItraining/module-bd

Copy the entire directory tree to your ~/CItraining/module-bd:

$ cp -pr /shared/DeapSECURE/module-bd/. ~/CItraining/module-bd/

Be careful! All characters do matter (even a period must not be missed). Do NOT insert whitespace where there is not one in the command above!

Now change directory to ~/CItraining/module-bd,

$ cd ~/CItraining/module-bd

and you are ready to learn! If you are using the Jupyter notebooks (see the resources near the top of this page), navigate your Jupyter’s file browser to this directory and select the appropriate notebook to open.