This lesson is in the early stages of development (Alpha version)

DeapSECURE module 1: Introduction to HPC

Key Points

Introduction to High-Performance Computing
  • Supercomputers are a collection of smaller computers into one big unit

  • HPC is used in various domains, pretty much everywhere a computer could be used to solve a problem

  • HPC helps get results faster than traditional computing.

Spam: Everyone's Cybersecurity Issue
  • Spam is an unsolicited email that contains unwanted advertisements, requests, or enticements.

  • Different types of spam emails include: unsolicited advertisements, scam, phishing, email with malicious payload.

  • Spam poses cybersecurity risks through stealing personal information, malicious software, and system break-in.

  • Powerful supercomputers can tremendously reduce the time to process massive amounts of data through parallel processing.

Accessing HPC
  • An HPC cluster is remotely accessed using secure shell

Basic Shell Interaction
  • UNIX shell provides a basic means to interact with HPC systems

  • pwd, cd, and ls provides essential means to navigate around files and directories

  • Directories and files are addressed by their paths, which can be relative or absolute

  • Basic file management tools: mkdir, cp, mv, rm, rmdir

  • Basic text viewing and editing tools: cat, less, nano

Text Processing Tools & Pipeline
  • echo prints a message.

  • wc counts the number of lines, words, and bytes in a file.

  • head prints the first few lines of a text file.

  • tail prints the last few lines of a text file.

  • cut selects a particular column or columns of text data from a text file.

  • sort sorts lines of text.

  • uniq prints the unique lines of text.

  • grep filters lines of text matching a given text pattern.

Task Automation with Scripts
  • A script is a text file containing a sequence of commands

  • The for statement takes a list and run commands for each of the elements in the list by iterating through the list items

  • The if statements are used to execute commands based on given conditions

Investigating the Origin of Spam Emails
  • A spam database is a collection of spam emails that have been gathered over many years to provide a representation of spam circulating on the Internet.

  • A spam database such as the SPAM Archive is helpful to study the characteristics of spam emails, including their origins.

  • The origin of an email can be determined from the IP address recorded in the tracking information in the email’s header.

  • An IP address can be mapped to a geographic location using an appropriate database.

Running Spam Analyzer on a High-Performance Computer System
  • A job script is used to launch a computation job on an HPC system.

Using HPC for Parallel Processing
  • A large job can be split into smaller jobs to reduce the time to solution.

Using GNU Parallel on HPC
  • GNU parallel is suitable for executing many single-node jobs that are independent of each others.

  • GNU parallel automates parallel execution of multiple jobs.

Analyzing and Summarizing the Distribution of Spam Origins
  • Simple UNIX tools such as cut, sort, head, tail and uniq are helpful to analyze text results.

Glossary

FIXME

UNIX Commands

External References

Internet information database

Further Reading

Parallel Computing