Outro to Real-World Parallel Computing

Overview

Teaching: 30 min
Exercises: 0 min

Questions

How is parallelism used in real-world computing?

What are the different parallelization approaches?

Objectives

FIXME

Various Parallelization Approaches (CPU-Centric)

There are two types of parallelism:

Task parallelism focuses on breaking up computation into (sub)tasks, and distributing the tasks to be executed concurrently by multiple workers. The tasks may have interdependency, and different workers may be assigned different kinds of tasks.
Data parallelism is the concept of splitting data across different computing units (e.g. processes or nodes) in order to carry out computation on the data.

Problem parallelization: the making of an omelet

Imagine you have been tasked with making an omelet composed of eggs, onion, garlic, tomatoes, salt, and pepper. Consider the preparation stage, just before frying your omelet. This is composed of beating up the eggs, dicing the onion, mincing the garlic, dicing the tomatoes and combining all together and finishing up with seasoning with salt and pepper. And illustration of task parallelism would be giving the different tasks (beating eggs, dicing onion, mincing garlic, dicing tomatoes,) to different workers at the same time, then having one worker finish up with combining all together, and seasoning with salt and pepper. Here workers can independently work on their tasks at the same time and the tasks are all different. An illustration of a data parallelism with your omelet making would be doing each task one at the time but dividing it among workers. This means beating the eggs will be done concurrently by workers with eggs divided into the number of workers, and the rest of the tasks done in the same way, before bringing it all together at the end to one final pre-omelet batter.

Vectorization (SIMD)

SIMD stands for Single Instruction Multiple Data, and allows a single instruction to process multiple data (in parallel). This concept is a component of data parallelism. The goal of SIMD vectorization is to optimize performance by dividing the data amongst multiple computing resources that simultaeously execute the same single instruction. Domain decomposition can be applied here to assist in distributing the workload evenly among computing resources.

Multicore (Shared Memory Model) Parallelization

Multi-core parallelization utilizes a shared memory model. However, computing resources can implement a hybrid model (a mix between shared and distributed models). In shared memory models, any worker has access to the memory space allocated to the problem.

Shared memory (SM) parallelism utilizes multi-core threading (within one node). The parallelism is specified through cpus-per-task. POSIX Threads (Pthreads) and OpenMP are standard for implementing multi-thread/multi-core (parallel) programming. This model is limited by the number of cores sharing the same memory.

Accelerator-Based Parallelism

Parallelism demands can lead to new processor or processing units that specialize in accelerating a certain type of operation. GPUs, or graphical processing units, as the name implies, specializes in graphical processing. NPU, or Neural Processing Unit, specializes in accelerating neural network operations and AI tasks.

Distributed-Memory Parallelism

Each processor has it’s own local memory/data. Inter-processor memory communication is required for a process to access another processor’s memory/data (i.e. to exchange data). Therefore, care must be taken by the programmer to handle synchronization and sharing of data among processors.

Example Applications Requiring Parallel Programming

There are several problems that naturally require parallel programming solutions. These problems take too long to compute using sequential programming. Some are time sensitive, like weather modeling; it would be useless to obtain the weather prediction after it occurred. Others are less time sensitive. Some problems cannot be solved on a single computer due to memory limitations. For example, the first image of a black hole which was released in April 2019 was constructed from a massive amount of data from a number of telescopes around the world, accounting for around five petabytes (five million gigabytes), which could not possibly be held in the memory of a single computer!

Parallel programming allows the computation of more accurate and complex computations as well improving visualization in the following areas, as shown in the figure below. Top-left to bottom right, the areas are computational fluid dynamics, astrophysics, medical, engineering climate and environmental science, computer science, and geoscience (geology/seismology). See Credits for External Materials for citations.

Parallel computing applications. Figure: Examples of Science and Engineering simulation problems requiring parallel computations. Further explanations explained below. Image courtesy of Lawrence Livermore National Laboratory.

Computational fluid dynamics simulation: Jet engines’ entire turbulent flow path requires simulating multiple-component effects that require multiple computationally expensive flow solvers. Experimental (simulation) results reported utilizing 480 processors for the fan/compressor, 80 for the combustor, and 140 for the turbine. A finer detailed simulation required approximately 4,000 processors.

Astrophysics: Helium detonations on the surface of neutron stars (density at 90 microseconds) simulations utilized FLASH. FLASH is a modular, adaptive, and parallel simulation software that utilizes distributed-memory architecture and MPI to parallelize. One of the parallelizing aspects utilizes a single program multiple data (SPMD) approach.

Medical: Brain mapping algorithms used to detect accelerated gray matter loss in very early-onset schizophrenia. Parallel computing is necessary for creating 3-D MRI scans. This proposed method transferred transfers all of the imaging data (from coil data) to the GPU memory, then parallel (GPU) CS solvers are used, and then a sum-of-squares is used. It also utilizes a shared memory model and domain decomposition that partitions the data.

Drilling application? (To do)

Climate and environmental science: simulations of ocean models, such as Los Alamos National Laboratory’s MPAS-Ocean Model. It utilizes a hybrid framework that parallelizes using OpenMP and MPI. A high-resolution simulation utilizes 3,600 cores for ocean, 3,200 for sea ice, and 3,600 cores for the coupler components.

Computer Science: The Opte Project: Internet visualization. It is a visualization of the routing paths of the Internet around 2003. More specifically, it is the Internet using traceroute vs BGP in 2003 by Barrett Lyon as part of The Opte Project.

Geoscience (geology/seismology): Harvard’s seismological group’s velocity perturbation data used to create 3-D iso-surface imaging of the Earth’s mantle. The image below is the (spherical) iso-surface map of shear wave velocity in the mantle using spherical harmonic expansion to degree 4.

Key Points

FIXME

previous episode

DeapSECURE module 6: Parallel and High-Performance Programming

lesson home