Parallel Computation with Homomorphic Encryption
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Hands-on 3: How to use parallel computing to speed up HE computations?
Objectives
First learning objective. (FIXME)
Computational Cost of Homomorphic Encryption
Encrypted numbers are much more expensive to create and compute than regular (unencrypted) numbers. Our experiments on Turing (Intel Xeon with CPU frequencies between 2.1–2.4 GHz) showed that encrypting a number takes more than 40 times the amount of time needed to perform a single operation. In one experiment of summing 100 encrypted numbers, we noted that it took 1.8 seconds to encrypt them all, but only 0.04 seconds to sum them all up. Therefore, the encryption is the most expensive part, exceeding anything else by a large factor..
Parallelization to Speed Up Homomorphic Operations
In this section, we will demonstrate how one can use parallel computing resources to speed up homomorphic operations. Because we have identified that the most expensive part is the encryption, we will simply demonstrate the parallelization of the encryption.
Task: Encrypting ODU Crown Image
In this hands-on session, we want to encrypt the Old Dominion University’s Crown image so that we can securely transfer this a potentially hostile cloud environment for further processing. Here’s the image in question:
The image is in PNG format, its size is 182 by 146 pixels. There are a total of 26,572 pixels. We want a faster way to encrypt it using Paillier scheme, then save the encrypted image to a file.
How long it will take?
Knowing that it takes 1.8 seconds to encrypt 100 numbers, how long it should take to encrypt 26,572 pixels? Assume one pixel is represented by one number (which is true, because we converted the numbers to grayscale).
Message Passing Interface for parallel computing
We will be using the industry-standard approach to parallelize our calculation called “Message Passing Interface” (MPI). We will use MPI to divvy up the task to encrypt the numbers to many processors, which is analogous to a restaurant employing five people to manually wash dishes on five separate sinks. MPI is implemented as a software library and is available in many programming languages, including C, C++, Fortran, Java, and even Python. Python’s interface to MPI is called mpi4py. On Turing, mpi4py leverages Intel’s high-performance MPI library.
MPI is a library that allows a programmer to split a single problem
to be tackled among a number of processes (call this N
) on
N
CPU cores (or more, as we will learn in the next module).
What is a process?
A process is an independently running program on a computer: Your UNIX shell is a process. If you execute
python encrypt_phe.py
, then it launches a yet another process by runnng the Python interpreter (which in turn will runencrypt_phe.py
). Each process is basically a “worker” that serves us to do a specified task. The traditional programming is focused around having a single worker to do all the tasks that are specified as the computer program (or script).
We will reserve more details about MPI for the upcoming module;
it is sufficient for now to know that MPI is used to employ one or more
“workers” in order to solve a problem in a concerted way.
With MPI, there will be N
independent processes that can communicate
to one another through the functionalities of MPI library.
Each process has a rank
, a unique id to distinguish it from other processes.
The rank IDs are from 0
, 1
, 2
, … N-1
.
They each have their own computing capabilities.
They can communicate with one another to exchange information by sending
and receiving messages.
Usually an MPI program in Python will initialize itself in this way:
import mpi4py.MPI as MPI
# ... import other libraries as well
#MPI stuff
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
The comm
variable is a communicator, which is essentially an object
to perform communication with all the other processes in this MPI
“world” (i.e. the set of processes launched together as a single MPI
run).
COMM_WORLD
is the largest universe of all these processes, which is sufficient
for most MPI programs.
Get_rank
returns the rank ID, as previously described.
Get_size
returns the number of processes in this MPI world.
Before we get too far, we need to be aware of few things:
-
Each process has to work on its own chunk of the image, not overlapping with any other process
-
All processes have to use the exact same key so that the image will be recoverable by one process
How will we encrypt the image shown at the beginning of this episode?
-
Each process reads the image to work on. (Because it is such a small image, we will make all processes read their own images. In other cases, the input is so large that we have to even read the input in pieces, i.e. not all pieces of information are available in every process.)
-
Each process identifies what chunk of the image to work on. This is called domain decomposition.
-
Each process will encrypt the pixels contained in its own chunk of the image.
-
All the processes will return the encrypted form of their respective chunk to the master process, which is identified as process with rank 0.
Step number three above is what yield the time savings, because now we have many workers to encrypt the same image.
Domain decomposition
Let us delve deeper into the domain decomposition of this problem. To do this, we need to know how is the image represented: First, we simplify the image to grayscale in order to reduce all color channels per pixel to a single value. In this case we will get a 146 by 182 matrix. (Remember that the image is a two-dimensional object with 182 by 146 pixels. Image sizes are usually expressed in the x-by-y format, whereas matrices are in row-by-column format, i.e., y-by-x. So there are 146 rows in the matrix format.)
Next, we divide the image by the rows, giving a process an (almost) equal number of rows to process. If we have 5 processes, 146 / 5 = 29.2 rows per process. We will not work on a partial row; we need full rows. Therefore, we need to some processes 30 rows to process. We could do this:
Process | proc 0 | proc 1 | proc 2 | proc 3 | proc 4 |
---|---|---|---|---|---|
# rows | 29 + 1 | 29 | 29 | 29 | 29 |
The order of the rows matters so that they do not overlap. Process 0 works on rows 0 through row 29 (in the 0-based array index); process 1 works on row 30 through row 58; and so on…
#Domain decomposition
chunk = int(int(n_row) / int(size))
if n_row % size > 0:
if rank < ( n_row % size):
chunk = chunk + 1
offset = rank * chunk
else:
offset = rank * chunk + (n_row % size)
else:
offset = rank * chunk
(Later on, convince yourself that the domain decomposition is right, by printing out some values such as chunk, offset, etc. for every process.)
To ensure all processes use the same public key for encryption, only
process 0 generates the key and then broadcast (comm.bcast
) it to
all the other processes.
(In this case below, process 0 reads an existing key.)
#Public and private keys
if rank == 0:
pub_key, pri_key = paillier.generate_paillier_keypair()
else:
#pk_n = None
pub_key = None
#Broadcast public key to all processes
pub_key = comm.bcast(pub_key, root=0)
Each process works on its own share of the task and then sends its result to process 0 which will combine the final result into one single file.
Please turn to subdirectory secure-image
in your hands-on directory,
then inspect the program called encrypt_image_mpi.py
.
No need to understand everything, but please take time to get some
sense of what is being done at the different parts of the program.
Executing the MPI program
Now you can run the program parallel mode (we choose 24 processes) or in serial mode (1-process). We use SLURM job script to define the resources needed for this run:
!/bin/bash -l
#SBATCH --job-name=MPI_encrypt_24core
#SBATCH --output=MPI_encrypt_24core
#SBATCH --ntasks=24
#SBATCH --nodes=1
source crypto-env-py3
mpirun -n 24 python3 encrypt_image_mpi.py
SLURM script is a shell script (in this case, a bash script).
The four lines near the top of the script are special comments with
#SBATCH
prefix, which will be parsed by the job scheduler (SLURM):
-
--job-name
sets the name of the job -
--output
sets the name of the file containing the output of the job -
--ntasks
sets the number of MPI tasks to be launched -
--nodes
sets the number of compute nodes requested
The actual “program” in the script is simple: it sources the environment definition
we’ve seen in the setup phase,
then launches the 24 processes using the mpi
command.
The -n
argument sets the number of processes.
With mpirun
, these processes will be coordinated and be able to communicate
with each other through the communication channels established by the MPI
library upon the start of the program.
To run in 24-process mode, invoke on the shell,
$ sbatch job_01core.slurm
To run in serial mode, invoke on the shell,
$ sbatch job_01core.slurm
Notice at the end of the file you will be shown how much (wallclock) time is spent for running the program.
Parallel decryption
Modify the encryption program so that it will decrypt the encrypted image (
ODU_Crown_enc.json
) and reconstruct the original image (in grayscale).Hints
- Use the
mpimg.imsave
) function to save the reconstructed image.
Key Points
First key point. Brief Answer to questions. (FIXME)