DeapSECURE module 2: Dealing with Big Data: Key Concepts in Spark

In the previous episode, we learned that Spark is capable of using multiple cores and multiple computers to process massive amounts of data. In this episode, we will introduce some key technical concepts in Spark that will be needed to work with Spark.

Resilient Distributed Dataset

In Spark, data is represented as the “Resilient Distributed Dataset” (RDD). There can be more than one RDDs in a single Spark environment; they can be merged or joined in many different ways in order to accomplish the desired analysis. Spark assumes that RDD contains a countable collection of “elements”; this collection can be extremely large. There are two striking features of RDD:

An RDD can contain elements of arbitrary types.
Spark automatically distributes the elements among the workers in order to increase processing efficiency.
Spark RDD is intended to be resilient to failures: should there be failure in one of the compute nodes (whether network connectivity failure, or computer crash), Spark will be able to recover by performing the necessary recomputation.
RDDs are stored and processed in memory; therefore processing large amounts of data will require a commensurately large amount of memory. (This is in contrast to Hadoop’s original “MapReduce” approach which rely on disk storage to store intermediate results. As a result, for the same processing workflow, Spark can be up to 100x faster than disk-based HapReduce.)

RDD Element Datatype

An element in RDD is loosely defined to allow for flexibility to process arbitrary kinds of data; but a typical RDD contain elements that have the same type. Here are some common datatypes of RDD elements:

numerical value (integers or real numbers)
text value (strings of characters)
logical value (true or false)
date/time value
image
audio
video
and many more.

Complex Datatype

It is very common that RDD will have a complex datatype that is built by combining two or more subelements of varying datatypes that are mentioned above. The our previous spam analysis produces datasets that have such a complex datatype. Consider the analysis result of the January 1999 spam emails (only the first five are shown below):

1999/01/915202639.14137.txt|193.68.153.2|BG|Bulgaria
1999/01/915258711.14416.txt|139.175.250.157|TW|Taiwan, Province of China
1999/01/915338036.14886.txt|204.126.205.203|US|United States
1999/01/915338371.14888.txt|12.74.105.130|US|United States
1999/01/915338372.14888.txt|153.37.71.59|CN|China

In this example, the content of the entire file is the dataset, whereas each row of this file corresponds to an element of the dataset. Closer inspection shows that the element consists of four subelements, delimited by vertical bar (“pipe”) characters. For the first row, they are:

1999/01/915202639.14137.txt – email’s filename (relative path)
193.68.153.2 – IP address of the email’s origin
BG – two-letter country code of the email origin
Bulgaria – the full name of the country

In Computer Science lingo, a complex datatype is often called derived datatype, structure, or record, whereas a subelement of the datatype is often called field. There are four fields above, and all of them are of text datatype.

A complex datatype can be well-structured like the example above; however, in Spark it does not necessarily have to be so.

A classic database system usually contains many regularly-structured datasets (a dataset is often called table in the database terminology). This is where Spark differs from database: Spark have the flexibility to deal with datasets that lack regular structure–often simply called unstructured data.

In this training module, we will experiment with RDDs that have different element datatypes (but a uniform datatype for a given RDD): integers, text strings, table-like records from a text file (i.e. the spam analysis output above), as well as whole text files.

DataFrame

When a dataset have a well-defined structure, such as the spam analysis result above, Spark has an even better representation called DataFrame. Under the hood, DataFrame is actually an RDD that assumes a specific structure for the data it holds. Data stored in a DataFrame has a tabular form, where every row contains a complete record, and the columns contain the different fields of every record. DataFrame makes it convenient to process table-like data through few expressive, simple-to-understand commands. We will introduce both RDD and DataFrame in this training, and will use DataFrame whenever possible.

Illustrations of DataFrame

The analysis result of the January 1999 spam emails above is a perfect use case for a Spark DataFrame. Here is a tabular representation the data, loaded onto a DataFrame:

      a Column
          |
          V
+-----------------------------+-----------------+-----+---------------------------+
|                    filename |       origin_ip | CC2 | country                   | <- column names
+-----------------------------+-----------------+-----+---------------------------+
| 1999/01/915202639.14137.txt |    193.68.153.2 |  BG | Bulgaria                  | <- a Row
| 1999/01/915258711.14416.txt | 139.175.250.157 |  TW | Taiwan, Province of China |
| 1999/01/915338036.14886.txt | 204.126.205.203 |  US | United States             |
| 1999/01/915338371.14888.txt |   12.74.105.130 |  US | United States             |
| 1999/01/915338372.14888.txt |    153.37.71.59 |  CN | China                     |
|                         ... |             ... | ... | ...                       |
+-----------------------------+-----------------+-----+---------------------------+

In this example, we appropriately named the columns (as we will do later). “CC2” is a shorthand for “two-letter country code”.

In real-world use cases, DataFrames can contain many columns and enormously large number of rows. Spark is well suited to handle that kind of big data.

Transformation and Actions

Once an RDD is constructed, we can manipulate the data using two types of operations: transformations and actions.

Transformation

A transformation, as the name suggests, refers to an operation that is applied to every element of the RDD. Some examples include:

a mathematical square function applied to every integer x in an RDD:
```
RDD_xx = RDD_x.map(lambda x: x*x)
```
a string operation to turn all letters to capital letters, for an RDD containing text strings:
```
RDD_upper = RDD_s.map(lambda s: s.upper())
```

We will discuss these using hands-on examples in the following episode. An advanced example in this training module shows how one can directly ingest the spam emails using Spark and extract its IP addresses, while performing the required analysis.

There are a few key concepts that we need to keep in mind about transformations:

A transformation on an RDD returns another RDD.
Transformations can be chained: For example, RDD_s can be transformed into RDD_upper, which can then be transformed into RDD_sshd, and so on:
```
RDD_upper = RDD_s.map(lambda s: s.upper())
RDD_sshd = RDD_upper.filter(lambda s: "SSHD" in s)
```
This means that RDD_sshd ultimately depends on RDD_s.
Creating a transformation as in examples above does NOT immediately execute the requested operations. This is unlike the usual programming expression, such as
```
A = B + 5
```
where the expression B + 5 is immediately executed and then stored to variable A. This is an intended feature of Spark, because it allows optimization of operations as well as recomputation (should it be necessary in the case of faults).

Action

An action is an operation that will result in the immediate execution of the command. An action will trigger the required the entire computation of the entire expression, including those transformations defined by the RDDs which are the dependencies of the action. Coming back to the earlier example where RDD_upper and RDD_sshd was defined: suppose we want to peek into the first five rows returned by RDD_sshd. We execute:

peek = RDD_sshd.take(5)

This will trigger the cascading computation of RDD_s, RDD_upper, and RDD_sshd; eventually it will return the first five elements of RDD_sshd.

Comparison with a Database

If you are familiar with database, there are many similarities between the concepts of data and processing in Spark and database. For example, from user perspective, a Spark DataFrame is basically equivalent to a table in a database. In a coming episode you will also see many database-like operation on DataFrames. But there are some stark differences as well:

Spark does not have the concept of ACID (Atomicity, Consistency, Isolation, Durability) guarantee that is highly demanded on a database system.
Spark is inherently parallel and scalable, whereas a database computation may not be able to scale beyond a single server.
Spark is primarily used for analytics; the original input data is not meant to be modified constantly. Database, on the other hand, is meant to store and process data; the data stored may be changing dynamically to reflect the actual physical or business states they represent. (Think about the use of databases to store and process your financial transactions: your recorded balance will be changing all the time.)
Database has a rigid structure (i.e. tables) and on-disk format. Spark is neither tied to a particular data structure or even certain data format. Spark supports a wide variety of data formats.