DeapSECURE module 2: Dealing with Big Data: Introduction to Big Data Analytics and Spark

FIXME

What Is Big Data?

The term “Big Data” is quite popular today.

“Big data” is a computer field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software.

Beginning early in 21st century, the amount and variety of data available for processing have exponentially grown due to several factors:

The rise of Internet, which facilitates the generation and collection of data;
The proliferation of mobile computing devices as well as “internet of things” which collectively generate massive amount of data;
The rapid advancement in storage, processing, and networking hardware, allowing vast amounts of data to be stored, processed, and transmitted.

When “Big Data” Is Really Big?

The term “too large” and “too complex” carries an absolute sense and a relative sense. For example: It is generally accepted today that a terabyte of data is considered “big data”. However, what is considered “too large” depends on the entity dealing with the data. For companies such as Google, a terabyte of data is far from being “too large”; but for an average 10-employee company, that amount of data is already too big to handle. Over time, what was considered “big” is no longer too big, due to the advance in technology. For example, over 20 years ago, a gigabyte of data is already considered enormous because most hard drives are still on the order of tens or hundreds of megabytes. However, today a photo collector can easily amass terabytes of data on his own hard drives.

For a long time, tables containing highly structured information bits (text, numbers, and other types of information) have been the prevailing way of representing, storing, and processing data in computer systems. Two popular types of software have facilitated the widespread adoption of tabular data format: spreadsheet and database. Spreadsheet software is highly popular for personal computing, while most businesses use databases of some sort. Many researchers often begin processing their data using spreadsheet because it is visual, and the data manipulation is quite intuitive. However, when the number of rows or columns in a table is large (for example, ten thousand rows), processing using spreadsheet becomes cumbersome.

The Many V’s of Big Data

There are many ways of characterizing big data.

The most common three V’s of Big Data are:

Volume
Velocity
Variety

Data-driven business, data-driven economy, data-driven society

Spark

Spark is a parallel computing framework that was created to facilitate the processing of very large amounts of data. Furthermore, Spark can be configured to ingest data in real-time, thus capable of handling the rapid velocity of data. Spark makes it easy to process and analyze large amounts of data to produce insight from the data.

Spark supports many programmin languages: Scala, Java, Python, and R. Spark itself is written in Scala, which is compiled into Java bytecodes, therefore Spark depends on Scala and Java Runtime Environment (JRE). In this training we will focus on Spark on Python, often called “PySpark”, because Python is simple and intuitive for beginners. We focus on Spark version 2.3; but most of our materials are applicable for Spark versions 2.0 and up.