DeapSECURE module 2: Dealing with Big Data

Key Points

Introduction to Big Data Analytics and Pandas	Big data refers to data sets that are too large or complex, as well as methodologies to tackle such data sets. Pandas is a powerful data framework for ingesting and processing large amounts of data.
Big Data Challenge: Detecting Malicious Activities on Smartphones	Smartphones are a prime target of cybersecurity attacks due to its ubiquity. Researchers use large amounts of data to develop methods to detect and thwart cyber attacks.
Fundamental of Pandas	Pandas store data as `Series` and `DataFrame` objects. A `Series` is a one-dimensional array of indexed data. In Pandas make it easy to access and manipulate individual elements and groups of elements. Useful DataFrame attributes/methods for initial exploration: `df.shape`, `df.dtypes`, `df.head()`, `df.tail()`, `df.describe()`, and `df.info()`.
Analytics of Sherlock Data with Pandas	Initial exploration: `df.shape`, `df.dtypes`, `df.head()`, `df.tail()`, `df.describe()`, and `df.info()` Transpose table for readability: `df.T` Filtering rows: `df[BOOLEAN_EXPRESSION]` Sort rows/columns: `df.sort_values()` Data aggregation: `df.max()`, `df.min()`, `df.mean()`, `df.sum()`, `df.mode()`, `df.median()` Execute custom function: `df.apply()` Group data by column and apply an aggregation function: `df.groupby(['COLUMN1','COLUMN2',...]).FUNC()` Merge two tables: `df.merge()`
Data Wrangling and Visualization	Visualization is a powerful tool to produce insight into data. Histogram is useful to display the distribution of values.
Outro: Big Data Analytics in Real-World Applications	Other big data processing frameworks include R and Spark. Pandas is a powerful data framework for ingesting and processing large amounts of data.

References

Pandas

Cheatsheets

These are handy reminders to help you write your own analysis pipeline using pandas. Please study these resources and keep them within easy reach.

Seaborn

Spark & PySpark

PySpark overview and programming guides

PySpark API reference

On RDD and DataFrame

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets – When to use them and why

Note that Dataset is a general case of DataFrame; however Dataset API is supported only on Scala and Java.

Spark running modes

This is a very technical aspect of Spark, which may be needed by people who set up their own Spark cluster.

Spark on YARN: This is the “traditional” way of deploying Spark on a Hadoop cluster, coupled with HDFS as the filesystem backend.
Spark standalone mode: In this mode, Spark master and worker processes must be set up manually (possibly with the help of some setup scripts).
It is also possible to run Spark with Mesos and Kubernetes, but it is outside the scope of our training.

Computer Notes

Networking

List of Popular TCP port numbers

Glossary

action (Spark): A method of a Spark RDD to invoke the computation and return the computed results.
attribute (object): In object-oriented programming, an attribute can be thought of as a variable, or a value, that belongs to an object. For example, a DataFrame object called df has an attribute called shape which describes the dimensions of the tabular dataset. An attribute has to be retrieved along with its owning object, e.g. df.shape. An attribute should not be called with the function call () operator.
descriptive statistics: TODO
Resilient Distributed Dataset (Spark): Resilient Distributed Dataset (RDD) is a representation of dataset in Spark that can be distributed across multiple machines and is resilient against network or computer failure.
nested list: TODO
network flow: A network flow, or a traffic flow, or a packet flow, is a sequence of packets from a source computer to a destination, which may be another host, a multicast group, or a broadcast domain. (Wikipedia definition)
transformation (Spark): A method of a Spark RDD which transforms the data into another form; it returns another RDD.