This lesson is in the early stages of development (Alpha version)

DeapSECURE module 2: Dealing with Big Data

Key Points

Introduction to Big Data Analytics and Pandas
  • Big data refers to data sets that are too large or complex, as well as methodologies to tackle such data sets.

  • Pandas is a powerful data framework for ingesting and processing large amounts of data.

Big Data Challenge: Detecting Malicious Activities on Smartphones
  • Smartphones are a prime target of cybersecurity attacks due to its ubiquity.

  • Researchers use large amounts of data to develop methods to detect and thwart cyber attacks.

Fundamental of Pandas
  • Pandas store data as Series and DataFrame objects.

  • A Series is a one-dimensional array of indexed data.

  • In Pandas make it easy to access and manipulate individual elements and groups of elements.

  • Useful DataFrame attributes/methods for initial exploration: df.shape, df.dtypes, df.head(), df.tail(), df.describe(), and df.info().

Analytics of Sherlock Data with Pandas
  • Initial exploration: df.shape, df.dtypes, df.head(), df.tail(), df.describe(), and df.info()

  • Transpose table for readability: df.T

  • Filtering rows: df[BOOLEAN_EXPRESSION]

  • Sort rows/columns: df.sort_values()

  • Data aggregation: df.max(), df.min(), df.mean(), df.sum(), df.mode(), df.median()

  • Execute custom function: df.apply()

  • Group data by column and apply an aggregation function: df.groupby(['COLUMN1','COLUMN2',...]).FUNC()

  • Merge two tables: df.merge()

Data Wrangling and Visualization
  • Visualization is a powerful tool to produce insight into data.

  • Histogram is useful to display the distribution of values.

Outro: Big Data Analytics in Real-World Applications
  • Other big data processing frameworks include R and Spark.

  • Pandas is a powerful data framework for ingesting and processing large amounts of data.

References

Pandas

Cheatsheets

These are handy reminders to help you write your own analysis pipeline using pandas. Please study these resources and keep them within easy reach.

Seaborn

Spark & PySpark

PySpark overview and programming guides

PySpark API reference

On RDD and DataFrame

Note that Dataset is a general case of DataFrame; however Dataset API is supported only on Scala and Java.

Spark running modes

This is a very technical aspect of Spark, which may be needed by people who set up their own Spark cluster.

Computer Notes

Networking

Glossary

action (Spark)
A method of a Spark RDD to invoke the computation and return the computed results.
attribute (object)
In object-oriented programming, an attribute can be thought of as a variable, or a value, that belongs to an object. For example, a DataFrame object called df has an attribute called shape which describes the dimensions of the tabular dataset. An attribute has to be retrieved along with its owning object, e.g. df.shape. An attribute should not be called with the function call () operator.
descriptive statistics
TODO
Resilient Distributed Dataset (Spark)
Resilient Distributed Dataset (RDD) is a representation of dataset in Spark that can be distributed across multiple machines and is resilient against network or computer failure.
nested list
TODO
network flow
A network flow, or a traffic flow, or a packet flow, is a sequence of packets from a source computer to a destination, which may be another host, a multicast group, or a broadcast domain. (Wikipedia definition)
transformation (Spark)
A method of a Spark RDD which transforms the data into another form; it returns another RDD.