DeapSECURE module 2: Dealing with Big Data: Spark---Scalable Framework for Big Data

This document contains additional introductory materials on Apache Spark, an alernative big data analytics framework.

Which One to Choose? Pandas vs. Spark

pandas can be compared to Spark in many ways: Each library offers DataFrame, an object that embodies a dataset in a tabular format. Here are some similarities:

Both Spark and pandas have capabilities to read from and write from popular common data formats:
- Pandas IO tools can read/write CSV, JSON, Excel, HDF5, and even Apache’s Parquet tabular format. Database formats can be supported via SQL queries.
- Spark DataFrameReader, which is customarily accessed via spark.read prefix, also has a rich support for formats like CSV, JSON, database tables (via JDBC), Apache Parquet, etc. See also the web page on Data Sources on Spark’s User Guide.

But there are several important differences:

In Pandas, a DataFrame stores the dataset “in-memory”; therefore the computer must have sufficient memory to contain the entire dataset. Furthermore, the dataset can only reside in a single computer; it is not distributed.
In Spark, a DataFrame represents a (potentially) enormous amount of data that can reside in a distributed fashion (across many computers). It is not a requirement that the dataset must fit in the computer’s memory, because Spark is capable of ingesting and processing data that is larger than the aggregated memory of the computers in which the Spark subtasks are executed.

In terms of data operation, there is also an important difference:

In Pandas, any operation (select, filter, join, etc.) will be executed immediately and returns a new DataFrame with the computed results. In other words, the results are available immediately following the command.
In Spark, only actions will return results that can be fetch, saved, or visualized. The other type of operations (transformations) will only build the task pipeline without executing it right away.