Key Points
Introduction to Big Data Analytics and Pandas |
|
Big Data Challenge: Detecting Malicious Activities on Smartphones |
|
Fundamental of Pandas |
|
Analytics of Sherlock Data with Pandas |
|
Data Wrangling and Visualization |
|
Outro: Big Data Analytics in Real-World Applications |
|
References
Pandas
Cheatsheets
These are handy reminders to help you write your own analysis pipeline using pandas. Please study these resources and keep them within easy reach.
Seaborn
Spark & PySpark
PySpark overview and programming guides
PySpark API reference
On RDD and DataFrame
Note that Dataset is a general case of DataFrame; however Dataset API is supported only on Scala and Java.
Spark running modes
This is a very technical aspect of Spark, which may be needed by people who set up their own Spark cluster.
-
Spark on YARN: This is the “traditional” way of deploying Spark on a Hadoop cluster, coupled with HDFS as the filesystem backend.
-
Spark standalone mode: In this mode, Spark master and worker processes must be set up manually (possibly with the help of some setup scripts).
-
It is also possible to run Spark with Mesos and Kubernetes, but it is outside the scope of our training.
Computer Notes
Networking
Glossary
- action (Spark)
- A method of a Spark RDD to invoke the computation and return the computed results.
- attribute (object)
- In object-oriented programming, an attribute can be thought of as a variable,
or a value, that belongs to an object.
For example, a DataFrame object called
df
has an attribute calledshape
which describes the dimensions of the tabular dataset. An attribute has to be retrieved along with its owning object, e.g.df.shape
. An attribute should not be called with the function call()
operator. - descriptive statistics
- TODO
- Resilient Distributed Dataset (Spark)
- Resilient Distributed Dataset (RDD) is a representation of dataset in Spark that can be distributed across multiple machines and is resilient against network or computer failure.
- nested list
- TODO
- network flow
- A network flow, or a traffic flow, or a packet flow, is a sequence of packets from a source computer to a destination, which may be another host, a multicast group, or a broadcast domain. (Wikipedia definition)
- transformation (Spark)
- A method of a Spark RDD which transforms the data into another form; it returns another RDD.