Key Points
| Introduction to Big Data Analytics and Pandas |
|
| Big Data Challenge: Detecting Malicious Activities on Smartphones |
|
| Fundamental of Pandas |
|
| Analytics of Sherlock Data with Pandas |
|
| Data Wrangling and Visualization |
|
| Outro: Big Data Analytics in Real-World Applications |
|
References
Pandas
Cheatsheets
These are handy reminders to help you write your own analysis pipeline using pandas. Please study these resources and keep them within easy reach.
Seaborn
Spark & PySpark
PySpark overview and programming guides
PySpark API reference
On RDD and DataFrame
Note that Dataset is a general case of DataFrame; however Dataset API is supported only on Scala and Java.
Spark running modes
This is a very technical aspect of Spark, which may be needed by people who set up their own Spark cluster.
-
Spark on YARN: This is the “traditional” way of deploying Spark on a Hadoop cluster, coupled with HDFS as the filesystem backend.
-
Spark standalone mode: In this mode, Spark master and worker processes must be set up manually (possibly with the help of some setup scripts).
-
It is also possible to run Spark with Mesos and Kubernetes, but it is outside the scope of our training.
Computer Notes
Networking
Glossary
- action (Spark)
- A method of a Spark RDD to invoke the computation and return the computed results.
- attribute (object)
- In object-oriented programming, an attribute can be thought of as a variable,
or a value, that belongs to an object.
For example, a DataFrame object called
dfhas an attribute calledshapewhich describes the dimensions of the tabular dataset. An attribute has to be retrieved along with its owning object, e.g.df.shape. An attribute should not be called with the function call()operator. - descriptive statistics
- TODO
- Resilient Distributed Dataset (Spark)
- Resilient Distributed Dataset (RDD) is a representation of dataset in Spark that can be distributed across multiple machines and is resilient against network or computer failure.
- nested list
- TODO
- network flow
- A network flow, or a traffic flow, or a packet flow, is a sequence of packets from a source computer to a destination, which may be another host, a multicast group, or a broadcast domain. (Wikipedia definition)
- transformation (Spark)
- A method of a Spark RDD which transforms the data into another form; it returns another RDD.