This lesson is in the early stages of development (Alpha version)

DeapSECURE module 2: Dealing with Big Data: Outro to Big Data Computing

On DataFrame

The Dataframe approach we learned in this lesson is actually very popular in analytics world. In this section we mention a number of other frameworks similar to Spark DataFrame. The purpose is to shows you that the approach you learned in this lesson module is actually transferrable to other tools and languages, because they are universal concepts.

Other Frameworks using DataDrame-like Approaches

Pandas

Website: https://pandas.pydata.org/

Pandas is an analytics platform written in Python. It relies on well-established Python libraries such as NumPy, SciPy, Matplotlib and so on. Dataframe (with lowercase f) is Pandas’ primary representation of structured data. Many operations like select, sorting, filtering, joining, aggregating, etc. are supported. The difference from Spark is that Pandas data is kept in memory, therefore its computing capacity is limited by the amount of memory the computer has. Nevertheless, Pandas is a very popular framework with many data scientists.

R

Website: data.frame reference

R has a built-in support for dataframes. The main function to construct a dataframe is aptly called data.frame. A very popular package called dplyer introduces the %>%` pipeline notation for convenient manipulation of dataframes: selecting, sorting, filtering, joining, aggregating, etc.

There is also another package called data.table which boasts fast aggregation of large amounts of data. You can learn more from this introductory article.

RAPIDS

Website: https://rapids.ai/

RAPIDS is a new project of NVIDIA to develop “open source software libraries [that] gives [users] the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs (graphical processing units).” It is intended to integrate with other tools already developed by NVIDIA for machine learning (including deep learning). At the heart of this software is a dataframe, fashioned after Pandas’ Dataframe API. Integration with Spark is planned.

Links: