On DataFrame
The Dataframe approach we learned in this lesson is actually very popular in analytics world. In this section we mention a number of other frameworks similar to Spark DataFrame. The purpose is to shows you that the approach you learned in this lesson module is actually transferrable to other tools and languages, because they are universal concepts.
Other Frameworks using DataDrame-like Approaches
Pandas
Website: https://pandas.pydata.org/
Pandas is an analytics platform written in Python.
It relies on well-established Python libraries such as NumPy, SciPy, Matplotlib
and so on.
Dataframe
(with lowercase f
) is Pandas’ primary representation of
structured data.
Many operations like select, sorting, filtering, joining, aggregating, etc.
are supported.
The difference from Spark is that Pandas data is kept in memory, therefore its
computing capacity is limited by the amount of memory the computer has.
Nevertheless, Pandas is a very popular framework with many data scientists.
R
Website: data.frame reference
R has a built-in support for dataframes.
The main function to construct a dataframe is aptly called data.frame
.
A very popular package called dplyer introduces the
%>%` pipeline
notation for convenient manipulation of dataframes:
selecting, sorting, filtering, joining, aggregating, etc.
There is also another package called data.table
which boasts fast aggregation
of large amounts of data.
You can learn more from this introductory article.
RAPIDS
Website: https://rapids.ai/
RAPIDS is a new project of NVIDIA to develop “open source software libraries [that] gives [users] the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs (graphical processing units).” It is intended to integrate with other tools already developed by NVIDIA for machine learning (including deep learning). At the heart of this software is a dataframe, fashioned after Pandas’ Dataframe API. Integration with Spark is planned.
Links: