Introduction
Visualization is an important tool when coming to analyzing large amounts of data. Python provides an excellent visualization library named Matplotlib. Matplotlib can be used to visualize data and analytics results from PySpark. To load this plotting package, invoke the following in your Python or PySpark session:
>>> from matplotlib import pyplot
The plotting functions are located in the pyplot
submodule.
Bar chart visualization
For this purpose, we will go back to the count of spam emails sent out in 1999
with the original 1999.ip_alg1
data file.
After data cleaning, aggregation, and sorting, we get the final result in
countries_top10
variable.
Here is last part of the PySpark session in the
previous session on email analytics:
>>> countries_top10 = df_email_count.take(10)
>>> countries_top10
[Row(country=u'United States', count=256),
Row(country=u'China', count=37),
Row(country=u'Germany', count=17),
Row(country=u'Canada', count=13),
Row(country=u'Korea, Republic of', count=10),
Row(country=u'Japan', count=9),
Row(country=u'France', count=7),
Row(country=u'Colombia', count=7),
Row(country=u'Australia', count=6),
Row(country=u'United Kingdom', count=5)]
The Row
datatype contains many fields that can be read using the
dict-like indexing scheme.
Here is an example session to get data out from the result above:
>>> top01 = countries_top10[0]
>>> top01
Row(country=u'United States', count=256)
>>> top01['country']
u'United States'
>>> top01['count']
256
Now we extract the list of countries and counts for visualization:
>>> countries = [ R['country'] for R in countries_top10 ]
>>> counts = [ R['count'] for R in countries_top10 ]
>>> countries
[u'United States',
u'China',
u'Germany',
u'Canada',
u'Korea, Republic of',
u'Japan',
u'France',
u'Colombia',
u'Australia',
u'United Kingdom']
The first two lines above employ Python’s
list comprehension
in order to create a new list out of the existing list (country_top10
).
Now this is the set of commands used to produce the plot.
# Import the visualization module, which is matplotlib.pyplot
>>> from matplotlib import pyplot
>>> fig = pyplot.figure()
>>> pyplot.bar(range(len(countries)), counts, tick_label=countries)
<BarContainer object of 10 artists>
# get the first plot (the only one)
>>> axes0 = fig.axes[0]
# set the xaxis label rotation to 25 degrees
>>> axes0.xaxis.set_tick_params(rotation=25)
>>> pyplot.savefig("country_top10_1999.png")
Here we use the pyplot.bar
function
(reference page here) to plot the values in a vertical bar plot.
The pyplot.savefig
command saves the plot to a PNG file,
which you can transfer out of Turing to your own computer for viewing.
See the instructions at
ODU HPC manual page
to transfer data between Turing and your computer.
Here is the resulting picture:
Using X11 for interactive visualization
For visualization, it is best to set up a connection with X11 display or using Remote Desktop. The links point to the instruction to set this up. For Windows, it is easier to install and download MobaXterm because it already includes X11 server. For MacOS X, the XQuartz display server (or a similar X11 server) is needed.
With X11 display, one can simply show the plot on the screen. So instead of calling
pyplot.savefig
function, one can invokepyplot.show()
and the graph will be displayed on the screen.
Plotting and Big Data
Care must be taken when one wants to plot extremely large data. Spark is designed to work with data sizes that are beyond a single computer’s memory capacity, but Matplotlib is not. The memory use can explode if we simply take out data from Spark RDD or DataFrame without considering how big the data is. One example is: plotting 100 billion of data points in a scatterplot form. One possible approach is to produce some alternative form of plotting, or use some sort of aggregation to reduce the number of data points before plotting it.
Our exercises in this training have moderate amounts of data, so it may still be acceptable to dump the data from Spark DataFrame to a Numpy or Python array for visualization purposes.