DeapSECURE module 2: Dealing with Big Data: Visualization (1)

Introduction

Visualization is an important tool when coming to analyzing large amounts of data. Python provides an excellent visualization library named Matplotlib. Matplotlib can be used to visualize data and analytics results from PySpark. To load this plotting package, invoke the following in your Python or PySpark session:

>>> from matplotlib import pyplot

The plotting functions are located in the pyplot submodule.

Bar chart visualization

For this purpose, we will go back to the count of spam emails sent out in 1999 with the original 1999.ip_alg1 data file. After data cleaning, aggregation, and sorting, we get the final result in countries_top10 variable. Here is last part of the PySpark session in the previous session on email analytics:

>>> countries_top10 = df_email_count.take(10)

>>> countries_top10
[Row(country=u'United States', count=256),
 Row(country=u'China', count=37),
 Row(country=u'Germany', count=17),
 Row(country=u'Canada', count=13),
 Row(country=u'Korea, Republic of', count=10),
 Row(country=u'Japan', count=9),
 Row(country=u'France', count=7),
 Row(country=u'Colombia', count=7),
 Row(country=u'Australia', count=6),
 Row(country=u'United Kingdom', count=5)]

The Row datatype contains many fields that can be read using the dict-like indexing scheme. Here is an example session to get data out from the result above:

>>> top01 = countries_top10[0]

>>> top01
Row(country=u'United States', count=256)

>>> top01['country']
u'United States'

>>> top01['count']
256

Now we extract the list of countries and counts for visualization:

>>> countries = [ R['country'] for R in countries_top10 ]

>>> counts = [ R['count'] for R in countries_top10 ]

>>> countries
[u'United States',
 u'China',
 u'Germany',
 u'Canada',
 u'Korea, Republic of',
 u'Japan',
 u'France',
 u'Colombia',
 u'Australia',
 u'United Kingdom']

The first two lines above employ Python’s list comprehension in order to create a new list out of the existing list (country_top10).

Now this is the set of commands used to produce the plot.

# Import the visualization module, which is matplotlib.pyplot
>>> from matplotlib import pyplot

>>> fig = pyplot.figure()
>>> pyplot.bar(range(len(countries)), counts, tick_label=countries)
<BarContainer object of 10 artists>

# get the first plot (the only one)
>>> axes0 = fig.axes[0]
# set the xaxis label rotation to 25 degrees
>>> axes0.xaxis.set_tick_params(rotation=25)
>>> pyplot.savefig("country_top10_1999.png")

Here we use the pyplot.bar function (reference page here) to plot the values in a vertical bar plot. The pyplot.savefig command saves the plot to a PNG file, which you can transfer out of Turing to your own computer for viewing. See the instructions at ODU HPC manual page to transfer data between Turing and your computer.

Here is the resulting picture:

Bar chart of top10 spamming country in 1999

Using X11 for interactive visualization

For visualization, it is best to set up a connection with X11 display or using Remote Desktop. The links point to the instruction to set this up. For Windows, it is easier to install and download MobaXterm because it already includes X11 server. For MacOS X, the XQuartz display server (or a similar X11 server) is needed.

With X11 display, one can simply show the plot on the screen. So instead of calling pyplot.savefig function, one can invoke pyplot.show() and the graph will be displayed on the screen.

Plotting and Big Data

Care must be taken when one wants to plot extremely large data. Spark is designed to work with data sizes that are beyond a single computer’s memory capacity, but Matplotlib is not. The memory use can explode if we simply take out data from Spark RDD or DataFrame without considering how big the data is. One example is: plotting 100 billion of data points in a scatterplot form. One possible approach is to produce some alternative form of plotting, or use some sort of aggregation to reduce the number of data points before plotting it.

Our exercises in this training have moderate amounts of data, so it may still be acceptable to dump the data from Spark DataFrame to a Numpy or Python array for visualization purposes.