Outro: Big Data Analytics in Real-World Applications

Overview

Teaching: 5 min
Exercises: 0 min

Questions

How does big data processing look in the real world?

What other tools and frameworks are available for big data processing?

Objectives

Understanding the big data processing in reak-world application?

This lesson has taught you basic data analytics techniques that are applicable both in the scales of “small data” and “big data”. As we have learned a little bit through the hands-on activities, big data poses additional challenges in the process of converting data to knowledge and insight. This goal can be achieved with appropriate tools. In this outro, we will expose you to big data analytics in real-world settings.

Other Ways of Doing Big Data

So far, we have limited ourselves to one “V” of the big data, i.e. the Volume. Real-world big data problems have also to deal with sheer speed of the data coming to be processed, or the different variety of data:

Streaming analytics

For real-world analytics. In cybersecurity operations, events happen in real time, therefore streaming analytics is of a great interest.
Dealing with extremely heterogenous data

These include human-friendly texts, audio, video. These are considered “unstructured data”, where the information contained therein requires considerably processing to extract the necessary information. These types of data are proliferating today thanks to the rapid advance in computing technologies.

Other Platforms for Big Data Analytics

R

R is an open-source programming environment dedicated for statistical analysis. R also has many data analytics capabilities that overlap with pandas, but R has its unique strength in its wealth of statistics-related tools and libraries. Often, the choice between R or pandas is driven by what capabilities are provided by which tool, and what community you are in. Similar to pandas, R runs on a single computer, although there are add-on libraries to make R process data in a distributed manner.

If you are interested to try R, we would recommend starting with (RStudio)[https://rstudio.com/products/rstudio/]. This comes with a graphical user interface and can run directly on your laptop or workstation. When your analysis becomes too large to handle on your own computer, it is time to migrate your work to HPC or to the cloud.

Spark

Apache Spark is a scalable, parallel computing framework that was created to facilitate the processing of extremely large volume and variety of data. Furthermore, Spark can be configured to ingest data in real-time, thus capable of handling the rapid velocity of data. Spark makes it easy to process and analyze large amounts of data to produce insight from the data. Spark is also an open source project that have seen wide adoption in industry. Compared to pandas, Spark has steeper learning curve. Its abstraction of data makes it less convenient for new learners to “touch” and “see” the data. But Spark is scalable—under the hood it utilizes parallel computing to handle data that is too big to handle on a single computer. When your data processing needs exceed the capacity of pandas or R, then it may be time to strongly consider Spark. Spark has interface to many languages (Java, Scala, Python, R).

These are just a limited list of open-source tools that can be useful to work with big data. There are also many commercial, proprietary solutions offered by many companies. And more recently, these solutions are made available in the cloud.

Big Data in the Cloud

“Cloud computing” is rising to be Amazon, Google, and Microsoft all have their own line of products which can be leveraged to process big data.

Big Data and Data Science: The Indispensable Human Dimension

In our first episode of this lesson, we briefly discussed the impact of data in our modern society. Big data analytics exists in the context of supporting the needs of society, business, government, health, education, or other types of services. Today, leaders and executives frequently rely on data analytics to provide basis for tough decisions they have to make. These decisions often touch issues such as: business profitability or sustainability, education and healthcare equality, challenging scientific questions, environmental concerns, etc. From cybersecurity standpoint, a business or institution need to defend its operation, infrastructure and data from ever-changing attacks in the cyberspace.

Answering big questions require more than programming and computing skills to process the large amounts of data. This “Big Data” lesson provides fundamental skills to work with large amounts of data. However, it is equally critical for us, the data analyst, to understand what our data means, and what the data tells us. As discussed briefly in the section on data wrangling, there are judgment calls that have to be made to address issues in the data such as the missing values or outliers. Those have ramifications to the results of the analysis, and consequently to the decisions made based on these results. These are decisions that have to be made by us, human beings. Only we have the necessary high-level understanding to make the right choices based on the information presented to us. We should, therefore, not treat the big data analytic tools as a magical “black box” that will render the correct answer every time. Neither should we blindly trust the results returned by computers.

Given the complexity of the real-world data analytic problems, designing a reliable solution based on big data analytics often requires a team with diverse set of expertise, creativity, curiousity, insight and analytical ability. It is important for us who desire to master technical skills of big data not to lose sight of the big picture where our skills play an important role.

Key Points

Other big data processing frameworks include R and Spark.

Pandas is a powerful data framework for ingesting and processing large amounts of data.

previous episode

DeapSECURE module 2: Dealing with Big Data

lesson home