Introduction to Big Data Analytics and Pandas
Overview
Teaching: 0 min
Exercises: 0 minQuestions
What is big data?
What is big data analytics?
What are the uses of big data analytics in cybersecurity?
What is Pandas?
What are the appropriate use cases of Pandas?
Objectives
Identify the three V’s of big data: volume, velocity, and variety.
Understand the challenges and impacts of big data.
Briefly introduce pandas, the Python programming library.
What Is Big Data?
The term “Big Data” is quite popular today. Big Data is a computer field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software.
Beginning early in the 21st century, the amount and variety of data available for processing has exponentially grown due to several factors:
-
The rise of Internet, which facilitates the generation and collection of data;
-
The proliferation of mobile computing devices as well as “Internet of things” which collectively generates massive amounts of data;
-
The rapid advancement in storage, processing, and networking hardware, allowing vast amounts of data to be stored, processed, and transmitted.
When “Big Data” Is Really Big?
The term “too large” and “too complex” carries an absolute sense and a relative sense. For example: It is generally accepted today that a terabyte of data is considered “big data”. However, what is considered “too large” depends on the entity dealing with the data. For companies such as Google, a terabyte of data is far from being “too large”; but for an average 10-employee company, that amount of data is already too big to handle. Over time, what was considered “big” is no longer too big, due to advances in technology. For example, over 20 years ago, a gigabyte of data will be considered enormous because most hard drives were still on the order of tens or hundreds of megabytes. However, today a photographer can easily amass terabytes of data on his own hard drives.
For a long time, tables containing highly structured information bits (text, numbers, and other types of information) have been the prevailing way of representing, storing, and processing data in computer systems. Two popular types of software have facilitated the widespread adoption of tabular data format: spreadsheet and database. Spreadsheet software is highly popular for personal computing, while most businesses use databases of some sort. Many researchers often begin processing their data using spreadsheet because it is visual, and the data manipulation is quite intuitive. However, when the number of rows or columns in a table is large (for example, ten thousand or even millions of rows), processing using spreadsheet becomes cumbersome, if not downright impossible.
The advent of new technologies gives rise to different variety of data. Today, a vast wealth of unstructured data such as images, videos, sounds, texts are generated and stored routinely in various media and platforms. There are also semi-structured data, which include log messages, XML data, and the like. These types of data are more complex to process with computers, but they are highly valuable and complement the traditionally structured data.
The Many V’s of Big Data
There are many ways of characterizing big data. The most common three V’s of Big Data are:
-
Volume
-
Velocity
-
Variety
Challenges of Big Data
As mentioned earlier, big data poses enormous challenge to conventional ways of ingesting, processing, and analyzing data. What makes big data difficult is that it is often big in two or more of the V’s above. When we speak about it being “too big” compared to data in the previous era, we are not talking about something that is only 50% more or twice as large, but potentially orders of magnitude too big, such as 100 or even 100,000 times more data than what people handled previously.
The sheer volume, variety and speed of data often put tremendous stress on computing infrastructure:
-
Demand for more storage to house the sheer volume of the data;
-
Demand for more processing power to process the massive amounts and/or rapidly incoming data in a timely manner;
-
Demand for more bandwidth to allow the data to be quickly transferred for processing, storage, and/or analysis;
-
Data often comes from many different sources, or exist in different geographical and computer locations, which means they must also be integrated in order to be processed and analyzed together.
Processing and analyzing big data is a complex and intricate process. The massive and/or rapidly changing data is just not practical for human beings to process and make sense of without the use of appropriate tools. In practice, visualization and machine learning are always used to gain insight and value from big data. But besides the powerful technologies and techniques, human skills and insights are indispensable to successfully reach the goals of big data analytics. We will revisit this subject at the end of our lesson.
Impact of Data
From the beginning of human civilization, data has played an important role in the decision that man made. Science is a prime example, where data is obtained from observations or deliberate experimentation. Following the collection of data, people seek to know the relationships, laws, and rules which lead to the observed data. Based on this knowledge, they could derive some useful applications, or made decision to improve, change, adapt, mitigate, and so on.
Businesses today employ data more than ever. For example, they constantly review whether a product is profitable—and how much—based on inventory data, product sell rate, return rate, etc. But today, one would also want to mine insight from customer reviews and buying habits. They also want to understand the demographic of customers that would generate the most demand. For example: how many product lines, and how to distribute and/or promote these products based on the demands across a state or a country? Which age group(s) of customers to target? Such detailed insights allow businesses to not just carry out the same old practice but to constantly improve and innovate based on data that they can gather. Therefore, data drives business decisions, and business decisions are constantly evaluated in light of further data. This cycle is not only applicable to for-profit companies. In education and health, for example, similar approaches have been adopted. Decision makers want to know how to maximize impact of their limited resources, what population groups have outstanding needs or challenges that must be addressed first.
How can we turn the massive pile of data into valuable insight, which provides a basis for data-driven decisions? This is where we turn to big data analytics. Coupled with powerful machine learning methods, big data analytics provides a way to extract insight with greater precision than previously possible.
What Is Big Data Analytics?
According to Oxford dictionary, analytics is “the systematic computational analysis of data or statistics”. Broadly speaking, data analytics is a process of extracting valuable insight from data. Analytics is “used for the discovery, interpretation, and communication of meaningful patterns in data. It also entails applying data patterns towards effective decision-making” (Wikipedia).
Big data analytics brings analytics to the next level: By leveraging the sheer amounts and varieties of data, we are able to obtain unprecedented accuracy and insight, as well as ability to make reliable predictions on the future based on the available data. This is made possible by leveraging all kinds of data that are now generated as well as collected–not only the structured data such as tables of numbers, but also the unstructured (text, images, videos, sounds) and semi-structured data.
Data analytics is all-inclusive in its scope: It includes the logistics of data collection, storage and management, information extraction and analysis, as well as the presentation of the analysis results in a way that provides insight and basis for decision making. In other words, it includes data analysis, but it also encompasses the other aspects which are necessary for the data analysis to take place.
Big Data Analytics in Cybersecurity
The field of cybersecurity is also a data-intensive area. With the rise of computing technologies comes the risk of cyber attacks such as data breaches, denial of service, botnets, etc. The level of malice and sophistication of these attacks are constantly evolving to get ahead of detection and defense measures. Frequently these attacks are motivated by financial gain and to lesser extent, political reasons. The target of these attacks can be generally divided into two large groups: (1) gaining access to compute resources to perpetrate malicious acts (e.g. creating more botnets), or (2) gaining access to data to steal them (e.g. stealing credit card and social security numbers) or destroy them (e.g. hacking by enemy states, encryption by ransomware). On the flip side, computing devices generate massive amounts of data: router logs, server logs, application logs. Some of these are generally captured and stored for a period of time, some others have to be analyzed on-the-fly. Massive piles of data are invaluable in cybersecurity as they help us diagnose problems, detect threats. Given the correct approach, they can even help us prevent or mitigate threats as closely as possible to the time of the events.
Here are some cases of big data uses in cybersecurity:
-
“How AI, IoT, and big data will shape the future of cybersecurity”. In this interview, IBM Security Vice President Caleb Barlow mentioned that IBM datacenter receives and processes 30-40 billions of logged events per day.
-
“Challenges to Cyber Security & How Big Data Analytics Can Help”. Big data can help cybersecurity by: Identifying anomalies in device behavior; identifying anomalies in employee and contractor behavior; detecting anomalies in the network; assessing network vulnerabilities and risks.
In summary, the rise of big data, Internet, and mobile computing has led to data-driven business, data-driven economy, and even on a greater scale, data-driven society. Certainly, data-driven decision making is not a new thing invented in the 21st century—this is a wisdom as old as the existence of humanity—but the explosion in computing technologies and the leverage of vast amounts of data have revolutionized many aspects of society, including cybersecurity. However, converting data into insights and decisions is a long process that involve many moving parts. These include the collection of data, validation, storage, management, transmission, computation, analysis, presentation (including visualization), and interpretation. On the human side, it involves policy, the different roles of people who are working with the data, how they work together, etc. Then there is the technology side: the nuts and bolts that will enable big data computing and analysis.
Pandas: Big Data Analytics in Python
This lesson is aimed at introducing practical tools that are needed to process and analyze data to produce results that can then be used to drive actions and decisions. We assume that data has been collected, then transmitted to the appropriate place for computation and analysis.
Pandas (or pandas according to its proper name) is an open-source software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables, time series, text data, and other kinds of data. (Wikipedia) pandas enables automated processing of these large amounts of data. In this lesson we choose to focus on pandas because it is relatively easy for newcomers to learn, it is capable of handling large amounts of data (given the right kind of hardware), and it is a tool that is widely deployed in academia and industry. pandas integrate well with visualization and machine learning tools, which are indispensable when working with big data.
In this lesson module, we will focus on structured tabular data, which is the primary type of data pandas was well designed for. (Since pandas is used in conjunction with Python programming languge, it is actually flexible enough to also store and process unstructured and semi-structured data by leveraging the capabilities existing in other Python libraries.) At the end of this lesson, we will review additional tools and frameworks for truly big data analytics.
Key Points
Big data refers to data sets that are too large or complex, as well as methodologies to tackle such data sets.
Pandas is a powerful data framework for ingesting and processing large amounts of data.