Module 2: Dealing with Big Data
“Big data” is a term that generally describes data whose characteristics defy “traditional” data processing approaches. Having a table of 10 million rows and 500 columns? That is too big for a spreadsheet software to handle. Having a total of 100 GB worth of log files coming from over 1000 different computers from a huge data center? These would be too long to analyze using system engineer’s basic tool (like UNIX commands chained together) on a single computer. These are a few examples of “big data” challenges. In this lesson module, we introduce an efficient way of handling, processing, and analyzing large amounts of data using pandas. pandas is the de facto data analysis and manipulation tool for Python programming language. The data handling skills introduced in this lesson form the foundation for the subsequent two lessons on machine learning and neural networks.
Please look at our Big Data lesson at the following link:
Workshop Resources (Workshop Series 2020-2021)
Presentation Slides
Presentation slides (Google sheets)
Jupyter Notebooks
(To download the notebook and the hands-on files, please right-click on the links below and select “Save Link As…” or a similar menu)
- Session 1: Fundamentals of Pandas - (html)
- Session 2: Analytics of Sherlock Data with Pandas - (html)
- Session 3: Data Wrangling and Visualization - (html)
Hands-on Files
- Sherlock hands-on files, except the large files (table of contents)
- Sherlock large data file: “sherlock_mystery_2apps.csv” (table of contents)
- Spam-ip based hands-on (legacy, optional) (table of contents)
The hands-on files are packed in ZIP format. The first two ZIP files above are mandatory. To reconstitute: Unzip all the files, preserving the paths, into the same destination directory.
Video Recordings
Key points of the video:
- What is Pandas?
- Series and DataFrame
- Data types
Key points of the video:
- Operation on Data
- Filtering
- Sorting
- Chaining operations
- Grouping & Aggregation
- Column operations