This lesson is in the early stages of development (Alpha version)

DeapSECURE module 2: Dealing with Big Data: Dataset: SherLock (Android Smartphone Security)

In this lesson, we are using the SherLock Android smartphone dataset produced by security researchers from Ben-Gurion University. This page contains information about this dataset, as well as modifications and reductions done by the DeapSECURE team for the purpose of teaching big data and machine learning topics.

Reference Paper

Yisroel Mirsky, Asaf Shabtia, Lior Rokach, Bracha Shapira, and Yuval Elovici, “SherLock vs Moriarty: A Smartphone Dataset for Cybersecurity Research”, 9th ACM Workshop on Artificial Intelligence and Security (AISec) with the 23nd ACM Conference on Computer and Communications (CCS), 2016.

Reference Sources on the Web

Original project website: http://bigdata.ise.bgu.ac.il/sherlock/index.html#/ . (The website has been down since 2021 or so. Instructors and learners may be able to review texts from the website on the Internet Archive, retrieved 2019-08-29.)

Alternative website to learn about Sherlock dataset: https://www.kaggle.com/BGU-CSRC/sherlock .

Detailed description of all the tables available in the Sherlock dataset is available online (PDF).

SherLock’s Sample Dataset

The full SherLock dataset is available only to researchers who sign a user agreement license with the dataset authors, which restricts the disclosure and dissemination of this data. However, they made a small sample dataset freely available for anyone interested in this dataset to experiment with. The hands-on activities in DeapSECURE lessons (Big Data, Machine Learning, and Neural Networks) are made only with SherLock’s sample dataset. Further modification and reduction of the dataset are performed, as described below, to produce greatly simplified datasets that are used by learners in this lesson module.

The original sample Sherlock dataset can be downloaded from the author’s Kaggle page:

https://www.kaggle.com/datasets/BGU-CSRC/sherlock/data

The Applications.csv is now hosted by DeapSECURE in xz-compressed format here:

https://drive.google.com/file/d/1GfLOohkjIMQeOxQb8fIcTRZlHT46_f7M/view

(Be advised that this is a very large file in compressed format [122 MB], and it grows to nearly 4.6 GB when not compressed!)

This sample dataset is also hosted locally for ODU’s HPC users on Wahab supercomputer at this path:

/scratch/Workshops/DeapSECURE/datasets/SherLock/User-97bb95f55a-20160430--20160518

Other learners can download from Kaggle and Google drive. Be advised that some of the data files are too large to handle with an ordinary spreadsheet program or text editor!

The authors used to host the dataset on Google drive: https://drive.google.com/drive/folders/0B_A1qX1kf7R9a282dHU4bWpqM1E, but this link is no longer working.

Some Statistics

Here are the tables (plus file sizes, numbers of rows) existing in the sample SherLock dataset:

File Name File Size (Bytes) Number of records Brief Description
AllBroadcasts.csv 22,178,362 173,651 “Broadcast” from Android OS to the apps, such as password change, network change, button presses, change in power state, etc.
Applications.csv 4,571,297,916 14,801,899 Records of resource usage (CPU, memory, threads, network and VM statistics) for each application, taken every 5 seconds
AppPackages.csv 126,177 299 Status update of apps: install, removal, upgrade, etc.
Bluetooth.csv 649,965 4,399 Information on visible (scanned) Bluetooth devices
Calls.csv 159,624 1,749 Log of phone calls: source/destination phone number, call time & duration
Moriarty.csv 18,096 187 “Hints” left behind by the malicious Moriarty app
ScreenOn.csv 106,879 1,930 Records of screen-on and screen-off events
SherLock Volunteer Survey.csv 7,986 111 Survey data collected from SherLock experiment volunteers
SMS.csv 57,061 606 Records of SMS messages sent and received: sender/receiver phone number, timestamp
T0.csv 29,174 43 Hardware and system information
T1.csv 7,273,825 24,763 Location, connected cell tower, device status
T2.csv 278,306,490 78,647 Hardware sensor data (accelerometer, gyroscope, barometer, etc.)
T3.csv 64,745,896 144,823 Audio and display device (LCD) information
T4.csv 106,760,194 156,018 Snapshots of system-wide resource usage (CPU, memory, network, battery, etc.)
UserPresent.csv 23,326 468 Timestaps when user begin interacting with the device
Wifi.csv 10,410,473 110,478 Information on visible (scanned) Wi-Fi access points

(The number of record is usually one less than the number of text lines due to CSV header, except the volunteer survey, which has a three-line header.)

Application.csv Table

The bulk (if not all) of hands-on activities in this lesson module use Applications.csv.

Data in Applications.csv were collected by the SherLock agent from the participant’s smartphones. The SherLock agent made frequent, periodic probes (“polls”) of the statistics of the apps running on the phone at a high resolution, about 5 seconds apart between adjacent probes. Such frequent probes lead to a detailed view of the history of the apps running on these phones.

The full Applications.csv table has a total of 57 features. Detailed specifications of all the tables available in the entire Sherlock dataset is available online in PDF format. The features of this table were collected from various sources: the Linux kernel (through Linux procfs, the Android API or operating system, as well as SherLock’s custom measurements. All these data were obtainable without rooting the phone. As can be seen in the specification document linked above, these parameters provide very intimate, machine-level information about the apps running on the smartphone.

SherLock’s Reduced Datasets

The DeapSECURE team selected only subsets columns from the original table,

The DeapSECURE team only used Applications.csv in order to create educational datasets that are used throughout several DeapSECURE lessons.

sherlock_mystery_2apps.csv

This dataset contains only 9 real features and two apps: Facebook and WhatsApp. We “fluffed” the dataset with common issues (noise, missing data, duplicate features) in order to teach important lessons on data cleaning. This dataset is used in the Big Data, Machine Learning, and early part of the Neural Networks modules.

sherlock_18apps.csv

This dataset contains 18 real features and 18 apps; This dataset is primarily used in the Neural Networks module to compare neural network models against other machine learning approaches.

SherLock Data Throughout the DeapSECURE Lessons

Big Data lesson, episode 2, “Big Data Challenge: Detecting Malicious Activities on Smartphones”: The SherLock dataset was first described in a high level, including the motivation and experimental setup.

Big Data lesson, episode 4, “Analytics of Sherlock Data with Pandas”: A small snippet of sherlock_mystery_2apps.csv was used, and its features described. Be aware that, for pedagogical purposes, this table is not 100% clean.

Neural Network lesson, episode 5, “Classifying Smartphone Apps with Keras”: The sherlock_18apps.csv was first introduced, and its features explained.