In this lesson, we are using the SherLock Android smartphone dataset produced by security researchers from Ben-Gurion University. This page contains information about this dataset, as well as modifications and reductions done by the DeapSECURE team for the purpose of teaching big data and machine learning topics.
Reference Paper
Yisroel Mirsky, Asaf Shabtia, Lior Rokach, Bracha Shapira, and Yuval Elovici, “SherLock vs Moriarty: A Smartphone Dataset for Cybersecurity Research”, 9th ACM Workshop on Artificial Intelligence and Security (AISec) with the 23nd ACM Conference on Computer and Communications (CCS), 2016.
- Published version: https://dl.acm.org/doi/10.1145/2996758.2996764
- DOI: doi/10.1145/2996758.2996764
- Unofficial copies are available on Cyber@BGU or on author’s ResearchGate profile.
Reference Sources on the Web
Original project website: http://bigdata.ise.bgu.ac.il/sherlock/index.html#/ . (The website has been down since 2021 or so. Instructors and learners may be able to review texts from the website on the Internet Archive, retrieved 2019-08-29.)
Alternative website to learn about Sherlock dataset: https://www.kaggle.com/BGU-CSRC/sherlock .
Detailed description of all the tables available in the Sherlock dataset is available online (PDF).
SherLock’s Sample Dataset
The full SherLock dataset is available only to researchers who sign a user agreement license with the dataset authors, which restricts the disclosure and dissemination of this data. However, they made a small sample dataset freely available for anyone interested in this dataset to experiment with. The hands-on activities in DeapSECURE lessons (Big Data, Machine Learning, and Neural Networks) are made only with SherLock’s sample dataset. Further modification and reduction of the dataset are performed, as described below, to produce greatly simplified datasets that are used by learners in this lesson module.
The original sample Sherlock dataset can be downloaded from the author’s Kaggle page:
https://www.kaggle.com/datasets/BGU-CSRC/sherlock/data
The Applications.csv
is now hosted by DeapSECURE in xz-compressed
format here:
https://drive.google.com/file/d/1GfLOohkjIMQeOxQb8fIcTRZlHT46_f7M/view
(Be advised that this is a very large file in compressed format [122 MB], and it grows to nearly 4.6 GB when not compressed!)
This sample dataset is also hosted locally for ODU’s HPC users on Wahab supercomputer at this path:
/scratch/Workshops/DeapSECURE/datasets/SherLock/User-97bb95f55a-20160430--20160518
Other learners can download from Kaggle and Google drive. Be advised that some of the data files are too large to handle with an ordinary spreadsheet program or text editor!
The authors used to host the dataset on Google drive: https://drive.google.com/drive/folders/0B_A1qX1kf7R9a282dHU4bWpqM1E, but this link is no longer working.
Some Statistics
Here are the tables (plus file sizes, numbers of rows) existing in the sample SherLock dataset:
File Name | File Size (Bytes) | Number of records | Brief Description |
---|---|---|---|
AllBroadcasts.csv | 22,178,362 | 173,651 | “Broadcast” from Android OS to the apps, such as password change, network change, button presses, change in power state, etc. |
Applications.csv | 4,571,297,916 | 14,801,899 | Records of resource usage (CPU, memory, threads, network and VM statistics) for each application, taken every 5 seconds |
AppPackages.csv | 126,177 | 299 | Status update of apps: install, removal, upgrade, etc. |
Bluetooth.csv | 649,965 | 4,399 | Information on visible (scanned) Bluetooth devices |
Calls.csv | 159,624 | 1,749 | Log of phone calls: source/destination phone number, call time & duration |
Moriarty.csv | 18,096 | 187 | “Hints” left behind by the malicious Moriarty app |
ScreenOn.csv | 106,879 | 1,930 | Records of screen-on and screen-off events |
SherLock Volunteer Survey.csv | 7,986 | 111 | Survey data collected from SherLock experiment volunteers |
SMS.csv | 57,061 | 606 | Records of SMS messages sent and received: sender/receiver phone number, timestamp |
T0.csv | 29,174 | 43 | Hardware and system information |
T1.csv | 7,273,825 | 24,763 | Location, connected cell tower, device status |
T2.csv | 278,306,490 | 78,647 | Hardware sensor data (accelerometer, gyroscope, barometer, etc.) |
T3.csv | 64,745,896 | 144,823 | Audio and display device (LCD) information |
T4.csv | 106,760,194 | 156,018 | Snapshots of system-wide resource usage (CPU, memory, network, battery, etc.) |
UserPresent.csv | 23,326 | 468 | Timestaps when user begin interacting with the device |
Wifi.csv | 10,410,473 | 110,478 | Information on visible (scanned) Wi-Fi access points |
(The number of record is usually one less than the number of text lines due to CSV header, except the volunteer survey, which has a three-line header.)
Application.csv
Table
The bulk (if not all) of hands-on activities in this lesson module use Applications.csv
.
Data in Applications.csv
were collected by the SherLock agent
from the participant’s smartphones.
The SherLock agent made frequent, periodic probes (“polls”) of
the statistics of the apps running on the phone
at a high resolution, about 5 seconds apart between adjacent probes.
Such frequent probes lead to a detailed view of the history
of the apps running on these phones.
The full Applications.csv
table has a total of 57 features.
Detailed specifications of all the tables available in the entire
Sherlock dataset is
available online in PDF format.
The features of this table were collected from various sources:
the Linux kernel (through Linux procfs,
the Android API or operating system,
as well as SherLock’s custom measurements.
All these data were obtainable without rooting the phone.
As can be seen in the specification document linked above,
these parameters provide very intimate,
machine-level information about the apps running on the smartphone.
SherLock’s Reduced Datasets
The DeapSECURE team selected only subsets columns from the original table,
The DeapSECURE team only used Applications.csv
in order to create
educational datasets that are used throughout several DeapSECURE lessons.
sherlock_mystery_2apps.csv
This dataset contains only 9 real features and two apps: Facebook and WhatsApp. We “fluffed” the dataset with common issues (noise, missing data, duplicate features) in order to teach important lessons on data cleaning. This dataset is used in the Big Data, Machine Learning, and early part of the Neural Networks modules.
sherlock_18apps.csv
This dataset contains 18 real features and 18 apps; This dataset is primarily used in the Neural Networks module to compare neural network models against other machine learning approaches.
SherLock Data Throughout the DeapSECURE Lessons
Big Data lesson, episode 2, “Big Data Challenge: Detecting Malicious Activities on Smartphones”: The SherLock dataset was first described in a high level, including the motivation and experimental setup.
Big Data lesson, episode 4,
“Analytics of Sherlock Data with Pandas”:
A small snippet of sherlock_mystery_2apps.csv
was used,
and its features described.
Be aware that, for pedagogical purposes, this table is not 100% clean.
Neural Network lesson, episode 5,
“Classifying Smartphone Apps with Keras”:
The sherlock_18apps.csv
was first introduced,
and its features explained.