MalDroid 2020 | Datasets | Research | Canadian Institute for Cybersecurity | UNB

Global Site Navigation (use tab and down arrow)

Canadian Institute for Cybersecurity

Android malware dataset (CICMalDroid 2020)

We are providing a new Android malware dataset, namely CICMalDroid 2020, that has the following four properties:

  1. Big. It has more than 17,341 Android samples.
  2. Recent. It includes recent and sophisticated Android samples until 2018.
  3. Diverse. It has samples spanning between five distinct categories: Adware, Banking malware, SMS malware, Riskware, and Benign
  4. Comprehensive. It includes the most complete captured static and dynamic features compared with publicly available datasets.

Data collection

We managed to collect more than 17,341 Android samples from several sources including VirusTotal service, Contagio security blog, AMD, MalDozer, and other datasets used by recent research contributions (the sources have been cited in the paper).

The samples were collected from December 2017 to December 2018. It is significant for cybersecurity researchers to classify Android apps with respect to the malware category for taking proper countermeasures and mitigation strategies. Hence, our dataset is intentionally spanning between five distinct categories: Adware, Banking malware, SMS malware, Riskware, and Benign. Each malware category is briefly described as follows:

Adware

Mobile Adware refers to the advertising material (i.e., ads) that typically hides inside the legitimate apps which have been infected by malware (available on the third-party market). Because the ad library used by the malware repeats a series of steps to keep the ads running, Adware continuously pops up ads (even if the victim tries to force-close the app). Adware can infect and root-infect a device, forcing it to download specific Adware types and allowing attackers to steal personal information.

Banking Malware

Mobile Banking malware is a specialized malware designed to gain access to the user’s online banking accounts by mimicking the original banking applications or banking web interface. Most of the mobile Banking malware are Trojan-based, which is designed to infiltrate devices, to steal sensitive details, i.e., bank login and password, and to send the stolen information to a command and control (C&C) server.

SMS Malware

SMS malware exploits the SMS service as its medium of operation to intercept SMS payload for conducting attacks. The attackers first upload malware to their hosting sites to be linked with the SMS. They use the C&C server for controlling their attack instructions, i.e., send malicious SMS, intercept SMS, and steal data.

Mobile Riskware

Riskware refers to legitimate programs that can cause damage if malicious users exploit them. Consequently, it can turn into any other form of malware such as Adware or Ransomware, which extends functionalities by installing newly infected applications. Uniquely, this category only has a single variant, mostly labeled as "Riskware" by VirusTotal.

Benign

All other applications that are not in categories above are considered benign which means that the application is not malicious. To verify the maliciousness, we scanned all the benign samples with VirusTotal.

Data analysis

We analyzed our collected data dynamically using CopperDroid, a VMI-based dynamic analysis system, to automatically reconstruct low-level OS-specific and high-level Android-specific behaviors of Android samples. Out of 17,341 samples, 13,077 samples ran successfully while the rest failed due to errors such as time-out, invalid APK files, and memory allocation failures.

All the APK files are first executed in CopperDroid, and the run-time behaviors are recorded in log files. The output analysis results of CopperDroid are available in JSON format for easy parsing and additional auxiliary information. The analysis results are classified into three big groups:

  1. Statically extracted information, e.g., intents; permissions and services; frequency counts for different file types; incidents of obfuscation, and sensitive API invocations;
  2. Dynamically observed behaviors which are largely broken down into three categories of system calls, binder calls and composite behaviors;
  3. PCAP of all the network traffic captured during the analysis.

Dataset

We loaded all 13,077 analysis results where about 12% of the JSON files failed to be opened mostly due to “unterminated string". The final remaining Android samples in each category are as follows:

  • Adware: 1,253
  • Banking: 2,100
  • SMS malware: 3,904
  • Riskware: 2,546
  • Benign: 1,795
  • Total: 11,598

Since the sizes of the categories are not equal, we balance the number of samples in each category before splitting them into the training and test bins for analyzing using AI techniques. To use all the samples equally likely, we randomly shuffle the dataset in each category before balancing the samples.

License

The CICMalDroid2020 dataset consists of the following items and is publicly available for researchers.

  1. APK files: 17,341 Android samples spanning between five distinct categories: Adware, Banking malware, SMS malware, Riskware, and Benign.
  2. Capturing-logs: The output analysis results of 13,077 samples in five categories: Adware, Banking malware, SMS malware, Riskware, and Benign.
  3. CSV files:
    1. 470 extracted features for 11,598 APK files comprising frequencies of system calls, binders, and composite behaviors
    2. 139 extracted features for 11,598 APK files comprising frequencies of system calls
    3. 50,621 extracted features for 11,598 APK files comprising static information, such as intent actions, permissions, intent consts, permissions, files, method tags, sensitive APIs, services, packages, receivers, etc.

If you are using our dataset, you need to cite our research paper which outlines the details of the dataset and its underlying principles:

Samaneh Mahdavifar, Andi Fitriah Abdul Kadir, Rasool Fatemi, Dima Alhadidi, Ali A. Ghorbani; Dynamic Android Malware Category Classification using Semi-Supervised Deep Learning, The 18th IEEE International Conference on Dependable, Autonomic, and Secure Computing (DASC), Aug. 17-24, 2020.
Samaneh Mahdavifar, Dima Alhadidi, and Ali A. Ghorbani (2022). Effective and Efficient Hybrid Android Malware Classification Using Pseudo-Label Stacked Auto-EncoderJournal of Network and Systems Management 30 (1), 1-34.
Acknowledgement
The authors would like to express their gratitude toward Dr. Lorenzo Cavallaro and Feargus Pendlebury (Systems Security Research Lab, King’s College London) for generously analyzing a large number of Android APKs in CopperDroid.

Download the dataset