AndMal 2020 | Datasets | Research | Canadian Institute for Cybersecurity | UNB

Global Site Navigation (use tab and down arrow)

Canadian Institute for Cybersecurity

CCCS-CIC-AndMal-2020

Canadian Institute for Cybersecurity (CIC) project in collaboration with Canadian Centre for Cyber Security (CCCS)

The unrivaled threat of android malware is the root cause of various security problems on the internet. Android malware industry is becoming increasingly disruptive with almost 12,000 new android malware instances every day. Detecting android malware in smartphones is an essential target for cyber community to get rid of menacing malware samples.

Android malware is one of the most serious threats on the internet which has witnessed an unprecedented upsurge in recent years. It is an open challenge for cybersecurity experts. There are many techniques available to identify and classify android malware based on machine learning, but recently, deep learning has emerged as a prominent classification method for such samples.

This research work proposes a new comprehensive and huge android malware dataset, named CCCS-CIC-AndMal-2020. The dataset includes 200K benign and 200K malware samples totalling to 400K android apps with 14 prominent malware categories and 191 eminent malware families.

1. Introduction

To generate the representative dataset, we collaborated with CCCS to capture 200K android malware apps which are labeled and characterized into corresponding family. Benign android apps (200K) are collected from Androzoo dataset to balance the huge dataset. We collected 14 malware categories including adware, backdoor, file infector, no category, Potentially Unwanted Apps (PUA), ransomware, riskware, scareware, trojan, trojan-banker, trojan-dropper, trojan-sms, trojan-spy and zero-day.

A complete taxonomy of all the malware families of captured malware apps is created by dividing them into eight categories such as sensitive data collection, media, hardware, actions/activities, internet connection, C&C, antivirus and storage & settings. The taxonomy is presented in the research paper mentioned under license (Section 4).

2. Capturing data and final dataset

CCCS supported us to capture the real-world android malware apps for analysis. We used VirusTotal to specify malware family and label the dataset by following a consensus of 70% anti-viruses to incorporate reliability in labeled dataset. We searched for similar malware samples to categorize malware samples in dataset with similar characteristics. Table 1 presents the details of 14 android malware categories along with number of respective families and samples in the dataset.

Table 1: Dataset details

Category Number of families Number of samples
Adware 48 47,210
Backdoor 11 1,538
File Infector 5 669
No Category - 2,296
PUA 8 2,051
Ransomware 8 6,202
Riskware 21 97,349
Scareware 3 1,556
Trojan 45 13,559
Trojan-Banker 11 887
Trojan-Dropper 9 2,302
Trojan-SMS 11 3,125
Trojan-Spy 11 3,540
Zero-day - 13,340

The families of each malware category in Table 1 along with the numbers of the captured samples are as presented below:


Adware

Sr. No. Family Number of captured samples
1 dowgin 2679
2 adflex 418
3 admogo 79
4 adviator 77
5 adwo 188
6 airpush 2242
7 appad 92
8 appsgeyser 60
9 baiduprotect 984
10 batmobi 458
11 dianjin 45
12 dianle 19
13 domob 103
14 ewind 1047
15 feiwo 108
16 fictus 349
17 ganlet 28
18 adend 301
19 gmobi 17
20 hiddenad 61
21 hummingbad 28
22 igexin 82
23 inmobi 330
24 inoco 5649
25 kalfere 113
26 kuguo 1015
27 leadbolt 233
28 mobclick 41
29 mobidash 1033
30 mobisec 117
31 mulad 171
32 oimobi 913
33 shedun 19036
34 sprovider 227
35 viser 31
36 wooboo 16
37 xynyin 44
38 zdtad 5694
39 frupi 43
40 kyhub 28
41 stopsms 26
42 loki 46
43 kyview 127
44 pandaad 50
45 plague 14
46 accutrack 7
47 adcolony 17
48 gexin 3

Backdoor

Sr. No. Family Number of captured samples
1 kapuser 15
2 kmin 24
3 fobus 171
4 mobby 119
5 hiddad 664
6 moavt 166
7 androrat 129
8 dendroid 48
9 levida 51
10 pyls 24
11 droidkungfu 50

File Infector

Sr. No. Family Number of captured samples
1 commplat 77
2 leech 99
3 tachi 45
4 gudex 14
5 aqplay 407

PUA

Sr. No. Family Number of captured samples
1 apptrack 92
2 cauly 27
3 secapk 1004
4 umpay 67
5 wiyun 11
6 youmi 529
7 utchi 139
8 scamapp 99

Ransomware

Sr. No. Family Number of captured samples
1 masnu 35
2 congur 252
3 fusob 67
4 jisut 820
5 koler 79
6 lockscreen 356
7 slocker 998
8 smsspy 3319

Riskware

Sr. No. Family Number of captured samples
1 skymobi 10229
2 anydown 57
3 badpac 45
4 deng 58
5 dnotua 36
6 jiagu 721
7 metasploit 28
8 mobilepay 1197
9 remotecode 36
10 revmob 806
11 secneo 27
12 smspay 28512
13 smsreg 50073
14 talkw 49
15 tencentprotect 144
16 tordow 7
17 triada 493
18 wapron 93
19 nqshield 46
20 kingroot 24
21 wificrack 15

Scareware

Sr. No. Family Number of captured samples
1 avpass 126
2 mobwin 23
3 fakeapp 1332

Trojan

Sr. No. Family Number of captured samples
1 Autosms 239
2 coinge 16
3 droiddreamlight 15
4 gluper 680
5 hiddenapp 157
6 iconosys 33
7 lotoor 661
8 mobtes 343
9 mseg 148
10 qysly 94
11 rootnik 474
12 syringe 99
13 wkload 143
14 zbot 85
15 hyspu 112
16 basebridge 63
17 boogr 218
18 lovetrap 48
19 oveead 30
20 rusms 27
21 systemmonitor 61
22 uupay 27
23 wintertiger 24
24 typstu 28
25 blouns 652
26 autoins 479
27 cnsms 3413
28 gappusin 766
29 gedma 11
30 ginmaster 130
31 hypay 360
32 mytrackp 1054
33 subspod 11
34 walkfree 15
35 xinyinhe 59
36 drosel 59
37 uapush 11
38 uten 9
39 smsagent 1166
40 styricka 833
41 autoinst 12
42 noicondl 33
43 obtes 5
44 droiddream 3
45 hiddenap 3

Trojan-Banker

Sr. No. Family Number of captured samples
1 asacub 260
2 fakebank 17
3 faketoken 52
4 marcher 87
5 minimob 56
6 guerrilla 256
7 bankbot 4
8 gugi 8
9 svpeng 68
10 wroba 9
11 zitmo 40

Trojan-Dropper

Sr. No. Family Number of captured samples
1 locker 1296
2 rooter 51
3 xiny 31
4 boqx 106
5 hqwar 118
6 ramnit 84
7 ztorg 500
8 gorpo 16

Trojan-SMS

Sr. No. Family Number of captured samples
1 opfake 368
2 hipposms 20
3 podec 13
4 feejar 56
5 smsdel 40
6 plankton 186
7 jsmshider 21
8 smsbot 42
9 boxer 87
10 fakeinst 2148
11 vietsms 13

Trojan-Spy

Sr. No. Family Number of captured samples
1 spynote 21
2 kasandra 29
3 spyagent 48
4 spyoo 13
5 tekwon 19
6 sandr 208
7 qqspy 27
8 smforw 1873
9 smsthief 1058
10 smszombie 52
11 spydealer 1

For benign android apps, we used the Androzoo dataset, which currently contains more than eight million unique android apps and the number is still growing. The architecture is developed to collect the Androzoo dataset from different sources including official android market, Google Play, Anshi, AppChina, 1mobile, and Genome project dataset. A weekly updated list containing all the detailed information about the apps is created. HTTP API is provided to allow the full download of the unaltered APKs from the Androzoo dataset.

3. Feature extraction and selection

AndroidManifest.xml contains a lot of features that can be used for static analysis. The main extracted features include:

  • Activities: An android activity is one screen of the android app's user interface
  • Broadcast receivers and providers
  • Metadata: It is basically an additional option to store information that can be accessed through the entire project
  • The permissions requested by application: It protects the privacy of the user and is needed to access sensitive user data (such as contacts and SMS)
  • System features (such as camera and internet)

Table 2 presents the examples of static features extracted from captured dataset.

Table 2: List of static features

Feature Values
Package Name "com.fb.iwidget"
Activities "com.fb.iwidget.OverlayActivity"
"org.acra.CrashReportDialog"
"com.batch.android.BatchActionActivity"
"com.fb.iwidget.MainActivity"
"com.fb.iwidget.PreferencesActivity"
"com.fb.iwidget.PickerActivity"
"com.fb.iwidget.IntroActivity"
Services "com.batch.android.BatchActionService"
"com.fb.iwidget.MainService"
"com.fb.iwidget.SnapAccessService"
Receivers/Providers "com.fb.iwidget.ExpandWidgetProvider"
"com.fb.iwidget.ActionReceiver"
Intents Actions "android.accessibilityservice.AccessibilityService"
"android.appwidget.action.APPWIDGET_UPDATE"
"android.intent.action.BOOT_COMPLETED"
"android.intent.action.CREATE_SHORTCUT"
"android.intent.action.MAIN"
"android.intent.action.MY_PACKAGE_REPLACED"
"android.intent.action.USER_PRESENT"
"android.intent.action.VIEW"
"com.fb.iwidget.action.SHOULD_REVIVE"
Intents Categories "android.intent.category.BROWSABLE"
"android.intent.category.DEFAULT"
"android.intent.category.LAUNCHER"
Permissions "android.permission.ACCESS_NETWORK_STATE"
"android.permission.CALL_PHONE"
"android.permission.INTERNET"
"android.permission.RECEIVE_BOOT_COMPLETED"
"android.permission.SYSTEM_ALERT_WINDOW"
"com.android.vending.BILLING"
"android.permission.BIND_ACCESSIBILITY_SERVICE"
Meta-Data "android.accessibilityservice"
"android.appwidget.provider"
#Icons 331
#Pictures 0
#Videos 0
Audio files 0
Videos 0
Size of the App 4.2M

4. License

You may redistribute, republish, and mirror the CCCS-CIC-AndMal-2020 dataset in any form. However, any use or redistribution of the data must include a citation to the CCCS-CIC-AndMal-2020 dataset and the following paper.

Abir Rahali, Arash Habibi Lashkari, Gurdip Kaur, Laya Taheri, Francois Gagnon, and Frédéric Massicotte, “DIDroid: Android Malware Classification and Characterization Using Deep Image Learning”, 10th International Conference on Communication and Network Security, Tokyo, Japan, November 2020

Acknowledgements

We thank the Mitacs Globalink Program for providing the Research Internship (GRI) opportunity and Harrison McCain Young Scholar Foundation funds from University of New Brunswick (UNB) for supporting this project. We also thank CCCS for sharing the malware samples of this dataset with us.

Download the dataset