Global Site Navigation (use tab and down arrow)

Canadian Institute for Cybersecurity

Botnet dataset

Assessing performance of any detection approach requires experimentation with data that is heterogeneous enough to simulate real traffic to an acceptable level. The lack of such data sets available for evaluating botnet detection approaches is well known in the field mostly due to a number of challenges that have been repeatedly emphasized in the literature [1], [2]. We constructed such data set paying a close attention to the following challenges:

Generality: Unfortunately, most of the existing botnet datasets have generality issue, i.e, they mostly include data from a few botnets (usually two or three samples). Limited in nature (detectors developed in these environments only reflect a small number of characteristics describing a very specific botnet behavior), these approaches are impractical and ineffective in a face of novel threats.

Realism: The effectiveness of the developed approach in practice is highly dependent on realistic botnet traffic traces used for its evaluation. Botnet traffic is usually generated/captured in a controlled environment. Providing a resilient environment (not detectable by the botnet) in which a botnet performs all of its intended malicious functionality is not trivial. In addition to resiliency, collection period must be long enough to allow dormant bots to exhibit their functionality.

Representativeness: Another problem with generating botnet data is an ability of collected network traffic traces to reflect real environment a detector will face during deployment. Due to privacy concerns gathering background data in a real production environment is not feasible in most cases, as a result traffic is either simulated or gathered in a controlled environment. To overcome these challenges, we create an evaluation set combining non overlapping subsets of the following data:

  • ISOT dataset [3] that has been created by merging different available datasets: French chapter of the Honeynet project [4], Ericsson Research in Hungray [5], and Lawrence Berkeley National Laboratory [6]. It contains both malicious (traces of Storm and Zeus botnets) and non malicious traffic (gaming packets, HTTP traffic and P2P application such as bittorrent). We used 15% and 25% of ISOT dataset in our training and test datasets respectiv
  • ISCX 2012 IDS dataset [7] that has been generated in a physical testbed implementation using real devices that generate real (e.g. SSH, HTTP, and SMTP) traffic that mimics users’ behavior. We included a subset of their normal traces in our training dataset. We also included a subset of their normal and IRC botnet traffic in our test dataset.
  • Botnet traffic generated by the Malware Capture Facility Project [8], a research project with the purpose of generating and capturing botnet traces in long term. From this data we extracted four botnet traces (Neris, Rbot, Virut, and NSIS) for our training dataset and nine botnet traces (Neris, Rbot, Virut, NSIS, Menti, Sogou, and Murlo) for our test dataset.

To merge these data traces in one unified data set we employed so called overlay methodology [1], one of the most popular methods for creating synthetic datasets. Malicious data is usually captured by honeypots or through infecting computers with a given bot binary in a controlled environment [9].

Botnet traces can be merged with benign data by mapping malicious data to either machines existing in the home network or machines outside of the current network [1]. Considering the wide range of IP addresses in the traces, we mapped botnet IPs to the hosts outside of the current network using BitTwist packet generator [10]. Malicious and benign traffic were then replayed using TCPReplay [11] and captured by TCPdump [12] as a single dataset.

Distribution of botnet types in the training dataset

Botnet name | Type | Portion of flows in dataset

  • Neris | IRC | 21159 (12%)
  • Rbot | IRC | 39316 (22%)
  • Virut | HTTP | 1638 (0.94 %)
  • NSIS | P2P | 4336 (2.48%)
  • SMTP Spam | P2P | 11296 (6.48%)
  • Zeus | P2P | 31 (0.01%)
  • Zeus control (C & C) | P2P | 20 (0.01%)

The resulting set was divided into training and test datasets that included 7 and 16 types of botnets, respectively. Tables 1 and 2 detail distribution and type of botnets in each dataset. Our training dataset is 5.3 GB in size of which 43.92% is malicious and the reminder contains normal flows. Test dataset is 8.5 GB of which 44.97% is malicious flows. We added more diversity of botnet traces in the test dataset than the training dataset in order to evaluate the novelty detection a feature subset can provide.

Distribution of botnet types in the test dataset

Botnet name | Type | Portion of flows in dataset

  • Neris | IRC | 25967 (5.67%)
  • Rbot | IRC | 83 (0.018%)
  • Menti | IRC | 2878(0.62%)
  • Sogou | HTTP | 89 (0.019%)
  • Murlo | IRC | 4881 (1.06%)
  • Virut | HTTP | 58576 (12.80%)
  • NSIS | P2P | 757 (0.165%)
  • Zeus | P2P | 502 (0.109%)
  • SMTP Spam | P2P | 21633 (4.72%)
  • UDP Storm | P2P | 44062 (9.63%)
  • Tbot | IRC | 1296 (0.283%)
  • Zero Access | P2P | 1011 (0.221%)
  • Weasel | P2P | 42313 (9.25%)
  • Smoke Bot | P2P | 78 (0.017%)
  • Zeus Control (C&C) | P2P | 31 (0.006%)
  • ISCX IRC bot | P2P | 1816 (0.387%)

List of malicious IPs

  • IRC
    • 192.168.2.112 -> 131.202.243.84
    • 192.168.5.122 -> 198.164.30.2
    • 192.168.2.110 -> 192.168.5.122
    • 192.168.4.118 -> 192.168.5.122
    • 192.168.2.113 -> 192.168.5.122
    • 192.168.1.103 -> 192.168.5.122
    • 192.168.4.120 -> 192.168.5.122
    • 192.168.2.112 -> 192.168.2.110
    • 192.168.2.112 -> 192.168.4.120
    • 192.168.2.112 -> 192.168.1.103
    • 192.168.2.112 -> 192.168.2.113
    • 192.168.2.112 -> 192.168.4.118
    • 192.168.2.112 -> 192.168.2.109
    • 192.168.2.112 -> 192.168.2.105
    • 192.168.1.105 -> 192.168.5.122
  • Neris: 147.32.84.180
  • RBot: 147.32.84.170
  • Menti: 147.32.84.150
  • Sogou: 147.32.84.140
  • Murlo: 147.32.84.130
  • Virut: 147.32.84.160
  • IRCbot and black hole1: 10.0.2.15
  • Black hole 2: 192.168.106.141
  • Black hole 3: 192.168.106.131
  • TBot: 172.16.253.130, 172.16.253.131, 172.16.253.129, 172.16.253.240
  • Weasel: Botmaster IP: 74.78.117.238; Bot IP: 158.65.110.24
  • Zeus (zeus sample 1 and 2 and 3, bin_zeus): 192.168.3.35, 192.168.3.25, 192.168.3.65, 172.29.0.116
  • Osx_trojan: 172.29.0.109
  • Zero access (zero access 1 and 2): 172.16.253.132, 192.168.248.165
  • Smoke bot: 10.37.130.4

The full research paper outlining the details of the dataset and its underlying principles:

Beigi, Elaheh Biglar, et al. "Towards effective feature selection in machine learning-based botnet detection approaches." Communications and Network Security (CNS), 2014 IEEE Conference on. IEEE, 2014.

References

[1] A. J. Aviv and A. Haeberlen, “Challenges in experimenting with botnet detection systems,” in USENIX 4th CSET Workshop, San Francisco, CA, 2011.

[2] M. Tavallaee, N. Stakhanova, and A. A. Ghorbani, “Toward credible evaluation of anomaly-based intrusion-detection methods,” Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol. 40, no. 5, pp. 516–524, 2010.

[3] D. Zhao, I. Traore, B. Sayed, W. Lu, S. Saad, A. Ghorbani, and D. Garant,“Botnet detection based on traffic behavior analysis and flow intervals,”Computers & Security, 2013.

[4] “The honeynet project, french chapter”.

[5] G. Szab ́o, D. Orincsay, S. Malomsoky, and I. Szab ́o, “On the validation of traffic classification algorithms,” in Passive and Active Network Measurement. Springer, 2008, pp. 72–81.

[6] “Lawrence berkeley national laboratory and icsi, lbnl/icsi enterprise tracing project. lbnl enterprise trace repository. 2005,”.

[7] A. Shiravi, H. Shiravi, M. Tavallaee, and A. A. Ghorbani, “Toward developing a systematic approach to generate benchmark datasets for intrusion detection,” Computers & Security, vol. 31, no. 3, pp. 357–374, 2012.

[8] S. Garcia, “Malware capture facility project, retrieved July 03, 2013. university,”

[9] M. Stevanovic and J. M. Pedersen, “Machine learning for identifying botnet network traffic,” Networking and Security Section, Department of Electronic Systems, Aalborg University, Tech. Rep., 2013.

[10] Bit-Twist, “Libpcap-based ethernet packet, retrieved July 10, 2013. generator”

[11] A. Turner and M. Bing, “Tcpreplay: Pcap editing and replay tools for* nix”. sourceforge. net, 2005.

[12] “Tcpdump and libpcap,” retrieved July 23, 2013.

For more information contact a.habibi.l@unb.ca.