CIC-YNU-IoTMal Dataset 2026 | Datasets | Research | Canadian Institute for Cybersecurity | UNB

Global Site Navigation (use tab and down arrow)

Canadian Institute for Cybersecurity

CIC-YNU-IoTMal Dataset 2026

A Comprehensive Multilayer Dataset for Static and Dynamic Analysis of IoT Malware Behavior

The hallmark of this research is to propose a comprehensive multi-architectural IoT malware dataset to enhance research in malware analysis integrating both static and dynamic processes.

To accomplish this, the IoTPOT dataset was utilized where 10,000 malware samples were collected and executed in controlled virtual environments capturing the network traffic (PCAP), system traces (STRACE) and system activities (SAR) across four separate architecture including ARM, MIPS, MIPSEl and x86.

The malware families, include Mirai, Bashlite (Gafgyt), DarkNexus, Rudedevil, Agent, Generic, and Tsunami. Benign samples were generated using large language models (LLMS)

Main contributions:

  • Comprehensive Dataset: A large-scale IoT malware corpus spanning multiple architectures (ARM, MIPS, MIPSEL, x86), enabling broad applicability across heterogeneous IoT environments.

  • Systematic Collection Methodology: Rigorous data acquisition using controlled sandbox environments, IoTPOT honeypot systems, and AI-enriched benign samples generated with large language models (LLMs), ensuring realism and diversity.

  • Scalable Behavioral Coverage: A dataset encompassing diverse malware behaviors—including network traffic profiles, system call traces, and system activity logs—facilitating cross-architectural and multi-dimensional domain adaptation for malware detectors.

  • Rich Feature Extraction: A hybrid analysis pipeline combining static features (opcodes, API calls, control flow graphs) with dynamic attributes (system calls, network traffic, behavioral patterns) to capture the intrinsic dynamics of malware operations.

  • Standardized Labeling Scheme: Unified malware family classification with detailed metadata across eight malware types, including rare families such as Tsunami and Rudedevil, supporting fine-grained comparative studies.

  • Comparative Benchmarking: In-depth analysis against existing IoT malware datasets, highlighting unique characteristics, strengths, and potential research applications of CIC-YNU-IoTMal.

  • Baseline ML Evaluation: Empirical validation through multiclass machine learning experiments, demonstrating the dataset’s utility for training and benchmarking IoT malware detection models.

Dataset directories

The main CIC-YNU-IoTMal dataset directory contains four subdirectories, representing each of the architectures (ARM, MIPS, MIPSEL, x86) and a supplementary material containing the code for generating the malware samples. Each subdirectory contains different files related to the architecture, including:

ARM

  1. PCAP: contains the processed network traffic captured during the malware execution as. parquet file
  2. SAR.parquet: represents the processed system activity reports collected using the SAR tool in Linux
  3. STRACE.parquet: contains the processed system traces collected using the STRACE tool
  4. Readme.txt containing information about the descriptive statistics for each file.

MIPS

  1. PCAP: contains the processed network traffic captured during the malware execution as. parquet file
  2. SAR.parquet: represents the processed system activity reports collected using the SAR tool in Linux
  3. STRACE.parquet: contains the processed system traces collected using the STRACE tool
  4. Readme.txt containing information about the descriptive statistics for each file.

MIPSEL

  1. PCAP: contains the processed network traffic captured during the malware execution as. parquet file
  2. SAR.parquet: represents the processed system activity reports collected using the SAR tool in Linux
  3. STRACE.parquet: contains the processed system traces collected using the STRACE tool
  4. Readme.txt containing information about the descriptive statistics for each file.

X86

  1. PCAP: contains the processed network traffic captured during the malware execution as. parquet file
  2. SAR.parquet: represents the processed system activity reports collected using the SAR tool in Linux
  3. STRACE.parquet: contains the processed system traces collected using the STRACE tool
  4. Readme.txt containing information about the descriptive statistics for each file.


Table 1: Summary of dataset distribution

Architecture Behaviour Total samples  Number of features



ARM

PCAP 737651 40
SAR 645518 461
STRACE 645518 461


MIPS
PCAP 870017 40
SAR 430540 392
STRACE 430540 392


MIPSEL
PCAP 1104016 40
SAR 516679 392
STRACE 516679 392


X86
PCAP 455641 40
SAR 529212 409
STRACE 529212 409

Note: The total number of samples presented here might be different from the paper because “Unknown” labels are contained in the dataset which can be dropped before applying ML algorithms.


Figure 1: Multi-tier sandbox architecture for IoT malware dynamic analysis

 



Figure 2
: Preprocessing flowchart

Figure 2: Preprocessing Flowchart


Descriptive statistic

The following statistics represents the distribution of the stat of the SAR features using the MIPS data.

features mean std min max cv
interval 1.224941 2.605152 0.00 117.00  2.126758
cpu-load[0].usr 18.819596 23.855737 0.00 100.00 1.267601
cpu-load[0].nice 0.000000 0.000000 0.00 0.00 NaN
cpu-load[0].sys 20.180509 24.555765 0.00 99.11 1.216806
cpu-load[0].iowait 0.454946 2.494017 0.00 95.92 5.482001
cpu-load[0].steal 0.000000 0.000000 0.00 0.00 NaN
cpu-load[0].irq 0.000000 0.000000 0.00 0.00 NaN
cpu-load[0].soft 0.543592 3.595822 0.00 97.14 6.614926
cpu-load[0].guest 0.000000 0.000000 0.00 0.00 NaN
cpu-load[0].gnice 0.000000 0.000000 0.00 0.00 NaN
cpu-load[0].idle 60.001124 44.701928 0.00 100.00 0.745018
cpu-load[1].cpu 0.000000 0.000000 0.00 0.00 NaN
cpu-load[1].usr 18.819596 23.855737 0.00 100.00 1.267601
cpu-load[1].nice 0.000000 0.000000 0.00 0.00 NaN
cpu-load[1].sys 20.180509 24.555765 0.00 99.11 1.216806
cpu-load[1].iowait 0.454946 2.494017 0.00 95.92 5.482001
cpu-load[1].steal 0.000000 0.000000 0.00 0.00 NaN
cpu-load[1].irq 0.000000 0.000000 0.00 0.00 NaN
cpu-load[1].soft 0.543592 3.595822 0.00 97.14 6.614926
cpu-load[1].guest 0.000000 0.000000 0.00 0.00 NaN
cpu-load[1].gnice 0.000000 0.000000 0.00 0.00 NaN
cpu-load[1].idle 60.001124 44.701928 0.00 100.00 0.745018
process-and-context-switch.proc 9.989325 17.479854 0.00 169.00 1.749853
process-and-context-switch.cswch 566.493796 1133.868486 0.28 11562.00 2.001555
interrupts[0].all 81.133433 59.262541 0.60 1886.00 0.730433
interrupts[0].CPU0 81.133433 59.262541 0.60 1886.00 0.730433
interrupts[1].intr 0.000000 0.000000 0.00 0.00 NaN
interrupts[1].all 0.000000 0.000000 0.00 0.00 NaN
interrupts[1].CPU0 0.000000 0.000000 0.00 0.00 NaN
interrupts[2].intr 2.000000 0.000000 2.00 2.00 0.000000
interrupts[2].all 0.000000 0.000000 0.00 0.00 NaN
interrupts[2].CPU0 0.000000 0.000000 0.00 0.00 NaN
interrupts[3].intr 3.000000 0.000000 3.00 3.00 0.000000
interrupts[3].all 0.001813 0.126386 14.00 14.00 69.713754
interrupts[3].CPU0 0.001813 0.126386 14.00 14.00 69.713754
interrupts[4].intr 4.000000 0.000000 4.00 4.00 0.000000
interrupts[4].all 0.875768 9.614894 0.00 1768.00 10.978813
interrupts[4].CPU0 0.875768 9.614894 0.00 1768.00 10.978813
interrupts[5].intr 8.000000 0.000000 8.00 8.00 0.000000
interrupts[5].all 0.000000 0.000000 0.00 0.00 NaN
interrupts[5].CPU0 0.000000 0.000000 0.00 0.00 NaN
interrupts[6].intr 10.000000 0.000000 10.00 10.00 0.000000
interrupts[6].all 16.209603 29.648234 0.00 905.00 1.829054
interrupts[6].CPU0 16.209603 29.648234 0.00 905.00 1.829054
interrupts[7].intr 14.000000 0.000000 14.00 14.00 0.000000
interrupts[7].all 0.982883 3.487567 0.00 940.59 3.548303
interrupts[7].CPU0 0.982883 3.487567 0.00 940.59 3.548303
interrupts[8].intr 15.000000 0.000000 15.00 15.00 0.000000

Acknowledgments

The authors would like to thank the Canadian Institute for Cybersecurity (CIC), for its financial and educational support.

Citation

S. Dadkhah, O. D. Okey, S. A. Maret, Y. Lo, A. Firouzia, R. Kuki, T. Sasaki, K. Yoshioka, T. Ban, S. Ozawa, A. A. Ghorbani, “CIC-YNU-IoTMal: A Comprehensive Multilayer Dataset for Static and Dynamic Analysis of IoT Malware Behavior," submitted to Expert Systems with Applications, 2026.

-

Common questions

The dataset includes a Network taxonomy group (eval_test_network) where we intentionally injected high jitter (up to 500ms) into valid PQC sessions.

This is to train models to distinguish between "Slow Network" (Normal) and "DoS/Resource Exhaustion" (Anomaly).


The labels in ml_features.csv are designed for a two-stage pipeline.

  1. L1/L2 (Rules): Catch deterministic failures (e.g., e9_conn_outcome = Failure or e1_alg_suite = Unknown).

  2. L3 (ML): Detect behavioural deviations in e6b_flow_duration and e5_entropy_c that pass the rules but represent adversarial activity.

Due to size constraints and privacy considerations (though the traffic is synthetic), we provide the raw_sessions.jsonl which contains the fully dissected TShark output, preserving all necessary metadata without the binary payload overhead.