CIC DGG Dataset 2025

CIC Dynamically Generated Graphs for Malware Analysis (CIC-DGG-2025)

Control flow graphs (CFGs) generated through a dynamic or emulated approach contain many benefits over statically generated CFGs such as yielding smaller, more well connected, and less noisy graphs. This is beneficial for enhanced performance and explainability, especially when considering applications pertaining to Graph Neural Networks (GNNs).

Extending our previous work (CIC-SGG-2024), we generate the dynamic CFGs for the BODMAS, DikeDataset, and pe-machine-learning-dataset datasets using the angr Python library. Additionally, we also provide embeddings of graphs and multiple explanations for use in machine learning tasks. Below is an example of the pipeline used to generate the graphs from our work.

Furthermore, this work is an important contribution, not only for the same core motivations as listed in CIC-SGG-2024 (graph classification on large graphs with many samples), but also since dynamic generation for certain samples is a computationally intensive process. Thus, this work lowers the barrier to entry to perform analysis on what might otherwise be too expensive to produce.

Dataset directories

We recognize two main audiences for this work: one, researchers in the field of malware detection and analysis, and two, researchers in the field of graph-based machine learning. The former may be interested in the Attribute Graphs whereas the latter may be interested in the Embedded and Explained Graphs, all of which are described below.

Attribute graphs (cfgs): This directory contains the output objects from the angr python library, where each sample is saved as a pickle file containing both the CFG of a given binary sample as well as other information output by angr.

Samples, grouped into sub-directories based on their respective datasets they were generated from. A CSV file (cfgs_map.csv) is included that contains the label (0 benign and 1 malicious), dataset (DikeDataset, BODMAS, or pe-machine-learning-dataset), hash (unique sha256 hash of the original binary file), number_nodes, number_edges, number_weakly_connected_components, and file_size (bytes).
Embedded graphs (ebds): This directory contains the embedded versions of graphs in the Attributes Graphs directory. All CFGs in the work are embedded using the Assembly Embedding (AE) approach. Similarly, this directory contains a CSV file (ebds_map.csv) that maps the same attributes listed in the cfgs_map.csv file.
Explained graphs (exps): This directory contains the node and edge-based explanations of the ebd graphs generated from models we train in our work that highlight important areas that contribute to a particular prediction. In this dataset we include additional explainers such as parameterized Explainer, Integrated Gradients, Guided BackPropogation, and Saliency.

These graphs have the extension “.exp” instead of ".pkl” to differentiate them from the Attribute Graphs with the same name. However, these are in fact pickle files. This directory also contains a CSV file (exps_map.csv) that maps the same set of attributes listed in (ebds_map.csv) and includes a predicted attribute (0 benign and 1 malicious). See the src/index.py file for an example of accessing these and other attributes.
Examples (src): This directory contains several example files as well as additional mapping information. We include a simple example in index.py that demonstrates how to load the various sample types, cfg, fcg, ebd and exp, as well as how to access some of their attributes.

Additionally, we also include a simple example, to train a simple GCN. The example references a dataset.py file, also included, that contains a PyTorch Geometric Dataset class that can be used to work with large datasets like ours.

We also include an additional CSV file (original_to_map.csv) mapping original raw source binary paths in their respective datasets to a given hash value we use in our dataset. We found installing angr version 9.2.89 is especially important in order to work with our samples.
Null and inaccessible samples (.null): During specific phases in the pipeline (i.e., generation and embedding), a given sample may create conditions where the generation or embedding operations may run out of memory and subsequently killed the Operating System (OS).

When this occurs, data is not written to the file, even though a handle is opened by the OS, causing the file to be created with 0 bytes. In the second case, the file may contain data, but during loading the exception "EOFError: Ran out of input." will be raised, it is still unclear exactly why this occurs.

Additionally, samples loaded onto a cpu vs cuda device may occasionally fail. We include these samples only for completeness. However, we verify that all samples in the main dataset load successfully using a cpu device. This is mainly to reduce problems for others during training.

Importantly, all samples with the same filename, excluding the file extension, originate from the same binary file. Derived from the sha256 hash, depending on if the original file required arming or not.

Using the dataset

Please refer to the src directory for an example of how to start working with the dataset in index.py. We refer users to angr, PyTorch Geometric, and NetworkX for further information on working with the underlying dataset objects and libraries.

Isomorphic samples

We are fully aware of the presence of isomorphic samples within the dataset. We knowingly include these samples, not only for completeness, but importantly because the sample graphs, while isomorphic, do not originate from "true" duplicates with respect to the original binary samples.

We leave it to the end user to decide how to handle such samples. We understand that removing isomorphic graphs may be of particular interest for graph-based machine learning whereas in malware analysis it may not be a concern.

One general approach to know which samples are isomorphic, with high probability, is simply to compare the number of nodes, edges, and components in the graphs based on the provided CSV files and then test for isomorphism.

There are many speculative reasons for the presence of isomorphic samples. Presumably, malware authors may alter source code, or the binary itself, to perturb its signature and evade detection while also leaving the underlying CFG intact. Additionally, some samples may originate from the same malware family and thus have the same CFG.

Acknowledgments

L. Yang, A. Ciptadi, I. Laziuk, A. Ahmadzadeh, and G. Wang, “Bodmas: An open dataset for learning based temporal analysis of pe malware,” in 2021 IEEE Security and Privacy Workshops (SPW), pp. 78–84, IEEE, 2021.

G.-A. Iosif, “Dikedataset,” 2021. Accessed on February 27, 2024.

Practical Security Analytics LLC, “Pe malware machine learning dataset,” 2024. Accessed: 2024-08-06.

License

The CIC-DGG-2025 dataset is publicly available for researchers. If you are using our dataset, you must cite our related research paper that covers important details related to its usage and application.

Citation

More details and information on the dataset descriptions, graph generation, and graph learning models used for evaluation and comparison are available in the following paper. Researchers using this dataset are requested to cite the associated research publication.

H. Shokouhinejad, G. Higgins, R. Razavi-Far, H. Mohammadian, A. Ghorbani. "On the Consistency of GNN Explanations for Malware Detection," Information Sciences, Dec 2025.

Download this dataset

Curated by Griffin Higgins, please direct questions to griffin.higgins@unb.ca. Only questions with the heading subject "CIC-DGG-2025" will be guaranteed to receive a reply.

Global Site Navigation (use tab and down arrow)