Control flow graphs (CFGs) generated through a dynamic or emulated approach contain many benefits over statically generated CFGs such as yielding smaller, more well connected, and less noisy graphs. This is beneficial for enhanced performance and explainability, especially when considering applications pertaining to Graph Neural Networks (GNNs).
Extending our previous work (CIC-SGG-2024), we generate the dynamic CFGs for the BODMAS, DikeDataset, and pe-machine-learning-dataset datasets using the angr Python library. Additionally, we also provide embeddings of graphs and multiple explanations for use in machine learning tasks. Below is an example of the pipeline used to generate the graphs from our work.

Furthermore, this work is an important contribution, not only for the same core motivations as listed in CIC-SGG-2024 (graph classification on large graphs with many samples), but also since dynamic generation for certain samples is a computationally intensive process. Thus, this work lowers the barrier to entry to perform analysis on what might otherwise be too expensive to produce.
We recognize two main audiences for this work: one, researchers in the field of malware detection and analysis, and two, researchers in the field of graph-based machine learning. The former may be interested in the Attribute Graphs whereas the latter may be interested in the Embedded and Explained Graphs, all of which are described below.
Importantly, all samples with the same filename, excluding the file extension, originate from the same binary file. Derived from the sha256 hash, depending on if the original file required arming or not.
Please refer to the src directory for an example of how to start working with the dataset in index.py. We refer users to angr, PyTorch Geometric, and NetworkX for further information on working with the underlying dataset objects and libraries.
We are fully aware of the presence of isomorphic samples within the dataset. We knowingly include these samples, not only for completeness, but importantly because the sample graphs, while isomorphic, do not originate from "true" duplicates with respect to the original binary samples.
We leave it to the end user to decide how to handle such samples. We understand that removing isomorphic graphs may be of particular interest for graph-based machine learning whereas in malware analysis it may not be a concern.
One general approach to know which samples are isomorphic, with high probability, is simply to compare the number of nodes, edges, and components in the graphs based on the provided CSV files and then test for isomorphism.
There are many speculative reasons for the presence of isomorphic samples. Presumably, malware authors may alter source code, or the binary itself, to perturb its signature and evade detection while also leaving the underlying CFG intact. Additionally, some samples may originate from the same malware family and thus have the same CFG.
L. Yang, A. Ciptadi, I. Laziuk, A. Ahmadzadeh, and G. Wang, “Bodmas: An open dataset for learning based temporal analysis of pe malware,” in 2021 IEEE Security and Privacy Workshops (SPW), pp. 78–84, IEEE, 2021.
G.-A. Iosif, “Dikedataset,” 2021. Accessed on February 27, 2024.
Practical Security Analytics LLC, “Pe malware machine learning dataset,” 2024. Accessed: 2024-08-06.
The CIC-DGG-2025 dataset is publicly available for researchers. If you are using our dataset, you must cite our related research paper that covers important details related to its usage and application.
More details and information on the dataset descriptions, graph generation, and graph learning models used for evaluation and comparison are available in the following paper. Researchers using this dataset are requested to cite the associated research publication.
H. Shokouhinejad, G. Higgins, R. Razavi-Far, H. Mohammadian, A. Ghorbani. "On the Consistency of GNN Explanations for Malware Detection," Information Sciences, Dec 2025.
Curated by Griffin Higgins, please direct questions to griffin.higgins@unb.ca. Only questions with the heading subject "CIC-DGG-2025" will be guaranteed to receive a reply.