CIC-PQC_OAV v1 Dataset | Hybrid Post-Quantum TLS 1.3 Anomaly Detection

Advancing operational assurance for the Post-Quantum Cryptography (PQC) migration in TLS 1.3

The main goal of this research is to propose a realistic benchmark dataset to support the development of "Hybrid" (Rule-based + Machine Learning) anomaly detection solutions for the transition to Post-Quantum Cryptography. To accomplish this, a controlled testbed was orchestrated to generate 40,010 TLS 1.3 sessions, covering both standard classical exchanges (X25519, RSA) and PQC algorithms (ML-KEM/Kyber, ML-DSA/Dilithium).

Unlike traditional intrusion detection datasets, CIC-PQC_OAV v1 focuses specifically on the "migration gap"—scenarios where PQC traffic might be mistaken for an anomaly, or where subtle attacks (downgrades, high-latency jitter, side-channel mimicry) hide within the larger PQC payloads. The dataset includes a rigorous four-group taxonomy of scenarios: Protocol Violations, Network Conditions, Adversarial Threats, and Data Integrity Fuzzing.

The main contributions are:

PQC-Specific Feature Set: It introduces the PQC-OAV Tuple, a set of 32 features extracted from encrypted metadata (without decryption) specifically designed to capture the structural "fingerprint" of post-quantum handshakes (e.g., Payload Entropy, Record Size Ratios).
Hybrid Architecture Support: The dataset is labeled to validate multi-stage detection systems, distinguishing between deterministic "Rule Violations" (L1/L2) and probabilistic "Behavioural Anomalies" (L3).
Operational Realism: It includes "Gray" scenarios such as high-jitter networks and valid-but-unusual PQC parameter sets to test model robustness against false positives in live deployments.

Dataset description

Files and directory structure

The dataset consists of three primary files that provide different levels of granularity for researchers:

raw_sessions.jsonl: The ground truth. Contains 40,010 JSON objects, each representing a full TShark-dissected TLS session with nested metadata.
ml_features_and_labels.csv: The "ML-Ready" matrix. Contains the 32 engineered numerical features, one-hot encoded context vectors, and ground-truth labels used for training the LSTM/Autoencoder models.
scenario_manifest.csv: The experiment metadata. Maps every session ID to its specific testbed scenario (e.g., net_high_jitter_classic_run1), allowing for stratified performance analysis.

Features extracted

The dataset utilizes 32 features divided into four logical categories: Traffic Volume, Entropy, Timing, and Cryptographic Context.

Feature Name	Description
e2c_total_bytes	Total bytes transferred during the handshake (Client - Server).
e4_entropy_h	Shannon entropy of the Client Hello / Server Hello headers.
e5_entropy_c	Shannon entropy of the encrypted ciphertext payload (first 4KB).
e6b_flow_duration_ms	Total handshake duration in milliseconds (critical for PQC latency profiling).
e2_client_size	Size of the Client Key Share (indicates KEM group, e.g., X25519 vs ML-KEM-768).
e2_server_record_len	Length of the Server Hello record (proxy for Certificate size + KEM Ciphertext).
e1_alg_suite_*	One-hot vector indicating the negotiated Key Exchange Method (e.g., mlkem768, x25519).
e8_proto_context	TLS version context (TLS 1.2 vs TLS 1.3).
e9_conn_outcome	Connection state (Success, Failure, Incomplete).

Dataset statistics

The dataset is class-imbalanced to reflect realistic operational conditions, with a prevalence of normal traffic over anomalies.

General statistics

Total sessions: 40,010
Normal sessions: ~35,000 (87.5%)
- Includes: Standard X25519, ML-KEM-768, ML-KEM-1024, ML-DSA-65.
Anomalous sessions: ~5,010 (12.5%)
- Includes: Downgrade attacks, Fuzzing (corrupt keys), High-Latency Jitter, Side-Channel Mimicry.

Feature distribution (decimal representation)

Feature	Mean	Std	Min	50% (Median)	Max
e6b_flow_duration_ms	210.45	185.32	0.42	148.2	12,050.1
e2c_total_bytes	4,102.3	2,840.1	285.0	5,316.0	18,450.0
e5_entropy_c	5.82	0.45	0.00	5.91	7.99
e2_client_record_len	245.1	120.5	64.0	180.0	1,024.0

Using the dataset

Citation: If you use this dataset in your research, please cite the associated paper:

Michael O. Mills and Ali A. Ghorbani, "Beyond Rules: Behavioral Anomaly Detection for Post-Quantum Cryptography Operational Assurance in TLS," Journal of Information Security and Applications (under Review), 2025.

Common questions

Why are there "Normal" sessions with high latency?

The dataset includes a Network taxonomy group (eval_test_network) where we intentionally injected high jitter (up to 500ms) into valid PQC sessions.

This is to train models to distinguish between "Slow Network" (Normal) and "DoS/Resource Exhaustion" (Anomaly).

What is the "Hybrid" aspect?

The labels in ml_features.csv are designed for a two-stage pipeline.

L1/L2 (Rules): Catch deterministic failures (e.g., e9_conn_outcome = Failure or e1_alg_suite = Unknown).
L3 (ML): Detect behavioural deviations in e6b_flow_duration and e5_entropy_c that pass the rules but represent adversarial activity.

Do you have raw PCAPs?

Due to size constraints and privacy considerations (though the traffic is synthetic), we provide the raw_sessions.jsonl which contains the fully dissected TShark output, preserving all necessary metadata without the binary payload overhead.

Download the dataset

Global Site Navigation (use tab and down arrow)