ARDA ranked #1 on the PerturBench Norman19 benchmark. Frozen test split. Black-box evaluation. Artifact-complete reproducibility. Causal discovery guiding perturbation response prediction.
0.9109
Cosine LogFC
Rank #1, same-split
0.9684
Pearson DE
Top-20 differentially expressed
0.9281
Hardest subgroup
combo_seen0 (zero prior)
Vareon Inc. · Vareon Limited · March 2026
Gene perturbation experiments are expensive, slow, and combinatorially explosive. A single CRISPR screen can cost tens of thousands of dollars and take weeks. Combination perturbations — knocking out two or more genes simultaneously — grow factorially. For the roughly 20,000 human protein-coding genes, testing all pairwise combinations would require ~200 million experiments.
The PerturBench Norman19 benchmark poses the central question directly: given a perturbation combination that has never been run in the lab, what cellular response should be expected? This is not pattern-matching — it is prediction under genuine novelty, evaluated on a frozen test split with no data leakage.
The established state of the art is GEARS (Roohani et al. 2023) — a graph-enhanced gene activation and repression simulator that uses gene regulatory graph structure to predict perturbation outcomes. GEARS and similar supervised approaches train a model on observed perturbation-response pairs and hope it generalizes to unseen combinations. Simple additive baselines already achieve strong results (linear baseline: 0.9022 Cosine LogFC). But neither GEARS nor additive models discover the causal interaction structure — the synergies and antagonisms between genes that drive the most scientifically important phenotypes.
ARDA is not a perturbation prediction model. It is the Universal Discovery Engine — a governed scientific discovery platform that discovers causal structure from data and uses that structure to make predictions. The PerturBench benchmark tests whether causal discovery improves prediction quality on genuinely unseen perturbation combinations.
ARDA's causal discovery mode. A Bayesian GNN with SIREN dynamics that recovers directed causal graphs from observational time-series data. For PerturBench, CDE discovers the gene-gene interaction structure that drives perturbation responses — mechanism, not correlation.
The primary ARDA contract (arda-predict + arda-cde) uses CDE-discovered causal structure as evidence to guide predictions. The causal graph tells the prediction engine which gene interactions matter, producing more accurate predictions on truly unseen combinations.
Symbolic regression for governing equations. Neural ODEs for latent dynamics. Neuro-Symbolic for interpretable decomposition. CDE for causal graphs. Each mode produces typed scientific claims — not token predictions. For PerturBench, the CDE mode is primary.
Every claim is typed and machine-readable. Negative controls (time shuffle, phase randomization, label permutation, noise robustness) validate that discovered structure is genuine. Full provenance, deterministic replay, evidence ledger.
The key insight
Prediction models ask “what will happen?”. ARDA asks “why does it happen?” first, then predicts. By discovering the causal structure underlying perturbation responses, ARDA's predictions generalize better to unseen combinations — because they are grounded in mechanism, not memorized patterns.
On the frozen PerturBench Norman19 test split, ARDA with causal evidence achieves the highest Cosine LogFC score across all models, including baselines and ablations.
| # | Model | Cosine LogFC | RMSE | Pearson DE |
|---|---|---|---|---|
| 1 | arda_predict_plus_cde | 0.9109 | 0.0425 | 0.9684 |
| 2 | linear_baseline | 0.9022 | 0.0491 | 0.9571 |
| 3 | arda_predict_plus_cde_gnn | 0.9003 | 0.0450 | 0.9637 |
| 4 | arda_predict_only | 0.8968 | 0.0478 | 0.9650 |
| 5 | nearest_neighbor_baseline | 0.8236 | 0.0595 | 0.8590 |
| 6 | GEARS (Roohani et al. 2023)(same-split, 3 seeds) | 0.7158 | 0.0725 | 0.7929 |
| 7 | control_baseline | 0.0000 | 0.0996 | 0.0000 |
GEARS (Roohani et al. 2023) evaluated on the same frozen split with GO + co-expression features, averaged over 3 random seeds.
The benchmark splits test combinations by how many constituent genes appeared in training. combo_seen0 (neither gene seen) is the hardest — genuinely novel combinations. ARDA with CDE maintains strong performance even on the most challenging subgroup.
0.9281
combo_seen0
Neither gene seen in training
n=7 conditions
0.8753
combo_seen1
One gene seen in training
n=20 conditions
0.9163
combo_seen2
Both genes seen in training
n=19 conditions
Removing CDE causal evidence (arda_predict_only) drops Cosine LogFC from 0.9109 to 0.8968. The causal graph provides the prediction engine with structural information about gene-gene interactions that pure supervised learning cannot recover from training data alone.
0.9109
With CDE
Rank #1
0.8968
Without CDE
-0.0141
0.9022
Linear baseline
Reference
The strongest linear baseline achieves 0.9022 through simple additive effects. Beating it requires understanding interaction structure — which genes amplify or suppress each other's effects. CDE discovers this structure. The prediction engine uses it.
CDE was not built for gene perturbation. It is ARDA's general-purpose causal inference mode — the same engine that achieves 0.959 path fidelity on double-pendulum mechanics, 0.817 on gene regulatory networks, and 0.789 on clinical pharmacokinetics. The PerturBench result demonstrates that causal discovery improves downstream prediction in biology, just as it does in physics.
ARDA was evaluated through its production API surfaces — the same REST API, SDK, and MCP tools that every customer uses. No special research mode. No hand-tuning. The benchmark ran against the same deployment that serves commercial customers.
The complete peer-reviewed manuscript with benchmark methodology, unified leaderboard, subgroup analysis, and reproducibility protocol.
Read the Paper