Benchmark
ARDA on PerturBench: Predicting Cellular Response to Genetic Perturbations
Predicting how cells respond to genetic perturbations they have never experienced is one of the most consequential challenges in computational biology. If a research team can accurately forecast the cellular response to a combination of gene knockouts that has never been run in the laboratory, they can prioritize experiments, reduce wet-lab costs, and accelerate the path from hypothesis to therapeutic insight. PerturBench Norman19 is the standard benchmark for this task, and ARDA's results on its frozen test split set a new standard: Cosine LogFC 0.8954, RMSE mean 0.0471, and Pearson DE correlation 0.9640.
These numbers are not abstract metrics. Cosine LogFC measures how well predicted gene expression changes align with observed changes in direction and magnitude across the transcriptome. An RMSE mean of 0.0471 indicates that the average prediction error across all genes is remarkably small. A Pearson DE correlation of 0.9640 on differentially expressed genes — the genes that actually change in response to perturbation — confirms that ARDA captures the biologically meaningful signal, not just the background noise. Together, these metrics demonstrate that ARDA can predict cellular responses to unseen genetic perturbations with high fidelity across the full spectrum of transcriptomic readouts.
The PerturBench Norman19 benchmark
PerturBench Norman19 is derived from the seminal Norman et al. 2019 Perturb-seq dataset, one of the largest and most carefully characterized combinatorial CRISPR screening experiments in the literature. The benchmark poses a specific and demanding question: given single-gene perturbation data and a subset of combination perturbations, can a model predict the cellular response to held-out combinations it has never seen? The frozen test split ensures that every method is evaluated on exactly the same unseen combinations, eliminating data leakage and enabling fair comparison across methods and platforms.
The benchmark stratifies results by how much information is available about each combination's constituent genes. The combo_seen0 subgroup contains combinations where neither gene has been observed in any training combination — the hardest category, requiring pure generalization from single-gene effects and network structure. The combo_seen1 subgroup contains combinations where exactly one gene has appeared in training combinations. The combo_seen2 subgroup contains combinations where both genes have been seen in other combinations, but the specific pairing has not. This stratification reveals whether a method truly generalizes or merely interpolates from closely related training examples.
ARDA's subgroup performance
ARDA's performance across all three subgroups demonstrates genuine generalization, not memorization. On combo_seen0 — the most demanding subgroup, where neither constituent gene appears in any training combination — ARDA achieved a Cosine LogFC of 0.9246. On combo_seen1, where one gene has been seen in training combinations, the result was 0.8626. On combo_seen2, where both genes have been seen but never together, the result was 0.9190. The fact that combo_seen0 performance exceeds combo_seen1 is particularly noteworthy: ARDA predicts combinations of entirely unseen genes with higher fidelity than combinations where partial information is available. This pattern suggests that the underlying approach captures fundamental principles of gene interaction rather than relying on similarity to training examples.
Cosine LogFC 0.8954. RMSE 0.0471. Pearson DE 0.9640. On the standard benchmark for predicting cellular responses to genetic perturbations never run in the lab. These are not projections — they are measured results on a frozen test split.
Why perturbation prediction matters
Every combinatorial genetic screen faces an exponential wall. If a genome has 20,000 protein-coding genes, the number of pairwise combinations exceeds 200 million. Triple combinations exceed two trillion. No laboratory can run every possible experiment. The practical question is: which combinations should be prioritized for wet-lab validation? A perturbation prediction platform that achieves high fidelity on unseen combinations transforms this combinatorial explosion from a barrier into a tractable search problem. The platform identifies the most promising — or most surprising — combinations, and the laboratory validates them.
This is not a theoretical capability. Pharmaceutical and biotechnology companies routinely face decisions about which gene targets to pursue, which combination therapies to test, and which perturbation experiments to prioritize given finite budgets and timelines. A prediction platform that reliably forecasts cellular response to untested perturbations directly informs those decisions, reducing the number of experiments needed to reach the same level of biological understanding. The economic value is not marginal — each unnecessary experiment costs thousands of dollars and weeks of laboratory time.
From screening to reasoning
Traditional approaches to perturbation prediction treat the problem as a large-scale regression: learn a mapping from perturbation identity to expression profile, then apply that mapping to unseen perturbations. The limitation of this approach is that it captures statistical regularities in the training data but does not reason about why perturbations produce the effects they do. When a novel combination involves genes that have never been seen together, pure regression has no basis for prediction. The result is a model that works well on easy cases and fails on the cases that matter most — the truly novel combinations where biological insight would be most valuable.
Causal evidence guiding predictions
ARDA's approach to perturbation prediction is distinguished by its integration of causal evidence. The Causal Dynamics Engine (CDE) was used to provide causal evidence guiding the prediction process. Rather than treating gene expression prediction as a purely statistical regression problem, ARDA incorporates mechanistic understanding of how genes influence each other through regulatory networks. This causal grounding means predictions are informed not only by observed correlations but by the directional, mechanistic relationships between genes — the kind of relationships that remain valid even when the specific perturbation has never been observed.
The advantage of causal guidance is most apparent in the combo_seen0 subgroup, where no training combinations involve either gene. A purely statistical model has no direct evidence to draw on for these combinations. A causally-informed approach can reason about the expected interaction by tracing the causal pathways connecting the two genes through the regulatory network, even when those genes have never been perturbed together. The high combo_seen0 Cosine LogFC of 0.9246 provides empirical evidence that this causal reasoning adds genuine predictive value beyond what statistical interpolation can achieve.
Evaluated as a black-box agent-facing platform
ARDA was evaluated on PerturBench Norman19 as a black-box agent-facing platform, assessed through approved benchmark surfaces. This evaluation methodology reflects ARDA's design philosophy: the platform is built for agents to interact with through structured API surfaces, not for researchers to inspect internal representations or tune internal parameters. The benchmark results demonstrate that ARDA achieves high-fidelity predictions through its external interface, without requiring users to understand or configure internal mechanisms. The predictions are the product of the platform's complete discovery and reasoning pipeline, accessed as a unified capability through standard endpoints.
The RMSE perspective: quantitative precision
While Cosine LogFC captures directional alignment of predicted expression changes, RMSE mean of 0.0471 measures the absolute magnitude of prediction errors. In the context of log fold-change predictions — where values typically range across a scale of several units — an RMSE this low indicates that predictions are not only directionally correct but quantitatively precise. The predictions do not merely identify which genes go up and which go down; they accurately estimate how much each gene's expression changes in response to the perturbation.
Quantitative precision matters because downstream biological interpretation depends on effect sizes, not just directions. A gene that is weakly upregulated may have entirely different biological significance than one that is strongly upregulated. Drug target prioritization, pathway analysis, and therapeutic dose reasoning all depend on accurate magnitude predictions. An RMSE of 0.0471 suggests that ARDA's predictions are precise enough to support these quantitative downstream applications, not just qualitative directional screening.
Pearson DE correlation: capturing what matters biologically
The Pearson DE correlation of 0.9640 deserves particular attention because it focuses on differentially expressed genes — the genes whose expression actually changes in response to perturbation. Most genes in any perturbation experiment do not change significantly. A model that predicts near-zero change for everything would achieve low RMSE on the full transcriptome simply by being conservative. The Pearson DE correlation specifically evaluates whether the model captures the variation among the genes that matter — the ones that respond to the perturbation and drive the biological phenotype.
A correlation of 0.9640 on differentially expressed genes indicates that ARDA's predictions align closely with observed expression changes across the genes most relevant to understanding the perturbation's biological effect. This is the metric that matters most for practical applications: when a biologist asks which genes will be affected by a perturbation and by how much, the answer needs to be accurate specifically for the genes that actually respond. ARDA's Pearson DE correlation demonstrates that this accuracy is achieved at a level that supports confident downstream interpretation.
AI-native research and engineering
The authorship of the PerturBench validation paper reflects Vareon's foundational principle: AI-native research and engineering built from the ground up on first principles. The paper was produced by ARDA AI Agent Teams, with ChatGPT 5.4 Extra High serving as Lead Scientist, Opus 4.6 Max as Lead Software Engineer, and Faruk Guney as Lead Research Engineer, Inventor, and Founder. This is not a symbolic attribution. The agents conducted the benchmark evaluation, analyzed the results, and authored the scientific communication. The human founder provided the research direction, platform architecture, and inventive framework that made the work possible.
This authorship model is a statement about the future of scientific research. When AI agents can design benchmark evaluations, execute them through platform APIs, interpret the results with statistical rigor, and communicate findings in structured scientific prose, the traditional model of human-only authorship no longer reflects how the science was actually done. Vareon acknowledges this reality by crediting the agents who did the work alongside the humans who enabled it. The paper will be available at vareon.com/research.
Implications for drug discovery and therapeutic development
Accurate perturbation prediction has immediate implications for drug discovery. Combination therapies — treatments that target multiple genes or pathways simultaneously — are increasingly important in oncology, immunology, and rare disease. But testing every possible combination in the laboratory is infeasible. A platform that can predict which combinations will produce therapeutically relevant cellular responses, with the fidelity demonstrated on PerturBench, dramatically narrows the experimental search space. Laboratory resources can be focused on validating the most promising predictions rather than exhaustively screening possibilities.
Beyond drug discovery, perturbation prediction informs fundamental biological understanding. Which gene combinations produce synthetic lethal interactions? Which perturbations activate compensatory pathways? Which knockouts reveal hidden regulatory relationships? These questions are answered experimentally today, one combination at a time, at enormous cost in time and resources. A high-fidelity prediction platform transforms them into computational queries that can be explored at the speed of inference rather than the speed of cell culture. The biological insights that emerge can then be validated selectively, focusing laboratory effort where it will generate the most knowledge per dollar spent.
The combinatorial advantage
The true power of accurate perturbation prediction reveals itself at scale. A platform that can reliably predict pairwise perturbation outcomes can be queried across the full combinatorial space — not just the hundreds or thousands of combinations that fit within a single screening experiment, but the millions of combinations that define the complete landscape of gene-gene interactions. Patterns that would never be discovered through sequential experimentation become visible when the entire landscape can be surveyed computationally. ARDA's benchmark performance on PerturBench suggests that this kind of comprehensive computational survey is within reach for practical biological research programs.
From benchmark to deployment
Benchmark results establish capability. Deployment requires trust. The path from PerturBench performance to real-world adoption depends on the same governed infrastructure that underlies all of ARDA's capabilities: typed scientific claims with provenance, automated negative controls, evidence ledgers that support replay, and Truth Dial governance that distinguishes exploratory predictions from validated findings. A perturbation prediction that carries its evidence chain — including the causal reasoning that informed it and the benchmark performance that calibrated it — is a qualitatively different artifact from a number on a spreadsheet.
Vareon's vision is a research platform where AI agents can conduct biological discovery with the same rigor and reproducibility that the validation papers demonstrate. The PerturBench results are one proof point. The validated physics discovery results are another. Together, they establish that ARDA is not a single-domain tool but a governed discovery engine capable of producing defensible scientific claims across physics, biology, and the intersection of both. The benchmarks will continue. The standards will remain high. The evidence will speak for itself.
Download Paper