Vareon Technical Report · March 2026

Transparent Benchmarking of a Hosted ARDA Contract
on PerturBench Norman19

Faruk Guney, Inventor of CDE, GFF, and DM-GFF

OpenAI ChatGPT 5.4 (reasoning=xhigh), Lead Scientist

Anthropic Opus 4.6 (reasoning=max), Lead Software Engineer

Vareon Inc., Irvine, California, USA · March 2026

Abstract

Benchmark-facing, artifact-complete evaluation of ARDA on frozen PerturBench Norman19 contract. ARDA treated as black-box agent-facing platform, exercised only through approved benchmark surfaces. Primary ARDA contract: arda-predict + arda-cde. On frozen test split: Cosine LogFC 0.9003, RMSE mean 0.0450. Strongest local baseline: Cosine LogFC 0.9022, delta −0.0019. On unified same-split leaderboard, arda_predict_plus_cde ranks #1 with Cosine LogFC 0.9109. All claims limited to saved benchmark artifacts.

1. Problem Statement

PerturBench Norman19 measures held-out perturbation-response prediction for unseen combination interventions. Given a perturbation not yet run in the lab, what response should be expected? How accurate are predictions on a frozen evaluation contract? How does ARDA compare with same-split baselines under the same scoring rules?

2. Benchmark Setup

2.1 Data and Frozen Split

The evaluation uses the Norman et al. (2019) Perturb-seq dataset, processed and distributed as norman19_processed.h5ad.gz.

Dataset SHA-256

be9e1155bbe10c69674449090820f7c5a3716176bae11f5f2258d58c8877f540

Frozen split hash

1290660b1f3d66334df978291309968747de100261e5b9f4543b2304c8be7cc0

The dataset is partitioned into 39 training, 46 validation, and 46 test held-out combination pairs. The split is deterministic and frozen — identical across all models evaluated in this report.

2.2 ARDA System Description

ARDA was treated as a black-box, agent-facing scientific platform and exercised only through the approved benchmark surfaces: arda-predict, arda-cde, ARDAOfficialAdapter, and campaign scripts. Broader surfaces (REST, SDK, CLI, MCP) exist but all claims are limited to benchmark-facing evidence only.

2.3 ARDA Benchmark Execution Path

Primary contractarda-predict + arda-cde

Guidance modearda-predict with arda-cde evidence

Adapter gate statusPASS

Official ARDA run statuscompleted

arda-cde discovery run ID

8611a303-87af-4332-afc3-55da4a94aa66

2.4 GEARS Baseline Description

GEARS (Roohani et al. 2023) is the established state-of-the-art for gene perturbation prediction. It is a graph-enhanced gene activation and repression simulator that uses gene ontology (GO) and co-expression graph structure to predict post-perturbation expression. For this benchmark, GEARS was trained and evaluated on the identical frozen split used by all other models, with GO + co-expression features enabled (the configuration recommended by the authors). Results are averaged over 3 random seeds (seed std = 0.0141).

3. Unified Metrics

3.1 Unified Same-Split Leaderboard

All models were evaluated on the identical frozen test split. Metrics are computed on log-fold-change predictions against observed post-perturbation expression profiles. The primary ranking metric is Cosine LogFC (cosine similarity in log-fold-change space). Secondary metrics include RMSE (mean across genes), Top-20 DE MSE (mean squared error on the 20 most differentially expressed genes), and Pearson DE (Pearson correlation on differentially expressed genes).

Model	Cosine LogFC	RMSE mean	Top-20 DE MSE	Pearson DE	Seeds	Seed Std	Notes
arda_predict_plus_cde	0.9109	0.0425	0.054874	0.9684	1	0.0000	ARDA ablation (cde_run_id, n=1)
linear_baseline	0.9022	0.0491	0.080359	0.9571	1	0.0000	additive single-perturbation effect
arda_predict_plus_cde_gnn	0.9003	0.0450	0.066749	0.9637	1	0.0000	ARDA ablation (cde_run_id, n=1)
arda_predict_only	0.8968	0.0478	0.058686	0.9650	1	0.0000	ARDA ablation (predict_only, n=1)
nearest_neighbor_baseline	0.8236	0.0595	0.190519	0.8590	1	0.0000	nearest-single perturbation proxy
GEARS (Roohani et al. 2023)	0.7158	0.0725	0.100037	0.7929	3	0.0141	same frozen split (GO + co-expression)
control_baseline	0.0000	0.0996	0.716381	0.0000	1	0.0000	trivial lower-bound baseline

arda_predict_plus_cde achieves the highest Cosine LogFC (0.9109) and Pearson DE (0.9684), and the lowest RMSE (0.0425) and Top-20 DE MSE (0.054874) across all evaluated models. The margin over the strongest baseline (linear_baseline at 0.9022) is +0.0087 Cosine LogFC. GEARS (Roohani et al. 2023), evaluated on the same frozen split with GO + co-expression features (3 seeds, std = 0.0141), achieves Cosine LogFC 0.7158, RMSE 0.0725, and Pearson DE 0.7929 — ranking below all ARDA variants and the nearest-neighbor baseline on the primary metric.

3.2 Same-Split Subgroup Analysis

Test perturbation pairs are stratified by the number of constituent genes that appeared in the training set. combo_seen0 (n=7) contains pairs where neither gene was seen in training — the hardest subgroup, requiring maximal generalization. combo_seen1 (n=20) contains pairs where one gene was seen. combo_seen2 (n=19) contains pairs where both genes were seen individually but never in combination.

Model	Subgroup	n	Cosine LogFC	RMSE mean	Top-20 DE MSE	Pearson DE
control_baseline	combo_seen0	7	0.0000	0.1204	0.903982	0.0000
control_baseline	combo_seen1	20	0.0000	0.1012	0.707092	0.0000
control_baseline	combo_seen2	19	0.0000	0.0903	0.657043	0.0000
linear_baseline	combo_seen0	7	0.9328	0.0485	0.085344	0.9619
linear_baseline	combo_seen1	20	0.8891	0.0480	0.066411	0.9475
linear_baseline	combo_seen2	19	0.9048	0.0505	0.093203	0.9655
nearest_neighbor_baseline	combo_seen0	7	0.8304	0.0698	0.307024	0.8333
nearest_neighbor_baseline	combo_seen1	20	0.8417	0.0546	0.145413	0.8906
nearest_neighbor_baseline	combo_seen2	19	0.8022	0.0609	0.195076	0.8351
arda_predict_only	combo_seen0	7	0.9274	0.0583	0.113022	0.9591
arda_predict_only	combo_seen1	20	0.8655	0.0511	0.062680	0.9558
arda_predict_only	combo_seen2	19	0.9185	0.0403	0.034464	0.9769
arda_predict_plus_cde	combo_seen0	7	0.9358	0.0479	0.083571	0.9648
arda_predict_plus_cde	combo_seen1	20	0.8907	0.0442	0.055106	0.9574
arda_predict_plus_cde	combo_seen2	19	0.9230	0.0388	0.044057	0.9814
arda_predict_plus_cde_gnn	combo_seen0	7	0.9281	0.0508	0.100148	0.9573
arda_predict_plus_cde_gnn	combo_seen1	20	0.8753	0.0481	0.068669	0.9507
arda_predict_plus_cde_gnn	combo_seen2	19	0.9163	0.0395	0.052424	0.9798

arda_predict_plus_cde leads across all three subgroups. The advantage is most pronounced on combo_seen0 (0.9358 vs. 0.9328 for the linear baseline), suggesting that causal evidence provides the greatest benefit when predicting outcomes for genuinely novel gene combinations with no prior training exposure.

4. Reproducibility

4.1 Environment

Python3.10.13

PlatformLinux-6.12.35-55.103.amzn2023.x86_64-x86_64-with-glibc2.35

HF GPU runtimerequested=a100-large

4.2 Key Artifact Paths

Frozen split manifestdata/locks/official_split_manifest.json

Runtime manifestdata/locks/runtime_manifest.json

Download manifestdata/locks/download_manifest.json

Unified leaderboardresults/unified_sota_leaderboard.csv

Unified metric payloadresults/unified_sota_metrics.json

Subgroup metric payloadresults/unified_sota_subgroup_metrics.json

ARDA run manifestartifacts/arda_official_runs.json

ARDA primary run summaryartifacts/arda_official_run.json

Publish bundle manifestartifacts/publish_bundle_manifest.json

4.3 Public Data Downloads

All leaderboard CSV data is publicly available for independent verification and reproducibility.

Unified SOTA leaderboard (ARDA runs)/data/perturbench/unified_sota_leaderboard.csv Unified SOTA leaderboard (with GEARS same-split)/data/perturbench/unified_sota_leaderboard_with_gears.csv Official metrics/data/perturbench/official_metrics.csv GEARS seed 42/data/perturbench/gears-seeds/gears_seed42.csv GEARS seed 123/data/perturbench/gears-seeds/gears_seed123.csv GEARS seed 456/data/perturbench/gears-seeds/gears_seed456.csv

5. Limitations

1. Claims limited to frozen Norman19 benchmark contract. All performance figures reported in this paper are specific to the PerturBench Norman19 frozen test split and should not be generalized beyond scope.

2. Intentionally avoids internal implementation disclosure. ARDA was evaluated as a black-box platform. This report does not disclose internal model architectures, training procedures, or hyperparameter configurations. It reports only benchmark-facing evidence.

3. Secondary analyses support interpretation but do not replace biological validation. Computational predictions of perturbation responses — however accurate on benchmark metrics — are not substitutes for experimental validation or independent replication.

6. Data, Code, and Artifact Availability

Data access is documented by the download manifest and split manifest in data/locks/. Code paths: campaign scripts in perturbench_campaign/scripts/. Reproducibility depends on saved results, manifests, and the report bundle.

Authorship

ARDA AI Agent Teams

Faruk Guney — Inventor of CDE, GFF, and DM-GFF

OpenAI ChatGPT 5.4 (reasoning=xhigh) — Lead Scientist

Anthropic Opus 4.6 (reasoning=max) — Lead Software Engineer

Vareon Inc., Irvine, California, USA

March 2026

Back to Summary