Experimental validation

How rigorously has each benchmark been validated experimentally?
Clinical = sourced from real clinical-trial outcomes · Wet-lab confirmed = top predictions tested in-lab · Prospective = designed as a forward-looking test set · Retrospective = historical hold-out only · None = no experimental grounding.

Clinical: 7

Wet-lab confirmed: 29

Prospective: 7

Retrospective: 79

None: 9

Clinical (7)

Benchmark	Stages	Score	Notes
MIMIC-IV Benchmark Tasks	phase-iiiClinical DevelopmentPost-market / RWE	89.4	Canonical clinical ML benchmark. Credentialed access limits casual use.
ClinBench Quarterly — Q2 2026	phase-iiphase-iiiClinical Development	87.6	New track in Q2 2026 for endpoint adjudication.
ClinBench Quarterly (Insilico)	phase-iiphase-iiiClinical Development	81.5	Benchmark refresh cadence beats all academic trial outcome benchmarks. Leaderboards test frontier LLMs against quarterly-updated splits.
CPTAC Proteogenomic Benchmarks	Disease ModelingTarget IDphase-ii	80.8	Deep integrative oncology data.
HINT / TrialBench	phase-iiphase-iiiClinical Development	76.5	Limited by ClinicalTrials.gov quality.
Trial Outcome Prediction (TOP)	phase-iiiClinical Development	76.5	Often reported alongside HINT.
CT-Outcome (TrialBench v2)	phase-iiphase-iii	73.4	Temporal splits are key improvement.

Wet-lab confirmed (29)

Benchmark	Stages	Score	Notes
Open Targets Platform	Disease ModelingTarget ID	100.0	Industry gold standard for target prioritization. Quarterly versioned releases.
DepMap (Cancer Dependency Map)	Target IDDisease Modeling	100.0	Quarterly release cadence.
Protein Language Model Eval 2026	Virtual CellHit ID	100.0	Meta FAIR + EvolutionaryScale collaboration; includes held-out targets with wet-lab fitness.
ChEMBL	Hit IDLead ID / ADMET	97.5	Underlies ~80% of public bioactivity ML benchmarks.
ProteinGym	Target IDLead ID / ADMETIND-enabling	97.5	Field standard. Clinical track enables fair ESM/EVE/AlphaMissense comparison.
CAFA 6 (Critical Assessment of Function Annotation 6)	Target ID	97.5	Continuation of CAFA series (since 2010). Time-delayed evaluation prevents data leakage. Final evaluation May 2026 using annotations from UniProt Dec 2025/Jan 2
Therapeutic Antibody Design Benchmark 2026	Hit IDLead ID / ADMET	97.0	Top-ranked submissions had wet-lab binding measured (Kd + aggregation) by independent labs.
Protein Design Benchmark 2026	Hit ID	97.0	All submitted designs characterized in IPD / external wet labs.
RxRx3 Phenomics Benchmark	Hit IDLead ID / ADMET	94.9	Real phenomics data from Recursion's lab; public subsets only. Full dataset is proprietary (see private_benchmarks).
X-Atlas/Pisces (25.6M Cell Multi-Context Perturb-seq)	Target ID	94.4	Successor to X-Atlas/Orion. 16 diverse biological contexts enable robust cross-context generalization. Underlies X-Cell foundation model. Industry-scale data re
FLAb2 (Fitness Landscape for Antibodies 2)	IND-enablingLead ID / ADMET	91.9	Key finding: current protein AI models cannot consistently predict antibody developability. Critical for biologics pipeline. Covers therapeutically relevant pro
X-Atlas/Orion (Xaira Genome-wide Perturb-seq)	Target ID	91.4	Generated by Xaira Therapeutics. Unprecedented sequencing depth enables detection of subtle perturbation effects. Superseded in scale by X-Atlas/Pisces (25.6M c
LINCS L1000 / CMap	Virtual CellDisease ModelingTarget ID	89.9	Foundational pharma resource for MoA work. Batch effects require careful handling.
canSAR	Target IDHit ID	89.4	Deep oncology focus; widely-used druggability predictor.
PubChem BioAssay	Hit ID	88.6	Broadest HTS repository; quality heterogeneous.
BenchBB (Bench-tested Binder Benchmark)	Hit IDLead ID / ADMET	88.4	Unique in providing actual wet-lab validation infrastructure. Adaptyv runs cloud lab for protein designers. EGFR competition attracted diverse computational met
Cell Line Sensitivity Benchmark (CLSB)	Target IDLead ID / ADMET	88.1	DepMap-adjacent but adds new splits and PRISM v4.
OpenBind EV-A71 Structure-Affinity Dataset	Hit IDLead ID / ADMET	84.8	One of the largest public single-target structure-affinity datasets. High-throughput crystallography at Diamond Light Source. Plans for more targets and blind c
TargetBench (Insilico)	Target IDDisease Modeling	84.6	Disease-organized target ID benchmark — unique axis. Frontier LLM leaderboard.
ISM Benchmarks: ADMET (Insilico)	Lead ID / ADMETIND-enabling	84.6	Broader endpoint coverage than TDC ADMET. Side-by-side with TDC mirror on DDB.
Longevity Compound Benchmark	Hit IDLead ID / ADMET	84.6	Insilico-hosted; unique in bridging cheminformatics and aging biology.
LSD Large-Scale Docking Database	Hit ID	82.5	Unprecedented scale for public docking data. Includes experimental in vitro validation for subset. From UCSF Shoichet Lab. Critical for training ML scoring func
CycPeptMPDB (Cyclic Peptide Membrane Permeability Database)	IND-enablingLead ID / ADMET	82.5	De facto standard benchmark for cyclic peptide permeability prediction. Multiple 2025-2026 papers benchmark 13+ ML methods against this dataset. Critical for be
CRISPR Outcome Prediction Benchmark	Hit ID	79.5	Prospective track added in Q1 2026.
IgLM / AntiBERTa benchmarks	Hit IDDevelopmental Candidate	77.5	Moves toward true developability benchmarks.
Geneformer Eval	Virtual Cell	77.0	Author-led eval; still widely re-run on OpenProblems tasks.
TDC DrugSyn (OncoPolyPharm + DrugComb_NCI60)	Developmental CandidateLead ID / ADMET	77.0	Important for combination therapy design.
scGPT Evaluation Suite	Virtual Cell	73.7	Evaluation dominated by authors' own model — flagged self-referential. Pair with OpenProblems for fair comparison.
AWS-JHU Antibody Developability Benchmark	Developmental CandidateLead ID / ADMET	72.3	Groundbreaking for antibody developability — fills gap where most benchmarks focus on binding only. Wet-lab validated ground truth across diverse formats. Zero-

Prospective (7)

Benchmark	Stages	Score	Notes
Virtual Cell Benchmark Suite 2026	Virtual Cell	97.0	Successor to Open Problems perturbation benchmark. Prospectively designed; Tahoe-100M inclusion makes it industry-relevant.
ASAP Discovery Antiviral 2025	Hit IDLead ID / ADMET	93.9	Top predictions are synthesized and tested; a rare prospective public benchmark.
Longevity Benchmark (Insilico)	Disease ModelingTarget IDPost-market / RWE	90.6	Unique, broad longevity/aging benchmark slice — nothing else in the field covers aging comparably. Leaderboard features frontier LLMs.
Polaris ADMET	Lead ID / ADMET	88.4	Industry splits enforce blinded eval; highest industry relevance among ADMET benchmarks.
CZ Virtual Cell Challenge	Virtual CellTarget ID	88.1	Gold standard-in-the-making for foundation-model era perturbation prediction. Hidden test → strong against leakage.
mRNA Design Benchmark (CodonBench 2026)	Hit IDLead ID / ADMET	82.0	Designed with Moderna and Deep Genomics; includes held-out wet-lab validation track.
Polaris Biologics (Polyreactivity / SEC / Tm)	Developmental Candidate	79.0	Industry-donated; growing.

Retrospective (79)

Benchmark	Stages	Score	Notes
TDC ADMET Group	Lead ID / ADMET	100.0	Most-adopted ADMET benchmark. 100+ leaderboard submissions.
SAbDab	Hit IDLead ID / ADMETDevelopmental Candidate	100.0	Canonical antibody structure resource. Weekly updates.
Observed Antibody Space (OAS)	Hit IDLead ID / ADMET	97.5	Underlies AbLang, IgLM, AntiBERTa — industry-adopted.
PoseBusters	Hit ID	97.0	Exposed major failure modes in AlphaFold-Multimer/DiffDock/RFAA. Default pharma filter.
PLINDER	Hit ID	97.0	Replaces PDBbind as the modern leakage-controlled docking standard.
PLINDER v2 Protein-Ligand Benchmark	Hit ID	97.0	PLINDER is consistently cited as the go-to replacement for PDBbind in modern docking evaluation.
STRING	Target IDDisease Modeling	94.9	Workhorse for network-based target ID. Distinguish functional vs physical edges.
CASP15	Hit IDTarget ID	94.9	Biennial. Introduced ligand prediction category.
CASP16	Hit ID	94.4	First full multimer+ligand+RNA joint eval.
CAMEO weekly targets	Hit ID	94.4	Weekly cadence complements biennial CASP.
Boltz-1 Structure Prediction Benchmark	Hit ID	94.4	Open-source companion to commercial structure predictors; benchmark splits audited against AlphaFold 3 leakage.
ORD Reaction Benchmark	Developmental Candidate	93.9	Modern open reaction corpus; industry-scale.
Open Problems: Perturbation Prediction	Virtual Cell	91.9	Best-in-class rigor (Viash workflow, hidden test, NeurIPS track).
PrimeKG	Disease ModelingTarget ID	91.9	Modern, well-engineered KG; strong for GNN drug repurposing.
scPerturBench	Target ID	91.9	Published in Nature Methods (Vol 23, Issue 2). Most comprehensive evaluation of perturbation prediction methods. Covers both genetic and chemical perturbations.
PoseX (Protein-Ligand Docking Benchmark)	Hit IDLead ID / ADMET	91.4	Key findings: AI surpasses physics-based docking overall; relaxation crucial for AI-generated poses; pocket specification boosts performance; some co-folding me
FAERS (raw)	Post-market / RWE	91.1	Known under-/over-reporting biases.
scPerturb	Virtual CellTarget ID	88.9	Canonical harmonized resource. Strong Perturb-seq coverage; weaker for chemical perturbations.
PINDER	Hit ID	88.9	Expected PPI docking standard.
Practical Molecular Optimization (PMO)	Lead ID / ADMETDevelopmental Candidate	88.9	Sample-efficiency focus exposed shortcomings of reward-maxing methods.
CoV-AbDab	Hit ID	88.9	Narrow modality but critical for pandemic-preparedness ML.
NucleoBench	Hit IDLead ID / ADMET	88.9	First comprehensive benchmark for nucleic acid design. Google Research + Move37 Labs collaboration. Critical for CRISPR therapies, mRNA vaccines, and gene thera
ISM Benchmarks: GPCRs (Insilico)	Hit IDLead ID / ADMET	87.6	Largest open GPCR affinity benchmark. Leaderboards test external frontier LLMs — not self-referential.
CAPRI Rounds	Hit ID	86.3	Oldest PPI prediction benchmark.
mRNABench (mRNA Property Prediction Benchmark)	Hit IDLead ID / ADMET	86.3	Morris Lab (University of Toronto). First standardized benchmark for mRNA biology predictions. New Mamba-based model achieves SOTA with fewer parameters. Python
ToxCast	Lead ID / ADMETIND-enabling	85.6	Regulatory-grade broad tox dataset.
GNNBench-Drug 2026	Hit IDLead ID / ADMET	85.6	IBM-led; overlaps with MoleculeNet but adds modern splits.
CT-Open (Live Clinical Trial Outcome Benchmark)	phase-iiiphase-iphase-ii	84.8	Presented at ICLR 2026. Addresses critical data contamination problem in clinical trial prediction. Quarterly cadence: Winter/Spring/Summer/Fall challenges. Ful
CAFA5	Target ID	84.3	CAFA5 broke attendance records.
MoleculeACE	Lead ID / ADMET	83.3	Critical stress-test for generalization; exposed GNN weaknesses.
MatBench	Developmental Candidate	83.3	Materials-science benchmark; relevant for formulation / co-crystal work.
OffSides / TWOSIDES	Post-market / RWE	83.0	Key benchmark for DDI + adverse event ML.
DrugComb 2.0 Synergy Benchmark	Lead ID / ADMETDevelopmental Candidate	83.0	Industry-relevant for combination oncology.
DMPK Integrated Benchmark	Lead ID / ADMETDevelopmental Candidate	82.5	AZ/Merck/Pfizer contributed held-out test molecules.
BELKA (Big Encoded Library for Chemical Assessment)	Hit ID	82.5	NeurIPS 2024 competition. Unprecedented scale for public binding data. Library split tests true OOD generalization. DEL technology enables massive chemical spac
DOCKSTRING	Hit ID	81.3	Vina scores are a proxy; not a replacement for wet assays.
DisGeNET	Disease ModelingTarget ID	81.0	Commercial license required for industry. Text-mining noise limits quality.
LIT-PCBA	Hit ID	80.8	Much fairer than DUD-E; small target count limits coverage.
FLIP	Target IDDevelopmental Candidate	80.8	Complements ProteinGym (smaller but carefully designed splits).
AbBiBench (Antibody Binding Benchmark)	Lead ID / ADMET	80.8	Key finding: structure-conditioned inverse folding models outperform others for affinity prediction and generation. Treats complex holistically rather than anti
GuacaMol	Lead ID / ADMETDevelopmental Candidate	80.5	First-generation generative benchmark; largely superseded by PMO for goal-directed.
Open Systems Pharmacology / PK-Sim	phase-iIND-enabling	80.3	Open alternative to Simcyp.
pepADMET	IND-enablingLead ID / ADMET	80.0	Fills critical gap in peptide ADMET prediction (previous tools focused on small molecules). Covers Caco-2, PAMPA, BBB, half-life, toxicity. Supports modified pe
ADMET-AI	Lead ID / ADMET	79.5	Strong baselines + web tool; builds on TDC.
AMES (mutagenicity)	IND-enablingLead ID / ADMET	79.5	Core gentox endpoint.
scImmuneBench	Virtual CellDisease Modeling	79.5	Useful for cell-therapy companies evaluating immune foundation models.
MolGenBench	Hit IDLead ID / ADMET	78.2	Reveals significant gap between current generative model capabilities and real-world H2L demands. Novel metrics for target-specific active compound rediscovery
MoleculeNet	Lead ID / ADMETHit ID	78.0	Widely cited (3600+); aging splits with known scaffold leakage.
USPTO-50K / USPTO-MIT (Retrosynthesis)	Lead ID / ADMETDevelopmental Candidate	78.0	Known leakage across canonical splits; use time-split or ORD for fairer eval.
BioDesignBench	Hit IDLead ID / ADMET	77.7	Key finding: LLM agents select appropriate tools but evaluate designs superficially, rarely comparing alternatives. Strongest agents surpass hardcoded pipelines
Tox21	Lead ID / ADMETIND-enabling	77.5	Field-standard tox benchmark; endpoint count small vs modern suites.
Obach PK Dataset	phase-iIND-enablingLead ID / ADMET	77.0	Small but highest-quality human-PK dataset.
CASF-2016	Hit ID	76.2	Authoritative scoring-power eval; update cadence slow.
PDBbind	Hit IDLead ID / ADMET	75.9	Scaffold/temporal leakage well-documented. Pair with CASF + LeakyPDB.
SIDER	Post-market / RWEIND-enabling	74.9	Aging but still widely used. TWOSIDES/OffSides offer newer signals.
TAPE	Target IDDevelopmental Candidate	74.9	Historically important; largely superseded by ProteinGym/FLIP for fitness and by PEER for broader tasks.
Simcyp Validation Sets	phase-iphase-iiIND-enabling	74.4	Industry gold standard but proprietary. Open benchmarks exist via OSP Suite.
PEER	Target IDDevelopmental Candidate	74.4	Broader than TAPE, tighter than ProteinGym; good middle ground.
ClawBio Skill Correctness Bench	Disease ModelingTarget IDClinical Development	74.2	Independent third-party bench structurally precludes self-reference. Coverage narrow but rigor exemplary.
hERG (cardio-tox) TDC	IND-enablingLead ID / ADMET	73.9	Small but widely benchmarked. Industry pairs with SafetyPanel-5.
DILI / LD50 Zhu	IND-enablingLead ID / ADMET	73.9	Essential IND-enabling endpoints.
DUD-E	Hit ID	72.9	Well-known analog bias in decoy selection; use LIT-PCBA / PLINDER for fair VS.
MOSES	Lead ID / ADMETDevelopmental Candidate	72.4	Distribution-learning metrics known to saturate.
PerturbBench	Virtual Cell	71.4	Pharma-led (Genentech); well-specified eval.
ScaleBench: Molecular Property Prediction	Lead ID / ADMETHit ID	69.5	Timely study showing compact specialized models remain competitive vs. large foundation models for molecular property prediction. Key finding: performance depen
DrugPlayGround	Hit IDTarget IDLead ID / ADMET	66.8	First unified platform to benchmark both LLMs and molecular embeddings for drug discovery. Includes Head-Gordon lab (Berkeley) — reputable. Early days for adopt
ClinTox	Lead ID / ADMETIND-enabling	65.6	Small, binary; saturated. Useful only as sanity check.
CellBench-LS	Virtual CellDisease Modeling	65.2	Addresses critical gap: most scFM benchmarks use full supervision. Low-supervision evaluation is more realistic for clinical/translational settings. Finds scFMs
DEKOIS 2.0	Hit ID	57.5	Historical reference; use LIT-PCBA / PLINDER for modern VS.
FoldBench	Hit ID	55.8	Published Nat Comms 2026. Covers 9 task types (monomer, multimer, nucleic acid, ligand, ion, antibody-antigen, etc). Revealed that ligand docking accuracy decre
OpenADMET / Avoid-ome	Lead ID / ADMETIND-enabling	53.6	Nat Comms perspective (May 2026) by top structural biology/comp chem leaders (Fraser, Chodera, Murcko, Walters). Proposes mechanistic ADMET datasets grounded in
MetaboNet-Bench	Clinical Development (cross-phase)Post-market / RWE	53.0	Fills a genuine gap: standardized, multimodal (glucose+insulin+carbs) evaluation for T1D glucose forecasting where prior work was CGM-only and incomparable (rig
TxBench-PP	Lead ID / ADMETTarget IDIND-enabling	52.0	One of the first agentic drug-discovery benchmarks grounded in realistic program-decision workflows with deterministic grading over real assay data rather than
SAIR	Hit ID	51.2	ICLR 2026 paper from SandboxAQ. Synthetic data approach avoids PDB biases but raises generalizability questions. Self-referential flag: SandboxAQ models dominat
MPP Foundation Model Benchmark	Lead ID / ADMETHit ID	50.3	Valuable meta-benchmark tracing evolution from descriptors to foundation models. Includes industry perspective and highlights evaluation protocol challenges. Mu
CA-DEL	Hit ID	50.0	arXiv preprint (8 May 2026). Addresses a genuinely under-benchmarked modality (DEL screens) with explicit sim-to-real grounding via ChEMBL Ki validation across
InteractBind	Hit ID	47.5	arXiv preprint (21 May 2026). Valuable diagnostic framing: separates true binding-site localization from affinity/likelihood shortcuts, with ligand-similarity-c
PMO-Dock	Hit IDLead ID / ADMET	43.5	ICLR 2026 GEM workshop paper (published Mar 2026, last revised 26 May 2026). Useful upgrade to PMO by adding docking-based oracles and explicit specificity/gene
CompGen-MLIP: Compositional Generalisation for ML Interatomic Potentials	Hit IDLead ID / ADMET	39.8	Addresses important gap in MLIP evaluation — OOD generalization to unseen molecular compositions. Shows current models struggle (10x error on OOD). More computa

None (9)

Benchmark	Stages	Score	Notes
DMPKBench (DMPK LLM Evaluation Benchmark)	IND-enablingLead ID / ADMET	83.8	From GHDDI (Gates Foundation China). LLM accuracy ranges 11-89% across tasks. Models excel at knowledge tasks but struggle with multi-modal reasoning (PK curves
BOOM (Benchmarking Out-Of-Distribution Molecular Predictions)	Hit IDLead ID / ADMET	83.3	NeurIPS 2025. Critical benchmark showing the gap between in-distribution and OOD performance in molecular ML. Highly relevant for real-world drug discovery wher
FGBench (Functional Group Molecular Property Reasoning)	Hit IDLead ID / ADMET	77.0	NeurIPS 2025 Datasets & Benchmarks Track. Reveals LLMs struggle with FG-level property reasoning. Addresses gap between molecular-level and substructure-level u
PerturbArena	Target ID	74.4	Complementary to scPerturBench. Emphasizes metric divergence analysis and practical method selection guidelines. Shows limited robustness to shifts across cellu
DO Challenge 2025 (DeepOrigin Autonomous Drug Discovery)	Hit ID	70.9	First benchmark specifically for AI agents (not just models) in drug discovery. Multi-agent system 'Deep Thought' outperformed most human teams but underperform
PDFBench (De Novo Protein Design from Function)	Hit ID	68.9	First unified benchmark for function-guided protein generation. Addresses comparison challenges from proprietary datasets and inconsistent metrics. Inter-metric
VSDS-vd (Virtual Screening Decoy Set for Docking)	Hit ID	65.8	Chinese research benchmark (Zhejiang University). Finds AI methods show deficiencies in physical soundness of docked structures despite good VS performance. Pro
AssayBench	Target IDHit ID	63.3	Novel framing: gene rank prediction from natural language experiment descriptions. Part of broader virtual cell revolution. Tests whether LLMs can replace actua
VibeProteinBench (VPD-Bench)	Target IDDevelopmental Candidate	43.2	Novel concept: first benchmark for 'vibe protein design' using natural language interfaces. Very new (May 2026), no community adoption yet. Evaluates LLM+protei

Compare:

Open comparison →