Experimental validation

How rigorously has each benchmark been validated experimentally?
Clinical = sourced from real clinical-trial outcomes · Wet-lab confirmed = top predictions tested in-lab · Prospective = designed as a forward-looking test set · Retrospective = historical hold-out only · None = no experimental grounding.

Clinical: 7
Wet-lab confirmed: 29
Prospective: 7
Retrospective: 74
None: 9

Clinical (7)

BenchmarkStagesScoreNotes
MIMIC-IV Benchmark Tasksphase-iiiClinical DevelopmentPost-market / RWE89.4Canonical clinical ML benchmark. Credentialed access limits casual use.
ClinBench Quarterly — Q2 2026phase-iiphase-iiiClinical Development87.6New track in Q2 2026 for endpoint adjudication.
ClinBench Quarterly (Insilico)phase-iiphase-iiiClinical Development81.5Benchmark refresh cadence beats all academic trial outcome benchmarks. Leaderboards test frontier LLMs against quarterly-updated splits.
CPTAC Proteogenomic BenchmarksDisease ModelingTarget IDphase-ii80.8Deep integrative oncology data.
HINT / TrialBenchphase-iiphase-iiiClinical Development76.5Limited by ClinicalTrials.gov quality.
Trial Outcome Prediction (TOP)phase-iiiClinical Development76.5Often reported alongside HINT.
CT-Outcome (TrialBench v2)phase-iiphase-iii73.4Temporal splits are key improvement.

Wet-lab confirmed (29)

BenchmarkStagesScoreNotes
Open Targets PlatformDisease ModelingTarget ID100.0Industry gold standard for target prioritization. Quarterly versioned releases.
DepMap (Cancer Dependency Map)Target IDDisease Modeling100.0Quarterly release cadence.
Protein Language Model Eval 2026Virtual CellHit ID100.0Meta FAIR + EvolutionaryScale collaboration; includes held-out targets with wet-lab fitness.
ChEMBLHit IDLead ID / ADMET97.5Underlies ~80% of public bioactivity ML benchmarks.
ProteinGymTarget IDLead ID / ADMETIND-enabling97.5Field standard. Clinical track enables fair ESM/EVE/AlphaMissense comparison.
CAFA 6 (Critical Assessment of Function Annotation 6)Target ID97.5Continuation of CAFA series (since 2010). Time-delayed evaluation prevents data leakage. Final evaluation May 2026 using annotations from UniProt Dec 2025/Jan 2
Therapeutic Antibody Design Benchmark 2026Hit IDLead ID / ADMET97.0Top-ranked submissions had wet-lab binding measured (Kd + aggregation) by independent labs.
Protein Design Benchmark 2026Hit ID97.0All submitted designs characterized in IPD / external wet labs.
RxRx3 Phenomics BenchmarkHit IDLead ID / ADMET94.9Real phenomics data from Recursion's lab; public subsets only. Full dataset is proprietary (see private_benchmarks).
X-Atlas/Pisces (25.6M Cell Multi-Context Perturb-seq)Target ID94.4Successor to X-Atlas/Orion. 16 diverse biological contexts enable robust cross-context generalization. Underlies X-Cell foundation model. Industry-scale data re
FLAb2 (Fitness Landscape for Antibodies 2)IND-enablingLead ID / ADMET91.9Key finding: current protein AI models cannot consistently predict antibody developability. Critical for biologics pipeline. Covers therapeutically relevant pro
X-Atlas/Orion (Xaira Genome-wide Perturb-seq)Target ID91.4Generated by Xaira Therapeutics. Unprecedented sequencing depth enables detection of subtle perturbation effects. Superseded in scale by X-Atlas/Pisces (25.6M c
LINCS L1000 / CMapVirtual CellDisease ModelingTarget ID89.9Foundational pharma resource for MoA work. Batch effects require careful handling.
canSARTarget IDHit ID89.4Deep oncology focus; widely-used druggability predictor.
PubChem BioAssayHit ID88.6Broadest HTS repository; quality heterogeneous.
BenchBB (Bench-tested Binder Benchmark)Hit IDLead ID / ADMET88.4Unique in providing actual wet-lab validation infrastructure. Adaptyv runs cloud lab for protein designers. EGFR competition attracted diverse computational met
Cell Line Sensitivity Benchmark (CLSB)Target IDLead ID / ADMET88.1DepMap-adjacent but adds new splits and PRISM v4.
OpenBind EV-A71 Structure-Affinity DatasetHit IDLead ID / ADMET84.8One of the largest public single-target structure-affinity datasets. High-throughput crystallography at Diamond Light Source. Plans for more targets and blind c
TargetBench (Insilico)Target IDDisease Modeling84.6Disease-organized target ID benchmark — unique axis. Frontier LLM leaderboard.
ISM Benchmarks: ADMET (Insilico)Lead ID / ADMETIND-enabling84.6Broader endpoint coverage than TDC ADMET. Side-by-side with TDC mirror on DDB.
Longevity Compound BenchmarkHit IDLead ID / ADMET84.6Insilico-hosted; unique in bridging cheminformatics and aging biology.
LSD Large-Scale Docking DatabaseHit ID82.5Unprecedented scale for public docking data. Includes experimental in vitro validation for subset. From UCSF Shoichet Lab. Critical for training ML scoring func
CycPeptMPDB (Cyclic Peptide Membrane Permeability Database)IND-enablingLead ID / ADMET82.5De facto standard benchmark for cyclic peptide permeability prediction. Multiple 2025-2026 papers benchmark 13+ ML methods against this dataset. Critical for be
CRISPR Outcome Prediction BenchmarkHit ID79.5Prospective track added in Q1 2026.
IgLM / AntiBERTa benchmarksHit IDDevelopmental Candidate77.5Moves toward true developability benchmarks.
Geneformer EvalVirtual Cell77.0Author-led eval; still widely re-run on OpenProblems tasks.
TDC DrugSyn (OncoPolyPharm + DrugComb_NCI60)Developmental CandidateLead ID / ADMET77.0Important for combination therapy design.
scGPT Evaluation SuiteVirtual Cell73.7Evaluation dominated by authors' own model — flagged self-referential. Pair with OpenProblems for fair comparison.
AWS-JHU Antibody Developability BenchmarkDevelopmental CandidateLead ID / ADMET72.3Groundbreaking for antibody developability — fills gap where most benchmarks focus on binding only. Wet-lab validated ground truth across diverse formats. Zero-

Prospective (7)

BenchmarkStagesScoreNotes
Virtual Cell Benchmark Suite 2026Virtual Cell97.0Successor to Open Problems perturbation benchmark. Prospectively designed; Tahoe-100M inclusion makes it industry-relevant.
ASAP Discovery Antiviral 2025Hit IDLead ID / ADMET93.9Top predictions are synthesized and tested; a rare prospective public benchmark.
Longevity Benchmark (Insilico)Disease ModelingTarget IDPost-market / RWE90.6Unique, broad longevity/aging benchmark slice — nothing else in the field covers aging comparably. Leaderboard features frontier LLMs.
Polaris ADMETLead ID / ADMET88.4Industry splits enforce blinded eval; highest industry relevance among ADMET benchmarks.
CZ Virtual Cell ChallengeVirtual CellTarget ID88.1Gold standard-in-the-making for foundation-model era perturbation prediction. Hidden test → strong against leakage.
mRNA Design Benchmark (CodonBench 2026)Hit IDLead ID / ADMET82.0Designed with Moderna and Deep Genomics; includes held-out wet-lab validation track.
Polaris Biologics (Polyreactivity / SEC / Tm)Developmental Candidate79.0Industry-donated; growing.

Retrospective (74)

BenchmarkStagesScoreNotes
TDC ADMET GroupLead ID / ADMET100.0Most-adopted ADMET benchmark. 100+ leaderboard submissions.
SAbDabHit IDLead ID / ADMETDevelopmental Candidate100.0Canonical antibody structure resource. Weekly updates.
Observed Antibody Space (OAS)Hit IDLead ID / ADMET97.5Underlies AbLang, IgLM, AntiBERTa — industry-adopted.
PoseBustersHit ID97.0Exposed major failure modes in AlphaFold-Multimer/DiffDock/RFAA. Default pharma filter.
PLINDERHit ID97.0Replaces PDBbind as the modern leakage-controlled docking standard.
PLINDER v2 Protein-Ligand BenchmarkHit ID97.0PLINDER is consistently cited as the go-to replacement for PDBbind in modern docking evaluation.
STRINGTarget IDDisease Modeling94.9Workhorse for network-based target ID. Distinguish functional vs physical edges.
CASP15Hit IDTarget ID94.9Biennial. Introduced ligand prediction category.
CASP16Hit ID94.4First full multimer+ligand+RNA joint eval.
CAMEO weekly targetsHit ID94.4Weekly cadence complements biennial CASP.
Boltz-1 Structure Prediction BenchmarkHit ID94.4Open-source companion to commercial structure predictors; benchmark splits audited against AlphaFold 3 leakage.
ORD Reaction BenchmarkDevelopmental Candidate93.9Modern open reaction corpus; industry-scale.
Open Problems: Perturbation PredictionVirtual Cell91.9Best-in-class rigor (Viash workflow, hidden test, NeurIPS track).
PrimeKGDisease ModelingTarget ID91.9Modern, well-engineered KG; strong for GNN drug repurposing.
scPerturBenchTarget ID91.9Published in Nature Methods (Vol 23, Issue 2). Most comprehensive evaluation of perturbation prediction methods. Covers both genetic and chemical perturbations.
PoseX (Protein-Ligand Docking Benchmark)Hit IDLead ID / ADMET91.4Key findings: AI surpasses physics-based docking overall; relaxation crucial for AI-generated poses; pocket specification boosts performance; some co-folding me
FAERS (raw)Post-market / RWE91.1Known under-/over-reporting biases.
scPerturbVirtual CellTarget ID88.9Canonical harmonized resource. Strong Perturb-seq coverage; weaker for chemical perturbations.
PINDERHit ID88.9Expected PPI docking standard.
Practical Molecular Optimization (PMO)Lead ID / ADMETDevelopmental Candidate88.9Sample-efficiency focus exposed shortcomings of reward-maxing methods.
CoV-AbDabHit ID88.9Narrow modality but critical for pandemic-preparedness ML.
NucleoBenchHit IDLead ID / ADMET88.9First comprehensive benchmark for nucleic acid design. Google Research + Move37 Labs collaboration. Critical for CRISPR therapies, mRNA vaccines, and gene thera
ISM Benchmarks: GPCRs (Insilico)Hit IDLead ID / ADMET87.6Largest open GPCR affinity benchmark. Leaderboards test external frontier LLMs — not self-referential.
CAPRI RoundsHit ID86.3Oldest PPI prediction benchmark.
mRNABench (mRNA Property Prediction Benchmark)Hit IDLead ID / ADMET86.3Morris Lab (University of Toronto). First standardized benchmark for mRNA biology predictions. New Mamba-based model achieves SOTA with fewer parameters. Python
ToxCastLead ID / ADMETIND-enabling85.6Regulatory-grade broad tox dataset.
GNNBench-Drug 2026Hit IDLead ID / ADMET85.6IBM-led; overlaps with MoleculeNet but adds modern splits.
CT-Open (Live Clinical Trial Outcome Benchmark)phase-iiiphase-iphase-ii84.8Presented at ICLR 2026. Addresses critical data contamination problem in clinical trial prediction. Quarterly cadence: Winter/Spring/Summer/Fall challenges. Ful
CAFA5Target ID84.3CAFA5 broke attendance records.
MoleculeACELead ID / ADMET83.3Critical stress-test for generalization; exposed GNN weaknesses.
MatBenchDevelopmental Candidate83.3Materials-science benchmark; relevant for formulation / co-crystal work.
OffSides / TWOSIDESPost-market / RWE83.0Key benchmark for DDI + adverse event ML.
DrugComb 2.0 Synergy BenchmarkLead ID / ADMETDevelopmental Candidate83.0Industry-relevant for combination oncology.
DMPK Integrated BenchmarkLead ID / ADMETDevelopmental Candidate82.5AZ/Merck/Pfizer contributed held-out test molecules.
BELKA (Big Encoded Library for Chemical Assessment)Hit ID82.5NeurIPS 2024 competition. Unprecedented scale for public binding data. Library split tests true OOD generalization. DEL technology enables massive chemical spac
DOCKSTRINGHit ID81.3Vina scores are a proxy; not a replacement for wet assays.
DisGeNETDisease ModelingTarget ID81.0Commercial license required for industry. Text-mining noise limits quality.
LIT-PCBAHit ID80.8Much fairer than DUD-E; small target count limits coverage.
FLIPTarget IDDevelopmental Candidate80.8Complements ProteinGym (smaller but carefully designed splits).
AbBiBench (Antibody Binding Benchmark)Lead ID / ADMET80.8Key finding: structure-conditioned inverse folding models outperform others for affinity prediction and generation. Treats complex holistically rather than anti
GuacaMolLead ID / ADMETDevelopmental Candidate80.5First-generation generative benchmark; largely superseded by PMO for goal-directed.
Open Systems Pharmacology / PK-Simphase-iIND-enabling80.3Open alternative to Simcyp.
pepADMETIND-enablingLead ID / ADMET80.0Fills critical gap in peptide ADMET prediction (previous tools focused on small molecules). Covers Caco-2, PAMPA, BBB, half-life, toxicity. Supports modified pe
ADMET-AILead ID / ADMET79.5Strong baselines + web tool; builds on TDC.
AMES (mutagenicity)IND-enablingLead ID / ADMET79.5Core gentox endpoint.
scImmuneBenchVirtual CellDisease Modeling79.5Useful for cell-therapy companies evaluating immune foundation models.
MolGenBenchHit IDLead ID / ADMET78.2Reveals significant gap between current generative model capabilities and real-world H2L demands. Novel metrics for target-specific active compound rediscovery
MoleculeNetLead ID / ADMETHit ID78.0Widely cited (3600+); aging splits with known scaffold leakage.
USPTO-50K / USPTO-MIT (Retrosynthesis)Lead ID / ADMETDevelopmental Candidate78.0Known leakage across canonical splits; use time-split or ORD for fairer eval.
BioDesignBenchHit IDLead ID / ADMET77.7Key finding: LLM agents select appropriate tools but evaluate designs superficially, rarely comparing alternatives. Strongest agents surpass hardcoded pipelines
Tox21Lead ID / ADMETIND-enabling77.5Field-standard tox benchmark; endpoint count small vs modern suites.
Obach PK Datasetphase-iIND-enablingLead ID / ADMET77.0Small but highest-quality human-PK dataset.
CASF-2016Hit ID76.2Authoritative scoring-power eval; update cadence slow.
PDBbindHit IDLead ID / ADMET75.9Scaffold/temporal leakage well-documented. Pair with CASF + LeakyPDB.
SIDERPost-market / RWEIND-enabling74.9Aging but still widely used. TWOSIDES/OffSides offer newer signals.
TAPETarget IDDevelopmental Candidate74.9Historically important; largely superseded by ProteinGym/FLIP for fitness and by PEER for broader tasks.
Simcyp Validation Setsphase-iphase-iiIND-enabling74.4Industry gold standard but proprietary. Open benchmarks exist via OSP Suite.
PEERTarget IDDevelopmental Candidate74.4Broader than TAPE, tighter than ProteinGym; good middle ground.
ClawBio Skill Correctness BenchDisease ModelingTarget IDClinical Development74.2Independent third-party bench structurally precludes self-reference. Coverage narrow but rigor exemplary.
hERG (cardio-tox) TDCIND-enablingLead ID / ADMET73.9Small but widely benchmarked. Industry pairs with SafetyPanel-5.
DILI / LD50 ZhuIND-enablingLead ID / ADMET73.9Essential IND-enabling endpoints.
DUD-EHit ID72.9Well-known analog bias in decoy selection; use LIT-PCBA / PLINDER for fair VS.
MOSESLead ID / ADMETDevelopmental Candidate72.4Distribution-learning metrics known to saturate.
PerturbBenchVirtual Cell71.4Pharma-led (Genentech); well-specified eval.
ScaleBench: Molecular Property PredictionLead ID / ADMETHit ID69.5Timely study showing compact specialized models remain competitive vs. large foundation models for molecular property prediction. Key finding: performance depen
DrugPlayGroundHit IDTarget IDLead ID / ADMET66.8First unified platform to benchmark both LLMs and molecular embeddings for drug discovery. Includes Head-Gordon lab (Berkeley) — reputable. Early days for adopt
ClinToxLead ID / ADMETIND-enabling65.6Small, binary; saturated. Useful only as sanity check.
CellBench-LSVirtual CellDisease Modeling65.2Addresses critical gap: most scFM benchmarks use full supervision. Low-supervision evaluation is more realistic for clinical/translational settings. Finds scFMs
DEKOIS 2.0Hit ID57.5Historical reference; use LIT-PCBA / PLINDER for modern VS.
FoldBenchHit ID55.8Published Nat Comms 2026. Covers 9 task types (monomer, multimer, nucleic acid, ligand, ion, antibody-antigen, etc). Revealed that ligand docking accuracy decre
OpenADMET / Avoid-omeLead ID / ADMETIND-enabling53.6Nat Comms perspective (May 2026) by top structural biology/comp chem leaders (Fraser, Chodera, Murcko, Walters). Proposes mechanistic ADMET datasets grounded in
SAIRHit ID51.2ICLR 2026 paper from SandboxAQ. Synthetic data approach avoids PDB biases but raises generalizability questions. Self-referential flag: SandboxAQ models dominat
MPP Foundation Model BenchmarkLead ID / ADMETHit ID50.3Valuable meta-benchmark tracing evolution from descriptors to foundation models. Includes industry perspective and highlights evaluation protocol challenges. Mu
CompGen-MLIP: Compositional Generalisation for ML Interatomic PotentialsHit IDLead ID / ADMET39.8Addresses important gap in MLIP evaluation — OOD generalization to unseen molecular compositions. Shows current models struggle (10x error on OOD). More computa

None (9)

BenchmarkStagesScoreNotes
DMPKBench (DMPK LLM Evaluation Benchmark)IND-enablingLead ID / ADMET83.8From GHDDI (Gates Foundation China). LLM accuracy ranges 11-89% across tasks. Models excel at knowledge tasks but struggle with multi-modal reasoning (PK curves
BOOM (Benchmarking Out-Of-Distribution Molecular Predictions)Hit IDLead ID / ADMET83.3NeurIPS 2025. Critical benchmark showing the gap between in-distribution and OOD performance in molecular ML. Highly relevant for real-world drug discovery wher
FGBench (Functional Group Molecular Property Reasoning)Hit IDLead ID / ADMET77.0NeurIPS 2025 Datasets & Benchmarks Track. Reveals LLMs struggle with FG-level property reasoning. Addresses gap between molecular-level and substructure-level u
PerturbArenaTarget ID74.4Complementary to scPerturBench. Emphasizes metric divergence analysis and practical method selection guidelines. Shows limited robustness to shifts across cellu
DO Challenge 2025 (DeepOrigin Autonomous Drug Discovery)Hit ID70.9First benchmark specifically for AI agents (not just models) in drug discovery. Multi-agent system 'Deep Thought' outperformed most human teams but underperform
PDFBench (De Novo Protein Design from Function)Hit ID68.9First unified benchmark for function-guided protein generation. Addresses comparison challenges from proprietary datasets and inconsistent metrics. Inter-metric
VSDS-vd (Virtual Screening Decoy Set for Docking)Hit ID65.8Chinese research benchmark (Zhejiang University). Finds AI methods show deficiencies in physical soundness of docked structures despite good VS performance. Pro
AssayBenchTarget IDHit ID63.3Novel framing: gene rank prediction from natural language experiment descriptions. Part of broader virtual cell revolution. Tests whether LLMs can replace actua
VibeProteinBench (VPD-Bench)Target IDDevelopmental Candidate43.2Novel concept: first benchmark for 'vibe protein design' using natural language interfaces. Very new (May 2026), no community adoption yet. Evaluates LLM+protei
Compare:
Open comparison →