| TDC ADMET Group | Lead ID / ADMET | 100.0 | Most-adopted ADMET benchmark. 100+ leaderboard submissions. |
| SAbDab | Hit IDLead ID / ADMETDevelopmental Candidate | 100.0 | Canonical antibody structure resource. Weekly updates. |
| Observed Antibody Space (OAS) | Hit IDLead ID / ADMET | 97.5 | Underlies AbLang, IgLM, AntiBERTa — industry-adopted. |
| PoseBusters | Hit ID | 97.0 | Exposed major failure modes in AlphaFold-Multimer/DiffDock/RFAA. Default pharma filter. |
| PLINDER | Hit ID | 97.0 | Replaces PDBbind as the modern leakage-controlled docking standard. |
| PLINDER v2 Protein-Ligand Benchmark | Hit ID | 97.0 | PLINDER is consistently cited as the go-to replacement for PDBbind in modern docking evaluation. |
| STRING | Target IDDisease Modeling | 94.9 | Workhorse for network-based target ID. Distinguish functional vs physical edges. |
| CASP15 | Hit IDTarget ID | 94.9 | Biennial. Introduced ligand prediction category. |
| CASP16 | Hit ID | 94.4 | First full multimer+ligand+RNA joint eval. |
| CAMEO weekly targets | Hit ID | 94.4 | Weekly cadence complements biennial CASP. |
| Boltz-1 Structure Prediction Benchmark | Hit ID | 94.4 | Open-source companion to commercial structure predictors; benchmark splits audited against AlphaFold 3 leakage. |
| ORD Reaction Benchmark | Developmental Candidate | 93.9 | Modern open reaction corpus; industry-scale. |
| Open Problems: Perturbation Prediction | Virtual Cell | 91.9 | Best-in-class rigor (Viash workflow, hidden test, NeurIPS track). |
| PrimeKG | Disease ModelingTarget ID | 91.9 | Modern, well-engineered KG; strong for GNN drug repurposing. |
| scPerturBench | Target ID | 91.9 | Published in Nature Methods (Vol 23, Issue 2). Most comprehensive evaluation of perturbation prediction methods. Covers both genetic and chemical perturbations. |
| PoseX (Protein-Ligand Docking Benchmark) | Hit IDLead ID / ADMET | 91.4 | Key findings: AI surpasses physics-based docking overall; relaxation crucial for AI-generated poses; pocket specification boosts performance; some co-folding me |
| FAERS (raw) | Post-market / RWE | 91.1 | Known under-/over-reporting biases. |
| scPerturb | Virtual CellTarget ID | 88.9 | Canonical harmonized resource. Strong Perturb-seq coverage; weaker for chemical perturbations. |
| PINDER | Hit ID | 88.9 | Expected PPI docking standard. |
| Practical Molecular Optimization (PMO) | Lead ID / ADMETDevelopmental Candidate | 88.9 | Sample-efficiency focus exposed shortcomings of reward-maxing methods. |
| CoV-AbDab | Hit ID | 88.9 | Narrow modality but critical for pandemic-preparedness ML. |
| NucleoBench | Hit IDLead ID / ADMET | 88.9 | First comprehensive benchmark for nucleic acid design. Google Research + Move37 Labs collaboration. Critical for CRISPR therapies, mRNA vaccines, and gene thera |
| ISM Benchmarks: GPCRs (Insilico) | Hit IDLead ID / ADMET | 87.6 | Largest open GPCR affinity benchmark. Leaderboards test external frontier LLMs — not self-referential. |
| CAPRI Rounds | Hit ID | 86.3 | Oldest PPI prediction benchmark. |
| mRNABench (mRNA Property Prediction Benchmark) | Hit IDLead ID / ADMET | 86.3 | Morris Lab (University of Toronto). First standardized benchmark for mRNA biology predictions. New Mamba-based model achieves SOTA with fewer parameters. Python |
| ToxCast | Lead ID / ADMETIND-enabling | 85.6 | Regulatory-grade broad tox dataset. |
| GNNBench-Drug 2026 | Hit IDLead ID / ADMET | 85.6 | IBM-led; overlaps with MoleculeNet but adds modern splits. |
| CT-Open (Live Clinical Trial Outcome Benchmark) | phase-iiiphase-iphase-ii | 84.8 | Presented at ICLR 2026. Addresses critical data contamination problem in clinical trial prediction. Quarterly cadence: Winter/Spring/Summer/Fall challenges. Ful |
| CAFA5 | Target ID | 84.3 | CAFA5 broke attendance records. |
| MoleculeACE | Lead ID / ADMET | 83.3 | Critical stress-test for generalization; exposed GNN weaknesses. |
| MatBench | Developmental Candidate | 83.3 | Materials-science benchmark; relevant for formulation / co-crystal work. |
| OffSides / TWOSIDES | Post-market / RWE | 83.0 | Key benchmark for DDI + adverse event ML. |
| DrugComb 2.0 Synergy Benchmark | Lead ID / ADMETDevelopmental Candidate | 83.0 | Industry-relevant for combination oncology. |
| DMPK Integrated Benchmark | Lead ID / ADMETDevelopmental Candidate | 82.5 | AZ/Merck/Pfizer contributed held-out test molecules. |
| BELKA (Big Encoded Library for Chemical Assessment) | Hit ID | 82.5 | NeurIPS 2024 competition. Unprecedented scale for public binding data. Library split tests true OOD generalization. DEL technology enables massive chemical spac |
| DOCKSTRING | Hit ID | 81.3 | Vina scores are a proxy; not a replacement for wet assays. |
| DisGeNET | Disease ModelingTarget ID | 81.0 | Commercial license required for industry. Text-mining noise limits quality. |
| LIT-PCBA | Hit ID | 80.8 | Much fairer than DUD-E; small target count limits coverage. |
| FLIP | Target IDDevelopmental Candidate | 80.8 | Complements ProteinGym (smaller but carefully designed splits). |
| AbBiBench (Antibody Binding Benchmark) | Lead ID / ADMET | 80.8 | Key finding: structure-conditioned inverse folding models outperform others for affinity prediction and generation. Treats complex holistically rather than anti |
| GuacaMol | Lead ID / ADMETDevelopmental Candidate | 80.5 | First-generation generative benchmark; largely superseded by PMO for goal-directed. |
| Open Systems Pharmacology / PK-Sim | phase-iIND-enabling | 80.3 | Open alternative to Simcyp. |
| pepADMET | IND-enablingLead ID / ADMET | 80.0 | Fills critical gap in peptide ADMET prediction (previous tools focused on small molecules). Covers Caco-2, PAMPA, BBB, half-life, toxicity. Supports modified pe |
| ADMET-AI | Lead ID / ADMET | 79.5 | Strong baselines + web tool; builds on TDC. |
| AMES (mutagenicity) | IND-enablingLead ID / ADMET | 79.5 | Core gentox endpoint. |
| scImmuneBench | Virtual CellDisease Modeling | 79.5 | Useful for cell-therapy companies evaluating immune foundation models. |
| MolGenBench | Hit IDLead ID / ADMET | 78.2 | Reveals significant gap between current generative model capabilities and real-world H2L demands. Novel metrics for target-specific active compound rediscovery |
| MoleculeNet | Lead ID / ADMETHit ID | 78.0 | Widely cited (3600+); aging splits with known scaffold leakage. |
| USPTO-50K / USPTO-MIT (Retrosynthesis) | Lead ID / ADMETDevelopmental Candidate | 78.0 | Known leakage across canonical splits; use time-split or ORD for fairer eval. |
| BioDesignBench | Hit IDLead ID / ADMET | 77.7 | Key finding: LLM agents select appropriate tools but evaluate designs superficially, rarely comparing alternatives. Strongest agents surpass hardcoded pipelines |
| Tox21 | Lead ID / ADMETIND-enabling | 77.5 | Field-standard tox benchmark; endpoint count small vs modern suites. |
| Obach PK Dataset | phase-iIND-enablingLead ID / ADMET | 77.0 | Small but highest-quality human-PK dataset. |
| CASF-2016 | Hit ID | 76.2 | Authoritative scoring-power eval; update cadence slow. |
| PDBbind | Hit IDLead ID / ADMET | 75.9 | Scaffold/temporal leakage well-documented. Pair with CASF + LeakyPDB. |
| SIDER | Post-market / RWEIND-enabling | 74.9 | Aging but still widely used. TWOSIDES/OffSides offer newer signals. |
| TAPE | Target IDDevelopmental Candidate | 74.9 | Historically important; largely superseded by ProteinGym/FLIP for fitness and by PEER for broader tasks. |
| Simcyp Validation Sets | phase-iphase-iiIND-enabling | 74.4 | Industry gold standard but proprietary. Open benchmarks exist via OSP Suite. |
| PEER | Target IDDevelopmental Candidate | 74.4 | Broader than TAPE, tighter than ProteinGym; good middle ground. |
| ClawBio Skill Correctness Bench | Disease ModelingTarget IDClinical Development | 74.2 | Independent third-party bench structurally precludes self-reference. Coverage narrow but rigor exemplary. |
| hERG (cardio-tox) TDC | IND-enablingLead ID / ADMET | 73.9 | Small but widely benchmarked. Industry pairs with SafetyPanel-5. |
| DILI / LD50 Zhu | IND-enablingLead ID / ADMET | 73.9 | Essential IND-enabling endpoints. |
| DUD-E | Hit ID | 72.9 | Well-known analog bias in decoy selection; use LIT-PCBA / PLINDER for fair VS. |
| MOSES | Lead ID / ADMETDevelopmental Candidate | 72.4 | Distribution-learning metrics known to saturate. |
| PerturbBench | Virtual Cell | 71.4 | Pharma-led (Genentech); well-specified eval. |
| ScaleBench: Molecular Property Prediction | Lead ID / ADMETHit ID | 69.5 | Timely study showing compact specialized models remain competitive vs. large foundation models for molecular property prediction. Key finding: performance depen |
| DrugPlayGround | Hit IDTarget IDLead ID / ADMET | 66.8 | First unified platform to benchmark both LLMs and molecular embeddings for drug discovery. Includes Head-Gordon lab (Berkeley) — reputable. Early days for adopt |
| ClinTox | Lead ID / ADMETIND-enabling | 65.6 | Small, binary; saturated. Useful only as sanity check. |
| CellBench-LS | Virtual CellDisease Modeling | 65.2 | Addresses critical gap: most scFM benchmarks use full supervision. Low-supervision evaluation is more realistic for clinical/translational settings. Finds scFMs |
| DEKOIS 2.0 | Hit ID | 57.5 | Historical reference; use LIT-PCBA / PLINDER for modern VS. |
| FoldBench | Hit ID | 55.8 | Published Nat Comms 2026. Covers 9 task types (monomer, multimer, nucleic acid, ligand, ion, antibody-antigen, etc). Revealed that ligand docking accuracy decre |
| OpenADMET / Avoid-ome | Lead ID / ADMETIND-enabling | 53.6 | Nat Comms perspective (May 2026) by top structural biology/comp chem leaders (Fraser, Chodera, Murcko, Walters). Proposes mechanistic ADMET datasets grounded in |
| SAIR | Hit ID | 51.2 | ICLR 2026 paper from SandboxAQ. Synthetic data approach avoids PDB biases but raises generalizability questions. Self-referential flag: SandboxAQ models dominat |
| MPP Foundation Model Benchmark | Lead ID / ADMETHit ID | 50.3 | Valuable meta-benchmark tracing evolution from descriptors to foundation models. Includes industry perspective and highlights evaluation protocol challenges. Mu |
| CompGen-MLIP: Compositional Generalisation for ML Interatomic Potentials | Hit IDLead ID / ADMET | 39.8 | Addresses important gap in MLIP evaluation — OOD generalization to unseen molecular compositions. Shows current models struggle (10x error on OOD). More computa |