SAIR
Large-scale synthetic structural dataset for protein-ligand interactions, enabling deep learning models to learn binding interactions without experimental structure biases. Published at ICLR 2026.
Composite
51.2
Experimental validation
Retrospective
Stages
Hit ID
Modalities
small molecule
Task types
dockingclassificationregression
Size
complexes: 500,000
proteins: 15,000
molecules: 200,000
splits: {'train': 400000, 'val': 50000, 'test': 50000}
proteins: 15,000
molecules: 200,000
splits: {'train': 400000, 'val': 50000, 'test': 50000}
License
CC-BY-NC
First release
2025-06-17
Last updated
2026-04-25
Official site
Leaderboard
→ leaderboard
Dataset
→ dataset
Code / GitHub
→ repository
HuggingFace
Paper
SAIR: Enabling Deep Learning for Protein-Ligand Interactions with a Synthetic Structural Dataset · Pablo Lemos, Zane Beckwith, Sasaank Bandi, Maarten van Damme, Jordan Crivelli-Decker, Benjamin J. Shields, Thomas Merth · 2026 · paper · doi:10.1101/2025.06.17.660168 · 8 citations
Flags
self_referential
Experts
—
Groups
—
Hosted by
—
Related benchmarks
Rubric (7-criterion)
rigor
4
coverage
4
maintenance
3
adoption
2
quality
4
accessibility
4
industry_relevance
4
Notes
ICLR 2026 paper from SandboxAQ. Synthetic data approach avoids PDB biases but raises generalizability questions. Self-referential flag: SandboxAQ models dominate evaluations. Dataset available on HuggingFace. Key insight: synthetic data can match/exceed experimental for binding pose prediction.