SAIR

Large-scale synthetic structural dataset for protein-ligand interactions, enabling deep learning models to learn binding interactions without experimental structure biases. Published at ICLR 2026.

Composite

51.2

Experimental validation

Retrospective

Stages

Hit ID

Modalities

small molecule

Task types

dockingclassificationregression

Size

complexes: 500,000
proteins: 15,000
molecules: 200,000
splits: {'train': 400000, 'val': 50000, 'test': 50000}

License

CC-BY-NC

First release

2025-06-17

Last updated

2026-04-25

Official site

→ project page

Leaderboard

→ leaderboard

Dataset

→ dataset

Code / GitHub

→ repository

HuggingFace

→ HF

Paper

SAIR: Enabling Deep Learning for Protein-Ligand Interactions with a Synthetic Structural Dataset · Pablo Lemos, Zane Beckwith, Sasaank Bandi, Maarten van Damme, Jordan Crivelli-Decker, Benjamin J. Shields, Thomas Merth · 2026 · paper · doi:10.1101/2025.06.17.660168 · 15 citations

Flags

self_referential

Experts

—

Groups

—

Hosted by

—

Related benchmarks

PoseBusters, PLINDER, FoldBench

Rubric (7-criterion)

rigor

coverage

maintenance

adoption

quality

accessibility

industry_relevance

Notes

ICLR 2026 paper from SandboxAQ. Synthetic data approach avoids PDB biases but raises generalizability questions. Self-referential flag: SandboxAQ models dominate evaluations. Dataset available on HuggingFace. Key insight: synthetic data can match/exceed experimental for binding pose prediction.

← Back to all benchmarks

Compare:

Open comparison →