Evaluation
A critical analysis of how current benchmarks inflate reported performance. Based on Section 7 of the review, these interactive tables demonstrate that inventory scope and metric choice are at least as important as algorithmic innovation.
Inventory Controls Difficulty
92% vs 48%
Average Solv-1 for virtual vs. physical inventory tiers. Expanding the stock set inflates success rates without improving the planner.
STR ≠ Chemical Validity
99.7% → 11.9%
Chemformer achieves near-perfect stock-termination but recovers only 11.9% of ground-truth routes. High Solv-1 does not imply correct chemistry.
The Complexity Cliff
81% → 9%
AiZynthFinder's reconstruction accuracy drops from 81% on 2-step routes to 9% on 6-step routes. Performance collapses at depth.
Stock Inflation
Impact of Stock Set Scope on Reported Solvability. Models evaluated against extended virtual libraries (>10⁸ compounds) frequently report near-perfect stock-termination (Solv-1), as the expanded termination criteria shorten the requisite search depth. Evaluations constrained to physically in-stock inventories (~10⁵ compounds) yield substantially lower termination rates.
16 of 16 entries
| Reference | Benchmark | Stock Source | |||
|---|---|---|---|---|---|
| Virtual Tier — make-on-demand / extended (~10⁷–10⁹ compounds) (10) | |||||
| Chen et al. 2020 | Retro* | USPTO-190 | eMolecules | ~231 MVirtual | 86.84% |
| Zhao et al. 2024 | MEEA* | USPTO-190 | eMolecules | ~231 MVirtual | 100.0% |
| Xie et al. 2022 | RetroGraph | USPTO-190 | eMolecules | ~231 MVirtual | 99.47% |
| Wang & Montana 2025 | InterRetro | Retro*-190 | eMolecules | >100 MVirtual | 100.0% |
| Blackshaw et al. 2025 | Enh-MCTS | ChEMBL (Rand) | ZINC + eMolecules | ~35 MVirtual | 99.20% |
| Liu et al. 2023 | PDVN | USPTO-190 | eMolecules | ~23.1 MVirtual | 99.47% |
| Shee et al. 2025 | DirectMultiStep | ChEMBL-5000 | eMolecules | ~23.1 MVirtual | 75.58% |
| Wang et al. 2025 | LLM-Syn-Planner | USPTO-190 | eMolecules | ~23 MVirtual | 92.60% |
| Guo et al. 2024 | ReSynZ | Retro*-190 | Sigma + eMol | ~18 MVirtual | 73.54% |
| Torren-Peraire et al. 2024 | Chemformer (Search) | Caspyrus10k | ZINC Full | ~17.4 MVirtual | 94.10% |
| Physical Tier — off-the-shelf / in-stock (~10⁵–10⁶ compounds) (6) | |||||
| Shee et al. 2025 | DirectMultiStep | ChEMBL-5000 | ASKCOS Buyables | ~330 kPhysical | 68.66% |
| Sun et al. 2025 | SynLlama | ChEMBL | Enamine BB | ~230 kPhysical | 19.70% |
| Akhmetshin et al. 2025 | SynPlanner (MCTS) | PaRoutes (n=5) | ASKCOS Buyables | ~186 kPhysical | 56.24% |
| Akhmetshin et al. 2025 | SynPlanner (MCTS) | SAScore >5 | ASKCOS Buyables | ~186 kPhysical | 18.71% |
| Wang et al. 2020 | µMCT-dc | Reaxys (Rand) | Sigma + eMol | ~107 kPhysical | 76.20% |
| Guo et al. 2024 | ReSynZ | Retro*-190 | Sigma-Aldrich | ~85 kPhysical | 50.26% |
Virtual Tier (10)
Physical Tier (6)
a Reported inventory size. The data file in the original repository contains ~23.1M compounds, a number consistent with more recent literature.
b The USPTO-190 and Retro*-190 benchmarks refer to the same set of 190 molecules, pre-filtered for high single-step model performance, introduced in Chen et al. (2020).
Where's the 231M?
Several entries in the stock inflation table above report an eMolecules inventory of 231 million compounds. Where does this number come from? The figure originates in Retro* (ICML 2020) and has since been reprinted across multiple peer-reviewed publications. There's just one small problem with it.
The Citation Chain
Six papers report using the ~231M eMolecules inventory. Of these, 3 publish the actual stock file. The remaining 3 cite one of the others without publishing their own copy.
| Paper | Venue | Year | Stock Source | eMolecules Version | Publishes File | Claimed Size |
|---|---|---|---|---|---|---|
| Retro*Chen et al. | ICML | 2020 | eMolecules direct | 2019-11-01 | ~231 M | |
| Self-Improved RetroKim et al. | ICML | 2021 | eMolecules direct | 2019-11-01 | ~231 M | |
| RetroGraphXie et al. | KDD | 2022 | Retro*, Self-Improved Retro | 2019-11-01 | ~231 M | |
| GRASPYu et al. | NeurIPS | 2022 | eMolecules direct | 2019-11-01 | ~231 M | |
| EG-MCTSHong et al. | Commun Chem | 2023 | eMolecules direct | 2019-11-01 | ~231 M | |
| DreamRetroErZhang et al. | Nat Commun | 2025 | Retro* | eMolecules (no date) | ~231 M |
Chen et al. · ICML 2020
Kim et al. · ICML 2021
Xie et al. · KDD 2022
Yu et al. · NeurIPS 2022
Hong et al. · Commun Chem 2023
Zhang et al. · Nat Commun 2025
All 3 published stock files contain exactly 23,081,629 entries. They share an identical SHA-256 hash:
92b490c76e741212c82e177b22a89687aab0e53c4c78772aed333e540d0e38d8In Print
The ~231M figure as it appears in each publication.

Retro*. Chen et al. (ICML 2020) — origin of the ~231M figure
ICML 2020 · Peer-reviewed

Self-Improved Retro. Kim et al. (ICML 2021)
ICML 2021 · Peer-reviewed

RetroGraph. Xie et al. (KDD 2022)
KDD 2022 · Peer-reviewed

GRASP. Yu et al. (NeurIPS 2022)
NeurIPS 2022 · Peer-reviewed

EG-MCTS. Hong et al. (Commun Chem 2023)
Commun Chem 2023 · Peer-reviewed

DreamRetroEr. Zhang et al. (Nat Commun 2025)
Nat Commun 2025 · Peer-reviewed
1 / 6
Whether an order-of-magnitude discrepancy in the single most consequential experimental variable—one whose value, as shown above, can swing reported success rates by over 50 percentage points—could persist through peer review at ICML, KDD, NeurIPS, and Nature Communications is left as an exercise for the reader.
The Validity Gap and the Complexity Cliff
The disconnect between Tier 1 success (topological stock-termination) and higher-tier criteria related to chemical plausibility (Tiers 2-3). Two complementary views from recent audits.
Panel A: The Inverse Correlation of Solvability, Accuracy, and Speed
Models Matter (Torren-Peraire et al. 2024)
Chemformer achieves near-perfect stock-termination but recovers only 11.9% of ground-truth routes and requires ~8 hours per target. AiZynthFinder shows the opposite trade-off: lower termination but highest route accuracy and fastest inference.
| Model | Policy Type | SolvabilityStock-Termination | Route AccuracyGround Truth Top-50 | Search TimePer Target |
|---|---|---|---|---|
| Chemformer | Direct Sequence | 99.70% | 11.90% | ~7.8h |
| LocalRetro | Explicit Graph | 86.0% | 36.10% | ~160s |
| AiZynthFinder | Explicit Graph | 66.30% | 61.80% | ~160s |
Chemformer
Direct Sequence
LocalRetro
Explicit Graph
AiZynthFinder
Explicit Graph