Key Ideas

10 core arguments from the review, distilled for the web. Each idea is self-contained and directly linkable — share any section by copying the URL after clicking its title.

The section references at the end of each idea link back into the relevant part of the full text.

Domain Ideas

What the review argues about chemistry and machine learning.

The field is solving regression before mastering generation. Synthesis planning — not property prediction — is the foundational pre-training objective most likely to yield robust, generalizable chemical reasoning.

AI in chemistry has historically focused on predicting scalar properties from static molecular graphs. Graph neural networks and transformers have improved benchmark accuracy on tasks like solubility and binding affinity, but the efficiency of small-molecule drug discovery has not seen a commensurate improvement. Models propose inaccessible candidates and mispredict activity cliffs—symptoms of optimizing for the semantics of function (what a molecule does) before learning the syntax of construction (how a molecule is made).

The parallels to NLP research are instructive. Early NLP systems relied on supervised learning for specific scalar tasks such as sentiment analysis or entailment. A fundamental shift occurred when the field adopted a generative objective: predicting the next token in a sequence. By optimizing for the grammar of text rather than a specific label, models acquired internal representations capable of generalized reasoning. Chemical AI today occupies a developmental stage analogous to pre-generative NLP.

But identifying the chemical equivalent of next-token prediction requires distinguishing between the syntax of notation and the syntax of matter. Models trained on SMILES masked language modeling frequently fail to outperform simple regression baselines—scaling to 1.1 billion molecules yields diminishing returns, consistent with models learning statistical regularities of the notation rather than internalized chemical rules. The true syntax of chemistry is the transformation of matter through reactivity. Synthesis planning is the chemical analogue of next-token prediction.

Key Evidence

  • SMILES-based pre-training at 1.1B molecules yields diminishing returns on downstream property tasks — models learn notation statistics, not chemistry
  • In NLP and CV, the shift from label prediction to structural/generative objectives (next-token, image-text correspondence) produced representations far more durable under distribution shift
  • Reactivity-aware pre-training (REMO, HiCLR) captures functional group nuances better than static baselines
  • Framing activity prediction as conditional structure generation improved performance on activity cliffs

So What?

If the field is to build a true chemical foundation model with emergent reasoning, it must shift from correlating static graphs with property labels to training on the causal logic of molecular transformation.


The critique that AI learns "spurious correlations" misses that science has always run on phenomenology. The problem is not correlation—it's brittleness.

The critique of AI in science often centers on "spurious correlations"—the idea that models learn cheap statistical tricks rather than causal mechanisms. This critique misses the history of science itself. Science has always run on empirical laws that work; when they lack a micro-reductionist explanation, we simply dignify them with a respectable name: phenomenology.

Newton's Universal Gravitation predicted planetary motion with breathtaking accuracy, yet Newton could not explain how gravity acted at a distance. His famous defense — "Hypotheses non fingo" — was an admission that a perfect phenomenological model does not require a known mechanism to be scientific. Thermodynamics let us master the steam engine and phase transitions long before we accepted the existence of atoms. Pauling's Electronegativity has no single quantum mechanical operator—it is a heuristic, a scalar summary of complex vector fields, a "spurious correlation" by strict physics standards, yet one of the most powerful predictive concepts in chemistry.

Deep Learning is the ultimate engine for automated phenomenology. The crisis we face is not that these models rely on correlations, but that their correlations are brittle. They fail at activity cliffs—where a tiny structural change causes a massive property shift — because they map static graphs directly to scalar labels, skipping the causal layer of physical interaction. A standard model sees a functional group as a fixed feature vector; it does not inherently understand that the same group acts as a hydrogen-bond donor in one context, a nucleophile in another, and a steric clash in a third.

So What?

We should not demand that every model learn "true" mechanisms. We should demand that its phenomenology be robust — that the correlations it discovers generalize under distribution shift.

this idea is not in the formal review - it's a web bonus!

Models trained on static correlations systematically fail at activity cliffs, revealing a fundamental flaw in the learning paradigm. We posit this failure stems from learning superficial topology, and that a model pre-trained on the causal logic of synthesis will build more robust physical representations.

Despite the proliferation of increasingly sophisticated deep learning architectures, rigorous empirical surveys indicate that improvements in in-distribution performance have not consistently translated to out-of-distribution settings. Gaussian processes and support vector machines utilizing fixed fingerprints often match or exceed the predictive accuracy of more complex neural architectures. The disparity is most acute at activity cliffs: instances where minor structural modifications lead to disproportionate shifts in biological potency.

This failure has a mechanistic origin. Standard optimization objectives (e.g., MSE) encourage models to smooth the structure-activity landscape, effectively treating the sharp discontinuities characteristic of specific molecular recognition events as noise rather than signal. The subtle electronic or steric syntax that distinguishes a therapeutic from a toxic analogue is lost in the learned representation.

The fragility extends to structure-based drug design, where models explicitly incorporate the three-dimensional geometry of the target protein. Complex 3D convolutional networks frequently fail to enrich active binders in realistic decoy scenarios. Self-supervised pre-training (motif prediction, masked atom reconstruction) can worsen activity cliff prediction by encouraging scaffold memorization. When data leakage is removed via strict structural splitting, the performance of many deep learning models drops to levels approaching simple nearest-neighbor heuristics.

Key Evidence

  • GNNs and transformers frequently fail to outperform classical ML baselines (SVMs, Gaussian processes) under rigorous scaffold splitting
  • MSE-based training smooths the structure-activity landscape, treating activity cliffs as noise rather than signal
  • The problem persists in 3D structure-based drug design — even with protein geometry, models fail in realistic decoy scenarios
  • Self-supervised pre-training (motif prediction, masked atom reconstruction) can worsen activity cliff prediction by encouraging scaffold memorization

So What?

The failure at activity cliffs is a diagnostic that suggests the insufficiency of learning through static correlations. The hypothesis is that pre-training on synthesis planning—forcing a model to internalize the rules of molecular transformation—will produce representations that are inherently more sensitive to the local electronic and steric features that govern both reactivity and specific biological interactions.



The Solv-N framework deconstructs the ambiguous term "solvability" into four distinct tiers: Syntactic (Solv-0), Topological (Solv-1), Selectivity (Solv-2), and Executability (Solv-3). This clarifies that most published success rates measure only topological connectivity (Solv-1), not experimental feasibility.

The term "validity" in retrosynthetic planning often obscures the distinction between graph-theoretic connectivity and experimental feasibility. A route that terminates at commercially available starting materials (high STR) may still propose chemically implausible transformations. To address this ambiguity, we define four levels of constraints that a proposed transformation must satisfy.

Syntactic (Solv-0) and Topological (Solv-1) validity refer to the construction of a well-formed molecular graph and a legal reaction center modification. Selectivity (Solv-2) requires that the transformation be chemically plausible, satisfying constraints of chemoselectivity, regioselectivity, and stereochemistry. Executability (Solv-3) demands experimental viability under realistic lab conditions.

This framework allows for more precise reporting. A planner with a high stock-termination rate would be accurately described as achieving a high Solv-1 rate, making explicit that higher-order chemical constraints remain unverified. While template-based methods guarantee Solv-0/1 by construction, they offer no formal control over selectivity. Sequence-based methods lack formal guarantees at any tier. Critical constraints like enantioselectivity and stoichiometry are largely ignored by most current systems.

Key Evidence

  • Solv-0 and Solv-1 are largely solved — template-based methods enforce them by construction, and STR routinely exceeds 99%
  • Solv-2 requires satisfying 5 sub-constraints (C, R, D, E, S) simultaneously for every step in a route
  • Template-based methods guarantee Tier 0-1 but provide "no formal control over selectivity"
  • Enantioselectivity and stoichiometry are "largely ignored" by current planners

So What?

The Solv-N framework provides a more granular vocabulary for "solvability," moving beyond a simple binary to a multi-faceted diagnostic. It enables the field to make more precise claims about model capabilities and focuses attention on the unsolved frontier of chemical AI: selectivity and experimental executability.


Reported success rates can be inflated from ~19% to 100% simply by expanding the starting material inventory from a physical tier (~100K) to a virtual one (~230M). High STR against make-on-demand libraries often reflects inventory breadth, not algorithmic depth.

Because the stock-termination rate measures only whether a route terminates at available starting materials, it is strictly dependent on the inventory definition. The size of the stock set functions as a "difficulty dial": expanding the inventory increases the density of termination points, statistically shortening the required search depth and increasing the probability of success.

The virtual tier (~10M–1B compounds)—essentially a list of make-on-demand targets—relaxes the planning problem by allowing termination at complex intermediates. This can transform a deep route planning task into a shallower intermediate retrieval problem. Empirical comparisons confirm that inventory choice can alter apparent performance more significantly than the choice of search algorithm.

The quantitative effect is significant. Expanding the inventory from a physical subset to a large virtual set increased one planner's STR from 73.5% to 87.3%. State-of-the-art results like MEEA*'s 100.0% STR rely on virtual catalogs exceeding 230 million entries. In contrast, when restricted to physically deliverable buyables, standard MCTS planning achieves only 18.7% on high-difficulty targets. The definition of even nominally identical stock sources (e.g., "eMolecules") can vary by an order of magnitude across studies, rendering direct comparison of STR values unreliable.

Key Evidence

  • A single planner's STR drops from 73.24% to 50.26 % when moving from a large virtual inventory to a smaller physical one
  • Performance on hard targets varies from 100% STR (MEEA*, 231M stock) to 18.7% STR (SynPlanner, 186K stock)
  • The term "eMolecules inventory" has been used to refer to sets ranging from 23M to 231M compounds, making STR values non-portable across studies
  • Inventory variation can alter apparent performance more than the choice of search algorithm

So What?

A headline solvability number is incomplete without specifying the inventory against which it was measured. To enable meaningful comparison, the community needs standardized stock definitions that prevent the conflation of inventory breadth with algorithmic capability.


The tension between explicit search and direct sequence generation is transient. A plausible trajectory, mirroring Sutton's "Bitter Lesson," involves using physics-constrained search to generate high-fidelity training data, which is then distilled by sequence models into fast, generalizable policies.

The apparent dichotomy between explicit graph search and direct sequence generation is likely a transient phase. A recurring observation in computationally intensive sciences, articulated in Rich Sutton's Bitter Lesson, is that general-purpose architectures that scale with computation eventually outperform systems relying on complex, hand-engineered heuristics.

This suggests a symbiotic architecture. Explicit search, guided by rigorous physical constraints (e.g., automated Solv-2 filters), acts as the "teacher." It can explore the vast combinatorial space of synthesis to generate large, high-fidelity corpora of valid routes. High-capacity sequence models then act as the "student," distilling this complex physical and strategic logic into a fast, generalizable policy. This approach amortizes the computational expense of search into the model's weights, combining the rigor of symbolic methods with the speed and generalization of deep learning.

The complementarity is already visible in benchmarks. Explicit search excels at navigability on shallow routes, while sequence models show greater robustness at depth. On the RetroCast benchmark, for example, the reconstruction accuracy of MCTS planners drops sharply from 81% on short routes (length 2) to 9% on deeper ones (length 6), while sequence models degrade more gracefully, from 67% to 50%.

Key Evidence

  • Sutton's Bitter Lesson: General-purpose, scalable architectures eventually outperform specialized, hand-engineered systems—a likely trajectory for chemical AI.
  • Search accuracy collapses with route depth (81% -> 9%), whereas sequence models show more robust performance on deeper routes (67% -> 50%)
  • Search provides formal guarantees and rigor; sequence models provide inference speed and can learn global, route-level conditioning.
  • The synthesis: Search acts as a "teacher" generating high-fidelity training data; sequence models act as a "student" distilling it into a fast inference policy.

So What?

The future architecture is likely not search or generation, but search for generation. Physics-constrained search becomes the data engine; scalable sequence models become the deployment layer.


Meta Ideas

How the field should organize its science.

The architects of our software instruments define the scientific agenda. In an era where progress is gated by software capability, computational chemists must become proficient software architects, or risk ceding the direction of the field to those who are.

The field's early progress came from a productive abstraction: framing retrosynthesis as a graph search problem. This choice, made by those who could build the tools, shaped a decade of research and defined the metrics for success.

This reveals a critical dynamic: the software we build is the scientific instrument that dictates the boundaries of possible inquiry. Its architecture determines the questions we can ask. An instrument built for graph traversal will yield graph-based answers. To ask deeper questions of chemical validity, one must first build an instrument where those principles are foundational.

This dynamic upends the traditional model of a domain expert directing a technical specialist. The architectural decisions that enable new science are too deeply intertwined with chemical intuition to be effectively delegated. In this domain, software engineering has become a form of scientific literacy. Just as 20th-century science demanded statistical fluency to critically evaluate experimental data, 21st-century computational science demands architectural fluency to critically evaluate—and create—its instruments of inquiry. Researchers who cannot architect their own tools are effectively confined to the scientific questions that others have already framed in software. They become expert users of yesterday's instruments, limited to incremental advances within established paradigms. The capacity to architect new systems is no longer a peripheral skill; it is the core competency that enables paradigm-shifting research.

So What?

The disciplinary boundary between computational chemist and software architect has become a primary bottleneck to scientific progress. The researchers who can build robust, chemically-principled software will not just be more productive; they will be the ones who define the questions the field is capable of answering.


The practice of coupling method development with benchmark definition creates a structural conflict of interest, where evaluation criteria inevitably drift to favor the proposed method. The solution requires both independent evaluation infrastructure and recognizing validation itself as a first-class scientific contribution.

In the navigability era, it was common for papers to introduce a new search algorithm and a new custom benchmark simultaneously. This practice of coupled method-and-metric development introduces structural confounding: apparent algorithmic gains cannot be isolated from relaxed boundary conditions or favorable test set selection. The issue is not one of intent; it is that the incentive structure makes objective self-evaluation nearly impossible. When you both design and take the exam, the outcome is predictable.

A revealing instance is the widespread use of "round-trip" accuracy, where a forward reaction predictor validates proposed retrosynthetic steps. While useful for filtering syntactic errors, this does not provide an independent measure of validity. When the forward model shares the same training distribution and architectural biases as the planner, a successful round-trip primarily confirms internal consistency, not objective chemical correctness. The model is grading its own homework.

The problem extends to metric fragmentation. New methods are often introduced with bespoke evaluation criteria—novel diversity scores, composite "feasibility scores"—that highlight the strengths of the new architecture. While useful for internal ablation studies, this practice creates a literature where direct, cross-study comparison is difficult, if not impossible.

Key Evidence

  • The existence of independent evaluators (e.g., Syntheseus, RetroCast) demonstrates that standardized measurement is both feasible and necessary.

So What?

Progress requires decoupling method development from metric definition. Rigorous evaluation must be conducted within independent, community-standardized frameworks. Crucially, the work of building and validating these measurement instruments must be valued as a first-class scientific contribution, not relegated to the status of service work.


The field is deadlocked by an overly rigid definition of 'fair comparison,' which is rarely met in practice and often used to justify a lack of head-to-head evaluation. The dual-track framework resolves this by creating two distinct but equally valid modes: a Developer Track for isolating algorithmic novelty under controlled conditions, and a Chemist Track for pragmatically comparing off-the-shelf tools, legitimizing the crucial question: 'which available tool gives the best route right now?'

A truly rigorous comparison of two planning algorithms requires retraining both on identical datasets, templates, and stock definitions. In practice, the lack of public training scripts and the high computational cost often make such controlled comparisons infeasible. This results in a literature with strikingly few direct, controlled studies of planner performance.

This insistence on perfect experimental control has created a paradoxical situation where the difficulty of achieving a "fair" comparison becomes a justification for avoiding comparison altogether. If any comparison can be dismissed over minor differences in training data, then no model is ever meaningfully evaluated against another, creating a gap between theoretical rigor and practical accountability.

The dual-track evaluation framework resolves this by distinguishing between two complementary and legitimate evaluation goals. The Developer Track is for assessing algorithmic novelty; here, method creators must demonstrate advantages through retrained comparisons under fixed boundary conditions. The Chemist Track, in contrast, addresses practical application by evaluating pre-trained, off-the-shelf models as-is. It asserts that comparing tools trained on different datasets is a valid scientific inquiry when the goal is utilitarian.

So What?

By formalizing the distinction between algorithmic novelty and practical utility, the dual-track framework provides a clear standard for when different types of comparisons are appropriate. It removes the justification for avoiding head-to-head evaluation and helps build a more robust and accountable ecosystem for both researchers and tool developers.