Structure Precedes Quantity

We are attempting to solve chemistry's quantitative problems before we have mastered its grammar.

I. The Two Classes of Problems

Scientific problems fall into two distinct ontological categories: Quantitative and Structural.

Quantitative problems seek to map an input to a scalar value. Predicting toxicity, solubility, or binding affinity are quantitative tasks. They are the domain of regression. Structural problems seek to generate a complex object governed by an underlying syntax. Protein folding, language modeling, and image generation are structural tasks. They are the domain of generation.

The history of modern AI reveals a rigid hierarchy: mastery of structure is a prerequisite for the mastery of quantity.

We did not solve sentiment analysis by training on sentiment labels; we solved it by training on the structure of language (Next Token Prediction). We did not solve protein-ligand docking by training on binding assays; we solved it by training on the structure of evolution (Multiple Sequence Alignment). In every instance, the quantitative breakthrough was an emergent property of a structural foundation.


II. In Defense of Phenomenology

The critique of AI in science often centers on "spurious correlations"—the idea that models learn cheap statistical tricks rather than causal mechanisms. This critique misses the history of science itself.

Science has always run on empirical laws that work; when they lack a micro-reductionist explanation, we simply dignify them with a respectable name: phenomenology.

  • Newton’s Universal Gravitation: It predicted planetary motion with breathtaking accuracy, yet Newton could not explain how gravity acted at a distance. His famous defense—"Hypotheses non fingo" (I feign no hypotheses)—was an admission that a perfect phenomenological model does not require a known mechanism to be scientific.
  • Thermodynamics: We mastered the steam engine and phase transitions long before we accepted the existence of atoms. We built a rigorous predictive framework on macroscopic averages, treating the underlying reality as a useful black box.
  • Pauling’s Electronegativity: There is no single quantum mechanical operator for "electronegativity." It is a heuristic—a scalar summary of complex vector fields. It is a "spurious correlation" by strict physics standards, yet it remains one of the most powerful predictive concepts in chemistry.

Deep Learning is the ultimate engine for automated phenomenology. The crisis we face is not that these models rely on correlations, but that their correlations are brittle. They fail at "activity cliffs"—where a tiny structural change causes a massive property shift—because they map static graphs directly to scalar labels, skipping the causal layer of physical interaction. A standard model sees a functional group as a fixed feature vector; it does not inherently understand that the same group acts as a hydrogen-bond donor in one context, a nucleophile in another, and a steric clash in a third. It learns the what, but remains blind to the how.


III. Reactivity is the Grammar

If static property prediction encourages brittle correlation, we must change the objective function to force an understanding of the underlying physics. We must teach the model Synthesis.

Reactivity provides the only rigorous test of structural understanding. To correctly predict that an ester survives acidic conditions but cleaves in a base requires the model to implicitly deduce electronic character and resonance stability. This process grounds the model's representation in physical behavior rather than arbitrary labels.

However, training on single-step reactions is insufficient. Single-step prediction can be solved via local pattern matching—the model learns to recognize the "reaction center" (e.g., an amide coupling) and ignores the rest of the molecule. It creates a false confidence, achieving high accuracy by memorizing templates while remaining blind to global incompatibility.

The true test is Multistep Retrosynthesis. In a long synthetic route, "bystander" functional groups become active liabilities. A ketone that is irrelevant in Step 1 becomes a fatal flaw in Step 5 if a Grignard reagent is introduced. To successfully plan a route, the model must track the latent reactivity of every atom across the entire sequence. It cannot simply match a local template; it must reason about protection, orthogonality, and global survival. This objective function forces the model to learn the grammar of chemistry not as a set of isolated rules, but as a cohesive, interdependent system.


IV. The Infrastructure Constraint

The crisis we face is that our capacity to generate phenomenology has outstripped our capacity to validate it.

The old scientific process had a slow, built-in validation loop. It took a human lifetime to refine a theory like Thermodynamics. Today, a graduate student can generate a thousand potential phenomenological models in an afternoon. We have automated the discovery of "tricky ways," but we have no idea what to do with them.

Consequently, the core skill of a 21st-century scientist is no longer just hypothesis generation; it is epistemological taste. It is the ability to look at a correlation found by a model and distinguish between:

  • A useless artifact (Clever Hans).
  • A useful heuristic (Electronegativity).
  • A hint of a new deep law (AlphaFold treating evolution as a proxy for physics).

We cannot exercise this taste at scale with our current tools. Science is infrastructure-constrained (see my full argument).

In phenomenology, there is a distinction between tools that are ready-to-hand (zuhanden) and those that are merely present-at-hand (vorhanden). A good tool becomes an invisible extension of the will; you do not think about the hammer, you think about the nail. Bad software forces the mind out of flow and into debugging; it becomes an object of scrutiny that blocks the scientific process.

Academia has accumulated decades of "infrastructure debt"—one-off scripts and brittle pipelines that keep our tools in a perpetual state of vorhandenheit.

My work on RetroCast and SynthArena is born of this conviction: building rigorous benchmarks is not a side quest. It is the only thing that matters in a world where generation is free and validation is expensive.


V. Toward Aletheia

If we succeed in building this structural foundation—grounding our models in reactivity rather than static properties—we unlock a new trajectory for Artificial Chemical Intelligence. This is not a linear improvement, but a phased evolution of capability:

Stage I: The Librarian (Retrospective)

You are here. We have achieved powerful semantic search over known chemistry. The model knows what has been done; it mirrors the past with high fidelity. However, it struggles to extrapolate beyond the distribution of reported literature because it has memorized the what without internalizing the why.

Stage II: The Physics Engine (Predictive)

A model that has internalized the rules of reactivity through massive-scale retrosynthetic pre-training. This model possesses an implicit representation of quantum mechanical constraints. It offers true predictive power for novel reactions and can generate thermodynamically plausible molecules under complex constraints. It transitions from a map of history to a map of the possible.

Stage III: The Engine of Aletheia (Ontological)

Aletheia is the Greek concept of truth as "un-concealment." A Stage III intelligence does not just answer questions; it identifies the blind spots in our ontology. By exhaustively mapping the possible (Stage II), it identifies the "dark matter" of chemical space—the reactions that should exist but don't, and the molecules that defy our current heuristics. It becomes an engine of discovery, revealing the questions we did not yet know to ask.

My work is focused on building the structural and epistemic foundation to bridge the gap between Stage I and Stage II. We stop guessing. We start measuring. We build the engine.