AI will transform science. Just not the way you think.

December 13, 2025 · Anton Morgunov

TL;DR for the General Public

Everyone is waiting for AI to make world-changing discoveries, but most scientific disciplines are infra-constrained not ideas-constrained. Academia has accumulated decades of "infrastructure debt" from a culture of rewarding quick-and-dirty code that produces proof-of-concept papers, while neglecting the robust, well-engineered software that enables real progress. The true AI revolution is already underway and it's about finally building the high-quality tools we should have been building all along.

I never really understood what chemical engineering is. Sure, I had some rudimentary understanding that it encompasses creation of physical tools used in day-to-day chemical research (NMR, mass spectrometers, all the fancy LC-MS with mild ionization for protein characterization stuff) and actual chemical factories (monitoring industrial processes), but surely it can't be just that: after all, you can get a PhD in ChemE, and surely you don't get a PhD for being a craftsmanabsolutely not intending to diminish craftmanship in any way, just keep reading.

Like many other computational chemistry students, I started my research career by doing wet-lab research.a paper based on my work in 2018-2020 came out earlier this year I got seduced into considering switching to theory and computations only during my 3rd year, with roughly one year remaining before grad applications were due, so, naturally, I spent awhile thinking if it's worth (strategically) making a switch right now, or keep growing my expertise in experimental chemistryreal-life manifestation of sunk cost fallacy.

It was no doubt that computational chemistry, and ML in particular, was on the rise (this was pre-ChatGPT, but post AlphaFold 2), but it wasn't clear who is more likely to make an impact, a chemist who learned to code/train neural nets, or a CS major who learned chemistry? A related question is whether it is easier to train a chemist to write code or a coder to understand chemistry. Naturally, being a chemist-by-origin and getting interested in coding somewhat late to do a full double majorended up with a CS minor (unlike some of my friends), I was inclined to come up with rationalizations in favor of chemist-who-learns-to-code stance. A strong counterargument was AlphaFold2 itself: CASP was held since 1994, best structural biologists of the world spent 26 years doing incremental progress until a bunch of CS majors achieved a miraculously high score.

Note

Though as I learned later, Sir Demis Hassabis, before starting DeepMind to create an artificial brain, decided to learn how the actual brain works and went to do a PhD in cognitive neuroscience in what appears to be just 4 years (neuroscience PhD often lasts up to 8-9 years). First paper was published in PNAS (that's a good journal) with a few more in Nature and Science (that's the crown jewel of the academic world). True gigachad.

This october, PNAS publishedit was originally posted on arXiv in Sep of 2024. PNAS notes that paper was accepted in Nov 2024. So, apparently PNAS has a 11 month queue to be included in an issue? a perspective from Tiwary Group on Generative AI for computational chemistry, which, besides doing a concise review of different approaches (from classical AE and GANs to Flow Matching and Diffusion), makes a provocative epistemological claim:

We believe that the ultimate goal of a simulation method or theory is to predict phenomena not seen before and that generative AI should be subject to these same standards before it is deemed useful for chemistry

This perspective made rounds among chemists-by-origin, including many faculty members, and the perception was focused on the "we haven't seen evidence of predictive power" and the classic ML models learn the distribution and fail to generalize out-of-distribution. At the same time, frontier labs are increasingly painting the picture of AI revolutionizing science perhaps even with fully autonomous AI scientists. Naturally, you, my dear reader, might wonder—who is more right? Will AI enable/speed up scientific breakthroughs?

Yes, even if LLMs stop improving and stay at the current level of capabilities, we'll see revolutionary-grade impact of AI on science. Just not the way most people envision it.

Yes, generative AI for chemistry so far has been underwhelming, but it's not AI that is at fault.

In what follows, I revisit different aspects of the origin story of AlphaFold 2 and argue that there're certain overlooked lessons which currently prevent us from having "An AlphaGo/AlphaFold/ChatGPT/DeepSeek R1 moment" for scientific discipline X or problem Y.

Issue 1: We argue about the wrong things

I'll come out swinging: unless you conduct research on learning or optimization theorye.g., see the Grokking preprint . Or for a goldmine of fun papers see works by Kimon Fountoulakis, one example and another, i.e. you have a strict definition of what constitutes the distribution, you are hereby forbidden from using the term out-of-distribution (OOD). It's a perfectly legitimate and useful concept when used precisely, but in common parlance it encourages lazy thinking to the point of actually stifling research.

Take AlphaFold. You might even see some people mindlessly claiming AlphaFold cannot predict structures of proteins unseen during training, a claim that is so easily demonstrably false that I don't want to spend more than one sentence on addressing it: just read how the CASP competition physically operated. A more sophisticated criticism would be that AlphaFold only works for well-behaved proteins, if your target of interest is, for example, an intrinsically disordered protein (IDPs), you might as well gaze on a crystal ball. And sure, that's true, but also HEY that's a different problem the lack of a solution to which does not diminish in any way the significance of AF2.

A similar criticism is that PDB is mostly a database of structures in Apo form (unbound, empty state), while the things we're truly interested in (e.g. for drug discovery purposes) is how proteins interact with ligands, so we need to know their Holo form (protein bound to a ligand), and well, AF2 can't help you with that.AF3 tries to tackle the protein-ligand binding problem, but a common critique is: your model is only as good as the quality of your training set, and I don't see where you'd get a PDB-grade dataset of Holo forms.

In other words, a general problem of "protein folding" is actually a graph of problems:

Predicting structure of well-behaved proteins [YOU ARE HERE]
Predicting structure of intrinsically-disordered proteins
Predicting structure of Holo forms
Understanding/explicitly modeling the protein folding process

and I don't know why you would perform an unnecessary act of cognitive dimensionality reduction by calling the lack of ability to do 2-4 a lack of OOD generalization, especially given that for decades prior to AlphaFold structural biologists believed that 4 was a strict prerequisite even for 1.here's a quite instructive thread by Sergey Ovchinnikov

Let's pick a different problem. Say you train a GAN, Diffusion model, or a transformer to generate molecules. What would constitute OOD generalization? Ability to predict molecules from ChemBL by training on ZINC?these are dataset names It even sounds absurd. A common test is to characterize molecules with some descriptors (molecular weight, topological measures, physicochemical properties), reduce to 2 dimensions with PCA and show the distribution of molecules in training set and in generations. That makes a nice paper figure,I'm as guilty, we have such figure in ChemSpaceAL but what makes you think that the true chemical space extends outside of your training blob in this set of coordinates?

Or take my field, retrosynthesis.predicting a recipe for synthesizing a molecule of interest in a single or multiple steps For predicting single-step transformations, one can find a time-based split: can you predict reactions reported after a certain year (say 2016) after training on reactions reported only before 2016. But as Segler and Coleyand their students, the gigachads of our field, recently argued, time-based split is chemically meaningless, and argued in favor of split based on reaction types: if you never saw Mizoroki-Heck couplings, but you you've seen nucleophilic substitutions, can your model infer the existence of such couplings? It's an interesting test, sure, but the real problem we're trying to solve here is predicting how to make any molecule of interest, which often requires multiple steps. And what would OOD mean for multistep retrosynthesis?

Predicting a route for a molecule unseen in training? Some models can already do that.
Predicting a route without ever seeing each individual reaction step? Same.
Predicting a route without ever seeing any examples for all reaction types in the route? Or without ever seeing any examples for one of the reaction types in the route? Well, yeah, sure, current models cannot do that.

But is that a meaningful constraint? A synthetic chemist who needs to see suggestions how to make a molecule would care:

Whether a route is found and is chemically feasible (no obviously impossible reactions)
How quickly is it found
Whether he can repeat the search with extra constraints (w/o using specific solvents, conditions, or reaction types)

And what he couldn't care less is how the model that finds the route works or what data it was trained on. I'll go even further, I'll say he doesn't care if the model has seen the exact route he is being shown during training (or all single step reactions that it is made of). In some similar sense, even if AlphaFold is nothing but a mere lookup tablei.e., it just finds the most related known structures and returns you some average of that, if it can do that in <1 minwhich it can, assuming you already have local ref files for MSE, it's still an awesome invention because it unlocks a capability that researchers previously didn't have (an ability to get a rough picture of how their protein variant might look like). "Well, no, we could go to PDB, find related proteins and inspect them" -- sure, absolutely, but you'd spend a minute just learning to navigate the PDB websiteno offence intended, it's a good functional site, and your User Experience would be way worse.

Insight

A lot of problems stem from an attempt of computational chemists to sit on both chairs at the same time. If you want to demonstrate that your model has some novel expressivity and generalizability traits, you can come up with some interesting and strict definition of a distribution, impose artificial constraints onto the training set, and you should be judged by the level of rigor and whether you truly show OOD generalization.

But if you're making a model that is supposed to solve a particular task (protein-ligand binding affinity, finding a retrosynthetic route to a molecule, calculating some property), then, by definition you're building a product. And a product should be judged not just by the theoretical idea or proof-of-concept execution, but by quality of the code, user interface (UI), user experience (UX), developer experience (DX) and runtime performance.

Issue 2. We treat engineering as an afterthought

A common trope in academia is that you can always optimize your code after you're done with the actual work. All the best practices of software engineering is just something you can do when you're already preparing a submission to the journal. Might even do it after you submit the paper (or post the preprint). Unsurprisingly to any programmer, such refactors always create unforeseen issues, which is part of the reason why when you try to run academic code you get runtime errors: you're using a "polished" but untested version of the codebase, and you can't really diagnose how the code was supposed to work since the repo has only one commit "move code from private repo."full disclosure, i myself was guilty of that, though per isChemist Protocol my latest project was done properly from the start, in the published repo you see full commit history

But even on a more fundamental level, there's a missing appreciation that your codebase quality determines the boundaries of scientific exploration you can perform.

Case Study: DirectMultiStep

Because the claim above might sound incredulous to chemists-by-origin, here's a short story from my own experience of working on DirectMultiStep (blog entry coming soon!), a model for multistep retrosynthetic planning. I joined the project when it was considered "almost completed;" there were was a model that unlocked constrained (target, starting material, route length) planning. I began by familiarizing myself with the codebase, and almost immediately I noticed a few potential performance-related issues. For example, even though recursion is a quite neat approach to solving problems, any recursive function can be written iteratively,though not necessarily as cleanly which might be preferential because every recursive call is accompanied by an allocation of an environment/scope for execution, which takes time. It might take microseconds, but when you keep doing that thousands or millions of times, it compounds. Or recognizing a few footguns like this list append

some_list.append(new_elt)

is an O(1) (amortized, i.e. on average) operation, but

some_list = some_list + [new_elt]

is an O(n) operation, where $n$ is the length of the list, because python will reconstruct whole array from scratch. You don't usually see these kind of things in the introductory learn-to-code tutorials or guides on the internet (because the emphasis is usually on just making it work), but it's something that you learn through pain if you take formal classeswhich is why even with very brief formal experience (after all, I only did a minor) I paid attention to this

After I rewrote several parts of the codebase, I increased the speed of training by roughly 1.75x and generation by 6x. And that was with the 6-layer (9M params) model my coauthors had initially. Eventually, I ended up training 36-layer (40M, DMS Wide and DMS Deep) and 24-layer MoE (56M, DMS Explorer XL) behemoths, which showed state-of-the-art even on the unconstrained (target only) planning. Training (and evaluating) such models with the codebase prior to optimization would have been so much slower (doubling your layers leads to 2x runtime) that one could argue it'd have been practically unfeasible.

Issue 3. We can't measure progress

Deep Learning revolution started when AlexNet, a CNN-based image classifier, surpassed all existing methods on ImageNet. AlphaFold 2 made the news when it became the first model ever to achieve 90% accuracy threshold on some proteins as part of CASP tournament. Both CASP and ImageNet were established benchmarks that measured a very specific ability: how well can a model classify images across 1000 categories or how well a model can predict a structure of a protein not reported in PDB.

What other similar benchmarks do we have in science? Is there anything where if you bring a model/tool and achieve certain performance, it will be universally considered an incredible achievement?funny enough, such benchmarks are needed even for evaluation of classical non-ML models, but we're much less skeptical of ad-hoc evaluations if it is applied to a model created from first principles There are many papers published on, say, synthetic accessibility prediction or protein-ligand binding affinity, but whenever you ask a medicinal/synthetic chemist if he'll use any of those models, he is likely to say no. How come? What about reported high accuracies? It might turn out that benchmark X is irrelevant because it has hyperbolized versions of negative examples and so is too easy, benchmark Y is useless because labels are syntheticmade by another model and we don't think that model is good.

Those are all valid criticisms, but why are we expecting model developers to create better benchmarks? A fundamental principle in CS is separation of concerns, which, when applied to publishing, results in the explicit separation of methodology and evaluation papers. DeepMind didn't organize CASP; AlexNet authors did not create ImageNet. So, if we want to see progress in computational physical or life sciences, we probably should adopt a similar separation of tracks. In fact, I'd argue it's the responsibility of experimental (especially senior) chemists,ironically those that happen to criticize ML methods the most to dedicate certain amount of time to such work. And I'm not trying to point fingers, this take comes out of genuine respect for the expertise of experimental chemists: who knows all the minute details that differentiate easy-to-synthesize from hard-to-synthesize molecules better than someone who spent decades running synthesis? Who has a better understanding of which molecules can be good binders than someone who spent decades running binding assays?

A case study: protein-ligand binding pose prediction with DiffDock and Boltz-2

A current hot topic in computational life sciences is protein-ligand binding pose prediction. A reader might be aware that such function was introduced in AlphaFold3, but there are open source alternatives like Chai and Boltz-2. Boltz (a successor of DiffDock) is particularly interesting for two reasons. One, both Boltz and DiffDock come from a group in a CS department of MIT. Not biology, not bioengineering, not chemistry, not chemical engineering. So our original question of who leaves greater impact, chemist-by-origin or CS-by-origin, tilts towards an unfavorable side for us, chemists.

Second, the development of Boltz-2 is a real-life demonstration of my thesis that real progress happens when there is an independent effort on model development and evaluations. DiffDock showed great performance on computational metrics like root mean square distance between predicted and experimental pose,almost a 2x improvement of Top-1 accuracy over existing methods so medicinal chemists rushed to use these models, but had an already typical underwhelming reaction. It just wasn't good. A year later, a group from Oxford publishes a paper with a cosmic-grade banger title—PoseBusters: AI-based docking methods fail to generate physically valid poses or generalise to novel sequences—where they show that predicted poses often have obvious issues with bond lengths or angles (e.g. nonplanar aromatic rings). Most importantly they package it as a python library that can act as a chemical police: you give it a pose, and it'll tell you if it's physically valid. And now that check is incorporated into the generation workflow of Boltz-2, making it a much better model that predicts realistic poses.

Issue 4. We don't like to build the infra

Before working on AlphaFold, DeepMind first rose to fame by defeating one of the best human players of Go, something that was considered impossible or very far in the future. There is a nice documentary by DeepMind themselves with footages filmed while they were working on that project. One particular thing that struck with me is the tools they had to visualize the working of their system (e.g., see screens at 44:53 or 1:00:19 timestamps). For starters, there is a visual representation of the current state of the Go board. You might think, of course there would be one, you're building a model to play Go, but see, the computer doesn't actually see a board with white or black pieces, instead the board is represented as a string of characters like:

o+x
+++
+o+

where + is an empty slot, o is white piece, x is black piece. So if you'll decide to train your own RL policy or a neural net to play chess, checkers, or Go, this is how the game state will look like to youyou can literally see it even in the documentary at 1:18:02. And if you want to see an actual board, you have to write a visualization module.

Another thing you see in the documentary is multiple plots showing the evolution of key metrics like estimated probability of winning after any given move. If you haven't written code before, you should understand that everything you see in software is a result of a very specific instruction handwritten by a human, meaning it's not like you can just toggle some button "okay let's add this plot" and it'll magically work, you have to write loggers for desired metrics, then you need to write loaders & pre-processors, and only then you can actually make a plot out of it. It might sound straightforward, and it is when you only need to plot metrics for one particular run, but when you actually try to implement a more general system that will work across different runs, you'll end up dealing with so many tiny little nuisances you likely haven't anticipated. It doesn't make the problem difficult, it just will take time (might easily take a week or two).

Takeaway

How do you think a PI will react when a PhD student comes and suggests he'll spend a few weeks writing better visualization modules? You don't even need to have the experience of going through the grad school to guess the most likely answerthough if you're a PhD student, you should still ask this, even if you expect the probability of a positive answer to be low

Recently, DeepMind released another documentary on Hassabis's life in general, but with a big section dedicated to work on AlphaFold. And again, look at 48:11 or 53:37, these (protein visualizations) are obviously not PyMOL windows,software used by scientists to visualize protein structures this is custom internal interfacealthough since pyMOL engine is open source, they might be using that, but the emphasis here is on UI. Obviously, we're seeing a pattern: when DeepMind starts to work on the problem, it makes sure to create tools to visualize the internals; anything that can help you make insights on what is going right and what is going wrong.

Sounds common-sense? Let's look at the state of computational chemistry. We pride ourselves on the accuracy and explanatory (sometimes even predictive) power of quantum mechanicswhich we decided to call quantum chemistry, it really is still quantum mechanics. One of the most important features of any molecule is a set of molecular orbitals. Every single quantum calculation begins by the determination of those orbitals. Do you think we have any good software solution to visualize those orbitals? Lol. Lmao even.

Yes, there is GaussView 6 (released by Gaussian, one of the key software packages) that you can buy for $750/seat, which hasn't been updated since 2017 and it might work fine if you use Gaussian, but with any other quantum chemistry tool it's a coin toss. Some people unironically suggest using VMD, which, uhm, is a thing where to get anything good done you have to execute 15 different obscure commands. For awhile there was really good Avogadro, but it stopped showing MOs in 2018 after some change in macOS. The team has been working on Avogadro 2, and it's been in "almost ready" state since at least 2022. And it seems like most of the progress is made once some interested undergrad applies to Google's Summer of Codeironic, isn't it? stipend and works on the project. This is not a diss on Geoff Hutchinson (project lead), it just proves my point that academia doesn't show any gratitude to people who are working on foundational infra, and so Avogadro is a passion side-project for Prof. Hutchinson. Finally, there's Jmol, which is actually quite great, but just open the official webpage jmol.sourceforge.net to see what we're dealing with.

And even with Avogadro to visualize orbitals you need to first run some scripts to create one cube file, then another, and then you load the structure from out file, and then you add surfaces from those cube files. If you use ORCA, you can create a molden file with ORCA cli, and JMol will happily load that (assuming you type in JMol terminal load /Path/to/cube.cube). Long story short, you might spend 5 s running basic job on water molecule and you might spend the whole day trying to visualize the orbitals the first time you do it.

Note

computational chemistry actually has hillarious lore. back in 1970s, Sir John Pople (who later gets 1998 Nobel prize for being a father of computational chemistry) creates the first software package for running quantum chemical calculations, Gaussian. In 1980s they start charging small fee to cover dev costs. It's not clear what exactly happens, but in 1990s Pople leaves Gaussian and starts Q-Chem, an alternative/competitor. That results in Gaussian banning Pople himself and all his students from using Gaussian. Eventually the list expands to include sitewide bans for Caltech, UC Berkeley, GATech, Columbia U and a few named groups, there was even a website bannedbygaussian.org in early 2000s. Anyone who makes a contribution to any software (other than Gaussian) that runs quantum chemistry was banned.

More generally, would you be surprised to learn that in, say 25 years (since first release of ORCA and Q-Chem), the actual day-to-day of any new PhD student who wants to use quantum chemistry to do science hasn't changed a single bit? Your interface is a textual input file, a cli to start the job, and a huge (might easily take 100k tokens) textual output file. Maybe, (maybe!) if your group has been doing some routine for awhile, some postdoc wrote a bunch of Perl/Python scripts to slightly automate the process of creating those textual input files or parsing output files to get a single desired number. All of that notwithstanding the fact that the field has progressed from working on a small single molecule (when your workflow could really be launch one calculation, wait for it to finish, and manually inspect) to handling medium-large size molecules maybe even with a few explicit solvent molecules, where you can't do anything without submitting at least a 100 jobs. How do you make sense of them, how do you organize them, is there some software? Of course no, just create a bunch of folders, start with a single structure based on your initial understanding of complexity of the task, eventually you'd realize your structure was too basic, then you'd have a bunch of folders like freq-prot-tzvppd, freq-prot-deprot-geom-unrelaxed-tzvppd, freq-prot-deprot-geom-unrelaxed-tzvppd-run2. And if you stop working on this project for more than 2-3 days, you'll have to spend a day just familiarizing yourself and getting up to speed. Eventually your firstthe fate of ALL first projects, regardless of the competency of the undertaker project will be in such a mess that you might want to do what any good software engineer does: refactor (both your data organization and the scripts that operate on that gargantuan mess). And if you have weekly meetings, well good luck explaining why you have no new results to your PI or his collaborators.

And that's how we end up with 30 years of modern computational chemistry not yielding a good interchange protocol so that you could visualize molecular orbitals in a second or every single theochem grad student on this green earth starting his PhD by writing a parser for single point energies from out files. And when taken separately, all these individual things sound simple, but when you add them up, they create a cognitive load, so by the time you're done dealing with all the mess, how much cognitive power do you have left to actually do scientific thinking?

Okay, all of this can be dismissed as ramblings of an old PhD student, fine, here is a simple scenario that constitutes an acid test of whether your research is ideas-constrained or infra-constrained.

Acid Test: is your research ideas-constrained or infra-constrained?

Say you're studying some proton transfer reaction in two electronic states (S0 and S1). Q1: If you want (or ask your student to) see if any particular quantum metric is insightful (so you have to run 4 jobs, proton in position A, then B, and for two states), how much time (excluding wall time of the actual calculation) will it take? If the answer is not 30 min to get the plots and and hour to think about them, you're infra-constrained.

In practice, today, this might easily take anywhere from a day to a whole week, e.g. if you mess up something in input file, the job spends 10 hours in the queue only to terminate immediately with a runtime error because guess what? pre-submission input validation is not a thing, remember, we're scientists, not product developers.

Once you have the answer, Q2: can you (or your student) say that you're 100% confident that the results are valid, i.e. there were no human mistakes made? If the answer involves you (or your student) saying well we can double check the geometry in input spec because I might have copy-pasted the wrong thing or saying we need to double check that it's using values from the second job (when you run S1 it is preceded by S0),i know right, wouldn't it be nice if software engineers came up with a thing where you specify a few desired behaviors, like with this file you should give me this value and not that, and be able to run that check periodically then you're infra-constrained.

Software quality dictates scientific possibility

Arguably, all of these issues can be summarized as the reluctance of academia to recognize software engineering as a discipline that, perhaps against their wishes, became an integral component of doing any meaningful computational research. And just like you wouldn't expect a synthetic chemist (without training in chemical engineering) to build his own rotovap, or LC/MS, or an NMR machine, all of which he uses daily, you either shouldn't expect a computational chemist to build his own toolswhich implies outsourcing tool creation to SWE, which are expensive, so it is practically unrealistic OR you should expect your computational chemist to get formal (structured, not just self-studied) training in software engineering.

Why emphasis on formal (by which I mean a set of proper college courses)? Remember how Tiwary's perspective observed that generative models for chemistry currently might underperform classical methods? A reader might have wondered, well then how those models were published, and I think the following analogy applies. A significant component of CS education is not teaching you the syntax of a coding language or how to write code that does X (which is what web tutorials are good at). Rather, it's the creation of an environment, in which you get a problem to solve, most likely you come up with an idea of how to approach it (which is often based on an internal oversimplification of the problem), you write code, you click run,which is why coding is so awesome, you can immediately check if your ideas work. Let's say the program runs without crashing. You compare the output to what you expect, it all matches, you think oh nice, job is done, then you run the test suite provided by course instructors, and suddenly you see a wall of red text because you fail a bunch of test cases. How come? My code is running, it gives me the output I expect! Then you start thinking, eventually you realize there are certain scenarioscoders call them edge cases where your understanding of the problem/idea of a solution is incorrect, so you have to adjust or start from scratch. And the hardest part, the thing with which a huge numberI was an office hours helper for the largest intro to CS class at MIT for 3 semesters of students struggle, is learning to restrain yourself from checking correctness of your approach by clicking run and seeing if the code runs without crashing, and instead taking a piece of paper, a pen, and thinking through the problem from first principles. Or writing tests (sometimes even before you write the actual code) that test expected behavior (which you often get from your sketches with toy examples on a piece of paper).

In other words, I contend that many GenAI for Chemistry papers that have been published and found underwhelming by domain experts have been trained/created by chemists-by-origin who self-studied programming, and so missed that crucial learning experience of "just because it runs, doesn't mean it's correct." And a dirty little secret of deep learning is that unless you try really hard to mess things up, you can train a model on any task and it'll work, at least on the surface. Might not be accurate enough for practical purposes, might even be worse than some classical (non-ML) methods, but it'll work. And in a sense, this is a significant (perhaps maybe even scientific, if done rightthere is a harmful misconception that a scientist is a profession. A scientist is someone who applies scientific method) result that's worth publishing, but at this point, we should treat "ML model learns on a task and gets 80+ accuracy" on an equal standing (in terms of significance) to "a code compiles". Because at the end of the day, what matters is whether your computational models are practically useful, which is often a much, much higher bar.

So how does it all affect AI for science trajectories?

I never really understood what chemical engineering is. I'm also not sure that a fully autonomous AI scientist will be possible (or practical, cost-wise) in the near future. For many reasons, one—research is rarely constrained by your ability to formulate ideas or conjectures; it's the ability to perform a clean experiment that is the bottleneck.which is a major reason why I switched to computational chemistry: I loved that it minimizes the time required to test a hypothesis Two—I'm not sure the amount of compute required to run the LLM in the loop long enough for it to repeat the thinking process of a decent researcher is going to be cheaper than labor costs.

But LLMs, as is, are already capable of revolutionizing science by helping computational scientists solve (or alleviate) all the infrastructure bottlenecks described above. As an example, my recent project was mostly dedicated to the creation of infrastructure for evaluation of retrosynthetic models for multistep planning. You can read the preprint, I'm quite satisfied with how it's written, but long story short, I created a python package RetroCast which enables me to standardize predictions of different models and calculate relevant metrics with bootstrapped confidence intervals in just 3 cli commands. RetroCast powers SynthArena, a web interface for visualization of reference routes, predicted routes, comparing one to another (side by side or with direct overlay). And yes, I had functions that could run evals before, but that workflow often looked like, open script run-evals.py, replace run_name, update paths to checkpoints, double check eval set name, then start. If you want to run multiple evals at the same time, well create copies run-evals-model1.py, run-evals-model2.py, run-evals-model3.py and manually change paths to models in each file. With things like route visualizations, sure I had a function that given a route can create a pdf: you open a script, have to update model name, eval folder, remember which prediction you need, come up with a file name where to save it, open folders. The user experience is clearly much inferior to that of using SynthArena (which you can self-host locally). And crucially, only after making SynthArena, I realized how much that poor UX was affecting my scientific process because now I'm looking at routes way, way more frequently.

Or in a different project where I work on a successor to DMS, at this point I have a codebase where if I want to try a new experiment (change architectural params or dataset composition), I can start a new run in literally 20-60 min, come back overnight, sync metrics from the cluster, one script run and I have rich plots to study. It's so smooth, it genuinely makes me happy.

And the point here is not to highlight how awesome I am (although if you're an employer, take notice), but that it's only possible thanks to LLMsand mostly thanks to Google's AI Studio, where you can use models with 1M context window free of charge. Like I said before, I started taking CS seriously a bit late in my undergrad, so I didn't get the formal training that I wish I had. And you could see that in the quality of the codebase of my first work, ChemSpaceAL. After Gemini 2.5 Pro came out this spring, however, I started collecting code from my whole codebase, feeding it to AI Studio, and then talking to an LLM about my pain points and inefficiencies. "What am I doing wrong?", "What structural/architectural decision I made wrong to end up in this position?", "When did I make that mistake, what was the turning point?". When an LLM can see your whole codebase and knows all the workflows it's supposed to manage, it can give you surprisingly great high-level architectural advice ("how would you structure the code if you were to write it from scratch?"). So my ability to create RetroCastI've already received a very positive feedback from one of model developers and SynthArena is a product of weeks and months of iterative refinement and learning how the correct solution should look like. I can avoid most of the footguns and, most importantly, guide LLM coding agents to do things properly.spoiler alert, without strongly opinionated guidance, they'll transform your code into absolutely bloated trash. it might even work, but it'll be impossible to maintain and you'd waste way more tokens with an agent if you'd want to make some new feature As a result, I can create the infrastructure to make my research faster and more pleasant. And most importantly, it'll take weeks, not months of work.

So I never really understood what chemical engineering is. Until I realized that I'm probably doing it right now (if we agree to define it as "creating tools to enable scientific research"). And thanks to AI, the original question of who is likely to leave a mark in computational sciences, a chemist-by-origin or a coder-by-origin, becomes irrelevant, because it is much more feasible to become boththough, as I argued above, you should have at least some degree of formal training in CS, or at least read some textbooks from start to finish. You wouldn't learn organic chemistry from web tutorials, so why do you think you can learn to code that way?.

Nothing is new under the sun

Arguing that scientists should care not only about the theoretical foundation of the tools they create, but also about UI, UX, DX, might sound like a radical proposal to redefine a scientist and merge him with an engineer. But it's not really a novel thought from a rebellious grad student, there is a rich history to this idea.

Heidegger distinguished zuhandenheit (ready-to-use) from vorhandenheit (present-to-use). When you have a good tool (let's be banal and use the proverbial hammer example), it becomes an extension of your arm, you don't think about the tool as a separate object, it becomes a natural part of the process of hammering. Or, an example I like more, when you're driving a car, and you want to go faster, you're not thinking "I should apply extra pressure on the gas pedal, which will increase the fuel flow to the engine", the car becomes a transparent extension of your will, you want to go faster, and you actually go faster. That is when the tool is zuhanden, ready-to-use.

If your tool is broken or malfunctions, it is no longer an invisible extension of yourself, it suddenly becomes an object of your scrutiny. You dedicate your explicit cognitive attention to the tool itself, it becomes merely vorhanden, present-at-hand. As a result, bad software has a cognitive cost, it forces your mind out of the zuhanden mode of doing chemistry (or science) and forces you into the vorhanden mode of debugging python.in a sense, this is an academic description of what every programmer knows intuitively: the most precious prerequisite to productivity is the state of flow, which is very hard to get into and very easy to be distracted from.

Don't like Heidegger? Fine, read these lines by one of the fathers of the scientific revolution:

Neither the naked hand nor the understanding left to itself can effect much. It is by instruments and helps that the work is done, which are as much wanted for the understanding as for the hand.

Yes, in what follows Francis Bacon argues for the instruments of the mind like logic and reasoning, but something tells me he'd very much be in favor of principled software engineering as opposed to sloppy one-off scripts.

AI will transform science. Just not the way you think.

December 13, 2025 · Anton Morgunov

TL;DR for the General Public

Note

We believe that the ultimate goal of a simulation method or theory is to predict phenomena not seen before and that generative AI should be subject to these same standards before it is deemed useful for chemistry

Yes, even if LLMs stop improving and stay at the current level of capabilities, we'll see revolutionary-grade impact of AI on science. Just not the way most people envision it.

Yes, generative AI for chemistry so far has been underwhelming, but it's not AI that is at fault.

Issue 1: We argue about the wrong things

In other words, a general problem of "protein folding" is actually a graph of problems:

Predicting structure of well-behaved proteins [YOU ARE HERE]
Predicting structure of intrinsically-disordered proteins
Predicting structure of Holo forms
Understanding/explicitly modeling the protein folding process

Predicting a route for a molecule unseen in training? Some models can already do that.
Predicting a route without ever seeing each individual reaction step? Same.
Predicting a route without ever seeing any examples for all reaction types in the route? Or without ever seeing any examples for one of the reaction types in the route? Well, yeah, sure, current models cannot do that.

But is that a meaningful constraint? A synthetic chemist who needs to see suggestions how to make a molecule would care:

Whether a route is found and is chemically feasible (no obviously impossible reactions)
How quickly is it found
Whether he can repeat the search with extra constraints (w/o using specific solvents, conditions, or reaction types)

Insight

Issue 2. We treat engineering as an afterthought

But even on a more fundamental level, there's a missing appreciation that your codebase quality determines the boundaries of scientific exploration you can perform.

Case Study: DirectMultiStep

some_list.append(new_elt)

is an O(1) (amortized, i.e. on average) operation, but

some_list = some_list + [new_elt]

Issue 3. We can't measure progress

A case study: protein-ligand binding pose prediction with DiffDock and Boltz-2

Issue 4. We don't like to build the infra

o+x
+++
+o+

Takeaway

Note

Acid Test: is your research ideas-constrained or infra-constrained?

Software quality dictates scientific possibility

So how does it all affect AI for science trajectories?

Nothing is new under the sun

Don't like Heidegger? Fine, read these lines by one of the fathers of the scientific revolution:

Neither the naked hand nor the understanding left to itself can effect much. It is by instruments and helps that the work is done, which are as much wanted for the understanding as for the hand.

More in Long Form

February 7, 2026

Hell is an Engineering Patch for the Prisoner's Dilemma

Deriving the necessity of eternal punishment from the Prisoner's Dilemma. How infinite repeated games, discount factors, and the Folk Theorem explain the structural utility of Hell in fostering human cooperation

game-theorypsychologyphilosophyrationality

January 1, 2026

Vibe coding killed Cursor

Cursor is dying because cost-optimization forces models into tunnel vision. RAG agents fail because they only see what they search for. The superior workflow for 2026 is massive context windows (gemini 2.5 pro) and manual control. Stop letting agents hide code from the model.

coding-agentsDX

Featured Work

December 14, 2025

RetroCast and SynthArena: building the infrastructure for the next breakthrough in chemistry

A breakdown of Procrustean Bed for AI-Driven Retrosynthesis: A Unified Framework for Reproducible Evaluation

a deep dive into how the metrics for ai chemical synthesis are broken, rewarding models for "solving" routes with impossible, hallucinatory chemistry. we present the data and the open-source tools to fix it.

retrosynthesisinfrastructurebenchmarkingopen-sourcesoftware-engineering

September 25, 2025·J. Chem. Inf. Model.

Upsampling the Signal: Active Learning with Proxy Spaces

A breakdown of ChemSpaceAL: An Efficient Active Learning Methodology Applied to Protein-Specific Molecular Generation

we built an ai steering system for molecular design that efficiently guides a generative model to invent potent, protein-specific molecules, even rediscovering existing drugs from scratch

active-learningdrug-discoverygenerative-aisoftware-engineering