Research & Writings
Summaries of my academic work, long-form articles, tutorials, and miscellaneous notes. Filterable by topic.
Summaries of my academic work, long-form articles, tutorials, and miscellaneous notes. Filterable by topic.
An observation well-known to any designerfun story: my freshman winter break I took a How to Design intensive (4.02A) from MIT’s Architecture department and I expected to be taught the methodological procedure for creating good design, but instead the class was basically: here’s a piece of styrofoam, just do something with it. And when you try and do something and you get something shitty, you expect some magical advice, but you only hear: that’s fine, just iterate. At the time, I thought those 3 weeks were such a waste of my time, but looking back, honestly, one of the most based classes or coder is that success and quality is a function of the number of iterations. What if this is a more universal law of nature? What if it applies equally well to physical and life sciences?
A huge reason (or at least post-factum rationalization) for my shift from experimental to computational research was the realization that the time gap between a formulation and verification of a hypothesis in computational sciences is orders of magnitude smaller, so if I liked coming up with ideas and testing them (which I did), I could do much more with a computer than with my hands. I always considered this to just be a personal preference, but what if rapid iteration is a prerequisite for any good science?
What if the reason for the difference in the rates of progress in the worlds of atoms and bitsif you’re not terminally online, this is a distinction introduced by Peter Thiel who observed that over the past 50 years we had tremendous progress in the world of bits, but in terms of atoms the world hasn’t changed much since the 1970s is not the “we’ve already picked all the low-hanging fruits” or the lack of effort, and not even a combination thereof, but the lack of specific engineering effort targeted at trying as hard as possible to reduce experimentation latency? Which would be very plausible since it might necessitate spending weeks, months, if not years on work that is orthogonal to actual “doing science” (even though it might be a prerequisite for paradigm shifts in the long term)?
Revisiting the argument made in “We need SWE minds”, what if linting, type checking, testing, versioning (branching, commits, PRs) as abstract concepts are not unique to programming in any deep ontological sense, but are more universal ways of accelerating and measuring progress? And so it’s not that “you can’t define these things in experimental science”, it’s just that it’s a bit harder to define them in experimental science, and they’re less likely to organically crystallize as common-sense best practices (like they do in coding), but if we force ourselves to search for how to define the parallels, we will find them and as a bonus will accelerate experimental research.
In other words, what if some disciplines provide a more fertile ground for the articulation of best practices (because they have cleaner feedback loops or are, for structural reasons, more susceptible to formalization), but these practices, once abstracted to their normative core, have universal scope?
Rapid improvement of AI capabilities in math and coding is partially caused by the success of reinforcement learning on verifiable rewards (RLVR). A basic idea is that you can easily check if a math proof is correct,e.g. by checking if it compiles in Lean or you can check if the code is correct by checking if it passes test cases.
And things like versioning/commits make it incredibly easy to generate rich synthetic datasets. For those unfamiliar, a commit is an efficient snapshot of a codebase, basically programming has a way of you being able to track every single change to your documents without creating manual copies of each document and having a zoo of file names:
.
├── my-code-first-version-v1.txt
├── my-code-some-feature-added-v2.txt
├── my-code-some-bug-fixed-v3.txt
├── my-code-final-release-v4.txt
└── my-code-bug-fix-final-final-release-v5.txtOne obvious way to train on code is to use commit history as paired examples of intent and targeted change. Take my-code-first-version.txt, craft a prompt for which a perfect response would be all the differences added to make my-code-some-feature-added-v2.txt, and then a model is trained on the resulting pairs (prompt, code changes). For mature codebases, commits are usually very small targeted atomic changes, so a hobby/open source project might have hundreds or thousands of them, and enterprise codebases (e.g. Google’s monorepo) have tens of millions of commits. And you can create a very high quality training scenario on any pair of commits.
Naturally, when an AI bro, seeing how far you can push AI models in this environment, is saying “it’s so over, we’re going to cure all disease,” it’s very easy and tempting to counter-argue by saying:
And I find it very, very hard to even consider for a second that git-style versioning would not be revolutionary in experimental research. Maybe we can’t perfectly repeat an experiment because life is inherently stochastic and cells might just decide not to reproduce. Or maybe you’re simply following a slightly different procedure because whatever you wrote in your lab notebook is an inaccurate representation of the steps you actually took last time.
Say you want to have a perfect account of all chemical reactions performed in a lab. Here’s how a human typically approaches synthesis: you find a literature precedent for the reaction you want to perform (say A + B = C) and you check the ratios of moles of each compound used (e.g. 1.1 mole of B for 1 mole of Aoptimal ratios might not necessarily be 1 :1 even for a reaction that is on paper 1 :1), you recalculate the amounts of A and B you need to use for your particular scale (say you need 50 mg of C), you write in your notebook the masses of A and B to add, like 34.2 mg of B and 5.3 mg of A. Then you read the procedure from the precedent which might look like “dissolve A in solvent, cool down to 0°C, add B”, and ideally you write that procedure in your lab notebook. You go to the lab, get the actual containers with A and B, go to the balances and try to get roughly 34.3 mg of B and 5.3 mg of A. Because density of compounds can vary quite a bit in solid form, you might misjudge the mass corresponding to a full scoop and you might take 5.5 mg of A out of the container. If you’re doing things properly, you’re not supposed to put stuff back into the container,compounds might be hygroscopic, i.e. they absorb moisture from air so you either need to discard the rest or you might think, ah, not too big of a difference, let’s yolo 5.5 mg.
If you were to ask a competent SWE to engineer a system for accounting of all chemical reactions, he’d create dedicated fields to every mass measurement, create an inventory of all compounds, which would allow a user to pre-select compounds from a list and the program would automatically know things like molecular weight. So when a user enters 5.5 mg, the system can do a sanity check and re-calculate the de-facto ratios of compounds in your system, and if your intent was to have 1.1 B to 1 of A, with those extra 0.2 mg of A, you have 3.7% more of A, so your de-facto ratio is more like 1.06 B to 1 of A, which might be a very significant difference.
Never in a million years would an experienced SWE think to just slap in a single text box and ask a user to enter all relevant information. Because if you think you’ll collect high quality data like that, boy I have a bridge to sell you. And yet this is exactly how thousands if not millions of experimental scientists log their experiments every day! And you’d be lucky if the procedure is written in a lab notebook before the experiment is actually performed and not written post-factum from a scientist’s recollection at the end of the day (hopefully that same day). Because it surely can never happen that a grad student comes to the lab at 9am, realizes that his experiment would run 6 hours, has some “research in progress” talks at 10am-11am, and realizes if he writes everything down before the experiment he won’t have time to actually set it up before the RIPyeah, that’s the abbreviation many departments unironically use talks, but starting it at 11 would mean he wouldn’t even be able to start a workup before 5pm, which of course is never going to be a relevant consideration because grad students are surely trained to treat science as a holy mission, and so he surely won’t decide to set up an experiment before RIP talks real quick and do a writeup later, after his short-term memory has been cognitively overloaded with slides that show tens of bizarre abbreviationshere’s an example sentence pulled from a Nature paper: “to provide further evidence that IκB modification may be regulated by NHERF2, we co-transfected Flag-tagged IκB and Myc-tagged NHERF2 with two HA-tagged Ub (wild type Ub or K48-linkage-type-only Ub, HA-Ub or HA- K48) into HEK293T cells” denoting proteins and intricate interplays and not-at-all self-explanatory charts. Surely his recollection of the actual steps performed 2 hours ago to set up an experiment will be accurate. Right? RIGHT?
Basically my point is that yeah, we can’t cleanly extrapolate the way we train AI to write code into experimental science research, but not because the latter is somehow epistemically exceptional and therefore immune to technological scaffolding that makes dataset construction and verification computationally tractable, but because academia was too proud to pay market rates to competent SWEs to digitize data collection, so it hasn’t been done yet, but there’s no reason to believe it couldn’t be done, and most importantly no reason to believe it shouldn’t be done. Because maybe, (and this is a crazy thought), maybe human-led progress also requires highly structured and accurate accounting of all performed actions and helpful runtime validation sanity checks on all quantitative data. So maybe lack of clear RLVR is a roadblock not just to AI progress, but to any progress.
other voices in this fugue
A crude script for scheduling olympiad arbitration shows how mundane software can carry absurd downstream stakes.
The habits and thinking process trained by software engineering are valuable almost in every other domain.
A full translation of an interview with Grigori Perelman's math teacher. He explains Perelman's rejection of the Fields Medal as a protest against a 'dishonorable' math community that treats theorems as a commodity to be stolen. Also features a brutal, unapologetic defense of Soviet-era educational philosophy
Deriving the necessity of eternal punishment from the Prisoner's Dilemma. How infinite repeated games, discount factors, and the Folk Theorem explain the structural utility of Hell in fostering human cooperation