On Mental Models vs LLM Context in Programming
Still thinking about that METR study and I saw John Whiles write this:
The developers could provide chunks of that mental model to their AI tools - but doing so is a slow and lossy process that will never truly capture the theory of the program that exists in their minds. By offloading their software development work to an LLM they hampered their unique ability to work on their codebases effectively.
This feels to me like a good take on why AI assistants slowed down developers in that study. A core challenge with AI coding assistants is context composition and management. But, creating that context takes time and effort that's often orthogonal to writing code.
For developers who can manage complex context in their heads as wordless thought-forms, just writing the code can be faster. AI agents aren't mind-readers, and LLM context windows don't match human headspace.
Here's the irony: AI tools help in situations where human mental models are weakest—and those situations often exist because expert programmers never tried to write down their mental models in the first place. When you're working on projects built by people who left years ago, onboarding engineers struggle to form accurate mental models because the veteran engineers who had those models never documented them. New engineers have to start from scratch.
LLMs can ingest unfamiliar code faster than humans and generate changes that work because they resemble historic patterns. More importantly, they excel at transforming code symbols into readable explanations—roughly reverse-engineering a likely facsimile of documentation that should have existed.
If you expect to work on a project long-term and want to truly understand it, you need to do the homework: write the code yourself with maybe a little LLM advice and way-finding. At the very least, you should be reviewing LLM-generated code in detail before merging it.
The best approach treats these tools as what they are: power tools for symbol transformation and pattern completion, grinding through boilerplate work that doesn't demand high creativity. They're not replacements for human thinking about architecture, requirements, and constraints.