Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity - METR:
We conduct a randomized controlled trial (RCT) to understand how early-2025 AI tools affect the productivity of experienced open-source developers working on their own repositories. Surprisingly, we find that when developers use AI tools, they take 19% longer than without—AI makes them slower.
I'm seeing plenty of "I told you so" as this makes the rounds. But, having spent the past month deep in AI-assisted coding, it directly contradicts my experience. Maybe I've drunk too much Kool-Aid, but I don't think I'm entirely delusional. I want to head-scratch through this, though.
Small sample, specific scenario
we recruited 16 experienced developers from large open-source repositories (averaging 22k+ stars and 1M+ lines of code) that they’ve contributed to for multiple years.
That's a rather small sample and a very specific scenario, isn't it?
The researchers themselves acknowledge the limitations:
We caution readers against overgeneralizing on the basis of our results. The slowdown we observe does not imply that current AI tools do not often improve developer’s productivity—we find evidence that the high developer familiarity with repositories and the size and maturity of the repositories both contribute to the observed slowdown, and these factors do not apply in many software development settings. For example, our results are consistent with small greenfield projects or development in unfamiliar codebases seeing substantial speedup from AI assistance.
Maybe this double-negative-enriched bit from the "Key Caveats" section basically jibes with my experience? My recent work (admittedly, a sample of 1) has indeed been largely with greenfield projects and relatively unfamiliar codebases.
It does matter how you use it
The researchers hint at something important:
We expect that AI systems that have higher fundamental reliability, lower latency, and/or are better elicited (e.g. via more inference compute/tokens, more skilled prompt-ing/scaffolding, or explicit fine-tuning on repositories) could speed up developers in our setting (i.e. experienced open-source developers on large repositories).
I think this rhymes with my experience. When I've just charged into a rambling chat & autocomplete session with Cursor, things steer into the ditch early and often.
But when I've worked with Claude Code through a multi-step process of describing the problem, asking the agent to prompt me with clarifying questions, reviewing the problem and considering a solution, breaking it down into parts, and then asking the agent to methodically execute—that's yielded decently reliable success.
Waiting, or lack thereof
The study notes:
All else equal, faster AI generations would result in developers being slowed down less. Qualitatively, a minority of developers note that they spend significant time waiting on AI to generate code.
I rarely wait, because I'm juggling multiple projects. When one agent instance is working, I switch to another window. Sometimes it's a separate git worktree of the same codebase. Yes, context switching is tiring, but it also seems to help me overcome ADHD-related activation energy barriers?
Over the years, there've been days when I just sit there staring at the IDE window, poking my brain with a stick saying "c'mon, do something" and nothing happens for an hour or more. I'm not planning my next move, I'm just dissociating. My executive function doesn't, like, function. Often. My own brain makes me wait long periods of time before it starts generating useful results. 😅
Maybe it's the cycling novelty that gets me going? I enjoy task switching between prosing and coding. I enjoy finding that the model appears to have "read" everything—evidenced by it echoing my intent back in code or follow-up questions. I enjoy discovering that while I was in another window, new things happened in the background for me to review.
I've also found that many agents are reliable at handling drudgery. Re-jiggering data structures, applying repeated refactorings, etc. Those tasks can seize me up for tens of minutes at a time with brain-killing waves of tedium. But usually, I can just tell the bot to do it, while I turn to more interesting stuff.
Summing up
Although the influence of experimental artifacts cannot be entirely ruled out, the robustness of the slowdown effect across our analyses suggests it is unlikely to primarily be a function of our experimental design.
This study provides one data point about one specific scenario: experienced developers using specific tools on massive, mature codebases. The researchers themselves caution against overgeneralization, noting that different contexts likely yield different results.
These tools aren't magic and they're not universally beneficial. But dismissing them based on this narrow study would be premature. The key is understanding when, how, and why to use them—something that's still evolving rapidly as both tools and techniques improve.