I settled on a workflow that wasn't just pestering the agent with wishes.
I had a series of discrete sessions, each started by creating a directory named for a new git branch. I wrote a shell script to semi-automate this.
In that directory, I wrote a couple hundred words of intention in a spec.md file.
I asked the agent to expand my intentions into a step-by-step plan.md file.
I edited the plan and asked the agent to review it critically and ask questions.
I answered the questions.
I asked the agent to review it again and tell me if the plan looked clear enough to start implementing.
When it said "yes", I told it to start implementing.
The agent started implementing while I watched.
Sometimes I interrupted and told it that it was on the wrong track. But, for long stretches I was just reviewing the code as it wrote.
When it claimed to be done, I asked it to review the current changes against the plan and judge if it was really done.
Sometimes it wasn't and it went back to work.
When it petered out finally, I told it to make sure all the tests passed and linting errors were fixed. It did that.
I made sure the tests made sense, myself, fixed a few that didn't. Then I told it to run the tests some more.
Finally, when I was okay with the results, I told it to review our entire chat history for this session and summarize the results in a notes.md file.
In particular, I told it to pay special attention to things we did that hadn't been captured in the plan. Try to come up with unexpected conditions and derive some lessons learned.
These notes ended up being actually pretty good?
These three artifacts - spec.md, plan.md, and notes.md - were committed along with the code. That marked the end of the session and the branch.
Now, I won't say that each of the sessions I ran went perfectly. But, I expected it to be an exploration.
I switched models a few times between Claude Sonnet 3.7, GPT-4.1, and SWE-1.
I found Claude to usually work the best. It just sort of got to work and did the needful without enticing many objections from me.
GPT-4.1 seemed to like to make very detailed plans (even after reading the plan.md), ask lots of questions, and then drive off into the ditch and need rescuing.
SWE-1 was about in the middle - but I ended using it more because there's a promotion running right now that makes it free in Windsurf.
Occasionally, I'd switch models mid-session just to see what happened. I'm not sure how to characterize the differences, but they each had slightly different coding styles.
Claude and SWE-1 did better than GPT-4.1 at picking up from unfinished work in progress, I think?
Still, even with the needful babysitting, between these models I did get stuff implemented and it looked a lot like what I would have written if I'd had the executive function to work at it as doggedly.
I think I've learned that a focused scope and context window management are essential.
A few times, I think I asked the agent to bite off more than it could chew? Maybe I blew out the context windows? This is something I could get quantified answers around, if I paid attention to the metrics.
In those cases, I stopped the presses, backed up, and reworked the spec into a smaller scope.
Sometimes, I found it handy to get to the point of having the plan.md tuned up, then started a fresh chat with only the plan as context to start. That seemed to work pretty well - again, I think freeing up some of the context window with more condensed material.
Occasionally, I wandered off into the weeds myself and my session-based approach devolved into chatty iteration. That worked well for making very small tweaks and fussy updates.
I also learned that I'm good at juggling lots of git commits as save states. Whenever things were in a decent enough state, time to commit now and clean up later.
I forgot this a few times and lost some progress after driving into a ditch. But that wasn't too much of a hardship, since I could usually just scroll back in the chat and re-attempt the relative bits of the session for similar results.
I should clean all these bullets up into a proper blog post, but maybe tomorrow. The tl;dr, I guess, is that I think I'm getting comfortable with this stuff.
It's surprising me with how much it gets done.
I'm getting less surprised with where & how it goes wrong.
The failures seem manageable and the results seem decent.
I had a kind of meta-chat with Claude about the above process, trying to think through some improvements.
One interesting notion was to use some big cloud models for the spec.md to plan.md stage.
But, then, switch to a local model running on my laptop for the actual process of implementing the plan.
Then, switch back to a big model for the notes.md summary.
If this worked, it could save a lot of tokens!
I could also see all the above being bundled up and semi-automated into its own agentic workflow.