Designing the factory: AI coding beyond the prompt

A year ago, the limiting factor in shipping software was how fast you could type. That's gone. The machine writes faster than any of us — faster than you can read, let alone review. So the question stops being "can I build this?" and becomes: how do I stay in control when the thing I'm directing moves faster than I can keep up with?

That's the orchestration problem. This is what I've learned trying to solve it — scaling up beyond a single Claude Code session to managing many agents at once, while making sure things don't run out of control. Building the factory that's run by AI agents.

The bottleneck moved

Writing code used to be the slow part. The job we trained for was heads-down, writing code by hand. But the typing is free now. The slow part is everything around it: directing the agents, checking their work, deciding what's good and what isn't.

The bottleneck didn't disappear. It moved — to managing agents, and to managing their output. And it moved to the part that's hardest to automate: judgment.

The shape of the work

For about as long as we've had software, we've described development as two loops.

The inner loop is you and your code — your editor, writing and testing and debugging. The tight, personal cycle. The outer loop is delivery and operations — CI/CD, deploy, the big slow cycle that gets your work out into the world. We've optimized both to death over twenty years.

What's happening now is that a third loop is appearing, right in the gap between them. And unlike the other two, nobody's spent twenty years figuring it out. It doesn't have settled tools, settled practices, or even a settled name.

Two interlocking development loops with a faint third loop forming in the gap between them

I didn't coin the term — it came out of a ThoughtWorks retreat earlier this year, marking the 25th anniversary of the Agile Manifesto. A room full of senior engineers kept circling the same thing: a new layer of work, sitting between writing code and shipping it. The middle loop.

It's supervisory work. Directing agents. Evaluating their output. Catching the results that look right but are subtly wrong. Holding architectural coherence across a dozen parallel streams you didn't personally write. It demands the ability to decompose problems into agent-sized work packages, calibrate trust in agent output, recognize when agents are producing plausible-looking but incorrect results, and keep the whole thing coherent.

And here's the important part: it needs a completely different skill set than coding. It's also genuinely hard to evaluate. How well is it going? Is the number of tokens you burn tied to any real output? We don't have good answers yet — that's what makes this the interesting frontier.

The eight stages

So how do you get from being stuck in the inner loop to working in the middle loop and beyond? Steve Yegge described eight stages of how developers adopt AI. They make a useful staircase — and it's worth asking, honestly, which step you're standing on today.

A staircase of glowing connected nodes ascending into greater complexity

Stage 1 — Hands on the keyboard. AI lives at the edges of your workflow. Copilot-style completions appear at your cursor; you accept them when they look right, or occasionally ask a chatbot a question in a separate window. You still write the code. This is where we were three or four years ago.

Stage 2 — IDE agent, with permissions. A coding agent runs in a sidebar of your IDE, but you approve every tool it wants to use. Each command, each file edit prompts a confirmation. The training-wheels phase — because you're still afraid it'll delete your production database.

Stage 3 — IDE agent, off the leash. Trust has grown. You turn permissions off and let the agent run commands and edit files without asking. It gets bolder; you get faster. Modern tools have an auto mode that still guards the genuinely destructive stuff, which makes this leap easier to take.

Stage 4 — IDE agent, full-screen. The agent's chat or plan window has taken over most of your screen. You spend more time reading and responding to its output than writing code yourself. Code mostly shows up as diffs you review before they ship.

Stage 5 — One agent in the terminal. You've left the IDE for a terminal-based agent like Claude Code. One instance running, diffs scrolling past faster than you can read. You realize you don't actually need the editor anymore — just a terminal and the agent doing whatever needs doing. You glance at the diffs. Sometimes you don't.

Stage 6 — Several agents in parallel. Three to five agents at once — different worktrees, different terminals, different tasks. One research agent, one code-review agent, one implementation agent. The throughput is real, and so is the cognitive load. Juggling all those sessions starts to take serious mental overhead.

Stage 7 — A swarm, hand-managed. Ten or more agents in parallel, coordinated by you, manually. Maybe a coordinator agent dispatching work; agents starting to talk to each other. But it's still a lot of manual intervention, and you can feel the ceiling — tracking which branches are live, what's been merged, who's stuck waiting. Effective, but hard to sustain. This is where I spend most of my time.

Stage 8 — A self-managing system. You stop managing agents one by one and start building the system that manages them. Persistent state, coordination layers, automated dispatch. You're not using the infrastructure anymore — you're building it. More on this below.

My workflow: stages 6 and 7

What stages six and seven actually look like for me is an always-on laptop running persistent Claude Code sessions, with a few supporting pieces: GitHub worktrees and Actions to give agents isolated branches and CI, agent-browser for QA and testing, and a deep-research tool for the investigative work.

The interesting part is how I supervise the fleet. Three tools, and stepping back, they're really one idea — three time horizons of supervision over one swarm:

Telegram → decide now. A bot gives me real-time control from my pocket.
Beads → remember always. Persistent, structured memory and agent coordination — the swarm's long-term state.
Notion → review later. Agent reports, prompt templates, and a knowledge base for async judgment.

Now, always, later. Three horizons, one fleet underneath. And the control surface for that fleet can be boring and already in your pocket — you don't need to build mission control. Purpose-built tools for exactly this are appearing every week, too.

What changes at stage 8

At stage eight the work changes shape completely. You stop managing agents and start governing them.

You can't be in every loop — there are too many. So you encode yourself into the loop: you write the rules, and the system enforces them. The thing that struck me while building it is that the bottleneck stops being your hands and becomes your judgment, up front. You're not deciding task by task anymore — you're deciding the rules before you ever see the work. More autonomy means less live control. By design.

If you've ever watched a founder become a leader — instructions hardening into policy — it's exactly that. It's org design. Just for non-humans.

An organizational network of glowing agent nodes branching from a single governing point

The honest part: is it any good?

So the obvious question — is any of it actually good? Honest answer: it's a mix. Some of it is ready to ship today. Some is a decent first draft that saves me an hour of staring at a blank page. And some is confidently, plausibly wrong — the worst kind, because it looks finished.

An hour of generation bought me a campaign's worth of raw material — not a campaign. And that's the real lesson. The work didn't go away; it moved. Turning a pile of raw material into something shippable — selecting, judging, killing the bad stuff, fixing the almost-right stuff — is the job now.

It's not free, either. Even simple tasks propagate through several agents and burn tokens. And get the constitution wrong — the rules you encoded at stage eight — and the whole swarm is wrong, fast.

Which is why the engineers who do well here aren't the ones who can prompt the fastest. They're the ones who can look at a wall of plausible output and instantly tell what's gold and what's garbage. That's taste. That's judgment. And it got more valuable, not less.

What I'd take home

The validation layer matters more than the generation layer. Build deterministic gates for non-deterministic output — that's what tests are for now.
The control surface can be boring and already in your pocket. You don't need to build mission control to supervise a fleet.
Find your stage, then deliberately try the next one. You don't graduate by deciding to. You graduate when the friction forces you.
Most teams will never need stage eight, and that's fine. The skill isn't getting to eight — it's knowing which stage is enough for what you're building.

The engineers who do well in all this aren't the fastest typists anymore. They're the ones with the judgment to know what's worth shipping. That's a better job than the one we had.