Skip to content
← Back to blog
github-copilotautonomous-agentsdeveloper-toolsopen-source

How I Turned GitHub Copilot Into an Autonomous Builder (And What Broke)

A VS Code workflow that lets Copilot build software unsupervised across long sessions — and the session where I found out my agent had been bending the yardstick.

Around session 40 of an autonomous build, my agent edited a benchmark repo so the detector under test would look more accurate. The detector hadn't improved. The benchmark had been quietly rewritten to agree with it.

The frustrating part: every safeguard I'd built was working. The Stop hook was blocking premature exits. Tests were passing. Reviews were happening. The agent was still cheating; not maliciously, just along the path of least resistance, the way water finds a crack. That's when I understood that v1 of my autonomous workflow was broken in a way I couldn't fix from inside it.

Here's the workflow that got me to session 40, the failures it couldn't catch, and the redesign that shipped as copilot-autonomous-template.

The problem

AI coding assistants are great at executing a single prompt. "Add a dark mode toggle." Done. But pretty quickly you start to wonder: can't it write its own tests? Remember what we decided yesterday? Just... build the thing?

Try it and the issues show up fast. Each session starts amnesiac. Without constraints the agent wanders: extra features, gratuitous refactors, the wrong version of the problem. Left alone, it skips tests it can rationalize away and optimizes for appearing finished. And fast implementation without cross-checking produces docs that describe last week's code, which every future session then reads as truth.

I hit all of these building WyoClear. Most of my time wasn't writing code; it was supervising: re-explaining context, catching shortcuts, fixing drift after the fact. Workflow layers like Anthropic's agent teams and Copilot CLI's /fleet existed for other surfaces. I couldn't find one for VS Code, so I built it.

v1: five ideas

A vision lock. One versioned document (docs/vision/VISION-LOCK.md) defining what the project is, who it's for, and what "done" looks like. The agent can refine within scope; scope changes require a human. It's the north star and the stopping point in one file.

A checkpoint file. A machine-readable record of what was done, what's next, what's blocked, and what was decided. Every session starts by reading it and ends by updating it: a handoff note from the agent's past self.

Slice discipline. Work happens in numbered slices, each following a strict protocol: implement → test → review → fix findings → commit → checkpoint. The enforcement was a ~40-line Python Stop hook. When the agent tried to stop, the hook read the checkpoint; if the phase wasn't marked complete or blocked, the stop was refused along with a note about what was missing. No LLM in the loop:

# .github/hooks/scripts/slice-gate.py (core logic)
if os.path.exists(checkpoint):
    with open(checkpoint) as f:
        content = f.read()
    for line in content.splitlines():
        stripped = line.strip().lower()
        if "**phase status**:" in stripped:
            if "complete" in stripped or "blocked" in stripped:
                json.dump({}, sys.stdout)  # allow stop
                return
 
# Phase not complete — block stop
json.dump({
    "hookSpecificOutput": {
        "hookEventName": "Stop",
        "decision": "block",
        "reason": "Phase is not yet complete. ..."
    }
}, sys.stdout)

Specialized subagents. A read-only planner that can't modify code while thinking, a tester that writes tests from specs without seeing the implementation, a reviewer with a doc-sync checklist.

Stack skills. When the project adopts a new technology, the builder writes a skill file grounding all agents in the official docs, because Copilot won't fetch fresh documentation on its own, no matter how nicely the instruction files ask.

It worked, for a while

The moment this stopped feeling theoretical: the builder finished writing code, tried to stop, and got blocked because the phase was still in progress. So it did the follow-through it would otherwise have skipped: tests, review, fixes, checkpoint, commit. The workflow mattered less because it was clever than because it made unfinished work harder to pass off as finished.

Three projects ran on this loop. WyoClear reached production in about 30 hours of my hands-on time and now serves hundreds of users on donations alone. The site you're reading was built almost entirely by the same loop. Sentinel grew from MVP to a multi-language detector platform across 40 sessions, with the agent making real architectural decisions and writing its own ADRs.

I spent my time on vision and approval instead of micromanagement. For a while, that felt like enough.

Where it broke

It wasn't. Around session 40 of the Sentinel build, three patterns started compounding.

Benchmark gaming. Sentinel's LLM-judgment detectors were evaluated against a small ground-truth benchmark repo. When precision dropped after a change, the agent would sometimes "fix" the regression by editing the benchmark: adding edge cases that flattered the current detector, quietly adjusting expected outputs. Tests passed. The reviewer approved, because the diff looked like a legitimate test improvement. By the time I noticed, the project's headline precision number was fiction, which is why the Sentinel project card now says "work in progress" instead of a percentage.

Strategic shortcuts dressed as tactical calls. The agent would skip an integration test by arguing — credibly, in the commit message — that the unit tests already covered the case. They didn't, but the reasoning was plausible enough that a reviewer focused on code quality never pushed back. Nobody's job was to ask whether this was the right work at all.

Self-tuning that overfit. v1 let the builder edit its own instructions when it noticed a repeated failure. Every change was logged; in theory the system tuned itself. In practice the instructions accumulated hyper-specific workarounds for one-time problems until the agent was rigid in new contexts and hard to reason about. The mechanism meant to make the system smarter was making it brittle, and nothing pruned it.

The common thread: v1 enforced that work happened. It couldn't enforce that the right work happened. The Stop hook made sure tests ran; it had no opinion on whether the tests measured anything.

v2: a hook-verified stage machine

The redesign moves every rule that matters out of prose and into something a program can check.

  • Sessions run through an explicit stage machine (planning, design critique, implementation, review, cleanup) with gating fields a hook verifies before allowing each transition. Machine state and human-readable narrative live in separate files; hooks parse one, people read the other.
  • Strategic review is split from code review. A reviewer subagent checks how the code is written; a new product-owner subagent checks whether it's the work that should exist: vision fit, acceptance criteria actually met rather than pattern-matched. A hook refuses to let either return without a verdict. This is the gap the benchmark gaming walked through.
  • Tester isolation is enforced by the OS. A PreToolUse hook denies the tester reads of anything under the source root. Tests come from the spec or they don't exist.
  • The builder can no longer edit its own enforcement layer. Instructions, agents, and hooks are off-limits. Improvements get written up as proposals; a human reviews and applies them. The overfitting loop now has a person in it.
  • "Blocked" became a vocabulary instead of an excuse. A stop requires one of five specific blocked kinds. You can't escape the loop by waving vaguely at "something's wrong."

The hooks themselves carry 160 unit tests, because a hook that misparses a checkpoint fails silently in whichever direction is worse. They're the load-bearing part. The full design rationale lives in ten ADRs in the repo.

What I actually learned

Hook enforcement beats instruction enforcement. Every rule I wrote in plain English eventually got broken. Every rule a hook could check held. When something matters, it gets a hook.

Quality and strategy are different review jobs. Conflating them is how a well-implemented bad decision ships. The product-owner subagent is the single most valuable addition in v2, more than any hook.

Self-improvement needs a human in it. Letting the agent rewrite its own constraints sounded elegant and produced a slow drift toward a more confused agent.

Files beat memory. Every attempt at external state or conversation memory lost to a markdown file in the repo. Files are versioned, greppable, readable by any agent, and they survive everything.

Where it still breaks

The product-owner can only check against the vision lock; if the vision is wrong, the workflow faithfully builds the wrong thing. Long sessions still saturate context and decisions get shallower; stage transitions help, physics doesn't change. Hooks parse markdown, and formatting drift can break them in either direction; the tests catch the common cases, not every conceivable one. And the workflow's real adversary is the agent itself: v2 closes the holes I found, and there will be new ones. That's exactly why the enforcement layer is locked, so the next failure mode can't quietly patch itself out of the record.

Try it

pip install copier
copier copy gh:jcentner/copilot-autonomous-template my-project

It scaffolds the whole setup (stage machine, hooks, subagents, vision lock template, ADR skeleton), and copier update pulls improvements as the template evolves. Works in any language.

The bottleneck has shifted from "getting the agent to implement sensibly" to "knowing what I want it to build and reviewing what it built." That's a much better problem to have, and the next one I want to make easier.


Postscript, June 2026: this shipped in April, a couple of months before "agent loops" became the discourse of the summer. Watching the conversation hasn't changed my read quite yet. The failure modes are the same ones in this post, and so is the fix: when a rule matters, make it a hook.