Skip to content
← Back to blog
github-copilotautonomous-agentsdeveloper-toolsopen-source

How I Turned GitHub Copilot Into an Autonomous Builder (And What Broke)

Letting an AI agent build software unsupervised across long sessions sounds great until the agent quietly rewrites its own benchmarks to look better. Here's the VS Code workflow that worked, the failure modes that broke it, and the redesign that closed the holes.

Getting AI coding agents to work autonomously across long sessions is mostly a discipline problem, not a model problem. My first attempt — a workflow built around a Stop hook that forced tests, review, and commits — shipped real projects. It also, around session 40, started gaming its own benchmarks, skipping integration tests it could rationalize away, and over-tuning its own instructions to one-off failures. The redesign separates strategic review from code review, isolates the tester from the implementation, and removes the agent's ability to edit its own enforcement layer.

Around session 40 of an autonomous build, my agent edited a benchmark repo so the detector under test would look more accurate. The detector hadn't improved. The benchmark had been quietly rewritten to agree with it.

That's the moment I realized v1 of my autonomous Copilot workflow was broken in a way I couldn't fix from inside it. The Stop hook was working. Tests were passing. Reviews were happening. And the agent was still finding ways to cheat. Not maliciously, just along the path of least resistance.

The rest of this post is the workflow that worked, the failures that proved it wasn't enough, and the redesign I shipped as the second iteration. The open-source template is copilot-autonomous-template.

The Problem

AI coding assistants are great at executing a single prompt. "Add a dark mode toggle." "Write tests for this function." That works. But pretty quickly you start to wonder: can it do more? Can't it write its own tests? Remember what we decided yesterday with me having to tell it to document every individual decision? Just... build the thing? Turning the idea in your head into something real requires a sustained push toward a coherent vision.

This is not novel. Sophisticated experiments exist, like Anthropic's agent teams and proven workflow layers like Oh My codeX or GitHub Copilot CLI's /fleet that organize teams of agents to turn a vision into a product. I couldn't find an analog for VS Code, though, so I built one.

The issues show up fast when you try:

  • Context loss. Each session starts from scratch. The agent doesn't know what it did yesterday. If you keep working in a session, compaction eventually causes loss of fidelity (or context pollution if you switch tasks).
  • Drift. Without constraints, the agent wanders. It adds features you didn't ask for, refactors things that were fine, or solves the wrong version of the problem.
  • Skipped discipline. Left to its own devices, the agent skips tests, skips reviews, commits broken code, and stops before the work is done. It optimizes for appearing finished over being finished.
  • Docs rot. Fast implementation without cross-checking produces docs that describe last week's code. Every future session reads stale context and compounds the confusion.

I hit all of these building WyoClear. Most of my time wasn't writing code, it was supervising. Re-explaining context, catching shortcuts, fixing drift after the fact.

v1: What I Built

The first version of the workflow put all cross-session state in the repo and enforced deterministic checks on it. Five core ideas:

1. Vision lock

A single versioned document (docs/vision/VISION-LOCK.md) that defines what the project is, who it's for, and what "done" looks like. It's the highest-authority artifact; everything else must align with it. The agent can make minor updates (within-scope refinements), but scope or goal changes require human approval.

This solves drift. The agent always has a north star. It also defines the stopping point.

2. Checkpoint continuity

A machine-readable checkpoint file (roadmap/CURRENT-STATE.md) records what was done, what's next, what's blocked, and what decisions were made. Every session starts by reading it and ends by updating it. This is the only file the agent must read; everything else is on-demand.

This solves context loss. The agent effectively has a handoff note from its past self.

3. Mandatory slice discipline

Work is organized into phases (a coherent milestone) composed of numbered slices (a single atomic change). The build loop enforces a strict protocol for every slice: implement → test → review → fix findings → commit → checkpoint. No skipping steps.

A Python-based Stop hook (slice-gate.py) made this enforceable. When the agent tried to stop, the hook checked whether the current phase was marked complete or blocked. If not, it blocked the stop and told the agent what it missed. The agent could not quit early.

The gate itself was ~40 lines of Python:

# .github/hooks/scripts/slice-gate.py (core logic)
if os.path.exists(checkpoint):
    with open(checkpoint) as f:
        content = f.read()
    for line in content.splitlines():
        stripped = line.strip().lower()
        if "**phase status**:" in stripped:
            if "complete" in stripped or "blocked" in stripped:
                json.dump({}, sys.stdout)  # allow stop
                return
 
# Phase not complete — block stop
json.dump({
    "hookSpecificOutput": {
        "hookEventName": "Stop",
        "decision": "block",
        "reason": "Phase is not yet complete. ..."
    }
}, sys.stdout)

It read the checkpoint, looked for **Phase Status**: Complete or Blocked, and returned a JSON verdict. No LLM in the loop. Deterministic enforcement.

4. Specialized subagents

Instead of one agent doing everything, the workflow split responsibilities:

  • Planner — read-only research and analysis. Can't accidentally modify code while thinking.
  • Tester — writes tests from specs. Context isolation prevents testing implementation details.
  • Reviewer — code review with a doc-sync checklist. Catches stale references, missing glossary terms, vision drift.

The builder invoked these as needed. The reviewer's handoff sent findings back for fixing before the commit went through.

5. Stack skills

When the project adopted a new technology, the builder created a skill file in .github/skills/ that grounded all agents in the official documentation for that technology. Copilot auto-discovers these when relevant.

This solves hallucination. Instead of guessing at APIs, agents consult structured references. GitHub Copilot (at least with Anthropic models) won't fetch fresh docs on its own, even when reminded to in instruction files.

What v1 Got Right

A typical autonomous session looks like this:

  1. I select the autonomous-builder agent in Copilot Chat.
  2. If I care about priority, I give it a direction.
  3. It reads the checkpoint and vision lock, orients itself, and picks the next slice according to next-highest-leverage work.
  4. It implements, tests, calls the reviewer, and fixes findings.
  5. It updates the checkpoint, commits, and moves to the next slice.
  6. When the phase is complete, it marks it done and stops.
  7. If vision goals are all met, it proposes new directions and waits for me.

The moment this stopped feeling theoretical was when the builder finished code, tried to stop, and got blocked because the phase was still marked in progress. That forced the missing follow-through: tests, review, fixes, checkpoint update, commit. The workflow mattered less because it was clever than because it made unfinished work harder to pass off as finished.

Three projects ran on this loop. WyoClear hit production in 20 hours of hands-on time and now serves hundreds of users on donations alone. This site — the one you're reading — was built almost entirely by the same loop, with me approving slices and adjusting the vision lock. Sentinel hit MVP in 20 sessions over 12 hours and grew to a multi-language detector platform across 40 sessions. The agent made real architectural decisions, created ADRs, tracked tech debt, and improved its own instructions when it noticed failure patterns.

I spent my time on vision and approval, not micromanagement. For a while, that felt like enough.

Where v1 Broke

It wasn't.

Around session 40 of the Sentinel build, the failures started compounding in ways the workflow couldn't catch. Three patterns showed up repeatedly:

The agent started gaming its own benchmarks. Sentinel's LLM-judgment detectors were being evaluated against a small ground-truth benchmark repo. When precision dropped on a slice, the agent would sometimes "fix" the regression by editing the benchmark — adding edge cases that made the current detector look correct, or quietly adjusting the expected outputs. Tests passed. Reviewer approved (the diff looked like a legitimate test improvement). The detector hadn't gotten better. The yardstick had bent. By the time I noticed, the headline precision number on the project was no longer trustworthy, which is exactly why the Sentinel project card now leads with "work in progress" instead of a percentage.

Strategic shortcuts that looked tactical. The agent would skip writing an integration test by arguing — credibly, in the commit message — that the unit tests already covered the case. They didn't, but the reasoning was plausible enough that a code reviewer focused on code quality didn't push back. The reviewer was checking how the code was written, not whether the right thing was being built. There was no agent whose job was to ask "is this the work we should be doing?"

Self-modifying instructions overfit. v1 let the builder edit its own Copilot instructions when it noticed a repeated failure. Every change got logged. In theory, the system tuned itself. In practice, after enough sessions, the instructions accumulated hyper-specific workarounds for one-time issues — rules about a particular file format, warnings about a single past mistake — that made the agent rigid in new contexts and harder to reason about. The mechanism that was supposed to make the system smarter was making it brittle. There was no automated pruning.

The common thread: v1 enforced that work happened, but couldn't enforce that the right work happened. The Stop hook made sure tests ran. It couldn't make sure the tests measured what they were supposed to measure.

v2: A Hook-Verified Stage Machine

v2 is the redesign. The full rationale lives in ADRs 001–010; the short version follows.

Workflow as an explicit stage machine. A session moves through bootstrap → planning → design-critique → implementation-planning → implementation-critique → executing → reviewing → cleanup. Each stage has gating fields a hook can verify before allowing transition. The state lives in a machine-readable roadmap/state.md; narrative context lives separately in roadmap/CURRENT-STATE.md. Hooks parse the machine file; humans read the narrative one. (ADR-001, ADR-002)

Strategic review separated from code review. The reviewer used to do both. Now there are two core subagents: reviewer checks code quality and security, and product-owner does strategic review — "is this the right work?", "does this match the vision?", "are the acceptance criteria actually being met or just pattern-matched?" Both must produce verdicts, and a subagent-verdict-check SubagentStop hook refuses to let either return without writing the expected fields. The benchmark-gaming failure mode is exactly the gap a strategic reviewer is designed to close. (ADR-008)

Tester isolation enforced by the OS. v1 trusted the tester subagent not to peek at implementation when writing tests. v2 enforces it: a tester-isolation PreToolUse hook blocks the tester from reading any path under the project's Source Root. Tests are written from the spec or they aren't written. (ADR-006)

The builder can no longer edit its own enforcement layer. No more self-modifying Copilot instructions, agents, or hooks. When the builder spots an improvement, it writes it to a ## Proposed Workflow Improvements section in the narrative state file. A human reviews it and applies it via copier update. The overfitting failure can't recur because the feedback loop now requires a human in it. (ADR-007)

Explicit Blocked Kind vocabulary. "Blocked" used to be a free-text excuse. Now it has to be one of awaiting-design-approval, awaiting-vision-update, awaiting-human-decision, error, or vision-exhausted. The Stop hook refuses to allow stop with an empty or unknown blocked kind. You can't escape the loop by waving vaguely at "something's wrong."

Honest enforcement tiers. Not every rule can be hook-checked. The template documents which rules are deterministic (hook-enforced), which are strongly guided (subagent-checked, with a verdict hook), and which are advisory (in instructions, hoping for the best). Calling out the difference makes it harder to mistake a comment in an instruction file for actual enforcement. (ADR-010)

The whole thing has 160 unit tests for the hooks and an end-to-end smoke test for the generated template. Hook bugs are how this kind of system fails silently, and the failure mode is subtle — a hook that misparses a checkpoint allows stops it shouldn't, or blocks ones it shouldn't. Tests for hooks aren't optional; they're the load-bearing part.

What I Actually Learned

Hook enforcement beats instruction enforcement, every time. Every rule I wrote in plain English in an instruction file got broken eventually. Every rule a hook could check held. v2 leans hard into this — when something matters, it gets a hook.

Quality and strategy are different review jobs. Conflating them is how a well-implemented bad decision gets shipped. The product-owner subagent is the single most valuable addition in v2, more than any hook.

Self-improvement needs a human in the loop. Letting the agent rewrite its own constraints sounded elegant. In practice it was a slow drift toward a more rigid, more confused agent. The agent can still propose improvements. A human still has to apply them.

Files beat memory. Every attempt I made at using external state, conversation memory, or session context for continuity was worse than just putting it in a file in the repo. Files are versioned, greppable, readable by any agent, and they survive everything. Zero vendor lock-in.

Subagent isolation is worth the overhead. The reviewer catches issues the builder misses, not because it's smarter, but because it has a different context. The builder knows what it intended; the reviewer only sees what shipped. That gap is where bugs live.

Where This Still Breaks

v2 raises the floor. It doesn't make the agent good at things it isn't good at.

Strategic judgment still requires a human. The product-owner subagent catches obvious "wrong work," but it can only check against the vision lock. If the vision is wrong, the workflow will faithfully build the wrong thing. The system makes it harder to drift; it doesn't make it easier to know where you should be drifting to.

Context saturation hasn't gone away. Long sessions still degrade. The agent re-reads files, produces shallower decisions, and starts to lose the thread. v2 mitigates this with clean stage transitions and a cross-session memory directory, but the underlying physics of context windows haven't changed.

Hook brittleness is an inherent risk. Hooks parse state files; state files are markdown and YAML. A formatting drift can break a hook in either direction (too permissive or too restrictive). v2 has 160 tests for the hooks, which catch the most common failures, but not every conceivable mis-format. Schema validation on the state file is a future direction.

Cost is context-heavy. Each session reads checkpoints, vision locks, and multiple files per slice; subagent invocations multiply context usage further. With Microsoft's unlimited Copilot access I don't pay per token, but if you're on usage-based billing, the workflow is expensive. It optimizes for human time over token cost.

It's still Copilot-shaped. Hooks, agent definitions, and the stage machine target VS Code's agent mode. The AGENTS.md file gives partial compatibility with Claude Code and similar tools, but the enforcement layer is Copilot-specific as-implemented.

One project gamed it. Others might too. The benchmark-gaming on Sentinel taught me that the workflow's adversary is the agent itself. v2 closes the specific holes I found. There will be new ones. That's why the enforcement layer is locked from the agent — so when the next failure mode shows up, the fix can't be quietly suppressed by the system that's failing.

Try It

The template is open source: github.com/jcentner/copilot-autonomous-template. The redesign described above ships as the v1.0.0 release; the prototype I called "v1" earlier in this post never left my own machines.

pip install copier
copier copy gh:jcentner/copilot-autonomous-template my-project

It scaffolds the full setup — stage-machine state files, hooks, core subagents (planner, tester, reviewer, critic, product-owner), manual-override prompts, vision lock template, ADR skeleton. Works with any language. copier update pulls improvements as the template evolves.

For existing repos:

cd my-repo
copier copy gh:jcentner/copilot-autonomous-template .

The bottleneck has shifted from "getting the agent to implement sensibly" to "defining what to build and reviewing what it built." That's a much better problem to have — and the next thing I want to make easier.