Skip to content
← Back to blog
GitHub CopilotAutonomous AgentsDeveloper ToolsOpen Source

How I Turned GitHub Copilot Into an Autonomous Builder

The VS Code workflow I use to keep GitHub Copilot building toward a stable vision across sessions, with enforced testing, review, and checkpoint continuity.

I wanted GitHub Copilot to behave more autonomously and stay on track over long sessions, and I wanted to experiment with the new Preview Autopilot feature. Inside VS Code's agent mode, I built a workflow that keeps project state in the repo, enforces test-review-commit discipline, and blocks the agent from quitting early.

That setup built Sentinel across 40+ sessions without hand-written implementation code: 18 detectors spanning 5 languages, 1,300+ tests, a web UI, and a CI pipeline. I've since used the same system to ship other projects faster and with less supervision.

Here's the workflow, which parts are Copilot-specific, and where the whole thing breaks.

TL;DR: Put the project's memory in files, pin the target in a vision lock, track handoff state in a checkpoint, and force quality gates with hooks before the agent can stop. I built this in GitHub Copilot's VS Code agent mode, but the core pattern is portable. The result is a builder that can resume work across sessions without constant supervision. Open-source template: copilot-autonomous-template.

The Problem

AI coding assistants are great at executing a single prompt. "Add a dark mode toggle" or "write tests for this function" can be productive and impressive, but they'll only get you so far. Turning the idea in your head into something real, as you imagined it, requires a sustained push toward a coherent vision over longer periods of time.

This is not novel. Sophisticated experiments are out there, like Anthropic's agent teams and proven workflow layers like Oh My codeX or GitHub Copilot CLI's /fleet that organize teams of agents to turn a vision into a product. I couldn't find an analog for VSCode, though, so I thought I might learn a thing or two by trying to build one.

The issues show up fast when you try:

  • Context loss. Each session starts from scratch. The agent doesn't know what it did yesterday. If you keep working in a session, compaction eventually causes loss of fidelity (or context pollution if you switch tasks).
  • Drift. Without constraints, the agent wanders. It adds features you didn't ask for, refactors things that were fine, or solves the wrong version of the problem.
  • Skipped discipline. Left to its own devices, the agent skips tests, skips reviews, commits broken code, and stops before the work is done. It optimizes for appearing finished over being finished.
  • Docs rot. Fast implementation without cross-checking produces docs that describe last week's code. Every future session reads stale context and compounds the confusion.

I hit all of these building WyoClear. Most of my time wasn't writing code, it was supervising. Re-explaining context, catching shortcuts, fixing drift after the fact.

What Actually Happens

A typical autonomous session looks like this:

  1. I select the autonomous-builder agent in Copilot Chat.
  2. If I care about priority, I give it a direction.
  3. It reads the checkpoint and vision lock, orients itself, and picks the next slice according to next-highest-leverage work.
  4. It implements, tests, calls the reviewer, and fixes findings.
  5. It updates the checkpoint, commits, and moves to the next slice.
  6. When the phase is complete, it marks it done and stops.
  7. If vision goals are all met, it proposes new directions and waits for me.

Session length varies. Some sessions knock out 3 slices, some do 15. The agent manages its own context. When it gets saturated, it wraps the current slice and stops cleanly instead of producing garbage.

The moment this stopped feeling theoretical was when the builder finished code, tried to stop, and got blocked because the phase was still marked in progress. That forced the missing follow-through: tests, review, fixes, checkpoint update, commit. The workflow mattered less because it was clever than because it made unfinished work harder to pass off as finished.

Sentinel hit MVP in about 20 sessions over 12 hours and the full project ran past 40 sessions overall. The agent made real architectural decisions, created ADRs, tracked tech debt, and improved its own instructions when it noticed failure patterns. I spent my time on vision and approval, not supervision and micromanagement.

Why It Works

The workflow I built puts all cross-session state into the repo itself, and then enforces deterministic checks on it.

There are five core ideas:

1. Vision lock

A single versioned document (docs/vision/VISION-LOCK.md) that defines what the project is, who it's for, and what "done" looks like. It's the highest-authority artifact — everything else must align with it. The agent can make minor updates (within-scope refinements), but scope or goal changes require human approval.

This solves drift. The agent always has a north star. It also defines the stopping point.

2. Checkpoint continuity

A machine-readable checkpoint file (roadmap/CURRENT-STATE.md) records what was done, what's next, what's blocked, and what decisions were made. Every session starts by reading it and ends by updating it. This is the only file the agent must read — everything else is on-demand.

This solves context loss. The agent effectively has a handoff note from its past self.

3. Mandatory slice discipline

Work is organized into phases (a coherent milestone, like "ship the landing page" or "add the CLI") composed of numbered slices (a single atomic change — one feature, one fix, one refactor). The build loop enforces a strict protocol for every slice: implement → test → review → fix findings → commit → checkpoint. No skipping steps.

A Python-based Stop hook (slice-gate.py) makes this enforceable. When the agent tries to stop, the hook checks whether the current phase is marked complete or blocked. If not, it blocks the stop and tells the agent what it missed. The agent literally cannot quit early.

And the gate itself is ~40 lines of Python:

# .github/hooks/scripts/slice-gate.py (core logic)
if os.path.exists(checkpoint):
    with open(checkpoint) as f:
        content = f.read()
    for line in content.splitlines():
        stripped = line.strip().lower()
        if "**phase status**:" in stripped:
            if "complete" in stripped or "blocked" in stripped:
                json.dump({}, sys.stdout)  # allow stop
                return
 
# Phase not complete — block stop
json.dump({
    "hookSpecificOutput": {
        "hookEventName": "Stop",
        "decision": "block",
        "reason": "Phase is not yet complete. ..."
    }
}, sys.stdout)

It reads roadmap/CURRENT-STATE.md, looks for **Phase Status**: Complete or Blocked, and returns a JSON verdict. That's it. No LLM in the loop — deterministic enforcement.

This solves skipped discipline. The enforcement is deterministic, not aspirational.

4. Specialized subagents

Instead of one agent doing everything, the workflow splits responsibilities:

  • Planner — research and analysis with read-only tools. Can't accidentally modify code while thinking.
  • Tester — writes tests from specs before seeing implementation. Context isolation prevents testing implementation details.
  • Reviewer — code review with a doc-sync checklist. Catches stale references, missing glossary terms, vision drift.

The builder invokes these as needed. The reviewer's handoff sends findings back for fixing before the commit goes through.

This solves quality. Fresh eyes (even simulated ones) catch what the implementer misses (it's really about context).

5. Stack skills

When the project adopts a new technology, the builder creates a skill file in .github/skills/ that grounds all agents in the official documentation for that technology. Copilot auto-discovers these when relevant.

This solves hallucination. Instead of guessing at APIs, agents consult structured references. GitHub Copilot (at least with Anthropic models) is really bad at fetching fresh docs on its own, even when reminded to in instruction files.

What I Learned

The Stop hook is the single highest-leverage piece. Without it, the agent stops the moment anything gets hard. With it, the agent pushes through, because it has no choice. It completely changed the dynamic from "assistant waiting for instructions" to "builder iterating toward a goal."

Vision lock > detailed specs. I tried writing detailed implementation plans early on, or having a subagent write them. The agent followed them rigidly and produced worse results than when I gave it a clear vision and let it plan and adjust the implementation dynamically. Constraints on what and why are high-leverage. Constraints on how mostly just prevent the agent from adapting to what it finds. Less planning overhead, better results.

Subagent isolation is worth the overhead. The reviewer catches real issues, and not just style nits, but stale docs, missing test coverage, security concerns. The key is that it has a different context than the builder. The builder knows what it intended; the reviewer only sees what shipped. That gap is where bugs live. In practice, this catches issues that would otherwise cost hours or days to find in production.

Self-improving instructions compound. The builder can modify its own Copilot instructions when it notices a repeated failure. Every change gets logged. Over time, the instructions get sharper — not because I tuned them, but because the system tuned itself against real failure modes. The system gets better without manual intervention, though I still prune periodically.

Files beat memory. Every attempt I made at using external state, conversation memory, or session context for continuity was worse than just putting it in a file in the repo. Files are versioned, greppable, readable by any agent, and they survive everything. Zero vendor lock-in; works with any tool that can read a filesystem.

Where This Breaks

This workflow works well for projects with a defined target. Here's where it doesn't.

Context saturation. Long sessions degrade. The agent starts re-reading files it already read, producing truncated outputs, and making increasingly shallow decisions. The workflow mitigates this with checkpointing and a /memories/repo/ directory for cross-session patterns, but these capture what happened and what was learned, not what the agent was thinking. When a session restarts or context is compacted, the agent re-orients quickly but loses the deep working context.

The Stop hook is brittle. It parses markdown with string matching — looking for **Phase Status**: Complete or Blocked in the checkpoint file. A formatting change (different heading level, a typo in "Phase Status") breaks the pattern match. But the failure mode is the opposite of what you'd expect: when the pattern doesn't match, the hook blocks stopping rather than allowing it. So a format break makes the agent unable to stop cleanly, not permissive. It's annoying rather than dangerous, but it's still a reliability issue. A schema-validated checkpoint format would be more robust.

Self-improving instructions can overfit. The builder modifies its own Copilot instructions when it notices repeated failures — every change gets logged in an improvement log. The mechanism is real and valuable, but there's also a risk: over time, instructions can accumulate hyper-specific workarounds for one-time issues that make the agent rigid in new contexts. There's no automated pruning — the system doesn't know which of its own rules are still relevant, so periodic human review is needed. Could add pruning.

Vision lock doesn't suit exploratory work. The framework assumes you know what "done" looks like before you start. It has escape hatches — minor version bumps for within-scope adjustments, and a vision expansion mode where the agent proposes new directions when goals are met. But for research projects or prototypes where the goal shifts weekly, every scope change still requires human approval. That's friction by design (it prevents drift), but it makes the framework a poor fit for truly open-ended exploration.

It requires Copilot's agent mode. The workflow depends on VS Code's agent mode for tool use, file editing, and subagent invocation. It doesn't generalize cleanly to chat-only or API-only setups. The AGENTS.md file provides partial compatibility with Claude Code and similar tools, but the Stop hook and subagent orchestration are Copilot-specific as-implemented.

Cost is context-heavy. Each session reads checkpoints, vision locks, and multiple files per slice; subagent invocations multiply context usage further. With Microsoft's unlimited Copilot access, I don't pay per token. Without that kind of arrangement, dozens of deep implementation sessions adds up. The workflow optimizes for human time and whether the token economics work for you depends on your access model.

Try It

I extracted the workflow into a copier template:

pip install copier
copier copy gh:jcentner/copilot-autonomous-template my-project

It scaffolds the full setup — agents, subagents, Stop hook, skills scaffold, vision lock template, checkpoint structure, manual override prompts, and a prompt guide. Works with any language. copier update pulls improvements as the template evolves.

For existing repos:

cd my-repo
copier copy gh:jcentner/copilot-autonomous-template .

The template is open source: github.com/jcentner/copilot-autonomous-template

It works with VS Code's Autopilot mode and experimentally with Copilot CLI for background execution (worktree-isolated, sessions continue when VS Code closes). The AGENTS.md file also makes it compatible with Claude Code and similar tools.

What's Next

I've used this workflow to ship multiple projects now, each one faster than the last. Sentinel was the first. This site was the second. Each project feeds improvements back into the template.

The bottleneck has shifted from "getting the agent to implement sensibly" to "defining what to build and reviewing what it built." That's a much better problem to have.