Sentinel: What Happens When You Let AI Build Your Code Health Tool

The faster you build with AI, the faster things drift.

I noticed this first on my own projects. Ship a feature in an afternoon, and by the next morning the README describes a CLI flag that got renamed, the tests cover a function signature that changed, and three TODOs have been sitting there for two weeks. AI makes writing code easy. It doesn't make maintaining coherence easy.

Other developers mentioned the same thing. The tools that help you write code faster don't help you notice when your docs, tests, and code start disagreeing with each other.

So I built Sentinel.

What It Does

Sentinel is a local repo health monitor. It scans your codebase with 14 detectors, optionally runs findings through a local LLM judgment layer, deduplicates across runs, and produces a morning report of issues worth reviewing.

The detectors cover the usual suspects — linting, TODOs, dependency audits, complexity — but the part I'm most interested in is cross-artifact analysis. Specifically, docs-drift detection: comparing what your documentation says against what your code actually does.

No existing tool does this well. Linters check code. Doc tools check formatting. Nothing checks whether your README still describes reality.

Sentinel does. Its stale-reference detector (which checks if files and paths mentioned in docs actually exist) hit 100% accuracy across 56 findings when I ran it against a real project. The semantic drift detector asks a local LLM: "does this documentation accurately describe this code?" — and even a 4B model can reliably answer that binary question.

How It Was Built

Here's the part that surprised me: Sentinel was built almost entirely by AI agents.

I'd been developing with GitHub Copilot for a while, and one project (TSGBuilder) pushed me into more structured agentic workflows. But I kept hitting limits — the agent would lose context between sessions, skip reviews, accumulate drift in its own project docs. Sound familiar?

So I built a workflow to fix that: a vision lock document that defines success criteria, ordered implementation slices, a reviewer subagent that audits code after each slice, and a checkpoint file that maintains continuity across sessions. The agent reads the checkpoint, implements the next slice, tests it, gets it reviewed, commits, and updates the checkpoint.

The MVP took one morning. 15 implementation slices, 129 tests, 3 detectors, a full CLI. By the end of the first day: 5 detectors, 217 tests, GitHub integration, and a persistence scoring system. By day 10: 14 detectors, a web UI, CI pipeline, multi-language support, and 1,000+ tests across ~20 autonomous sessions.

I didn't write most of the code by hand. I wrote the architecture, the constraints, and the acceptance criteria. The agent did the implementation. A separate reviewer agent caught the bugs I would have missed — including an exception-safety issue in the per-detector provider system that would have been a production mystery months later.

What I Actually Learned

Precision matters more than breadth. Early on, the TODO scanner flagged TODOs inside string literals and markdown files. The docs-drift detector matched regex patterns as broken links. Every false positive erodes trust, and trust is the product. Most of my time went into precision engineering — targeted heuristics for each specific failure mode.

LLMs will confidently explain why a date is a missing file. During real-world validation, the LLM judge confirmed 42 out of 42 obvious non-path patterns (dates, CSS values, URLs) as real issues, each with plausible-sounding reasoning. "The referenced file 2026-01-15 appears to be missing from the repository." This is why Sentinel uses deterministic detectors first and the LLM only as a judgment layer for genuinely ambiguous cases.

The autonomous workflow is the real product. The agent configuration, checkpoint protocol, and reviewer pattern that built Sentinel turned out to be reusable. I extracted it into a copier template so anyone can bootstrap the same workflow. Tell it to start Phase 0, and it synthesizes a vision from your existing repo, creates ADRs, and starts building.

Cross-artifact analysis is an underserved problem. Most of Sentinel's lint-wrapper detectors duplicate what you'd get from running ruff or ESLint directly. The value is in finding things that span boundaries — docs that don't match code, tests that don't cover recent changes, environment configs that reference variables nobody sets anymore. These are the issues that hide in plain sight because no single tool owns the whole picture.

Try It

Sentinel is open source: github.com/jcentner/sentinel

git clone https://github.com/jcentner/sentinel && cd sentinel
pip install -e ".[detectors]"
sentinel scan /path/to/your/repo

It works without an LLM (you'll get raw findings instead of judged ones). If you have Ollama running locally, it'll use that for the judgment layer.

If you're interested in the autonomous development workflow, the template is at github.com/jcentner/copilot-autonomous-template.