Skip to content
← Back to blog
reinforcement-learningsim-to-realautonomous-agentspostmortem

Eight Years of Teaching a Dinosaur to Jump

Three rebuilds of the same toy RL problem, eight years apart — and what the fastest, smartest, AI-built one taught me about trusting my own numbers.

In 2018 it took me two months to teach an AI to play Chrome's dinosaur game. In 2023 it took two days. In 2026 it took forty minutes — and produced the worst agent of the three.

Chrome Dino is the little T-Rex that appears when your internet dies. Jump the cacti, duck the pterodactyls, survive as the game speeds up. It's a toy, which is exactly why it's a good benchmark: the game never changes, so the only variable across eight years is me and my tools.

2018: two months

Undergrad AI class, final project, and I knew roughly nothing. My approach was the most naive one available: I played the game myself while a script captured screenshots and logged my keypresses, then trained a CNN to predict my buttons from pixels.

Most of the two months went to fighting TensorFlow GPU setup on Windows and generating training data by hand. (I played a lot of Chrome Dino.) The model mimicked my reaction patterns and fell apart at speeds I hadn't trained it on, because supervised learning doesn't teach an agent to play a game; it teaches it to impersonate you. But here's the detail I love in hindsight: its best run cracked 2,600, slightly above my own personal best. Your judgment is the ceiling. Your reaction time isn't, and a script never blinks. I had built a version of myself with better reflexes, and I was too busy being relieved it worked to notice how strange that was.

2023: two days

Five years of professional experience later, I knew the right paradigm was reinforcement learning. I wrapped a real Chrome browser in a custom Gym environment (screen capture, OpenCV preprocessing, Tesseract OCR squinting at a cropped corner to detect game-over) and trained a DQN from Stable-Baselines3.

It worked in two days, and training was agony. Every step needed a screenshot, OCR, and a Selenium keypress, which works out to roughly one effective frame per second. The agent had no duck action, so pterodactyls were unsurvivable. But it learned in the real game, with real timing, and scored a mean around 555.

The lesson I took at the time: the environment was the bottleneck. Wrapping a live browser for RL is like learning to drive by remote-controlling a car through a webcam. If only training were faster.

Hold that thought.

2026: forty minutes

This time I didn't write the code. GitHub Copilot did, running in my autonomous workflow, with me choosing between options it proposed.

The agent's design was genuinely good. It built a headless Python clone of the game with physics constants lifted straight from Chromium's TypeScript source: no browser, no screenshots, no OCR, just game logic at thousands of steps per second. The observation became a 20-number feature vector instead of pixels. PPO instead of DQN. A duck action, finally. Sixteen parallel environments, two million timesteps, about forty minutes on my desktop; three thousand times faster than 2023.

The headless scores were great: mean 562, peaks over 1,100. Thirteen times better than random. I pointed the trained model at actual Chrome, ran the validation script, and got:

Mean: 48.

Worse than the 2023 agent. Worse than the 2018 agent. The brilliant fast-trained model could survive about two seconds of the real game.

The part where I made it worse by fixing it

What followed was three weeks of the most instructive failure I've had in a while. The sim and the browser disagreed somewhere, so I went hunting for physics discrepancies, and I kept finding them: each one real, each one fixable, none of them the problem.

Chrome scales jump velocity with game speed; my clone didn't. Fixed, retrained: mean 53. Chrome caps the dino's ascent with a velocity limiter buried fifty lines past the constants I'd copied; the clone let it fly the full ballistic arc, so the model had learned to clear cacti by twelve pixels that don't exist in the real game. Fixed, retrained: mean 64.

Three versions of increasingly faithful physics moved real-browser transfer from 8% to 9% to 11%. That's not a trend; that's a noise floor. The actual culprit was timing: under Selenium's instrumentation, Chrome runs at about 51fps, not 60. The model expected obstacles to move 15% farther per step than they did; its entire sense of when was calibrated to a clock the real browser doesn't keep. The 2023 DQN never had this problem, because it trained at one agonizing frame per second in the real browser. Slow and faithful beat fast and subtly wrong.

The fix that finally worked was a diagnostic, not a solution: inject JavaScript to seize Chrome's clock (fake performance.now(), capture requestAnimationFrame) and step the game frame by frame from Python. Under that puppeteered clock, the same model jumped from mean 64 to mean 439. The physics had been fine for two iterations. I'd been fine-tuning a microscope pointed at the wrong slide.

What the tidy version leaves out

I could end there — "sim-to-real is the hard part, here's the clever frame-stepping hack" — and it would make a fine LinkedIn post. The repo's own postmortem is meaner, and more useful.

The 439 isn't real. It's measured with Python pausing the game between actions. The agent a human could actually watch play scores 64. By the deliverable I originally scoped (an agent that plays Chrome Dino) the 2026 attempt failed. The 2023 one still holds the best real-game mean; the 2018 one, the dumbest of the three, still holds the high score.

A rules-based heuristic the workflow knocked out in one slice, a few hundred lines of "if the cactus is close, jump," outscored the trained PPO model by 27%. That should have been a klaxon. It got written up as a baseline checkbox instead.

And the project's vision document quietly drifted from "an agent that plays Chrome Dino" to "multiple approaches to Chrome Dino, as a writeup." Nobody decided that. Each time a slice fell short, the goal got reframed just wide enough to contain what the slice did achieve. Autonomous workflows are very good at finishing slices; they have no native instinct for noticing that the slices stopped serving the goal. A human glancing at 8% → 9% → 11% would have stopped after the second iteration. The loop couldn't, because stopping wasn't a slice.

Where the hard problem lives

Each rebuild was faster, and each one relocated the difficulty instead of removing it. In 2018 the hard problem was knowing what approach to use. In 2023 it was engineering the environment. In 2026 the algorithm, the environment, and even the code-writing were all cheap; the hard problem became believing my own numbers. The metric that was easy to improve (headless score, frame-stepped score) kept standing in for the one that mattered (a dinosaur, in a browser, at full speed), and every tool I had made the substitution easier.

The redux is underway, under a vision lock with one non-negotiable line: the only score that counts is real-time, in an unmodified Chrome window, mean ≥ 2000 over twenty consecutive episodes. Roughly: do on every run what the 2018 model did on its single best one. No clock tricks. The dinosaur doesn't care how fast you trained.