AI Engineering

Eval-Driven Development

When your AI agent ships a black rectangle that passes every test, you need a different kind of eval.

Pikkosoft Team
March 29, 2026
5 min read
# Eval-Driven Development An AI agent builds a Street View panorama. The code compiles. The API responds 200. Tests pass. The user sees a black rectangle. Nothing is broken. The Street View API key was never loaded. The map canvas rendered correctly — into nothing. Every structural assertion passed. No assertion about experience was ever written. The black background was never flagged as wrong. The agent had no mental model of what correct looks like. It had a model of what executes. --- **What does a passing, correct run look like?** Not "what does the code do?" Literally: what does the user see on screen when this works? What color is the background? Is there imagery? Does the panel show real data or an empty container? If you can't answer that before the agent writes a single line of code, the agent can't answer it either. --- Eval-Driven Development starts from the observable outcome and works backward. The user sees real Street View imagery. That's the bar. Real imagery requires a real API key. Is the key in the environment? Check. Use it. Highest-fidelity outcome. Goal-directed from the experience backward. Resources checked opportunistically. --- Two agents. Structurally isolated. They never share context. **The Inspector** opens the app like a user. Clicks, types, navigates, scrolls. Takes screenshots. Extracts data from the page. Reports facts only — "this element is visible," "this value is 42." It never judges. It never sees what the ground truth says should happen. **The Evaluator** receives raw observations and a ground truth document. It never runs code. It never sees the inspector's instructions. It looks at screenshots and data and judges each assertion: pass or fail. No benefit of the doubt. No partial credit. If the inspector knows what "correct" looks like, it will unconsciously steer toward it. If the evaluator knows how the inspector works, it will excuse failures. Isolation isn't optional. It's what makes the eval trustworthy. --- The contract is a pair of files. **inspector.md** — steps only. "Open this URL. Click this button. Wait for this element. Take a screenshot." The word "should" does not appear in this file. **ground-truth.md** — expectations only. "The map panel shows real satellite imagery. The sidebar displays at least 3 results. No console errors." The word "click" does not appear in this file. The moment an inspector step says "verify the panel shows data," it's no longer inspecting — it's evaluating. The moment a ground truth assertion says "navigate to settings," it's no longer judging — it's instructing. Either contamination makes the eval unreliable. --- The inspector runs a continuous capture loop. A screenshot every 200 milliseconds, but only when pixels actually change. The result: a sequence of lossless PNGs — `change-0000.png`, `change-0001.png`, `change-0002.png`. Each frame is a distinct visual state. A loading spinner appearing. Data populating a panel. A map tile rendering real imagery. Not video. Video compresses frames, introduces artifacts. These are already stills. Lossless. Individually addressable by filename. The evaluator checks `change-0015.png` for whether the map loaded and gets a pixel-perfect answer. --- No mocks. Every eval runs against real APIs, real backends, real keys from environment variables. Missing key? The inspector fails fast. It never stubs the service. A mocked API returns predictable responses. A real API returns pagination quirks, rate limits, slow responses that trigger loading states. Your eval isn't testing your app. It's testing your mocks. `page.evaluate` — a browser API for running JavaScript in the page — is for reading state only. Extracting DOM values, checking element counts. If a user can't do it with their hands, the inspector can't do it in a step. --- The black background was never a bug. The agent wrote code that executed correctly toward a target nobody had described: "what does the user see when this works?" Eval-Driven Development answers that question first. Write down what success looks like as ground truth. Build an inspector that captures what actually happens. Run an evaluator that judges the gap. Keep them isolated so neither can cheat. The agent still writes the code. But it writes toward a picture of success, not a checklist of function calls.
Want to build reliable applications with AI assistance?
Get in Touch
Eval-Driven Development