What I learned building a coding eval harness


I spent the past few weeks building a coding eval harness from scratch — 20 Python bug-fixing tasks, sandboxed execution in Docker, and a leaderboard with confidence intervals. The goal was to understand how LLM evaluation actually works end-to-end, not just use someone else’s benchmark.

Here is what I found and what surprised me.

What I built

The setup is straightforward: each task is a buggy Python function with a test suite. A model gets the buggy code and must return a fixed version. The fixed code runs in a Docker container, the tests execute, and the result is pass or fail.

I ran three models through OpenRouter — google/gemini-2.0-flash, openai/gpt-4o-mini, and qwen/qwen3-coder-next — across 20 tasks. Here are the results:

Model                         Tasks    Pass   Pass Rate     95% CI
google/gemini-2.0-flash          20      20     100.0%  [83.9, 100.0]
openai/gpt-4o-mini               20      18      90.0%  [69.9, 97.2]
qwen/qwen3-coder-next            20      14      70.0%  [48.1, 85.5]

Gemini passed everything. GPT-4o-mini failed two. Qwen failed six. But the confidence intervals tell a more honest story.

The confidence interval problem

The naive way to report eval results is a pass rate: 100%, 90%, 70%. But with 20 tasks, those numbers are much less certain than they look.

I used Wilson score confidence intervals instead of the standard p ± 1.96·√(p(1-p)/n). The reason: the naive formula collapses at the boundaries. If a model passes all 20 tasks, it gives you [100%, 100%] — which claims certainty you don’t have. Wilson scores stay honest because they’re derived differently: instead of expanding outward from the observed rate, they ask which true proportions are consistent with what you observed.

The practical consequence: Gemini’s [83.9%, 100%] CI and GPT-4o-mini’s [69.9%, 97.2%] CI overlap substantially. You cannot conclude from 20 tasks that Gemini is actually better than GPT-4o-mini. You would need more tasks to make that claim with confidence. The leaderboard is useful, but it should be read with that caveat in mind.

Task design is the hard part

I assumed the hard part would be the infrastructure — Docker sandboxing, model API calls, result aggregation. It wasn’t. The hard part was designing tasks that actually discriminate between models.

The bugs I wrote fell into two categories once I saw the results:

Easy tasks — tasks where all three models passed. These turned out to be bugs that are immediately obvious from the function signature and docstring: a wrong operator (- instead of +), a missing .lower() call, returning -1 instead of None. Models recognize these patterns immediately.

Discriminating tasks — tasks that separated models. Two tasks (none-input and string-unicode) failed across multiple models. Both required reasoning about edge cases that aren’t explicitly stated in the docstring: what happens when you pass None to a function that iterates over its argument? What happens when you try to ASCII-encode a unicode string?

The lesson: a good eval task tests whether a model understands the intent of the code, not just whether it can pattern-match to a common fix. Writing tasks that consistently land in the “sometimes fails” range — where you actually learn something — is a design problem, not an engineering problem.

Why Docker matters

Running model-generated code directly on the host is an obvious security problem, but there is a subtler reason Docker matters for eval correctness: isolation prevents cross-contamination between runs.

My initial instinct was to keep a persistent container and reset it between runs — avoid the startup cost. The problem is that resetting requires you to enumerate everything that could be dirty: files, installed packages, imported modules in the Python process. You will miss something, and you will get a subtly wrong result that is hard to reproduce.

The right approach is a fresh container per run with volume-mounted task files. Container startup adds ~0.5 seconds per run. For 20 tasks × 3 models = 60 runs, that is 30 seconds of overhead. Not a problem. In exchange, you get a hard guarantee that run N cannot affect run N+1.

Structured outputs over prompt engineering

Getting models to return clean Python code without markdown fences, preamble, or explanation is a classic prompt engineering problem. The standard approach is to ask nicely (“return only the code, no explanation”) and then strip markdown fences with a regex. This is fragile — models occasionally ignore formatting instructions when they are reasoning hard about the code.

A better approach: use the model API’s structured output feature to force the response into a JSON schema. I used OpenRouter’s json_schema response format, which requires the model to return a JSON object with a solution_code field. The extraction becomes a dict lookup instead of a regex. It works reliably across models.

The tradeoff: not all models support structured output through OpenRouter. For the three I tested, it worked. If you add models, verify support first.

What I would do differently

More tasks. 20 tasks is enough to see patterns but not enough to draw statistically confident conclusions. The CIs are wide. I would want 50+ tasks to make claims I could defend.

Partial credit. Pass/fail per task loses information. A model that passes 3 of 4 assertions is meaningfully different from one that passes 0 of 4. Adding per-assertion scoring would give finer-grained signal without requiring more tasks.

Harder bugs. My bugs are all single-function, single-bug, self-contained. Real codebases have bugs that require understanding multiple functions, or bugs that only manifest under specific runtime conditions. The tasks I wrote are a good start but they test a narrow slice of what “debugging ability” actually means.

The code

Everything is on GitHub. The harness runs with one command and supports any model available on OpenRouter.

This is the first of three artifacts I am building. The next one is a minimal coding agent — no frameworks, ~500 lines of Python — that uses this eval harness as its feedback signal.