🎯 PDB: Precise Debugging Benchmark

PDB is an automatic pipeline that converts any coding dataset into a debugging benchmark with precision-aware evaluation. It generates buggy programs by synthesizing verified atomic bugs and composing them into multi-bug programs, then evaluates models using edit-level precision and bug-level recall.

Abstract

Unlike code completion, debugging requires localizing faults and applying targeted edits. We observe that frontier LLMs often regenerate correct but over-edited solutions during debugging. To evaluate how far LLMs are from precise debugging, we introduce the Precise Debugging Benchmarking (PDB) framework, which automatically converts any coding dataset into a debugging benchmark with precision-aware evaluation.

PDB generates buggy programs by synthesizing verified atomic bugs and composing them into multi-bug programs. We define two novel metrics, edit-level precision and bug-level recall, which measure how many necessary edits are made and how many bugs are resolved. We release two evaluation benchmarks: PDB-Single-Hard (5,751 single-line bug examples) and PDB-Wild (484 examples = 256 multi-line synthesized bugs from BigCodeBench and LiveCodeBench $\cup$ 228 real-world repository bugs from SWE-bench).

Experiments show that frontier models, such as GPT-5.1-Codex and DeepSeek-V3.2-Thinking, achieve unit-test pass rates above 76% but precision at or below 45%, even when explicitly instructed to perform minimal debugging. Iterative and agentic debugging strategies do not substantially improve precision or recall, highlighting the need to rethink post-training pipelines for coding models.

Datasets

We release two evaluation sets built with the PDB generation and evaluation pipeline, sourced from three existing coding benchmarks: BigCodeBench (API usage), LiveCodeBench (algorithmic reasoning), and SWE-bench (real-world repository-level software engineering).

PDB-Single-Hard — 5,751 single-line bug examples, filtered from an initial pool of 7,591 PDB-Single examples to retain only cases that are not easily solved by a quorum of reference models.
PDB-Wild — 484 examples covering both synthesized multi-line bugs (256 contiguous 2–4 line bug blocks from BigCodeBench and LiveCodeBench, atomicity-filtered) and real-world bugs (228 examples from SWE-bench repositories).

PDB-Single-Hard data distribution: 5,751 examples across source benchmark, bug count (1–4), and bug category.

Results

Model rankings on debugging precision differ strikingly from rankings on the unit-test pass rate. The discrepancy persists across both PDB-Single-Hard (single-line bugs) and PDB-Wild (multi-line synthesized bugs $\cup$ real-world repository bugs from SWE-bench), indicating that neither bug granularity nor real-world provenance closes the gap.

Scatter plots below: hover for exact numbers. Full tables are on the Leaderboard.

PDB-Single-Hard (single-line)

PDB-Wild (multi-line + repository-level)

Bug-count breakdown on PDB-Single-Hard (BigCodeBench).

Bug-count breakdown on PDB-Single-Hard (LiveCodeBench).

Ablation Studies

Ablation — Prompting Freeform vs. minimal debugging. Across all evaluated models, freeform prompting produces a substantial drop in edit-level precision and bug-level recall. Even the strongest models, including Claude-Sonnet-4.5 and Qwen3-Coder-480B, achieve less than 60% precision without a minimal-edit constraint; Gemini-2.5-Pro drops by 40% absolute, and GPT-5.1-Codex fails to reach 20%. Prompt-level constraints are necessary but insufficient: while minimal-debug prompts reduce over-editing, they do not fundamentally change the underlying regeneration tendency.

Model performance under minimal-debug vs. freeform prompting.

Ablation — Iteration & agents Iterative and agentic debugging. Iterative (up to three revision attempts) and agentic (with unit-test & execution feedback) settings consistently improve unit-test scores and recall — but precision stays flat or degrades. In fact, agentic debugging often underperforms plain iterative debugging in precision, suggesting additional feedback exacerbates regeneration-oriented strategies. Even Claude-Code, the strongest agentic baseline, only reaches roughly 50% precision.

Model performance under iterative and agentic setups.

Ablation — Bug categories Which defect categories are easiest to repair? With the exception of Gemini-2.5-Pro, which exhibits a relatively uniform recall of ~70% across all categories, most models show markedly higher recall on Build/Package/Merge defects. We hypothesize this advantage arises from the higher prevalence of such defects in pretraining data, making them easier to recognize and repair than algorithmic or boundary-condition faults.