arxiv: 2605.09134 · v3 · submitted 2026-05-09 · 💻 cs.AI · cs.SE

Recognition: 2 theorem links

· Lean Theorem

BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models

Hongbo Wang, Xiaotang Shang, Xuhong Chen, Xunzhu Tang, Yiming Cao, Yuanhao Li

Pith reviewed 2026-05-13 05:57 UTC · model grok-4.3

classification 💻 cs.AI cs.SE

keywords automated program repairreinforcement learningdual reward modelsline-level credit assignmentprogram repairPPO optimizationSWE-benchcross-language transfer

0 comments

The pith

BoostAPR improves automated program repair by training a line-level credit allocator from execution outcomes to guide reinforcement learning edits more precisely.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a three-stage approach to automated program repair that starts with supervised fine-tuning on verified code fixes, then trains two separate reward models from test execution results, and finally applies PPO optimization where one model reallocates credit at the level of individual code lines. This setup tackles the difficulty of sparse feedback and broad sequence rewards that make it unclear which specific changes actually resolve bugs. The line-level allocator works at a natural scale for code modifications, allowing the learning process to focus rewards on effective edit regions rather than treating entire programs as single units. If the method works as described, it produces measurable gains in bug-fixing success rates and supports transfer of repair skills between programming languages. Readers would care because more reliable automated repair reduces the manual effort needed to diagnose and correct software defects.

Core claim

BoostAPR trains a sequence-level assessor and a line-level credit allocator from execution outcomes, then uses the line-level model during PPO to redistribute rewards toward the edits that matter, yielding higher repair success on multiple benchmarks including 40.7 percent on SWE-bench Verified and strong cross-language results on Defects4J.

What carries the argument

The line-level credit allocator, which learns to assign partial rewards to individual code lines based on how execution outcomes change after edits.

If this is right

Repair success rates rise substantially, reaching 40.7 percent on SWE-bench Verified and 24.8 percent on Defects4J with Python-to-Java transfer.
The same trained model generalizes competitively to HumanEval-Java at 84.5 percent and QuixBugs at 95.0 percent.
Credit assignment at line granularity produces more stable reinforcement learning updates than sequence-level rewards alone.
Open-source models can reach performance levels previously seen only in closed systems when dual rewards are used.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dual-reward pattern could be tested on other editing tasks such as automated refactoring or test generation where partial success is also hard to credit.
Combining the line-level allocator with richer test suites or symbolic execution might further reduce noise from coincidental passes.
If the credit allocator generalizes, similar intermediate-granularity rewards could apply to non-code sequence tasks like proof generation or dialogue response improvement.

Load-bearing premise

That execution outcomes provide clean enough signals to train a line-level model that correctly credits only the edits responsible for fixing bugs rather than coincidental test passes.

What would settle it

A test set where the line-level allocator consistently assigns high credit to edits that pass tests only because of incomplete coverage or side effects, without actually resolving the underlying bug.

Figures

Figures reproduced from arXiv: 2605.09134 by Hongbo Wang, Xiaotang Shang, Xuhong Chen, Xunzhu Tang, Yiming Cao, Yuanhao Li.

**Figure 1.** Figure 1: Overview of the BOOSTAPR training framework. Our approach consists of three stages: Stage I performs supervised finetuning on execution-verified demonstrations with reasoning traces; Stage II trains dual reward models using a hybrid regression-preference objective on execution outcomes; Stage III optimizes the policy via PPO with token-level rewards derived from the combination of Rseq and Rline. The line… view at source ↗

**Figure 2.** Figure 2: PPO training dynamics. Performance improves steadily until approximately step 250, then plateaus. Shaded region shows standard deviation across 3 seeds. Credit Assignment Strategies [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Reinforcement learning for program repair is hindered by sparse execution feedback and coarse sequence-level rewards that obscure which edits actually fix bugs. We present BoostAPR, a three-stage framework addressing these challenges: (1) supervised fine-tuning on execution-verified demonstrations with reasoning traces, (2) training dual reward models--a sequence-level assessor and a line-level credit allocator--from execution outcomes, and (3) PPO optimization where the line-level model redistributes rewards to critical edit regions. This line-level credit assignment operates at an intermediate granularity naturally suited to code changes. Trained on SWE-Gym and evaluated on four benchmarks, BoostAPR achieves 40.7% on SWE-bench Verified (+22.9pp over base model), 24.8% on Defects4J (Python-to-Java transfer), 84.5% on HumanEval-Java, and 95.0% on QuixBugs, achieving competitive results among open-source models with strong cross-language generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BoostAPR's line-level credit allocator inside PPO claims a 23-point lift on SWE-bench Verified, but the abstract gives almost no experimental controls or ablations to back it up.

read the letter

The main takeaway is that BoostAPR trains a line-level reward model from execution outcomes and uses it to redistribute PPO rewards to specific edits, reporting 40.7% on SWE-bench Verified and solid transfer numbers on Defects4J and HumanEval-Java. That intermediate granularity for credit assignment is a reasonable response to the sparse-feedback problem in repair tasks, and the three-stage pipeline (SFT with traces, dual reward training, then PPO) is laid out clearly enough in the abstract to see what they tried.

Referee Report

2 major / 2 minor

Summary. The manuscript presents BoostAPR, a three-stage framework for automated program repair (APR) that uses supervised fine-tuning on execution-verified demonstrations, trains dual reward models (a sequence-level assessor and a line-level credit allocator) from execution outcomes, and applies PPO optimization with rewards redistributed to critical edit regions by the line-level model. It reports strong performance gains on SWE-bench Verified (40.7%, +22.9 percentage points over the base model), Defects4J (24.8% with Python-to-Java transfer), HumanEval-Java (84.5%), and QuixBugs (95.0%), positioning it as competitive among open-source models with notable cross-language generalization.

Significance. Should the empirical results prove robust and the line-level credit assignment mechanism reliably isolate causal bug-fixing edits, this approach would represent a meaningful advance in addressing the sparse and coarse reward problem in RL for code repair. The dual-reward design and execution-grounded training are promising for finer-grained credit assignment in program repair tasks. The cross-benchmark and cross-language results suggest good generalization potential.

major comments (2)

[Abstract and §3 (three-stage framework)] The description of training the line-level credit allocator in stage (2) does not detail mechanisms to prevent reward assignment to non-causal edits that coincidentally pass tests due to incomplete coverage. This assumption is load-bearing for the central claim, as the performance improvements are attributed to the PPO redistribution in stage (3); without evidence that the allocator isolates true fixes, the gains may reflect spurious correlations rather than improved repair.
[Evaluation section (implied by results)] No ablation studies or statistical tests are referenced to isolate the contribution of the dual-reward PPO stage from the supervised fine-tuning stage alone. This is necessary to substantiate that the +22.9pp gain on SWE-bench Verified stems specifically from the line-level credit allocation.

minor comments (2)

The abstract would benefit from specifying the base model used for the +22.9pp comparison and the exact baselines for 'competitive results among open-source models'.
Clarify the training data split between SWE-Gym and the evaluation benchmarks to address potential data leakage concerns.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for clarification and additional evidence, which we will address in the revision.

read point-by-point responses

Referee: [Abstract and §3 (three-stage framework)] The description of training the line-level credit allocator in stage (2) does not detail mechanisms to prevent reward assignment to non-causal edits that coincidentally pass tests due to incomplete coverage. This assumption is load-bearing for the central claim, as the performance improvements are attributed to the PPO redistribution in stage (3); without evidence that the allocator isolates true fixes, the gains may reflect spurious correlations rather than improved repair.

Authors: We agree that the current description in §3 could be expanded to more explicitly address potential non-causal edits arising from incomplete test coverage. In the revised manuscript, we will elaborate on the training procedure for the line-level credit allocator, detailing how it is trained exclusively on execution-verified fixes from SWE-Gym, the use of differential execution feedback to prioritize edits that directly contribute to test passage, and any filtering steps to reduce spurious assignments. This will strengthen the justification for attributing gains to the PPO redistribution in stage (3). revision: yes
Referee: [Evaluation section (implied by results)] No ablation studies or statistical tests are referenced to isolate the contribution of the dual-reward PPO stage from the supervised fine-tuning stage alone. This is necessary to substantiate that the +22.9pp gain on SWE-bench Verified stems specifically from the line-level credit allocation.

Authors: We concur that isolating the contribution of the dual-reward PPO stage requires explicit ablations. We will add ablation experiments in the evaluation section comparing the full BoostAPR pipeline against the supervised fine-tuning baseline alone, reporting results on SWE-bench Verified and other benchmarks. We will also include statistical significance tests (e.g., bootstrap resampling for confidence intervals and paired tests where applicable) to quantify the incremental gains from the line-level credit allocation and PPO optimization. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The described three-stage framework grounds rewards in external execution outcomes from training data and measures final performance on held-out external benchmarks (SWE-bench Verified, Defects4J, HumanEval-Java, QuixBugs). No step reduces the headline performance claims to the training inputs by construction, self-definition, or self-citation. Reward model training follows standard supervised fitting on execution signals, with evaluation kept separate; this is self-contained empirical work rather than a tautological derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on standard RL training assumptions plus the domain premise that execution feedback can be decomposed into reliable line-level signals; no new physical entities are postulated.

free parameters (1)

Training hyperparameters for SFT, reward models, and PPO
Typical RL hyperparameters such as learning rates and batch sizes are chosen or tuned but not enumerated in the abstract.

axioms (1)

domain assumption Execution outcomes supply a reliable training signal for identifying which specific code edits fix bugs
Invoked when the line-level model is trained to redistribute rewards from execution results.

pith-pipeline@v0.9.0 · 5488 in / 1291 out tokens · 74843 ms · 2026-05-13T05:57:19.035400+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean (J-cost uniqueness) washburn_uniqueness_aczel unclear
Stage II trains dual reward models—a sequence-level assessor and a line-level credit allocator—from execution outcomes, and (3) PPO optimization where the line-level model redistributes rewards to critical edit regions.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
Rline assigns credit over edit-line spans... allocation weights via temperature-controlled softmax