Recognition: 2 theorem links
· Lean TheoremBoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
Pith reviewed 2026-05-13 05:57 UTC · model grok-4.3
The pith
BoostAPR improves automated program repair by training a line-level credit allocator from execution outcomes to guide reinforcement learning edits more precisely.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BoostAPR trains a sequence-level assessor and a line-level credit allocator from execution outcomes, then uses the line-level model during PPO to redistribute rewards toward the edits that matter, yielding higher repair success on multiple benchmarks including 40.7 percent on SWE-bench Verified and strong cross-language results on Defects4J.
What carries the argument
The line-level credit allocator, which learns to assign partial rewards to individual code lines based on how execution outcomes change after edits.
If this is right
- Repair success rates rise substantially, reaching 40.7 percent on SWE-bench Verified and 24.8 percent on Defects4J with Python-to-Java transfer.
- The same trained model generalizes competitively to HumanEval-Java at 84.5 percent and QuixBugs at 95.0 percent.
- Credit assignment at line granularity produces more stable reinforcement learning updates than sequence-level rewards alone.
- Open-source models can reach performance levels previously seen only in closed systems when dual rewards are used.
Where Pith is reading between the lines
- The dual-reward pattern could be tested on other editing tasks such as automated refactoring or test generation where partial success is also hard to credit.
- Combining the line-level allocator with richer test suites or symbolic execution might further reduce noise from coincidental passes.
- If the credit allocator generalizes, similar intermediate-granularity rewards could apply to non-code sequence tasks like proof generation or dialogue response improvement.
Load-bearing premise
That execution outcomes provide clean enough signals to train a line-level model that correctly credits only the edits responsible for fixing bugs rather than coincidental test passes.
What would settle it
A test set where the line-level allocator consistently assigns high credit to edits that pass tests only because of incomplete coverage or side effects, without actually resolving the underlying bug.
Figures
read the original abstract
Reinforcement learning for program repair is hindered by sparse execution feedback and coarse sequence-level rewards that obscure which edits actually fix bugs. We present BoostAPR, a three-stage framework addressing these challenges: (1) supervised fine-tuning on execution-verified demonstrations with reasoning traces, (2) training dual reward models--a sequence-level assessor and a line-level credit allocator--from execution outcomes, and (3) PPO optimization where the line-level model redistributes rewards to critical edit regions. This line-level credit assignment operates at an intermediate granularity naturally suited to code changes. Trained on SWE-Gym and evaluated on four benchmarks, BoostAPR achieves 40.7% on SWE-bench Verified (+22.9pp over base model), 24.8% on Defects4J (Python-to-Java transfer), 84.5% on HumanEval-Java, and 95.0% on QuixBugs, achieving competitive results among open-source models with strong cross-language generalization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents BoostAPR, a three-stage framework for automated program repair (APR) that uses supervised fine-tuning on execution-verified demonstrations, trains dual reward models (a sequence-level assessor and a line-level credit allocator) from execution outcomes, and applies PPO optimization with rewards redistributed to critical edit regions by the line-level model. It reports strong performance gains on SWE-bench Verified (40.7%, +22.9 percentage points over the base model), Defects4J (24.8% with Python-to-Java transfer), HumanEval-Java (84.5%), and QuixBugs (95.0%), positioning it as competitive among open-source models with notable cross-language generalization.
Significance. Should the empirical results prove robust and the line-level credit assignment mechanism reliably isolate causal bug-fixing edits, this approach would represent a meaningful advance in addressing the sparse and coarse reward problem in RL for code repair. The dual-reward design and execution-grounded training are promising for finer-grained credit assignment in program repair tasks. The cross-benchmark and cross-language results suggest good generalization potential.
major comments (2)
- [Abstract and §3 (three-stage framework)] The description of training the line-level credit allocator in stage (2) does not detail mechanisms to prevent reward assignment to non-causal edits that coincidentally pass tests due to incomplete coverage. This assumption is load-bearing for the central claim, as the performance improvements are attributed to the PPO redistribution in stage (3); without evidence that the allocator isolates true fixes, the gains may reflect spurious correlations rather than improved repair.
- [Evaluation section (implied by results)] No ablation studies or statistical tests are referenced to isolate the contribution of the dual-reward PPO stage from the supervised fine-tuning stage alone. This is necessary to substantiate that the +22.9pp gain on SWE-bench Verified stems specifically from the line-level credit allocation.
minor comments (2)
- The abstract would benefit from specifying the base model used for the +22.9pp comparison and the exact baselines for 'competitive results among open-source models'.
- Clarify the training data split between SWE-Gym and the evaluation benchmarks to address potential data leakage concerns.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for clarification and additional evidence, which we will address in the revision.
read point-by-point responses
-
Referee: [Abstract and §3 (three-stage framework)] The description of training the line-level credit allocator in stage (2) does not detail mechanisms to prevent reward assignment to non-causal edits that coincidentally pass tests due to incomplete coverage. This assumption is load-bearing for the central claim, as the performance improvements are attributed to the PPO redistribution in stage (3); without evidence that the allocator isolates true fixes, the gains may reflect spurious correlations rather than improved repair.
Authors: We agree that the current description in §3 could be expanded to more explicitly address potential non-causal edits arising from incomplete test coverage. In the revised manuscript, we will elaborate on the training procedure for the line-level credit allocator, detailing how it is trained exclusively on execution-verified fixes from SWE-Gym, the use of differential execution feedback to prioritize edits that directly contribute to test passage, and any filtering steps to reduce spurious assignments. This will strengthen the justification for attributing gains to the PPO redistribution in stage (3). revision: yes
-
Referee: [Evaluation section (implied by results)] No ablation studies or statistical tests are referenced to isolate the contribution of the dual-reward PPO stage from the supervised fine-tuning stage alone. This is necessary to substantiate that the +22.9pp gain on SWE-bench Verified stems specifically from the line-level credit allocation.
Authors: We concur that isolating the contribution of the dual-reward PPO stage requires explicit ablations. We will add ablation experiments in the evaluation section comparing the full BoostAPR pipeline against the supervised fine-tuning baseline alone, reporting results on SWE-bench Verified and other benchmarks. We will also include statistical significance tests (e.g., bootstrap resampling for confidence intervals and paired tests where applicable) to quantify the incremental gains from the line-level credit allocation and PPO optimization. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The described three-stage framework grounds rewards in external execution outcomes from training data and measures final performance on held-out external benchmarks (SWE-bench Verified, Defects4J, HumanEval-Java, QuixBugs). No step reduces the headline performance claims to the training inputs by construction, self-definition, or self-citation. Reward model training follows standard supervised fitting on execution signals, with evaluation kept separate; this is self-contained empirical work rather than a tautological derivation.
Axiom & Free-Parameter Ledger
free parameters (1)
- Training hyperparameters for SFT, reward models, and PPO
axioms (1)
- domain assumption Execution outcomes supply a reliable training signal for identifying which specific code edits fix bugs
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.lean (J-cost uniqueness)washburn_uniqueness_aczel unclearStage II trains dual reward models—a sequence-level assessor and a line-level credit allocator—from execution outcomes, and (3) PPO optimization where the line-level model redistributes rewards to critical edit regions.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearRline assigns credit over edit-line spans... allocation weights via temperature-controlled softmax
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.