arxiv: 2603.05863 · v2 · submitted 2026-03-06 · 💻 cs.CL · cs.LG· cs.SE

Recognition: 2 theorem links

· Lean Theorem

ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning

Juyong Jiang , Jiasi Shen , Sunghun Kim , Kang Min Yoo , Jeonghoon Kim , Sungju Kim

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:52 UTC · model grok-4.3

classification 💻 cs.CL cs.LGcs.SE

keywords code generationself-reflectionreinforcement learningself-correctionlarge language modelscode benchmarksautonomous refinement

0 comments

The pith

Reinforcement learning embeds self-reflection and code correction directly into an 8B model's weights for autonomous fixes at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ReflexiCoder, a reinforcement learning framework that trains large language models to generate code, reflect on bugs and optimizations, and correct errors as one internalized sequence. This replaces reliance on external oracles or repeated prompt cycles with a process that runs entirely inside the model after training. Granular rewards guide the full trajectory during RL so the model learns to debug without ground-truth feedback or execution engines. Experiments show the resulting 8B model reaching high single-attempt scores on HumanEval, MBPP, BigCodeBench, LiveCodeBench, and CodeForces while using roughly 40 percent fewer tokens than base models. The central goal is to raise the performance ceiling of code generation on complex tasks by building correction skills into the weights themselves.

Core claim

ReflexiCoder uses an RL-only training paradigm with granular reward functions to optimize the entire reflection-correction trajectory, internalizing initial generation, bug and optimization aware reflection, and self-correction directly into the model weights so that self-correction works autonomously at inference time without ground-truth feedback or execution engines.

What carries the argument

Reinforcement learning with granular reward functions applied across the full generation-reflection-correction sequence.

Load-bearing premise

That reinforcement learning on reflection-correction trajectories can embed reliable self-correction ability into the model so it succeeds without any external verification at inference time.

What would settle it

Run the trained ReflexiCoder model on the same benchmarks with no execution feedback or ground truth supplied at inference and measure whether accuracy stays at the reported levels or falls back to the base model's performance.

read the original abstract

While Large Language Models (LLMs) have revolutionized code generation, standard ``System 1'' approaches that generate solutions in a single forward pass often hit a performance ceiling on complex algorithmic tasks. Existing iterative refinement strategies attempt to bridge this gap at inference time, yet they predominantly rely on external oracles, execution feedback, or computationally expensive prompt-response cycles. In this work, we propose ReflexiCoder, a novel reinforcement learning (RL) framework that internalizes the structured reasoning trajectory, encompassing initial generation, bug and optimization aware reflection, and self-correction, directly into the model's weights. Unlike prior methods, ReflexiCoder shifts the paradigm from external-dependent refinement to an intrinsic, fully autonomous self-reflection and self-correction capabilities at inference time. We utilize an RL-only training paradigm with granular reward functions to optimize the entire reflection-correction trajectory, teaching the model how to debug without reliance on ground-truth feedback or execution engines at inference time. Extensive experiments across seven benchmarks demonstrate that our ReflexiCoder-8B establishes a new state-of-the-art (SOTA) among leading open-source models in the 1.5B to 14B range, achieving 94.51% (87.20%) on HumanEval (Plus), 81.80% (78.57%) on MBPP (Plus), 35.00% on BigCodeBench, 52.21% on LiveCodeBench, and 37.34% on CodeForces in a single-attempt setting, rivaling or surpassing proprietary models like GPT-5.1. Notably, our framework is significantly more token-efficient than base models, reducing inference-time compute overhead by approximately 40% through disciplined, efficient reasoning and reflection patterns. The source code and data are available at https://github.com/juyongjiang/ReflexiCoder.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReflexiCoder trains an 8B model with RL to embed generate-reflect-correct into weights and reports strong single-attempt code benchmark scores, but the abstract leaves the reward setup and inference procedure too opaque to verify true autonomy.

read the letter

The main thing here is that ReflexiCoder applies RL-only training with granular rewards to internalize the full reflection and self-correction trajectory into an 8B model's weights, so it can produce corrected code in a single forward pass without external oracles or execution feedback at inference. They report clear gains: 94.51% on HumanEval, 81.80% on MBPP, plus solid numbers on BigCodeBench and LiveCodeBench, all while cutting token use by about 40% compared to the base model. Releasing the code and data is also useful for anyone who wants to inspect or extend the work. The shift from prompt cycles to baked-in behavior via RL is the concrete difference from the methods they cite. The soft spot is the missing detail on exactly how the rewards are shaped, what the training trajectories look like, and whether ablations confirm that the gains come from internalization rather than data or prompting artifacts. The stress-test point lands because the benchmarks still rely on post-generation execution to score correctness, so the paper needs to show explicitly that inference stays single-pass and free of any hidden feedback. Without those sections, the autonomy claim is hard to assess from the abstract alone. This paper is for people working on training smaller code models to reason better rather than relying on inference tricks. A reader focused on RL for structured reasoning would get value from the numbers and the released artifacts, even if they have to wait for the full methods to replicate. I would send it to peer review so referees can check the training procedure and confirm whether the internalization actually holds up.

Referee Report

3 major / 2 minor

Summary. The paper introduces ReflexiCoder, a reinforcement learning framework that internalizes a generate-reflect-correct trajectory into LLM weights via granular rewards, enabling fully autonomous self-correction at inference time without external oracles or execution feedback. It reports new SOTA results among open-source models in the 1.5B-14B range on HumanEval (94.51%), MBPP (81.80%), BigCodeBench (35.00%), LiveCodeBench (52.21%), and CodeForces (37.34%), plus ~40% inference efficiency gains, for an 8B model.

Significance. If the central internalization claim holds and is reproducible, the work would meaningfully advance autonomous code generation by shifting refinement from inference-time prompting or tools into model weights, with potential efficiency benefits over iterative baselines.

major comments (3)

[Methods] Methods section: The paper states that 'granular reward functions' optimize the full reflection-correction trajectory but supplies no explicit formulation, weighting scheme, or pseudocode for these rewards (contrast with standard RLHF reward models). This definition is load-bearing for the claim that the trajectory is internalized into weights rather than learned as prompting behavior.
[Experiments] Experiments (§4 or equivalent): No training details, hyperparameters, ablation studies on reward components, or error analysis are provided, despite the abstract reporting precise benchmark numbers. Without these, it is impossible to verify whether the SOTA scores arise from the proposed RL internalization or from standard supervised fine-tuning plus prompting.
[Results] Inference protocol (abstract and results): The single-attempt autonomous setting is asserted, yet all reported metrics rely on post-generation execution for scoring. The manuscript must clarify and demonstrate (e.g., via controlled comparison) that inference uses no implicit external validation or multi-turn loops, as this distinction is central to the paradigm-shift claim.

minor comments (2)

[Abstract] Abstract: The 40% token-efficiency claim should be backed by a specific table or figure in the main text showing token counts or latency for ReflexiCoder versus baselines.
[Abstract] Notation: The parenthetical scores (e.g., 94.51% (87.20%)) on HumanEval (Plus) are unclear without an explicit definition of the 'Plus' variant in the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below. We agree that several clarifications and additions are needed to strengthen the presentation of our RL internalization approach and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Methods] Methods section: The paper states that 'granular reward functions' optimize the full reflection-correction trajectory but supplies no explicit formulation, weighting scheme, or pseudocode for these rewards (contrast with standard RLHF reward models). This definition is load-bearing for the claim that the trajectory is internalized into weights rather than learned as prompting behavior.

Authors: We agree that the explicit formulation of the granular reward functions is missing from the current Methods section and is necessary to substantiate the internalization claim. In the revised manuscript we will add the full mathematical definitions of each reward component (correctness, reflection quality, and correction efficiency), the weighting scheme used to combine them into a single scalar reward, and pseudocode for the reward computation and trajectory optimization process. These additions will make clear how the generate-reflect-correct behavior is shaped directly in the weights via RL rather than through inference-time prompting. revision: yes
Referee: [Experiments] Experiments (§4 or equivalent): No training details, hyperparameters, ablation studies on reward components, or error analysis are provided, despite the abstract reporting precise benchmark numbers. Without these, it is impossible to verify whether the SOTA scores arise from the proposed RL internalization or from standard supervised fine-tuning plus prompting.

Authors: We acknowledge the absence of these details and agree they are required for reproducibility and to isolate the contribution of the RL stage. The revised Experiments section will include a dedicated training-details subsection with all hyperparameters (learning rate, batch size, number of RL steps, discount factor, etc.), ablation studies that remove or re-weight individual reward components, and an error analysis of failure modes. These additions will allow readers to verify that the reported gains derive from the proposed internalization rather than from supervised fine-tuning or prompting alone. revision: yes
Referee: [Results] Inference protocol (abstract and results): The single-attempt autonomous setting is asserted, yet all reported metrics rely on post-generation execution for scoring. The manuscript must clarify and demonstrate (e.g., via controlled comparison) that inference uses no implicit external validation or multi-turn loops, as this distinction is central to the paradigm-shift claim.

Authors: We confirm that inference in ReflexiCoder is strictly single-pass and autonomous: the model generates the full reflection-correction trajectory in one forward pass with no external oracles, execution feedback, or multi-turn loops at inference time. Post-generation execution is used only for offline benchmark scoring, which is standard and does not influence the model's behavior. In the revision we will add an explicit inference-protocol paragraph and a controlled comparison (single-pass vs. any hypothetical multi-turn variant) to demonstrate that the reported numbers are obtained under the claimed autonomous setting. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL training outcomes with no definitional or self-referential reductions

full rationale

The paper describes an RL training process that uses granular rewards to teach reflection-correction trajectories, with all reported results framed as measured benchmark performance after training. No equations, derivations, or fitted parameters are presented that reduce the central claim (autonomous self-correction at inference) to its own inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked. The internalization effect is an empirical training result, not a definitional equivalence.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on the assumption that custom reward signals can shape internal reasoning trajectories; no explicit free parameters or invented physical entities are named, but the ReflexiCoder training paradigm itself is the novel construct.

free parameters (1)

granular reward functions
Designed to optimize the entire reflection-correction trajectory; concrete definitions and weighting not provided in abstract.

axioms (1)

domain assumption Reinforcement learning can embed complex multi-step reasoning (generation, reflection, correction) into model parameters without external feedback at inference
Central premise of the RL-only training paradigm described in the abstract.

invented entities (1)

ReflexiCoder framework no independent evidence
purpose: Internalizes structured self-reflection and self-correction for code generation
Newly proposed training method whose effectiveness is asserted via benchmark results.

pith-pipeline@v0.9.0 · 5666 in / 1362 out tokens · 54739 ms · 2026-05-15T15:52:11.053905+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We utilize an RL-only training paradigm with granular reward functions to optimize the entire reflection-correction trajectory... Rtrajectory(τ) = I[rn = rmax] + η Σ wt mt ... P(n) decay for n > n0 ... E(n) = I[rn ≥ τq]/n + (rn − r0)/max(1,n−1) ... Roverall(τ) = I[F(τ)=1] P(n) (φ Rtrajectory + ψ E(n)) + ξ F(τ)
IndisputableMonolith/Foundation/ArrowOfTime.lean z_monotone_absolute echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

ReflexiCoder is significantly more token-efficient... reducing inference-time compute overhead by approximately 40% through disciplined, efficient reasoning and reflection patterns... executes exactly one reflection cycle in virtually all cases

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning
cs.CL 2026-04 unverdicted novelty 6.0

WebGen-R1 uses end-to-end RL with scaffold-driven generation and cascaded rewards for structure, function, and aesthetics to transform a 7B model into a generator of deployable multi-page websites that rivals much lar...