arxiv: 2605.11922 · v1 · submitted 2026-05-12 · 💻 cs.SE · cs.CL

Recognition: 2 theorem links

· Lean Theorem

StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

Hao Wang, Jie M. Zhang, Lei Sha, Rui Li

Pith reviewed 2026-05-13 05:20 UTC · model grok-4.3

classification 💻 cs.SE cs.CL

keywords code reasoningstepwise executionreinforcement learningexecution tracesintermediate supervisionreward hackingcode generation

0 comments

The pith

Models that predict runtime states step by step reason about code more reliably.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Code reasoning models often reach correct final answers by taking inconsistent or incorrect intermediate steps because training only checks the end result. This paper shows how to fix that by automatically inserting print statements that capture the state after each line of code. The model is then trained to predict those states, making the reasoning process explicit and verifiable. A bi-level reinforcement learning method assigns credit to both entire paths and individual steps that matter for the outcome. When this holds, both reasoning accuracy and code generation improve.

Core claim

StepCodeReasoner uses automatic insertion of structured print-based execution-trace anchors to train models to predict runtime states at each step, turning code reasoning into stepwise execution modeling. Combined with Bi-Level GRPO for inter- and intra-trajectory credit assignment, this produces more consistent reasoning.

What carries the argument

Structured print-based execution-trace anchors for intermediate state supervision, together with bi-level reinforcement learning for credit assignment.

If this is right

Reasoning becomes more consistent because intermediate steps are directly supervised.
Performance improves on tasks requiring code understanding and execution prediction.
Code generation also benefits from the execution-aware training.
The framework supports both reasoning and generation tasks with the same model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same principle of intermediate state supervision may apply to non-code sequential tasks such as planning or simulation.
Improved execution modeling may lead to more reliable automated code review or repair tools.

Load-bearing premise

Automatically inserting print statements produces faithful supervision signals that train consistent reasoning rather than allowing the model to hack the final answer.

What would settle it

An ablation that removes the execution-trace anchors and checks whether the model reverts to baseline behavior on reasoning tasks.

Figures

Figures reproduced from arXiv: 2605.11922 by Hao Wang, Jie M. Zhang, Lei Sha, Rui Li.

**Figure 1.** Figure 1: Overview of the StepCodeReasoner framework. The top-left illustrates Execution-Trace Augmentation, where the original Swapcase code is modified by inserting structured print statements to create observable execution anchors. The top-right shows the resulting Model Inference Process for the input "y7s6", where the model generates interleaved blocks of <reasoning> for logical deduction and <print> for state … view at source ↗

**Figure 2.** Figure 2: Learning dynamics over 1500 training steps. The left panel shows the evolution of training rewards, while the right panel illustrates the changes in average response length. Shaded regions represent raw fluctuations, and solid curves denote smoothed trends. As shown in [PITH_FULL_IMAGE:figures/full_fig_p018_2.png] view at source ↗

read the original abstract

Existing code reasoning methods primarily supervise final code outputs, ignoring intermediate states, often leading to reward hacking where correct answers are obtained through inconsistent reasoning. We propose StepCodeReasoner, a framework that introduces explicit intermediate execution-state supervision. By automatically inserting structured print-based execution-trace anchors into code, the model is trained to predict runtime states at each step, transforming code reasoning into a verifiable, stepwise execution modeling problem. Building on this execution-aware method, we introduce Bi-Level GRPO, a reinforcement learning algorithm for structured credit assignment at two levels: inter-trajectory, comparing alternative execution paths, and intra-trajectory, rewarding intermediate accuracy based on its impact on downstream correctness. Extensive experiments demonstrate that StepCodeReasoner achieves SOTA performance in code reasoning. In particular, our 7B model achieves 91.1\% on CRUXEval and 86.5\% on LiveCodeBench, outperforming the CodeReasoner-7B baseline (86.0\% and 77.7\%) and GPT-4o (85.6\% and 75.1\%). Furthermore, on the execution-trace benchmark REval, our model scores 82.9\%, outperforming baseline CodeReasoner-7B (72.3\%), its 14B counterpart (81.1\%), and GPT-4o (77.3\%). Additionally, our approach also improves code generation performance, demonstrating that explicit execution modeling enhances both code reasoning and code generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StepCodeReasoner improves code reasoning benchmarks with execution trace supervision via prints and bi-level RL, but insertion fidelity is the key assumption to verify.

read the letter

The main point is that this paper gets better code reasoning from a 7B model by adding supervision on intermediate execution states rather than only the final answer. They insert print statements automatically to capture runtime values at each step and train the model to predict those, then use a bi-level RL algorithm for credit assignment across paths and within steps. This setup produces clear gains. Their 7B model scores 91.1% on CRUXEval and 86.5% on LiveCodeBench, ahead of the same-size baseline and GPT-4o. On the REval execution trace benchmark it reaches 82.9%, better than both the baseline and a 14B version. The method also boosts code generation performance, which is a nice side benefit. What is new is the specific use of structured print anchors to turn reasoning into verifiable stepwise execution modeling, paired with the Bi-Level GRPO that handles inter-trajectory and intra-trajectory rewards. It directly targets the problem of reward hacking through inconsistent steps. The potential weak point is the automatic print insertion itself. For the supervision to work as intended, the anchors must not alter the original code's semantics in any control flow, mutation, or exception scenario. If the insertion logic skips branches incorrectly or adds prints that affect I/O or state, the model could be learning from noisy or wrong traces. The abstract shows strong benchmark results, but the improvements might partly reflect other factors like training setup rather than pure execution alignment. Checking the full methods for how they validate the insertion and any ablations would clarify this. This is relevant for researchers developing LLM tools for programming tasks who want more reliable step-by-step reasoning. The empirical results are strong enough and the problem is well-defined, so it deserves peer review. A referee could usefully examine the robustness of the anchor insertion and whether the bi-level RL is the main driver of the gains.

Referee Report

2 major / 2 minor

Summary. The paper introduces StepCodeReasoner, a framework that automatically inserts structured print-based execution-trace anchors into code to enable explicit intermediate-state supervision during training. It combines this with Bi-Level GRPO, a reinforcement learning algorithm that performs inter-trajectory and intra-trajectory credit assignment, to align code reasoning with verifiable stepwise execution. The 7B model is reported to achieve 91.1% on CRUXEval, 86.5% on LiveCodeBench, and 82.9% on REval, outperforming the CodeReasoner-7B baseline, its 14B variant, and GPT-4o, with additional gains on code generation.

Significance. If the gains are shown to stem from faithful execution supervision rather than insertion artifacts or training confounders, the work could meaningfully advance code reasoning by shifting from final-answer supervision to verifiable intermediate states, reducing reward hacking and improving reliability. The bi-level RL formulation for structured credit assignment and the empirical outperformance on execution-trace benchmarks represent a concrete step toward more interpretable and robust code models.

major comments (2)

[§3.2] §3.2 (Anchor Insertion Algorithm): The claim that automatically inserted print anchors preserve original semantics for arbitrary control flows, mutations, side effects, and exceptions is load-bearing for the central thesis that the resulting traces provide faithful, non-disruptive supervision. No formal invariants, exhaustive test cases, or empirical checks for I/O interference or skipped branches are provided, leaving open the possibility that observed gains reflect surface-level format prediction rather than genuine reasoning alignment.
[§5] §5 (Experiments and Ablations): The SOTA claims rest on comparisons to CodeReasoner-7B and GPT-4o, yet no ablation isolates the contribution of the print-anchor supervision from the Bi-Level GRPO objective or from possible differences in training data volume. Without such controls, it is impossible to confirm that the 5.1-point CRUXEval and 8.8-point LiveCodeBench lifts are attributable to the proposed execution modeling rather than other factors.

minor comments (2)

[Abstract and §4.1] The abstract and §4.1 refer to 'structured print-based execution-trace anchors' without a concise pseudocode listing of the insertion rules, making it difficult for readers to reproduce the preprocessing step.
[Table 1] Table 1 (benchmark results) reports single-point percentages without standard deviations or number of evaluation runs, which is standard for RL-based code models to establish statistical reliability of the reported margins.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us clarify and strengthen key aspects of the manuscript. We provide point-by-point responses to the major comments below. Where the comments identify gaps in formal analysis or experimental controls, we have revised the paper by adding the requested details, invariants, test cases, and ablations.

read point-by-point responses

Referee: [§3.2] §3.2 (Anchor Insertion Algorithm): The claim that automatically inserted print anchors preserve original semantics for arbitrary control flows, mutations, side effects, and exceptions is load-bearing for the central thesis that the resulting traces provide faithful, non-disruptive supervision. No formal invariants, exhaustive test cases, or empirical checks for I/O interference or skipped branches are provided, leaving open the possibility that observed gains reflect surface-level format prediction rather than genuine reasoning alignment.

Authors: We appreciate the referee's emphasis on rigorously establishing semantic preservation, which underpins the validity of the execution traces. The original §3.2 described the insertion rules and provided illustrative examples for common structures. In the revised manuscript we have added a formal invariants subsection proving that the algorithm (1) inserts only non-mutating print statements, (2) leaves control flow, exception paths, and side-effect order unchanged, and (3) captures state without introducing new I/O or skipping branches. We also include an expanded appendix with 60+ test cases spanning arbitrary loops, conditionals, mutations, I/O, and exceptions; each case was executed before and after insertion to confirm identical observable behavior. The 10.6-point gain on REval (which scores trace fidelity directly) further indicates that improvements derive from genuine stepwise reasoning rather than format prediction. revision: yes
Referee: [§5] §5 (Experiments and Ablations): The SOTA claims rest on comparisons to CodeReasoner-7B and GPT-4o, yet no ablation isolates the contribution of the print-anchor supervision from the Bi-Level GRPO objective or from possible differences in training data volume. Without such controls, it is impossible to confirm that the 5.1-point CRUXEval and 8.8-point LiveCodeBench lifts are attributable to the proposed execution modeling rather than other factors.

Authors: We agree that explicit isolation of each component is necessary. The CodeReasoner-7B baseline was trained on the same underlying data distribution but without anchors or bi-level credit assignment. In the revision we have added three controlled ablations: (i) our anchors with standard GRPO (isolating bi-level credit assignment), (ii) Bi-Level GRPO without anchors (isolating the supervision signal), and (iii) matched training token budgets by subsampling the baseline data to identical volume. These experiments attribute roughly 3.2 points on CRUXEval and 3.9 points on LiveCodeBench to the print-anchor supervision and 2.5 / 4.9 points respectively to the bi-level objective. We have also clarified the exact data composition and token counts to ensure comparability. The results confirm that the reported lifts arise from the proposed execution modeling. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark results stand independently

full rationale

The paper proposes an empirical framework (automatic print-anchor insertion plus Bi-Level GRPO) and reports performance numbers on external benchmarks (CRUXEval, LiveCodeBench, REval) against named baselines and GPT-4o. No derivation chain, first-principles prediction, or fitted parameter is presented that reduces by construction to the method's own inputs. The central claims are verifiable accuracy deltas on public test sets; no self-definitional, self-citation load-bearing, or renaming steps appear in the abstract or described pipeline. The reader's noted assumption about anchor faithfulness is a correctness concern, not a circularity reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the assumption that inserted print statements faithfully capture runtime states without behavioral side effects and that bi-level credit assignment in RL will produce consistent reasoning. No explicit free parameters or invented entities are named in the abstract.

axioms (2)

domain assumption Print-based execution-trace anchors inserted into code produce accurate, non-disruptive intermediate state labels for supervision.
This is the core mechanism that turns code reasoning into a stepwise prediction task.
domain assumption Bi-level GRPO can assign credit both across trajectories and within a trajectory in a way that improves downstream correctness.
Central to the reinforcement learning component.

pith-pipeline@v0.9.0 · 5568 in / 1450 out tokens · 78923 ms · 2026-05-13T05:20:52.879057+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
By automatically inserting structured print-based execution-trace anchors into code, the model is trained to predict runtime states at each step... Bi-Level GRPO... inter-trajectory... intra-trajectory shaping advantage
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear
LStepCodeReasoner(θ) = −∑ log pθ(zi | ...)

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 22 internal anchors

[1]

1945 , url =

As We May Think , journal =. 1945 , url =

work page 1945
[2]

CruxEval: A Benchmark for Code Reasoning, Understanding and Execution.arXiv:2401.03065,

Cruxeval: A benchmark for code reasoning, understanding and execution , author=. arXiv preprint arXiv:2401.03065 , year=

work page arXiv
[3]

Advances in Neural Information Processing Systems , volume=

Semcoder: Training code language models with comprehensive semantics reasoning , author=. Advances in Neural Information Processing Systems , volume=

work page
[4]

Codei/o: Condensing reasoning patterns via code input-output prediction.CoRR, abs/2502.07316, 2025a

CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction , author=. arXiv preprint arXiv:2502.07316 , year=

work page arXiv
[5]

Advances in Neural Information Processing Systems , volume=

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation , author=. Advances in Neural Information Processing Systems , volume=

work page
[6]

Hugging Face , year=

Qwq: Reflect deeply on the boundaries of the unknown , author=. Hugging Face , year=

work page
[7]

Instruction Tuning with GPT-4

Instruction tuning with gpt-4 , author=. arXiv preprint arXiv:2304.03277 , year=

work page internal anchor Pith review arXiv
[8]

Lecture notes-monograph series , pages=

Generalized accept-reject sampling schemes , author=. Lecture notes-monograph series , pages=. 2004 , publisher=

work page 2004
[9]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? , author=. arXiv preprint arXiv:2504.13837 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

arXiv preprint arXiv:2502.14768

Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning , author=. arXiv preprint arXiv:2502.14768 , year=

work page arXiv
[11]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

A theoretical analysis of the repetition problem in text generation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[12]

arXiv preprint arXiv:2505.10402 , year=

Rethinking Repetition Problems of LLMs in Code Generation , author=. arXiv preprint arXiv:2505.10402 , year=

work page arXiv
[13]

CTRL: A conditional transformer language model for controllable generation.Preprint arXiv:1909.05858,

Ctrl: A conditional transformer language model for controllable generation , author=. arXiv preprint arXiv:1909.05858 , year=

work page arXiv 1909
[14]

The Curious Case of Neural Text Degeneration

The curious case of neural text degeneration , author=. arXiv preprint arXiv:1904.09751 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1904
[15]

Advances in Neural Information Processing Systems , volume=

Learning to break the loop: Analyzing and mitigating repetitions for neural text generation , author=. Advances in Neural Information Processing Systems , volume=

work page
[16]

arXiv preprint arXiv:2410.15226 , year=

On the Diversity of Synthetic Data and its Impact on Training Large Language Models , author=. arXiv preprint arXiv:2410.15226 , year=

work page arXiv
[17]

arXiv preprint arXiv:2310.14971 , year=

Penalty decoding: Well suppress the self-reinforcement effect in open-ended text generation , author=. arXiv preprint arXiv:2310.14971 , year=

work page arXiv
[18]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Sft memorizes, rl generalizes: A comparative study of foundation model post-training , author=. arXiv preprint arXiv:2501.17161 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

arXiv preprint arXiv:2403.16437 , year=

Reasoning runtime behavior of a program with llm: How far are we? , author=. arXiv preprint arXiv:2403.16437 , year=

work page arXiv
[23]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

DeepSeek-Coder: When the Large Language Model Meets Programming--The Rise of Code Intelligence , author=. arXiv preprint arXiv:2401.14196 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Qwen2.5-Coder Technical Report

Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

GPT-4o , url =

OpenAI , year =. GPT-4o , url =

work page
[26]

GPT-4o-mini , url =

OpenAI , year =. GPT-4o-mini , url =

work page
[27]

Qwen2.5 Technical Report

Qwen2 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

DeepSeek-Coder-V2: Breaking the barrier of closed-source models in code intelligence.arXiv preprint arXiv:2406.11931,

Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence , author=. arXiv preprint arXiv:2406.11931 , year=

work page arXiv
[30]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild , author=. arXiv preprint arXiv:2503.18892 , year=

work page internal anchor Pith review arXiv
[32]

arXiv preprint arXiv:2504.14286 , year=

Srpo: A cross-domain implementation of large-scale reinforcement learning on llm , author=. arXiv preprint arXiv:2504.14286 , year=

work page arXiv
[33]

arXiv preprint arXiv:2408.13001 , year=

Cruxeval-x: A benchmark for multilingual code reasoning, understanding and execution , author=. arXiv preprint arXiv:2408.13001 , year=

work page arXiv
[34]

arXiv preprint arXiv:2506.00750 , year=

CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning , author=. arXiv preprint arXiv:2506.00750 , year=

work page arXiv
[35]

arXiv e-prints , pages=

Equibench: Benchmarking code reasoning capabilities of large language models via equivalence checking , author=. arXiv e-prints , pages=

work page
[36]

arXiv preprint arXiv:2503.04779 , year=

Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specification Inference , author=. arXiv preprint arXiv:2503.04779 , year=

work page arXiv
[37]

arXiv preprint arXiv:2404.14662 , year=

Next: Teaching large language models to reason about code execution , author=. arXiv preprint arXiv:2404.14662 , year=

work page arXiv
[38]

Proceedings of the 44th international conference on software engineering , pages=

Neural program repair with execution-based backpropagation , author=. Proceedings of the 44th international conference on software engineering , pages=

work page
[39]

arXiv preprint arXiv:2402.16906 , year=

Debug like a human: A large language model debugger via verifying runtime execution step-by-step , author=. arXiv preprint arXiv:2402.16906 , year=

work page arXiv
[40]

International Conference on Machine Learning , pages=

Lever: Learning to verify language-to-code generation with execution , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[41]

IET Software , volume=

Embedding and classifying test execution traces using neural networks , author=. IET Software , volume=. 2022 , publisher=

work page 2022
[42]

arXiv preprint arXiv:2305.05383 , year=

Code execution with pre-trained language models , author=. arXiv preprint arXiv:2305.05383 , year=

work page arXiv
[43]

Proceedings of the 46th IEEE/ACM International Conference on Software Engineering , pages=

Traced: Execution-aware pre-training for source code , author=. Proceedings of the 46th IEEE/ACM International Conference on Software Engineering , pages=

work page
[44]

International Conference on Information and Communications Security , pages=

FuzzBoost: Reinforcement compiler fuzzing , author=. International Conference on Information and Communications Security , pages=. 2022 , organization=

work page 2022
[45]

Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=

Curiosity-driven testing for sequential decision-making process , author=. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=

work page
[46]

Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis , pages=

Fuzzing JavaScript Interpreters with Coverage-Guided Reinforcement Learning for LLM-Based Mutation , author=. Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis , pages=

work page
[47]

Proceedings of the ACM on Software Engineering , volume=

Ircoco: Immediate rewards-guided deep reinforcement learning for code completion , author=. Proceedings of the ACM on Software Engineering , volume=. 2024 , publisher=

work page 2024
[48]

arXiv preprint arXiv:2407.19487 , year=

Rlcoder: Reinforcement learning for repository-level code completion , author=. arXiv preprint arXiv:2407.19487 , year=

work page arXiv
[49]

arXiv preprint arXiv:2502.18449 , year=

Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution , author=. arXiv preprint arXiv:2502.18449 , year=

work page arXiv
[50]

Proceedings of the ACM on Software Engineering , volume=

Clarifygpt: A framework for enhancing llm-based code generation via requirements clarification , author=. Proceedings of the ACM on Software Engineering , volume=. 2024 , publisher=

work page 2024
[51]

IEEE Transactions on Software Engineering , year=

Llm-based test-driven interactive code generation: User study and empirical evaluation , author=. IEEE Transactions on Software Engineering , year=

work page
[52]

Computer Standards & Interfaces , volume=

Large language models for code completion: A systematic literature review , author=. Computer Standards & Interfaces , volume=. 2025 , publisher=

work page 2025
[53]

2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , pages=

Cctest: Testing and repairing code completion systems , author=. 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , pages=. 2023 , organization=

work page 2023
[54]

Advances in Neural Information Processing Systems , volume=

Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=

work page
[55]

Lingma SWE-GPT : An open development-process-centric language model for automated software improvement

Lingma swe-gpt: An open development-process-centric language model for automated software improvement , author=. arXiv preprint arXiv:2411.00622 , year=

work page arXiv
[56]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Swe-bench: Can language models resolve real-world github issues? , author=. arXiv preprint arXiv:2310.06770 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[59]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , address=

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , address=. 2024 , url=

work page 2024
[60]

2024 , journal =

HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

work page 2024
[61]

arXiv preprint arXiv:2503.01307 , year=

Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars , author=. arXiv preprint arXiv:2503.01307 , year=

work page arXiv
[62]

arXiv preprint arXiv:2505.00661 , year=

On the generalization of language models from in-context learning and finetuning: a controlled study , author=. arXiv preprint arXiv:2505.00661 , year=

work page arXiv
[63]

arXiv preprint arXiv:2502.08130 , year=

Selective Self-to-Supervised Fine-Tuning for Generalization in Large Language Models , author=. arXiv preprint arXiv:2502.08130 , year=

work page arXiv
[64]

arXiv preprint arXiv:2211.00635 , year=

Two-stage LLM fine-tuning with less specialization and more generalization , author=. arXiv preprint arXiv:2211.00635 , year=

work page arXiv
[65]

arXiv preprint arXiv:2308.10792 , year=

Instruction tuning for large language models: A survey , author=. arXiv preprint arXiv:2308.10792 , year=

work page arXiv
[66]

arXiv preprint arXiv:2312.05934 , year=

Fine-tuning or retrieval? comparing knowledge injection in llms , author=. arXiv preprint arXiv:2312.05934 , year=

work page arXiv
[67]

arXiv preprint arXiv:2501.04961 , year=

Demystifying domain-adaptive post-training for financial llms , author=. arXiv preprint arXiv:2501.04961 , year=

work page arXiv
[68]

arXiv preprint arXiv:2503.03730 , year=

Towards understanding distilled reasoning models: A representational approach , author=. arXiv preprint arXiv:2503.03730 , year=

work page arXiv
[69]

arXiv e-prints , pages=

Towards widening the distillation bottleneck for reasoning models , author=. arXiv e-prints , pages=

work page
[70]

Codereasoner: Enhancing the code reasoning ability with reinforcement learning.arXiv preprint arXiv:2507.17548,

CodeReasoner: Enhancing the Code Reasoning Ability with Reinforcement Learning , author=. arXiv preprint arXiv:2507.17548 , year=

work page arXiv
[71]

Program Synthesis with Large Language Models

Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[72]

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Absolute zero: Reinforced self-play reasoning with zero data , author=. arXiv preprint arXiv:2505.03335 , year=

work page internal anchor Pith review arXiv
[73]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[74]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[75]

arXiv preprint arXiv:2510.06062 , year=

Aspo: Asymmetric importance sampling policy optimization , author=. arXiv preprint arXiv:2510.06062 , year=

work page arXiv
[76]

Group Sequence Policy Optimization

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[77]

R1-vl: Learn- ing to reason with multimodal large language models via step-wise group relative policy optimization.arXiv preprint arXiv:2503.12937, 2025

R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization , author=. arXiv preprint arXiv:2503.12937 , year=

work page arXiv
[78]

arXiv preprint arXiv:2508.17445 , year=

Treepo: Bridging the gap of policy optimization and efficacy and inference efficiency with heuristic tree-based modeling , author=. arXiv preprint arXiv:2508.17445 , year=

work page arXiv
[79]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[80]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=

work page

Showing first 80 references.