arxiv: 2603.29957 · v3 · submitted 2026-03-31 · 💻 cs.SE · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Think Anywhere in Code Generation

Xue Jiang , Tianyu Zhang , Ge Li , Mengyang Liu , Taozhi Chen , Zhenhua Xu , Binhua Li , Wenpin Jiao

show 3 more authors

Zhi Jin Yongbin Li Yihong Dong

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:11 UTC · model grok-4.3

classification 💻 cs.SE cs.LG

keywords code generationreasoning LLMsadaptive reasoningon-demand thinkingreinforcement learningLLM post-trainingbenchmark evaluation

0 comments

The pith

LLMs can invoke reasoning at any token position during code generation rather than only upfront.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Upfront reasoning in LLMs often falls short for code generation because full problem complexity appears only while writing the code. Think-Anywhere trains models to decide on their own when to insert thinking steps at any point in the token sequence. It starts with imitation of useful reasoning patterns then uses outcome-based reinforcement learning to explore effective placement. The result is stronger performance than prior reasoning techniques and post-training methods across standard benchmarks. Readers care because it makes the reasoning process adaptive to varying difficulty instead of fixed in advance.

Core claim

The paper introduces Think-Anywhere as a mechanism that lets LLMs invoke thinking on-demand at any token position during code generation. Cold-start imitation training teaches initial reasoning patterns, after which outcome-based RL rewards drive the model to discover when and where to reason autonomously. Experiments on LeetCode, LiveCodeBench, HumanEval, and MBPP show state-of-the-art results over existing reasoning methods and recent post-training approaches, with consistent generalization across LLMs and adaptive invocation at high-entropy positions for improved interpretability.

What carries the argument

Think-Anywhere, the on-demand reasoning invocation at arbitrary token positions during generation, learned through cold-start imitation followed by outcome-based RL.

If this is right

Outperforms both existing reasoning methods and recent post-training approaches on four mainstream code generation benchmarks.
Demonstrates consistent generalization across diverse LLMs.
Enables the model to adaptively invoke reasoning at high-entropy positions during generation.
Provides enhanced interpretability of when the model chooses to reason.
Allocates reasoning effort where difficulty varies during implementation rather than only before starting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same on-demand mechanism could be tested on sequential tasks outside code such as mathematical proof construction or story writing.
Models might learn to interleave short reasoning bursts with action steps in interactive settings like agent workflows.
Combining this approach with process-level rewards could further refine the quality of inserted reasoning steps.
High-entropy position selection may offer a general signal for monitoring model uncertainty in other generation domains.

Load-bearing premise

That outcome-based RL will cause the model to learn genuinely useful adaptive positions for reasoning rather than simply memorizing reward patterns or introducing new failure modes.

What would settle it

If Think-Anywhere shows no performance gain over standard upfront reasoning on a new code-generation benchmark while still generating code at similar speed, the central performance claim would be falsified.

Figures

Figures reproduced from arXiv: 2603.29957 by Binhua Li, Ge Li, Mengyang Liu, Taozhi Chen, Tianyu Zhang, Wenpin Jiao, Xue Jiang, Yihong Dong, Yongbin Li, Zhenhua Xu, Zhi Jin.

**Figure 1.** Figure 1: Illustration of THINK-ANYWHERE. Reasoning can be invoked at any token position during code generation. The ellipsis (“...”) within <think> or <thinkanywhere> represents truncated thinking content for brevity. coding cognition shows that developers not only think before coding but also pause to think at any point during implementation, which proves a more reasonable thinking approach. Motivated by these obs… view at source ↗

**Figure 2.** Figure 2: Results of Thinking Position Analysis. that removes the initial thinking block and relies solely on THINK-ANYWHERE within the code. To isolate the impact of the THINK-ANYWHERE mechanism itself, we evaluate an inference variant: 6) Padding Thinking: During THINK-ANYWHERE generation, the content within <thinkanywhere> blocks is replaced with padding tokens before continuing generation [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 3.** Figure 3: Token cost of different methods. Computational Efficiency Comparison. We evaluate the inference efficiency of THINKANYWHERE by measuring the average number of tokens generated per solution. We compare THINK-ANYWHERE against two reasoning baselines: GRPO (upfront thinking) and CoT prompting. As shown in [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Pass@k comparison between GRPO and THINK-ANYWHERE across four benchmarks. Pass@k Analysis. Pass@k reflects the upper bound of a model’s capability by evaluating whether at least one correct solution exists among k sampled candidates. We report pass@k results for both GRPO and THINK-ANYWHERE across all benchmarks to investigate whether THINK-ANYWHERE expands the model’s capability boundary. As shown in [PI… view at source ↗

read the original abstract

Recent advances in reasoning Large Language Models (LLMs) have primarily relied on upfront thinking, where reasoning occurs before final answer. However, this approach suffers from critical limitations in code generation, where upfront thinking is often insufficient as problems' full complexity only reveals itself during code implementation. Moreover, it cannot adaptively allocate reasoning effort throughout the code generation process where difficulty varies significantly. In this paper, we propose Think-Anywhere, a novel reasoning mechanism that enables LLMs to invoke thinking on-demand at any token position during code generation. We achieve Think-Anywhere by first teaching LLMs to imitate the reasoning patterns through cold-start training, then leveraging outcome-based RL rewards to drive the model's autonomous exploration of when and where to invoke reasoning. Extensive experiments on four mainstream code generation benchmarks (i.e., LeetCode, LiveCodeBench, HumanEval, and MBPP) show that Think-Anywhere achieves state-of-the-art performance over both existing reasoning methods and recent post-training approaches, while demonstrating consistent generalization across diverse LLMs. Our analysis further reveals that Think-Anywhere enables the model to adaptively invoke reasoning at high-entropy positions, providing enhanced interpretability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Think-Anywhere offers a distinct on-demand reasoning mechanism during code generation that targets a real limitation of upfront thinking, but the abstract supplies no numbers or ablations so the SOTA claims remain uncheckable.

read the letter

The main takeaway is that this paper introduces a way for LLMs to invoke reasoning at any point during code generation instead of only upfront. They use cold-start imitation to teach the pattern and then outcome-based RL to let the model figure out the best spots on its own. This targets the issue that code problems often get harder or clearer only after some tokens are produced. It does well at framing the problem. Standard reasoning methods assume all the hard thinking happens before the answer, which doesn't fit code where implementation reveals new complexities. The adaptive behavior at high-entropy positions could improve both performance and our understanding of where the model struggles. The soft spots are in the missing details. The abstract claims better results than existing reasoning and post-training methods on four benchmarks and good generalization, but there are no numbers, no baseline specifics, no ablation studies, and no checks on whether the chosen positions reflect real reasoning needs or just reward patterns. This makes it tough to assess if the approach actually works as described or introduces new issues like fragmented outputs. This paper is for researchers and practitioners in LLM-based code generation who want more flexible reasoning. Someone looking for new post-training techniques would find the full experiments and analysis valuable if they hold up. It deserves a serious referee because the core idea is distinct and relevant to a practical area, even with the current lack of supporting data in the abstract. I would recommend sending it to peer review so the experimental setup and results can be properly evaluated.

Referee Report

3 major / 0 minor

Summary. The manuscript proposes Think-Anywhere, a reasoning mechanism that enables LLMs to invoke thinking on-demand at any token position during code generation. This is achieved by first using cold-start imitation training to teach reasoning patterns, followed by outcome-based RL to allow autonomous exploration of reasoning positions. The abstract reports that this approach achieves state-of-the-art performance on LeetCode, LiveCodeBench, HumanEval, and MBPP benchmarks compared to existing reasoning methods and post-training approaches, with consistent generalization across diverse LLMs and adaptive invocation at high-entropy positions.

Significance. Should the reported results be confirmed through detailed experiments, the work would offer a valuable contribution to adaptive reasoning in code generation, addressing the limitations of upfront thinking by dynamically allocating reasoning effort where needed. This could lead to more efficient and effective code generation models with improved interpretability.

major comments (3)

[Abstract] The central SOTA claims lack any supporting numerical results, baseline comparisons, ablation details, or statistical analysis, making it impossible to evaluate whether the performance gains are due to the proposed adaptive mechanism or other factors.
[Abstract] The description of the training process (cold-start imitation followed by outcome-based RL) provides no specifics on implementation details, reward design, or analysis showing that reasoning is invoked at high-entropy positions for meaningful reasons rather than spurious correlations.
[Abstract] Claims of generalization across diverse LLMs are made without identifying the specific models tested or presenting evidence of consistent improvements.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our abstract. We agree that strengthening the abstract with more concrete details will improve clarity and will revise it accordingly while keeping it concise. We address each major comment below.

read point-by-point responses

Referee: [Abstract] The central SOTA claims lack any supporting numerical results, baseline comparisons, ablation details, or statistical analysis, making it impossible to evaluate whether the performance gains are due to the proposed adaptive mechanism or other factors.

Authors: We will revise the abstract to include key numerical results from our experiments that support the SOTA claims, such as pass@1 scores on LeetCode, LiveCodeBench, HumanEval, and MBPP with direct comparisons to existing reasoning methods and post-training baselines. We will also briefly reference the ablation studies and statistical analysis presented in the main text to substantiate that gains arise from the adaptive mechanism. revision: yes
Referee: [Abstract] The description of the training process (cold-start imitation followed by outcome-based RL) provides no specifics on implementation details, reward design, or analysis showing that reasoning is invoked at high-entropy positions for meaningful reasons rather than spurious correlations.

Authors: We will expand the abstract to include high-level specifics on the cold-start imitation training and outcome-based RL, including the reward design based on code execution success. The analysis confirming adaptive invocation at high-entropy positions (correlating with problem difficulty rather than spurious factors) is detailed in the experimental section; we will add a concise reference to this in the revised abstract. revision: partial
Referee: [Abstract] Claims of generalization across diverse LLMs are made without identifying the specific models tested or presenting evidence of consistent improvements.

Authors: We will update the abstract to explicitly name the diverse LLMs tested and note the consistent performance improvements observed across them. Full per-model results and generalization analysis are provided in the experiments section of the manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical claims on external benchmarks

full rationale

The paper describes an empirical training procedure (cold-start imitation followed by outcome-based RL) and reports performance on four external code-generation benchmarks. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the available text. All load-bearing claims are evaluated against independent test sets rather than reducing to internal definitions or self-referential fits, satisfying the criteria for a self-contained empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review reveals no explicit free parameters, mathematical axioms, or newly postulated entities; the approach relies on standard LLM fine-tuning and RL techniques.

pith-pipeline@v0.9.0 · 5502 in / 1099 out tokens · 59927 ms · 2026-05-13T23:11:44.740075+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We achieve THINK-ANYWHERE by first teaching LLMs to imitate the reasoning patterns through cold-start training, then leveraging outcome-based RL rewards to drive the model's autonomous exploration of when and where to invoke reasoning.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

THINK-ANYWHERE enables the model to adaptively invoke reasoning at high-entropy positions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy
cs.CL 2026-04 unverdicted novelty 7.0

LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 1 Pith paper · 15 internal anchors

[1]

Opencodereasoning: Advancing data distillation for competitive coding

Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jo- celyn Huang, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning: Advancing data distillation for competitive coding.arXiv preprint arXiv:2504.01943,

work page arXiv
[2]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models.CoRR, abs/2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Hao Hu, Xiaor...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

URLhttps://matharena. ai/. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukas...

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit S. Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan- Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav ...

work page internal anchor Pith review Pith/arXiv arXiv
[6]

CoRR , volume =

11 Preprint, March 2026 Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. Self-collaboration code generation via chatgpt.ACM Trans. Softw. Eng. Methodol., 33(7):189:1–189:38, 2024a. Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, Bin Gu, Mengfei Yang, and Ge Li. Generalization or memorization: Data contamination and trustworthy evaluation for large language models...

work page arXiv 2026
[7]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

URL https://openreview.net/ forum?id=ph04CRkPdC. Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y . Wu, Y . K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. Deepseek-coder: When the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen...

work page arXiv
[9]

URL https: //github.com/huggingface/open-r1. Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

OpenAI o1 System Card

12 Preprint, March 2026 Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Hel- yar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Pas- sos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Ku- mar, Andre Saraiva, Andrea Vallone, Andrew Dubers...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment

Xue Jiang, Yihong Dong, Mengyang Liu, Hongyi Deng, Tian Wang, Yongding Tao, Rongyu Cao, Binhua Li, Zhi Jin, Wenpin Jiao, Fei Huang, Yongbin Li, and Ge Li. Coderl+: Improving code generation via reinforcement with execution semantics alignment.CoRR, abs/2510.18471,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

KOCO-BENCH: Can Large Language Models Leverage Domain Knowledge in Software Development?

Xue Jiang, Jiaru Qian, Xianjie Shi, Chenjie Li, Hao Zhu, Ziyu Wang, Jielun Zhang, Zheyu Zhao, Kechi Zhang, Jia Li, Wenpin Jiao, Zhi Jin, Ge Li, and Yihong Dong. KOCO-BENCH: can large language models leverage domain knowledge in software development?CoRR, abs/2601.13240,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Codei/o: Condensingreasoning patterns via code input-output prediction, 2025

Junlong Li, Daya Guo, Dejian Yang, Runxin Xu, Yu Wu, and Junxian He. Codei/o: Condensing reasoning patterns via code input-output prediction.CoRR, abs/2502.07316, 2025a. Qingyao Li, Xinyi Dai, Xiangyang Li, Weinan Zhang, Yasheng Wang, Ruiming Tang, and Yong Yu. Codeprm: Execution feedback-enhanced process reward model for code generation. InFindings of th...

work page arXiv 2025
[15]

URL https://arxiv.org/abs/ 2512.03176. Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia ...

work page arXiv
[16]

StarCoder 2 and The Stack v2: The Next Generation

URLhttps://arxiv.org/abs/2402.19173. 13 Preprint, March 2026 Mathematical Association of America. American invitational mathematics examination (aime)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Jacob Pfau, William Merrill, and Samuel R

Accessed: 2025-10-06. Jacob Pfau, William Merrill, and Samuel R. Bowman. Let’s think dot by dot: Hidden computation in transformer language models. InFirst Conference on Language Modeling,

work page 2025
[18]

URL https: //openreview.net/forum?id=NikbrdtYvG. Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Cope...

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Proximal Policy Optimization Algorithms

doi: 10.64434/tml.20250929. https://thinkingmachines.ai/blog/lora/. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.64434/tml.20250929
[20]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Codereasoner: Enhancing the code reasoning ability with reinforcement learning.arXiv preprint arXiv:2507.17548,

Lingxiao Tang, He Ye, Zhongxin Liu, Xiaoxue Ren, and Lingfeng Bao. Codereasoner: Enhancing the code reasoning ability with reinforcement learning.arXiv preprint arXiv:2507.17548,

work page arXiv
[23]

CodeBoost: Boosting code LLMs by squeezing knowledge from code snippets with rl.arXiv preprint arXiv:2508.05242,

Sijie Wang, Quanjiang Guo, Kai Zhao, Yawei Zhang, Xin Li, Xiang Li, Siqi Li, Rui She, Shangshu Yu, and Wee Peng Tay. CodeBoost: Boosting code LLMs by squeezing knowledge from code snippets with rl.arXiv preprint arXiv:2508.05242,

work page arXiv
[24]

Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms.arXiv preprint arXiv:2504.14655, 2025

Yunhui Xia, Wei Shen, Yan Wang, Jason Klein Liu, Huifeng Sun, Siyue Wu, Jian Hu, and Xiaolong Xu. Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms. arXiv preprint arXiv:2504.14655,

work page arXiv
[25]

Interleaved reasoning for large language models via reinforcement learning

Roy Xie, David Qiu, Deepak Gopinath, Dong Lin, Yanchao Sun, Chong Wang, Saloni Potdar, and Bhuwan Dhingra. Interleaved reasoning for large language models via reinforcement learning. CoRR, abs/2505.19640,

work page arXiv
[26]

For GRPO and CoT, the token cost consists entirely of upfront thinking tokens

14 Preprint, March 2026 A Token Cost Breakdown Table 6 provides a detailed breakdown of reasoning token costs. For GRPO and CoT, the token cost consists entirely of upfront thinking tokens. For THINK-ANYWHERE, we separately report the upfront thinking length and the <thinkanywhere> block length. The upfront thinking phase of THINK-ANYWHEREis substantially...

work page 2026