pith. machine review for the scientific record. sign in

arxiv: 2503.04697 · v2 · pith:BIGRIRYTnew · submitted 2025-03-06 · 💻 cs.CL · cs.AI· cs.LG

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

Pith reviewed 2026-05-18 00:14 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords reasoning modelschain of thoughtreinforcement learninglength controltest-time computepolicy optimizationshort reasoning models
0
0 comments X

The pith

Reinforcement learning lets reasoning models obey user-specified thinking lengths in their prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Length Controlled Policy Optimization, a reinforcement learning method that trains reasoning language models to produce chain-of-thought sequences matching lengths given in the prompt while still maximizing accuracy. This control makes it possible to allocate more or less test-time compute to match a desired performance level across many tasks. The same training also produces Short Reasoning Models that keep the internal patterns of long reasoning but use far fewer tokens, matching or beating much larger models at the same short length. A sympathetic reader would care because current reasoning models either think for an uncontrolled amount of time or require separate filtering steps that waste compute.

Core claim

L1 is a reasoning language model trained with Length Controlled Policy Optimization (LCPO) that generates outputs satisfying a length constraint supplied in its prompt. LCPO optimizes a combined reward for task accuracy and adherence to the requested chain-of-thought length. The resulting models allow smooth trade-offs between computational cost and accuracy on a wide range of tasks and outperform prior length-control methods. Training with LCPO also yields Short Reasoning Models whose reasoning patterns resemble those of full-length models yet whose outputs are as short as non-reasoning models, producing gains such as a 1.5B L1 model exceeding GPT-4o performance at equal reasoning lengths.

What carries the argument

Length Controlled Policy Optimization (LCPO), a reinforcement learning objective that adds a length-adherence term to the usual accuracy reward so the policy learns to match prompt-specified chain-of-thought lengths at generation time.

If this is right

  • Users can directly specify a desired reasoning length in the prompt to control how much compute is spent at test time.
  • Performance can be traded off against computational cost on many tasks without retraining.
  • Short Reasoning Models derived via LCPO maintain reasoning-like behavior while using token budgets comparable to ordinary language models.
  • Length control works across model sizes, including small models that reach competitive performance at short lengths.
  • Precise allocation of test-time compute becomes possible without post-processing steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Length control could be combined with other test-time scaling techniques such as multiple samples or tree search to allocate compute more efficiently.
  • Short Reasoning Models might lower inference energy use in production settings while preserving most of the gains from chain-of-thought.
  • The same objective might generalize to controlling other generation attributes such as step count or intermediate answer format.

Load-bearing premise

A single reinforcement learning objective that rewards both accuracy and length adherence will produce models that reliably follow the length instructions at inference time without extra filtering or large accuracy losses.

What would settle it

Train a model with LCPO, then run it on prompts that request many different target lengths and measure whether the actual generated chain-of-thought token counts stay within a small tolerance of each target while accuracy stays close to the unconstrained baseline.

read the original abstract

Reasoning language models have shown an uncanny ability to improve performance at test-time by ``thinking longer''-that is, by generating longer chain-of-thought sequences and hence using more compute. However, the length of their chain-of-thought reasoning is not controllable, making it impossible to allocate test-time compute to achieve a desired level of performance. We introduce Length Controlled Policy Optimization (LCPO), a simple reinforcement learning method that optimizes for accuracy and adherence to user-specified length constraints. We use LCPO to train L1, a reasoning language model that produces outputs satisfying a length constraint given in its prompt. L1's length control allows for smoothly trading off computational cost and accuracy on a wide range of tasks, and outperforms the state-of-the-art S1 method for length control. Furthermore, we uncover an unexpected short chain-of-thought capability in models trained with LCPO. Specifically, using LCPO we derive Short Reasoning Models (SRMs), that exhibit similar reasoning patterns as full-length reasoning models, but can generate CoT lengths comparable to non-reasoning models. They demonstrate significant performance gains, for instance, our 1.5B L1 model surpasses GPT-4o at equal reasoning lengths. Overall, LCPO enables precise control over reasoning length, allowing for fine-grained allocation of test-time compute and accuracy. We release code and models at https://www.cmu-l3.github.io/l1

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 3 minor

Summary. The paper introduces Length Controlled Policy Optimization (LCPO), a reinforcement learning method for training reasoning language models to adhere to user-specified length constraints in their chain-of-thought outputs. The trained L1 models allow for smooth trade-offs between computational cost and accuracy on various tasks, outperforming the S1 method. The work also highlights the emergence of Short Reasoning Models (SRMs) that use short CoTs similar to non-reasoning models but retain reasoning patterns, with examples like a 1.5B L1 model surpassing GPT-4o at equal lengths. The authors release code and models.

Significance. If validated, this approach provides a practical tool for controlling test-time compute in reasoning models, which is important for efficient and scalable deployment. The public release of code and models is a strength for reproducibility. The identification of SRMs is an interesting finding that could lead to more efficient reasoning systems. The empirical results suggest potential for fine-grained accuracy-compute balancing, though additional evidence on robustness is needed to fully assess the impact.

major comments (4)
  1. [Method] The LCPO reward function combining accuracy and length adherence is described at a high level but lacks specific equations or details on the weighting between terms, which is load-bearing for claims of stable training and inference-time control.
  2. [Experiments] The central claim of reliable adherence to prompt-specified lengths at inference without major performance loss or post-hoc filtering is not supported by quantitative adherence rates, results on out-of-distribution lengths, or analysis of cases where the model ignores the constraint.
  3. [SRM Results] The claim that the 1.5B L1 model surpasses GPT-4o at equal reasoning lengths requires more details on the evaluation setup, including how lengths are equalized, the specific benchmarks, and whether multiple runs or statistical tests were used to establish the gains.
  4. [Comparison to S1] While outperformance over S1 is reported, the manuscript should include ablations isolating the effect of the length constraint component in LCPO to strengthen the attribution of improvements to the proposed method.
minor comments (3)
  1. [Abstract] The abstract summarizes positive outcomes but could include a brief note on the reward design or training setup for better context.
  2. [Related Work] Additional references to prior work on length-controlled generation or constrained RL in language models would provide better positioning.
  3. [Figures] Plots showing the length-accuracy trade-off should include confidence intervals or error bars to indicate variability across runs.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We have carefully considered each major comment and made revisions to address them, improving the clarity and completeness of the paper.

read point-by-point responses
  1. Referee: [Method] The LCPO reward function combining accuracy and length adherence is described at a high level but lacks specific equations or details on the weighting between terms, which is load-bearing for claims of stable training and inference-time control.

    Authors: We agree with this observation. The original manuscript described the reward at a high level to focus on the overall approach. In the revised version, we now include the precise mathematical formulation of the LCPO reward function, specifying the accuracy term, the length adherence penalty, and the weighting hyperparameter lambda. We also add a short discussion on how this weighting was chosen to ensure stable training. revision: yes

  2. Referee: [Experiments] The central claim of reliable adherence to prompt-specified lengths at inference without major performance loss or post-hoc filtering is not supported by quantitative adherence rates, results on out-of-distribution lengths, or analysis of cases where the model ignores the constraint.

    Authors: This is a valid point for strengthening the empirical support. We have added quantitative results showing adherence rates (e.g., 85-95% of generations fall within the target length bucket across tasks). We also report performance on out-of-distribution length prompts and analyze a small number of failure cases where the model exceeds the length, attributing them to prompt ambiguity in some cases. revision: yes

  3. Referee: [SRM Results] The claim that the 1.5B L1 model surpasses GPT-4o at equal reasoning lengths requires more details on the evaluation setup, including how lengths are equalized, the specific benchmarks, and whether multiple runs or statistical tests were used to establish the gains.

    Authors: We appreciate the request for additional details. The revised manuscript now specifies that lengths are equalized by selecting GPT-4o responses with matching average token counts to the L1 model's outputs on the same prompts. We list the exact benchmarks (MATH, GSM8K, HumanEval, etc.) and report results from 3 independent runs with mean and standard deviation, including p-values from t-tests for the performance gains. revision: yes

  4. Referee: [Comparison to S1] While outperformance over S1 is reported, the manuscript should include ablations isolating the effect of the length constraint component in LCPO to strengthen the attribution of improvements to the proposed method.

    Authors: We partially agree. While a full ablation removing the length constraint would reduce LCPO to a standard RL method without control (which is already compared via S1), we have added an ablation study in the appendix that varies the strength of the length term in the reward and shows its direct impact on both adherence and accuracy. This helps attribute the improvements more clearly to the length control component. revision: partial

Circularity Check

0 steps flagged

Standard RL application to composite reward shows no circular reduction

full rationale

The paper introduces LCPO as a reinforcement learning method that optimizes a composite reward for accuracy plus adherence to prompt-specified length constraints. The central claims—smooth cost-accuracy trade-offs, outperformance of S1, and emergence of short reasoning models—are presented as empirical outcomes from training and evaluating L1 models on downstream tasks. These results do not reduce by the paper's own description to quantities defined solely by fitted parameters, self-referential definitions, or load-bearing self-citations; the approach applies standard policy optimization to a new reward formulation whose validity rests on experimental generalization rather than tautological construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard reinforcement learning assumptions for language model fine-tuning plus the empirical effectiveness of a composite reward; no new physical or mathematical entities are postulated and no free parameters beyond typical RL hyperparameters are introduced in the abstract.

axioms (1)
  • domain assumption Reinforcement learning with a composite reward can jointly optimize task accuracy and adherence to a generation constraint.
    Core premise underlying LCPO; invoked when describing the training objective.

pith-pipeline@v0.9.0 · 5791 in / 1326 out tokens · 48699 ms · 2026-05-18T00:14:15.681347+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization

    cs.LG 2026-05 unverdicted novelty 7.0

    RDPO applies magnitude-aware quantile normalization and Mahalanobis whitening to decorrelate heterogeneous rewards in multi-objective RL, improving instruction following and writing quality on LongCat-Flash post-train...

  2. LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning ...

  3. Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

    cs.AI 2026-05 conditional novelty 7.0

    Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.

  4. AI Achieves a Perfect LSAT Score

    cs.AI 2026-04 unverdicted novelty 7.0

    Language models achieve a perfect LSAT score, with experiments showing that internal thinking phases and a fine-tuned process reward model are key to high performance on logical reasoning questions.

  5. TiCo: Time-Controllable Spoken Dialogue Model

    cs.CL 2026-03 unverdicted novelty 7.0

    TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.

  6. TIME: Temporally Intelligent Meta-reasoning Engine for Context-Triggered Explicit Reasoning

    cs.LG 2026-01 unverdicted novelty 7.0

    TIME trains LLMs to trigger compact, context-triggered reasoning via time tags and tick events, improving TIMEBench scores while cutting explicit reasoning tokens by an order of magnitude.

  7. STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes

    cs.CL 2026-05 unverdicted novelty 6.0

    STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.

  8. Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.

  9. Shorter, but Still Trustworthy? An Empirical Study of Chain-of-Thought Compression

    cs.CL 2026-04 unverdicted novelty 6.0

    CoT compression frequently introduces trustworthiness regressions with method-specific degradation profiles; a proposed normalized efficiency score and alignment-aware DPO variant reduce length by 19.3% with smaller t...

  10. CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

    cs.LG 2026-03 conditional novelty 6.0

    CRISP achieves 57-59% token reduction on MATH-500 with 9-16 point accuracy gains on Qwen3 models via iterative self-distillation of concise reasoning behavior.

  11. GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

    cs.CL 2026-01 unverdicted novelty 6.0

    GDPO decouples per-reward normalization in multi-reward RL to avoid advantage collapse and improve convergence over GRPO on tool-calling, math, and coding tasks.

  12. UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

    cs.AI 2025-03 accept novelty 6.0

    UI-R1 shows rule-based RL with GRPO on 136 GUI tasks improves a 3B MLLM's action prediction accuracy by 6-22% over its base model and matches larger SFT-trained models.

  13. Reasoning Compression with Mixed-Policy Distillation

    cs.AI 2026-05 unverdicted novelty 5.0

    Mixed-Policy Distillation transfers concise reasoning behavior from larger to smaller LLMs by having the teacher compress student-generated trajectories, cutting token usage up to 27% while raising benchmark scores.

  14. NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning

    cs.LG 2026-05 unverdicted novelty 5.0

    Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.

  15. Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

    cs.CL 2025-03 accept novelty 5.0

    A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.

  16. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    cs.AI 2025-03 unverdicted novelty 5.0

    The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

  17. A Survey of Reinforcement Learning for Large Reasoning Models

    cs.CL 2025-09 accept novelty 3.0

    A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 17 Pith papers · 16 internal anchors

  1. [1]

    Let‘s sample step by step: Adaptive-consistency for efficient reasoning and coding with LLMs

    Pranjal Aggarwal, Aman Madaan, Yiming Yang, and Mausam. Let‘s sample step by step: Adaptive-consistency for efficient reasoning and coding with LLMs. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 12375–12396, Singapore, December

  2. [2]

    doi: 10.18653/v1/2023.emnlp-main.761

    Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.761. URL https://aclanthology.org/2023.emnlp-main.761/. Daman Arora and Andrea Zanette. Training language models to reason efficiently,

  3. [3]

    Bradley Butcher, Michael O’Keefe, and James Titchener

    URLhttps://arxiv.org/abs/2502.04463. Bradley Butcher, Michael O’Keefe, and James Titchener. Precise length control in large language models,

  4. [4]

    Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu

    URLhttps://arxiv.org/abs/2412.11937. Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Do not think that much for 2+3=? on the overthinking of o1-like llms,

  5. [5]

    Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

    URLhttps://arxiv.org/abs/2412.21187. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi De...

  6. [6]

    URLhttps://arxiv.org/abs/2501.12948. Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. Omni-math: A universal olympiad level mathematic benchmark for lar...

  7. [7]

    URLhttps://arxiv.org/abs/2410.07985. Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong 11 Published as a conference paper at COLM 2025 Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad- level bilingual multimodal s...

  8. [8]

    URL https://arxiv.org/abs/2402. 14008. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021a. URL https://arxiv.org/abs/2009.03300. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. M...

  9. [9]

    URL https://arxiv.org/abs/2303.17651. Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, Wayne Xin Zhao, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems,

  10. [10]

    Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems

    URLhttps://arxiv.org/abs/2412.09413. Bytedance Seed MLSys. verl: Volcano engine reinforcement learning for llms. https: //github.com/volcengine/verl,

  11. [11]

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand`es, and Tatsunori Hashimoto

    Accessed: 2025-02-28. Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand`es, and Tatsunori Hashimoto. s1: Simple test-time scaling,

  12. [12]

    URLhttps://arxiv.org/abs/2501.19393. OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vall...

  13. [13]

    Qwen2.5 Technical Report

    URLhttps://arxiv.org/abs/2412.15115. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google- proof q&a benchmark,

  14. [14]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    URLhttps://arxiv.org/abs/2311.12022. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models,

  15. [15]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    URL https://arxiv.org/abs/ 2402.03300. Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. A long way to go: Investigat- ing length correlations in rlhf,

  16. [16]

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar

    URLhttps://arxiv.org/abs/2310.03716. Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters,

  17. [17]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    URL https:// arxiv.org/abs/2408.03314. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models,

  18. [18]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    URLhttps://arxiv.org/abs/2203.11171. Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Thoughts are all over the place: On the underthinking of o1-like llms,

  19. [19]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou

    URL https://arxiv.org/abs/2501.18585. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large lan- guage models,

  20. [20]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    URLhttps://arxiv.org/abs/2201.11903. Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and Yejin Choi. Generating sequences by learning to self-correct. In The Eleventh 14 Published as a conference paper at COLM 2025 International Conference on Learning Representations,

  21. [21]

    Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

    URLhttps://arxiv.org/abs/2408.00724. Huajian Xin, Z. Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Qihao Zhu, Dejian Yang, Zhibin Gou, Z. F. Wu, Fuli Luo, and Chong Ruan. Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search,

  22. [22]

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L

    URL https: //arxiv.org/abs/2408.08152. Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models,

  23. [23]

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    URLhttps://arxiv.org/abs/2305.10601. Weizhe Yuan, Ilia Kulikov, Ping Yu, Kyunghyun Cho, Sainbayar Sukhbaatar, Jason Weston, and Jing Xu. Following length constraints in instructions,

  24. [24]

    Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan

    URL https://arxiv.org/ abs/2406.17744. Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models,

  25. [25]

    AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

    URLhttps://arxiv.org/abs/2304.06364. A Results A.1 Extended training further improves length constraint precision. Table 2: Length Error Comparison Between Methods Method Mean Error (%)↓RMSE Error (%)↓ Math Reasoning L1-Exact3.0118.44 L1-Exact + 3.1510.04 OOD-1 (General Reasoning) L1-Exact 21.22 31.37 L1-Exact +10.89 16.77 OOD-2 (General knowledge) L1-Exa...

  26. [26]

    Notably, while soft-violation rates are very low (≤ 3%), hard-violation rates are relatively higher (9% on average across different token budgets). The violation rates are particularly higher at lower token budgets, as the model prioritizes performance over length control, and generating correct solutions with shorter CoTs is relatively more difficult. Mo...

  27. [27]

    The results highlights that the model still failed to follow length constraints, producing very long outputs regardless of the requested target. Requested Tokens 512 1024 2048 4096 Actual Tokens 21388 22749 21426 20903 Table 6: SFT-only model does not learn to follow token-length constraints despite being trained on relabeled data with explicit length req...

  28. [28]

    These results highlight the effectiveness and potential of LCPO to scale to even larger reasoning models

    We observe the same trends as with the 1.5B model: high controllability, low budget-violation rates, and strong performance relative to length-controlled baselines across budgets. These results highlight the effectiveness and potential of LCPO to scale to even larger reasoning models. 20 Published as a conference paper at COLM 2025 5121K2K4K 0% 10% 20% 30...

  29. [29]

    Thus, the maximum real part is approximately 625.6

    R=√391, 600≈625.6. Thus, the maximum real part is approximately 625.6. The largest possible real part is 625.6 . Figure 20: Example of model response with 512 tokens. Note that the model’s answer is incorrect. 26 Published as a conference paper at COLM 2025 Example Question Find the largest possible real part of [(75 + 117i)z+ 96+144i z ] where z is a com...