L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning
Pith reviewed 2026-05-18 00:14 UTC · model grok-4.3
The pith
Reinforcement learning lets reasoning models obey user-specified thinking lengths in their prompts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
L1 is a reasoning language model trained with Length Controlled Policy Optimization (LCPO) that generates outputs satisfying a length constraint supplied in its prompt. LCPO optimizes a combined reward for task accuracy and adherence to the requested chain-of-thought length. The resulting models allow smooth trade-offs between computational cost and accuracy on a wide range of tasks and outperform prior length-control methods. Training with LCPO also yields Short Reasoning Models whose reasoning patterns resemble those of full-length models yet whose outputs are as short as non-reasoning models, producing gains such as a 1.5B L1 model exceeding GPT-4o performance at equal reasoning lengths.
What carries the argument
Length Controlled Policy Optimization (LCPO), a reinforcement learning objective that adds a length-adherence term to the usual accuracy reward so the policy learns to match prompt-specified chain-of-thought lengths at generation time.
If this is right
- Users can directly specify a desired reasoning length in the prompt to control how much compute is spent at test time.
- Performance can be traded off against computational cost on many tasks without retraining.
- Short Reasoning Models derived via LCPO maintain reasoning-like behavior while using token budgets comparable to ordinary language models.
- Length control works across model sizes, including small models that reach competitive performance at short lengths.
- Precise allocation of test-time compute becomes possible without post-processing steps.
Where Pith is reading between the lines
- Length control could be combined with other test-time scaling techniques such as multiple samples or tree search to allocate compute more efficiently.
- Short Reasoning Models might lower inference energy use in production settings while preserving most of the gains from chain-of-thought.
- The same objective might generalize to controlling other generation attributes such as step count or intermediate answer format.
Load-bearing premise
A single reinforcement learning objective that rewards both accuracy and length adherence will produce models that reliably follow the length instructions at inference time without extra filtering or large accuracy losses.
What would settle it
Train a model with LCPO, then run it on prompts that request many different target lengths and measure whether the actual generated chain-of-thought token counts stay within a small tolerance of each target while accuracy stays close to the unconstrained baseline.
read the original abstract
Reasoning language models have shown an uncanny ability to improve performance at test-time by ``thinking longer''-that is, by generating longer chain-of-thought sequences and hence using more compute. However, the length of their chain-of-thought reasoning is not controllable, making it impossible to allocate test-time compute to achieve a desired level of performance. We introduce Length Controlled Policy Optimization (LCPO), a simple reinforcement learning method that optimizes for accuracy and adherence to user-specified length constraints. We use LCPO to train L1, a reasoning language model that produces outputs satisfying a length constraint given in its prompt. L1's length control allows for smoothly trading off computational cost and accuracy on a wide range of tasks, and outperforms the state-of-the-art S1 method for length control. Furthermore, we uncover an unexpected short chain-of-thought capability in models trained with LCPO. Specifically, using LCPO we derive Short Reasoning Models (SRMs), that exhibit similar reasoning patterns as full-length reasoning models, but can generate CoT lengths comparable to non-reasoning models. They demonstrate significant performance gains, for instance, our 1.5B L1 model surpasses GPT-4o at equal reasoning lengths. Overall, LCPO enables precise control over reasoning length, allowing for fine-grained allocation of test-time compute and accuracy. We release code and models at https://www.cmu-l3.github.io/l1
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Length Controlled Policy Optimization (LCPO), a reinforcement learning method for training reasoning language models to adhere to user-specified length constraints in their chain-of-thought outputs. The trained L1 models allow for smooth trade-offs between computational cost and accuracy on various tasks, outperforming the S1 method. The work also highlights the emergence of Short Reasoning Models (SRMs) that use short CoTs similar to non-reasoning models but retain reasoning patterns, with examples like a 1.5B L1 model surpassing GPT-4o at equal lengths. The authors release code and models.
Significance. If validated, this approach provides a practical tool for controlling test-time compute in reasoning models, which is important for efficient and scalable deployment. The public release of code and models is a strength for reproducibility. The identification of SRMs is an interesting finding that could lead to more efficient reasoning systems. The empirical results suggest potential for fine-grained accuracy-compute balancing, though additional evidence on robustness is needed to fully assess the impact.
major comments (4)
- [Method] The LCPO reward function combining accuracy and length adherence is described at a high level but lacks specific equations or details on the weighting between terms, which is load-bearing for claims of stable training and inference-time control.
- [Experiments] The central claim of reliable adherence to prompt-specified lengths at inference without major performance loss or post-hoc filtering is not supported by quantitative adherence rates, results on out-of-distribution lengths, or analysis of cases where the model ignores the constraint.
- [SRM Results] The claim that the 1.5B L1 model surpasses GPT-4o at equal reasoning lengths requires more details on the evaluation setup, including how lengths are equalized, the specific benchmarks, and whether multiple runs or statistical tests were used to establish the gains.
- [Comparison to S1] While outperformance over S1 is reported, the manuscript should include ablations isolating the effect of the length constraint component in LCPO to strengthen the attribution of improvements to the proposed method.
minor comments (3)
- [Abstract] The abstract summarizes positive outcomes but could include a brief note on the reward design or training setup for better context.
- [Related Work] Additional references to prior work on length-controlled generation or constrained RL in language models would provide better positioning.
- [Figures] Plots showing the length-accuracy trade-off should include confidence intervals or error bars to indicate variability across runs.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We have carefully considered each major comment and made revisions to address them, improving the clarity and completeness of the paper.
read point-by-point responses
-
Referee: [Method] The LCPO reward function combining accuracy and length adherence is described at a high level but lacks specific equations or details on the weighting between terms, which is load-bearing for claims of stable training and inference-time control.
Authors: We agree with this observation. The original manuscript described the reward at a high level to focus on the overall approach. In the revised version, we now include the precise mathematical formulation of the LCPO reward function, specifying the accuracy term, the length adherence penalty, and the weighting hyperparameter lambda. We also add a short discussion on how this weighting was chosen to ensure stable training. revision: yes
-
Referee: [Experiments] The central claim of reliable adherence to prompt-specified lengths at inference without major performance loss or post-hoc filtering is not supported by quantitative adherence rates, results on out-of-distribution lengths, or analysis of cases where the model ignores the constraint.
Authors: This is a valid point for strengthening the empirical support. We have added quantitative results showing adherence rates (e.g., 85-95% of generations fall within the target length bucket across tasks). We also report performance on out-of-distribution length prompts and analyze a small number of failure cases where the model exceeds the length, attributing them to prompt ambiguity in some cases. revision: yes
-
Referee: [SRM Results] The claim that the 1.5B L1 model surpasses GPT-4o at equal reasoning lengths requires more details on the evaluation setup, including how lengths are equalized, the specific benchmarks, and whether multiple runs or statistical tests were used to establish the gains.
Authors: We appreciate the request for additional details. The revised manuscript now specifies that lengths are equalized by selecting GPT-4o responses with matching average token counts to the L1 model's outputs on the same prompts. We list the exact benchmarks (MATH, GSM8K, HumanEval, etc.) and report results from 3 independent runs with mean and standard deviation, including p-values from t-tests for the performance gains. revision: yes
-
Referee: [Comparison to S1] While outperformance over S1 is reported, the manuscript should include ablations isolating the effect of the length constraint component in LCPO to strengthen the attribution of improvements to the proposed method.
Authors: We partially agree. While a full ablation removing the length constraint would reduce LCPO to a standard RL method without control (which is already compared via S1), we have added an ablation study in the appendix that varies the strength of the length term in the reward and shows its direct impact on both adherence and accuracy. This helps attribute the improvements more clearly to the length control component. revision: partial
Circularity Check
Standard RL application to composite reward shows no circular reduction
full rationale
The paper introduces LCPO as a reinforcement learning method that optimizes a composite reward for accuracy plus adherence to prompt-specified length constraints. The central claims—smooth cost-accuracy trade-offs, outperformance of S1, and emergence of short reasoning models—are presented as empirical outcomes from training and evaluating L1 models on downstream tasks. These results do not reduce by the paper's own description to quantities defined solely by fitted parameters, self-referential definitions, or load-bearing self-citations; the approach applies standard policy optimization to a new reward formulation whose validity rests on experimental generalization rather than tautological construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reinforcement learning with a composite reward can jointly optimize task accuracy and adherence to a generation constraint.
Forward citations
Cited by 17 Pith papers
-
Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization
RDPO applies magnitude-aware quantile normalization and Mahalanobis whitening to decorrelate heterogeneous rewards in multi-objective RL, improving instruction following and writing quality on LongCat-Flash post-train...
-
LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models
LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning ...
-
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
-
AI Achieves a Perfect LSAT Score
Language models achieve a perfect LSAT score, with experiments showing that internal thinking phases and a fine-tuned process reward model are key to high performance on logical reasoning questions.
-
TiCo: Time-Controllable Spoken Dialogue Model
TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.
-
TIME: Temporally Intelligent Meta-reasoning Engine for Context-Triggered Explicit Reasoning
TIME trains LLMs to trigger compact, context-triggered reasoning via time tags and tick events, improving TIMEBench scores while cutting explicit reasoning tokens by an order of magnitude.
-
STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes
STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.
-
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning
BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
-
Shorter, but Still Trustworthy? An Empirical Study of Chain-of-Thought Compression
CoT compression frequently introduces trustworthiness regressions with method-specific degradation profiles; a proposed normalized efficiency score and alignment-aware DPO variant reduce length by 19.3% with smaller t...
-
CRISP: Compressed Reasoning via Iterative Self-Policy Distillation
CRISP achieves 57-59% token reduction on MATH-500 with 9-16 point accuracy gains on Qwen3 models via iterative self-distillation of concise reasoning behavior.
-
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
GDPO decouples per-reward normalization in multi-reward RL to avoid advantage collapse and improve convergence over GRPO on tool-calling, math, and coding tasks.
-
UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning
UI-R1 shows rule-based RL with GRPO on 136 GUI tasks improves a 3B MLLM's action prediction accuracy by 6-22% over its base model and matches larger SFT-trained models.
-
Reasoning Compression with Mixed-Policy Distillation
Mixed-Policy Distillation transfers concise reasoning behavior from larger to smaller LLMs by having the teacher compress student-generated trajectories, cutting token usage up to 27% while raising benchmark scores.
-
NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning
Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.
-
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
A Survey of Reinforcement Learning for Large Reasoning Models
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
Reference graph
Works this paper leans on
-
[1]
Let‘s sample step by step: Adaptive-consistency for efficient reasoning and coding with LLMs
Pranjal Aggarwal, Aman Madaan, Yiming Yang, and Mausam. Let‘s sample step by step: Adaptive-consistency for efficient reasoning and coding with LLMs. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 12375–12396, Singapore, December
work page 2023
-
[2]
doi: 10.18653/v1/2023.emnlp-main.761
Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.761. URL https://aclanthology.org/2023.emnlp-main.761/. Daman Arora and Andrea Zanette. Training language models to reason efficiently,
-
[3]
Bradley Butcher, Michael O’Keefe, and James Titchener
URLhttps://arxiv.org/abs/2502.04463. Bradley Butcher, Michael O’Keefe, and James Titchener. Precise length control in large language models,
-
[4]
URLhttps://arxiv.org/abs/2412.11937. Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Do not think that much for 2+3=? on the overthinking of o1-like llms,
-
[5]
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
URLhttps://arxiv.org/abs/2412.21187. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi De...
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
URLhttps://arxiv.org/abs/2501.12948. Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. Omni-math: A universal olympiad level mathematic benchmark for lar...
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
URLhttps://arxiv.org/abs/2410.07985. Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong 11 Published as a conference paper at COLM 2025 Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad- level bilingual multimodal s...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
URL https://arxiv.org/abs/2402. 14008. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021a. URL https://arxiv.org/abs/2009.03300. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. M...
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[9]
URL https://arxiv.org/abs/2303.17651. Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, Wayne Xin Zhao, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems
URLhttps://arxiv.org/abs/2412.09413. Bytedance Seed MLSys. verl: Volcano engine reinforcement learning for llms. https: //github.com/volcengine/verl,
work page internal anchor Pith review arXiv
-
[11]
Accessed: 2025-02-28. Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand`es, and Tatsunori Hashimoto. s1: Simple test-time scaling,
work page 2025
-
[12]
URLhttps://arxiv.org/abs/2501.19393. OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vall...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
URLhttps://arxiv.org/abs/2412.15115. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google- proof q&a benchmark,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
URLhttps://arxiv.org/abs/2311.12022. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
URL https://arxiv.org/abs/ 2402.03300. Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. A long way to go: Investigat- ing length correlations in rlhf,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar
URLhttps://arxiv.org/abs/2310.03716. Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters,
-
[17]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
URL https:// arxiv.org/abs/2408.03314. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
URLhttps://arxiv.org/abs/2203.11171. Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Thoughts are all over the place: On the underthinking of o1-like llms,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
URL https://arxiv.org/abs/2501.18585. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large lan- guage models,
-
[20]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
URLhttps://arxiv.org/abs/2201.11903. Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and Yejin Choi. Generating sequences by learning to self-correct. In The Eleventh 14 Published as a conference paper at COLM 2025 International Conference on Learning Representations,
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
URLhttps://arxiv.org/abs/2408.00724. Huajian Xin, Z. Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Qihao Zhu, Dejian Yang, Zhibin Gou, Z. F. Wu, Fuli Luo, and Chong Ruan. Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search,
work page internal anchor Pith review arXiv
-
[22]
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L
URL https: //arxiv.org/abs/2408.08152. Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models,
-
[23]
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
URLhttps://arxiv.org/abs/2305.10601. Weizhe Yuan, Ilia Kulikov, Ping Yu, Kyunghyun Cho, Sainbayar Sukhbaatar, Jason Weston, and Jing Xu. Following length constraints in instructions,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
URL https://arxiv.org/ abs/2406.17744. Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models,
-
[25]
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
URLhttps://arxiv.org/abs/2304.06364. A Results A.1 Extended training further improves length constraint precision. Table 2: Length Error Comparison Between Methods Method Mean Error (%)↓RMSE Error (%)↓ Math Reasoning L1-Exact3.0118.44 L1-Exact + 3.1510.04 OOD-1 (General Reasoning) L1-Exact 21.22 31.37 L1-Exact +10.89 16.77 OOD-2 (General knowledge) L1-Exa...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Notably, while soft-violation rates are very low (≤ 3%), hard-violation rates are relatively higher (9% on average across different token budgets). The violation rates are particularly higher at lower token budgets, as the model prioritizes performance over length control, and generating correct solutions with shorter CoTs is relatively more difficult. Mo...
work page 2025
-
[27]
The results highlights that the model still failed to follow length constraints, producing very long outputs regardless of the requested target. Requested Tokens 512 1024 2048 4096 Actual Tokens 21388 22749 21426 20903 Table 6: SFT-only model does not learn to follow token-length constraints despite being trained on relabeled data with explicit length req...
work page 2048
-
[28]
We observe the same trends as with the 1.5B model: high controllability, low budget-violation rates, and strong performance relative to length-controlled baselines across budgets. These results highlight the effectiveness and potential of LCPO to scale to even larger reasoning models. 20 Published as a conference paper at COLM 2025 5121K2K4K 0% 10% 20% 30...
work page 2025
-
[29]
Thus, the maximum real part is approximately 625.6
R=√391, 600≈625.6. Thus, the maximum real part is approximately 625.6. The largest possible real part is 625.6 . Figure 20: Example of model response with 512 tokens. Note that the model’s answer is incorrect. 26 Published as a conference paper at COLM 2025 Example Question Find the largest possible real part of [(75 + 117i)z+ 96+144i z ] where z is a com...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.