pith. machine review for the scientific record. sign in

arxiv: 2605.01111 · v1 · submitted 2026-05-01 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

When Less is Enough: Efficient Inference via Collaborative Reasoning

Yilei Chen , Sharut Gupta , Yannis Paschalidis , Ayush Sekhari , Aldo Pacchiano

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords collaborative inferenceefficient reasoningdual-model systemslength-penalized trainingtoken efficiencyreasoning benchmarkslarge language models
0
0 comments X

The pith

DUET shows a large model and lightweight model can collaborate on reasoning tasks with up to 60% lower inference cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DUET, a collaborative inference framework that splits a reasoning task into two stages so a capable large model generates a reasoning signal while a lightweight model produces the final answer. A length-penalized joint training objective pushes the large model to send only the minimal information needed, keeping overall task performance intact. This matters because running full end-to-end inference on large models for every query drives up costs, and the method demonstrates a practical way to reduce those costs on hard benchmarks such as AIME and GPQA.

Core claim

DUET decomposes inference into two stages where the capable model produces a reasoning signal and the lightweight model interprets the signal to generate the final answer. A length-penalized joint training objective encourages the capable model to transmit only the information sufficient for the lightweight model to solve the task. The result is strong reasoning performance with substantially lower inference cost than end-to-end inference using a large model alone, saving up to 60% of the large model's output tokens on challenging reasoning benchmarks including AIME and GPQA.

What carries the argument

length-penalized joint training objective within the DUET two-stage collaborative inference framework

If this is right

  • Inference cost for reasoning tasks decreases because the capable model only needs to output a concise signal rather than a full end-to-end solution.
  • Lightweight models can reach high performance on complex tasks when given an optimized signal from a larger model.
  • The two-stage split preserves accuracy on math and science reasoning benchmarks such as AIME and GPQA.
  • Joint training aligns the models so the transmitted signal matches what the lightweight model can use effectively.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar two-stage decompositions could be tested on sequential or multi-step tasks beyond single-answer reasoning.
  • The approach might allow smaller models to handle final output stages in other efficiency-focused pipelines where only part of the work requires high capability.
  • Deployment scenarios with limited compute budgets could benefit if the lightweight model runs on cheaper hardware while the capable model is invoked selectively.

Load-bearing premise

The length-penalized joint training will produce a reasoning signal that is always sufficient for the lightweight model to recover full task performance without new failure modes.

What would settle it

Running DUET on a new out-of-distribution reasoning benchmark and finding that accuracy falls below the large model alone despite the training procedure.

Figures

Figures reproduced from arXiv: 2605.01111 by Aldo Pacchiano, Ayush Sekhari, Sharut Gupta, Yannis Paschalidis, Yilei Chen.

Figure 1
Figure 1. Figure 1: High-level overview of the DUET framework. Standard CoT relies on a large model to generate long reasoning traces, substantially increasing inference cost. DUET instead has the capable model produce a high-level reasoning signal that a lightweight model uses to generate the final answer, reducing generation length (and cost) without sacrificing accuracy. 50 100 150 200 Training Step 1000 1500 2000 2500 300… view at source ↗
Figure 2
Figure 2. Figure 2: The number of output tokens generated by [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: For each input, the capable model generates a reasoning signal and the lightweight model produces an answer conditioned on (input + signal). In parallel, the lightweight model also produces a baseline answer from the input alone. The baseline score provides a reference point, and the gain achieved by adding the reasoning signal is used to update both models. The capable model is trained with an additional … view at source ↗
Figure 4
Figure 4. Figure 4: We compare DUET with the original large model and naive truncation on MATH500, AMC23, AIME2024, and GPQA Diamond benchmarks. The x-axis shows the total number of output tokens generated by the large model, and the y-axis shows task performance. Dashed vertical and horizontal lines indicate the output tokens and performance of the original large model, respectively. Each DUET point corresponds to a checkpoi… view at source ↗
Figure 5
Figure 5. Figure 5: The performance evolution from DUET train￾ing. Each curve corresponds to a different training dataset, on which the large model exhibits a different level of accuracy. Lighter-colored curves indicate training datasets for which the large model has higher original accuracy. The y-axis reports the IPT metric averaged across all benchmarks except MATH500. 1000 2000 3000 Token Length 0.83 0.85 0.87 0.89 0.91 A… view at source ↗
Figure 7
Figure 7. Figure 7: The ablation results of marginal utility and length penalty schedule. –MU removes marginal-utility; –LPS removes the length-penalty schedule and sets a con￾stant length-penalty coefficient as 0.5. Dashed blue vertical lines indicate the original large-model performance. The performance is evaluated on the AMC23 benchmark [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Training dataset performance of DUET versus the fixed-m variant during training. The small model is Qwen2.5-0.5B-Instruct and models are trained on the DeepScaleR dataset. the full DUET configuration achieves the best overall results. When both marginal utility and the length penalty schedule are removed, training becomes unstable and often collapses, causing the large model to converge to producing empty … view at source ↗
Figure 9
Figure 9. Figure 9: The ablation results of marginal utility and length penalty schedule. –MU removes marginal-utility; –LPS removes the length-penalty schedule. Dashed blue vertical lines indicate the original large-model performance. The models are trained on MATH-LightEval dataset for 150 steps. E.6.1 The Value of B The hyperparameter B is the length normalization factor which controls the strength of length penalty. We ev… view at source ↗
Figure 10
Figure 10. Figure 10: The ablation results of marginal utility and length penalty schedule. –MU removes marginal-utility; –LPS removes the length-penalty schedule. Dashed blue vertical lines indicate the original large-model performance. The models are trained on DeepScaleR dataset for 150 steps [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of training with and without λ clipping. Models are trained on the DeepScaleR dataset and evaluated on MATH500 benchmark. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗
read the original abstract

In this work, we introduce DUET (Dual-model Efficient Two-stage inference), a collaborative inference framework in which a capable model and a lightweight model work together to solve a task. Relying on a single large model to perform end-to-end reasoning and prediction often incurs substantial inference cost. In contrast, DUET decomposes inference into two stages: the capable model produces a reasoning signal, and the lightweight model interprets this signal to generate the final answer, allowing reasoning-intensive computation to be handled by the capable model while non-reasoning-intensive components are delegated to the lightweight model without sacrificing task performance. To achieve this objective, we propose a length-penalized joint training objective that encourages the capable model to transmit only the information that is sufficient for the lightweight model to solve the task. As a result, DUET maintains strong reasoning performance with substantially lower inference cost than end-to-end inference using a large model alone, saving up to 60% of the large model's output tokens on challenging reasoning benchmarks, including AIME and GPQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DUET, a collaborative two-stage inference framework in which a large capable model generates a concise reasoning signal and a lightweight model produces the final answer from that signal. A length-penalized joint training objective is used to encourage the capable model to transmit only task-sufficient information, with the central empirical claim being that this yields up to 60% reduction in the large model's output tokens on reasoning benchmarks such as AIME and GPQA while preserving end-to-end task performance.

Significance. If the performance claims are substantiated with proper controls, the work offers a practical and lightweight method for reducing inference cost on reasoning tasks by decomposing computation across heterogeneous models. The length-penalized training objective is a straightforward mechanism that could be adopted more broadly for efficient LLM deployment.

major comments (2)
  1. [§4] §4 (Experimental Evaluation): The headline claim of maintained accuracy with 60% token savings requires explicit reporting of baselines (including the exact end-to-end large-model configuration), statistical significance tests, error bars across multiple runs, and an ablation of the length-penalty coefficient. Without these, the central performance claim cannot be fully verified and the sufficiency of the learned signal remains unproven.
  2. [§3] §3 (Training Objective): The length-penalized joint objective is load-bearing for the sufficiency argument, yet the manuscript provides no information-theoretic bound, exhaustive error analysis on out-of-distribution cases, or demonstration that the lightweight model recovers every necessary inference step. This leaves open the possibility that the reported savings come with unmeasured accuracy degradation on edge cases.
minor comments (2)
  1. [§2] The notation for the two-stage decomposition and the exact form of the length penalty should be clarified with a single consolidated equation early in the method section to improve readability.
  2. [Figure 1] Figure 1 (framework diagram) would benefit from explicit annotation of the token counts saved at inference time to directly illustrate the efficiency gain.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive comments. We address each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Evaluation): The headline claim of maintained accuracy with 60% token savings requires explicit reporting of baselines (including the exact end-to-end large-model configuration), statistical significance tests, error bars across multiple runs, and an ablation of the length-penalty coefficient. Without these, the central performance claim cannot be fully verified and the sufficiency of the learned signal remains unproven.

    Authors: We agree that these controls are necessary to substantiate the claims. In the revised manuscript we now report the exact end-to-end large-model baseline configuration (including model size, decoding parameters, and prompt format), include error bars as standard deviation over five independent runs with different random seeds, add paired t-test p-values confirming no statistically significant accuracy difference, and provide a full ablation of the length-penalty coefficient λ in new Section 4.3 and Appendix B. These additions directly address the verification concern. revision: yes

  2. Referee: [§3] §3 (Training Objective): The length-penalized joint objective is load-bearing for the sufficiency argument, yet the manuscript provides no information-theoretic bound, exhaustive error analysis on out-of-distribution cases, or demonstration that the lightweight model recovers every necessary inference step. This leaves open the possibility that the reported savings come with unmeasured accuracy degradation on edge cases.

    Authors: We acknowledge the absence of a formal information-theoretic bound in the original submission. While we cannot derive a tight bound for LLM token distributions within the scope of this empirical work, we have added an out-of-distribution analysis in revised Appendix C evaluating performance on held-out reasoning tasks. We also include qualitative case studies in Section 3.2 demonstrating that the lightweight model reaches the correct final answer from the transmitted signal. We have expanded the limitations discussion to note the possibility of edge-case degradation. We maintain that end-to-end accuracy preservation on the reported benchmarks supports signal sufficiency, though we agree further analysis would be valuable. revision: partial

standing simulated objections not resolved
  • Deriving a formal information-theoretic bound on the sufficiency of the length-penalized reasoning signal.

Circularity Check

0 steps flagged

No circularity: empirical joint-training framework with measured benchmark results

full rationale

The paper introduces DUET as an empirical collaborative inference method that decomposes tasks into a capable model generating a reasoning signal and a lightweight model producing the final answer, trained via a length-penalized joint objective. All performance claims (token savings up to 60% on AIME/GPQA while preserving accuracy) are obtained through direct experimentation and evaluation on held-out benchmarks rather than any derivation, prediction, or uniqueness theorem that reduces to fitted parameters or self-citations by construction. No equations or training steps are shown to be tautological with their inputs; the sufficiency of the signal is asserted as an empirical outcome, not proven via self-referential logic. This is a standard empirical ML paper whose central results rest on external validation data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of the proposed training objective and the assumption that reasoning can be cleanly split between signal generation and interpretation.

free parameters (1)
  • length penalty coefficient
    Hyperparameter in the joint training loss that controls how strongly the capable model is encouraged to shorten its output signal.
axioms (1)
  • domain assumption Task performance can be preserved when reasoning is decomposed into a compact signal produced by a large model and interpretation by a small model
    Invoked by the two-stage design and the claim that accuracy is maintained.

pith-pipeline@v0.9.0 · 5492 in / 1251 out tokens · 19733 ms · 2026-05-09T19:23:39.977389+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 47 canonical work pages · 13 internal anchors

  1. [1]

    Training language models to reason efficiently

    D. Arora and A. Zanette. Training language models to reason efficiently.arXiv preprint arXiv:2502.04463,

  2. [2]

    A. P . Behera, J. P . Champati, R. Morabito, S. Tarkoma, and J. Gross. Towards efficient multi-llm inference: Characterization and analysis of llm routing and hierarchical techniques.arXiv preprint arXiv:2506.06579,

  3. [3]

    C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and J. Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318,

  4. [4]

    D. Ding, A. Mallick, C. Wang, R. Sim, S. Mukherjee, V . Ruhle, L. V . Lakshmanan, and A. H. Awadallah. Hybrid llm: Cost-efficient and quality-aware query routing.arXiv preprint arXiv:2404.14618,

  5. [5]

    S. Fan, B. Qin, P . Han, S. Shang, Y. Wang, and A. Sun. The price of a second thought: On the evaluation of reasoning efficiency in large language models.arXiv preprint arXiv:2505.22017,

  6. [6]

    G. Fang, X. Ma, and X. Wang. Thinkless: Llm learns when to think.arXiv preprint arXiv:2505.13379,

  7. [7]

    Concise reasoning via reinforcement learning

    M. Fatemi, B. Rafiee, M. Tang, and K. Talamadupula. Concise reasoning via reinforcement learning.arXiv preprint arXiv:2504.05185,

  8. [8]

    T. Fu, Z. Min, H. Zhang, J. Yan, G. Dai, W. Ouyang, and Y. Wang. Cache-to-cache: Direct semantic communication between large language models.arXiv preprint arXiv:2510.03215,

  9. [9]

    C. Gao, H. Li, T. W. Killian, J. She, R. Wang, L. Ma, Z. Cheng, S. Hao, and Z. Xu. Concise reasoning in the lens of lagrangian optimization.arXiv preprint arXiv:2510.10168,

  10. [10]

    URL https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/ performance-analysis-of-deepseek-r1-ai-inference-using-vllm-on-nd-h100-v5/4449351. G. Grand, J. B. Tenenbaum, V . K. Mansinghka, A. K. Lew, and J. Andreas. Self-steering language models. arXiv preprint arXiv:2504.07081,

  11. [11]

    Y. Gu, L. Dong, F. Wei, and M. Huang. Minillm: Knowledge distillation of large language models.arXiv preprint arXiv:2306.08543,

  12. [12]

    13 E. Guha, R. Marten, S. Keh, N. Raoof, G. Smyrnis, H. Bansal, M. Nezhurina, J. Mercat, T. Vu, Z. Sprague, et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178,

  13. [13]

    Gupta, P

    S. Gupta, P . Isola, S. Jegelka, D. Lopez-Paz, K. Ahuja, M. Ibrahim, and M. Pezeshki. Reasoncache: Teaching llms to reason without weight updates.arXiv preprint arXiv:2602.02366,

  14. [14]

    S. Han, M. Gao, M. Jiang, Y. Jiang, H. Hu, and S. Mai. Uncertainty-aware collaborative system of large and small models for multimodal sentiment analysis.arXiv preprint arXiv:2509.04459, 2025a. T. Han, Z. Wang, C. Fang, S. Zhao, S. Ma, and Z. Chen. Token-budget-aware llm reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pag...

  15. [15]

    Distilling the Knowledge in a Neural Network

    G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

  16. [16]

    Training Compute-Optimal Large Language Models

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

  17. [17]

    B. Hou, Y. Zhang, J. Ji, Y. Liu, K. Qian, J. Andreas, and S. Chang. Thinkprune: Pruning long chain-of- thought of llms via reinforcement learning.arXiv preprint arXiv:2504.01296,

  18. [18]

    Hsieh, C.-L

    C.-Y. Hsieh, C.-L. Li, C.-K. Yeh, H. Nakhost, Y. Fujii, A. Ratner, R. Krishna, C.-Y. Lee, and T. Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InFindings of the Association for Computational Linguistics: ACL 2023, pages 8003–8017,

  19. [19]

    Juneja, S

    G. Juneja, S. Dutta, S. Chakrabarti, S. Manchanda, and T. Chakraborty. Small language models fine-tuned to coordinate larger language models improve complex reasoning.arXiv preprint arXiv:2310.18338,

  20. [20]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

  21. [21]

    Revisiting cascaded ensembles for efficient inference,

    S. Kolawole, D. Dennis, A. Talwalkar, and V . Smith. Agreement-based cascading for efficient inference. arXiv preprint arXiv:2407.02348,

  22. [22]

    Łajewska, M

    W. Łajewska, M. Hardalov, L. Aina, N. A. John, H. Su, and L. M `arquez. Understanding and improving information preservation in prompt compression for llms.arXiv preprint arXiv:2503.19114,

  23. [23]

    S. Lei, Z. Cheng, K. Jia, and D. Tao. Revisiting llm reasoning via information bottleneck.arXiv preprint arXiv:2507.18391,

  24. [24]

    J. Li, W. Zhao, Y. Zhang, and C. Gan. Steering llm thinking with budget guidance.arXiv preprint arXiv:2506.13752,

  25. [25]

    14 S.-Y. Liu, X. Dong, X. Lu, S. Diao, M. Liu, M.-H. Chen, H. Yin, Y.-C. F. Wang, K.-T. Cheng, Y. Choi, et al. Dler: Doing length penalty right-incentivizing more intelligence per token via reinforcement learning. arXiv preprint arXiv:2510.15110,

  26. [26]

    X. Lu, S. Han, D. Acuna, H. Kim, J. Jung, S. Prabhumoye, N. Muennighoff, M. Patwary, M. Shoeybi, B. Catanzaro, et al. Retro-search: Exploring untaken paths for deeper and efficient reasoning.arXiv preprint arXiv:2504.04383,

  27. [27]

    Monea, Y

    G. Monea, Y. Feldman, S. Padmanabhan, K. Brantley, and Y. Artzi. Breadcrumbs reasoning: Memory- efficient reasoning with compression beacons.arXiv preprint arXiv:2510.13797,

  28. [28]

    Muennighoff, Z

    N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P . Liang, E. Cand`es, and T. B. Hashimoto. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286–20332,

  29. [29]

    Self-training elicits concise reasoning in large language models

    T. Munkhbat, N. Ho, S. H. Kim, Y. Yang, Y. Kim, and S.-Y. Yun. Self-training elicits concise reasoning in large language models.arXiv preprint arXiv:2502.20122,

  30. [30]

    Buttazzo, Nicolamaria Manes, and Fabrizio Giacomelli

    S. Nayab, G. Rossolini, M. Simoni, A. Saracino, G. Buttazzo, N. Manes, and F. Giacomelli. Concise thoughts: Impact of output length on llm reasoning and cost.arXiv preprint arXiv:2407.19825,

  31. [31]

    Beyond chinchilla- optimal: Accounting for inference in language model scaling laws,

    N. Sardana, J. Portes, S. Doubov, and J. Frankle. Beyond chinchilla-optimal: Accounting for inference in language model scaling laws.arXiv preprint arXiv:2401.00448,

  32. [32]

    Z. Shao, P . Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  33. [33]

    Z. Shen, H. Yan, L. Zhang, Z. Hu, Y. Du, and Y. He. Codi: Compressing chain-of-thought into continuous space via self-distillation.arXiv preprint arXiv:2502.21074,

  34. [34]

    HybridFlow: A Flexible and Efficient RLHF Framework

    G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y. Peng, H. Lin, and C. Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

  35. [35]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,

  36. [36]

    Walk before you run! concise llm reasoning via reinforcement learning

    15 M. Song and M. Zheng. Walk before you run! concise llm reasoning via reinforcement learning.arXiv preprint arXiv:2505.21178,

  37. [37]

    The information bottleneck method

    N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method.arXiv preprint physics/0004057,

  38. [38]

    J.-F. Ton, M. F. Taufiq, and Y. Liu. Understanding chain-of-thought in llms through information theory. arXiv preprint arXiv:2411.11984,

  39. [39]

    J. Wang, W. Qiang, Z. Song, C. Zheng, and H. Xiong. Learning to think: Information-theoretic reinforce- ment fine-tuning for llms.arXiv preprint arXiv:2505.10425,

  40. [40]

    X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171,

  41. [41]

    Z. Wang, L. Zou, S. Wei, F. Liao, J. Zhuo, H. Mi, and R. Lai. Large language model enabled semantic communication systems.arXiv preprint arXiv:2407.14112,

  42. [42]

    X. Wu, K. Li, Y. Zhao, L. Zhang, L. Ou, H. Yin, Z. Zhang, Y. Jiang, P . Xie, F. Huang, M. Cheng, S. Wang, H. Cheng, and J. Zhou. Resum: Unlocking long-horizon search intelligence via context summarization. arXiv preprint arXiv:2509.13313,

  43. [43]

    Y. Wu, Z. Sun, S. Li, S. Welleck, and Y. Yang. Inference scaling laws: An empirical analysis of compute- optimal inference for problem-solving with language models.arXiv preprint arXiv:2408.00724,

  44. [44]

    H. Xia, Y. Li, C. T. Leong, W. Wang, and W. Li. Tokenskip: Controllable chain-of-thought compression in llms.arXiv preprint arXiv:2502.12067,

  45. [45]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  46. [46]

    URL https://arxiv.org/abs/2502.03387. Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, T. Fan, G. Liu, L. Liu, X. Liu, H. Lin, Z. Lin, B. Ma, G. Sheng, Y. Tong, C. Zhang, M. Zhang, W. Zhang, H. Zhu, J. Zhu, J. Chen, J. Chen, C. Wang, H. Yu, W. Dai, Y. Song, X. Wei, H. Zhou, J. Liu, W.-Y. Ma, Y.-Q. Zhang, L. Yan, M. Qiao, Y. Wu, and M. Wang. Dapo: An open...

  47. [47]

    URL https: //arxiv.org/abs/2503.14476. A. L. Zhang, T. Kraska, and O. Khattab. Recursive language models.arXiv preprint arXiv:2512.24601, 2025a. P . Zhang, Z. Liu, S. Xiao, N. Shao, Q. Ye, and Z. Dou. Long context compression with activation beacon. InInternational Conference on Learning Representations, 2025b. Y. Zheng, Z. Zhao, Z. Li, Y. Xie, M. Gao, L....

  48. [48]

    16 H. Zhu, S. Hao, Z. Hu, J. Jiao, S. Russell, and Y. Tian. Reasoning by superposition: A theoretical perspective on chain of continuous thought.arXiv preprint arXiv:2505.12514,

  49. [49]

    J. Zou, X. Yang, R. Qiu, G. Li, K. Tieu, P . Lu, K. Shen, H. Tong, Y. Choi, J. He, et al. Latent collaboration in multi-agent systems.arXiv preprint arXiv:2511.20639, 2025a. J. Zou, X. Yang, R. Qiu, G. Li, K. Tieu, P . Lu, K. Shen, H. Tong, Y. Choi, J. He, et al. Latent collaboration in multi-agent systems.arXiv preprint arXiv:2511.20639, 2025b. 17 A Addi...

  50. [50]

    Han et al

    introduced a planner–follower paradigm, where a planner model generates a task-specific inference program that is executed by a population of follower models. Han et al. [2025a] proposed an uncertainty-aware collaborative system that offloads high-confidence inputs to a smaller model while reserving uncertain cases for a larger model. Relatedly, Zhang et ...

  51. [51]

    studied periodic context summarization for long-horizon agentic search. These approaches are complementary to DUET: they primarily compress memory or history within a single reasoning process, whereas DUET learns a concise reasoning signal that is explicitly optimized to be consumed by a second model (in an independent auto-regressive run). At the systems...

  52. [52]

    studied the accuracy degradation caused by truncation, and showed that it arises from inadequate RL optimization rather than the lack of sophisticated penalties. Other approaches guide reasoning to adhere to a budget by predicting the remaining thinking length [Li et al., 2025] or dynamically adjusting the token budget based on problem complexity [Han et ...

  53. [53]

    There are also several heuristic approaches

    show that training a prefix cache while keeping the base model frozen is sufficient to improve reasoning performance, and strikingly induces more concise reasoning even without explicitly optimizing for brevity. There are also several heuristic approaches. These include prompt-based methods for eliciting concise reasoning [Nayab et al., 2024], controlled ...

  54. [54]

    In multi-model settings, Fu et al

    provided theoretical support for such latent reasoning by showing that continuous thoughts can encode superpositions of multiple search frontiers. In multi-model settings, Fu et al. [2025], Zheng et al. [2025], and Zou et al. [2025b] explore direct semantic communication through KV caches, latent thoughts, or shared latent working memory instead of textua...

  55. [55]

    Note that you should keep the reasoning concise: include only essential reasoning steps and avoid repetition or unnecessary explanations

    The learning rate for both the large and small model is 1 × 10−6. For both the large and small models, we sample four rollouts per input to compute advantages during GRPO training. We applied a KL-divergence regularization term to the reward with a coefficient of 0.001. The maximum response lengths for the large model M and the small model m are set to 16...

  56. [56]

    The original model requires 63m27.599s, while DUET reduces this to 41m54.337s, demonstrating a substantial improvement in practical efficiency

    In addition, we measure the wall-clock inference time on the full evaluation dataset. The original model requires 63m27.599s, while DUET reduces this to 41m54.337s, demonstrating a substantial improvement in practical efficiency. Table 3:The performance and efficiency of DUET under different training steps. Each cell reports (accuracy, large-model tokens,...

  57. [57]

    Method Benchmark MATH AMC AIME GPQA 500 23 2024 Diamond Original large model(0.89,

    Each cell reports (accuracy, large-model output tokens). Method Benchmark MATH AMC AIME GPQA 500 23 2024 Diamond Original large model(0.89,

  58. [58]

    Each cell reports the intelligence-per-1000-tokens metric evaluated at 50, 100, and 150 training steps, respectively. Training dataset Benchmark AIME AIME GPQA 2024 2025 Diamond MATH-LightEval(0.075, 0.126, 0.121) (0.045, 0.073, 0.075) (0.113, 0.162, 0.230) DeepScaleR(0.089, 0.119, 0.109) (0.042, 0.079, 0.106) (0.129, 0.196, 0.369) DAPO-MATH-17k(0.085, 0....

  59. [59]

    E.6 Ablation on Hyperparameters We conduct ablation studies on the hyperparameters in the length penalty, including B and λ (with constant schedule)

    Training only the large model with a standard RL algorithm fails to produce concise reasoning, demonstrating that the ability to generate concise reasoning stems from the DUET framework and length-penalized joint training, rather than from the training dataset itself. E.6 Ablation on Hyperparameters We conduct ablation studies on the hyperparameters in th...

  60. [60]

    Larger values of λ apply stronger length penalties, achieving more compressed reasoning. As shown in the ablations on marginal utility and length penalty schedule (Section E.4), the constant schedule achieves similar 25 t[!htb] 0 4000 8000 12000 Output T okens 0.15 0.30 0.45 Accuracy AIME 2025 0 2500 5000 7500 Output T okens 0.45 0.60 0.75 0.90 Accuracy A...

  61. [61]

    Models are trained on theDeepScaleRdataset and evaluated at 100 training steps across multiple benchmarks

    We find that including the KL penalty achieves better performance and training stability 26 Table 8:Ablation study of hyperparameter B. Models are trained on theDeepScaleRdataset and evaluated at 100 training steps across multiple benchmarks. Acc. denotes accuracy (%) and Tokens denotes average output tokens per sample. Value of B MATH500 AMC23 AIME24 AIM...