Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training

Gouki Minegishi; Kohsei Matsutani; Takeshi Kojima; Yusuke Iwasawa; Yutaka Matsuo

arxiv: 2605.28008 · v1 · pith:ARNF5YJYnew · submitted 2026-05-27 · 💻 cs.AI · cs.LG

Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training

Kohsei Matsutani , Gouki Minegishi , Takeshi Kojima , Yusuke Iwasawa , Yutaka Matsuo This is my paper

Pith reviewed 2026-06-29 12:11 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords chain of thoughtcompressed reasoningsupervised fine-tuningreinforcement learningLLM post-trainingsynthetic tasksgeneralizationmemorization

0 comments

The pith

Coarser chain-of-thought compression requires more supervised fine-tuning data than composed or implicit forms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines how different granularities of compressed chain-of-thought reasoning affect LLM post-training outcomes under supervised fine-tuning and reinforcement learning. It defines a taxonomy of explicit CoT that shows every operation, composed CoT that aggregates multiple operations into one step, and implicit CoT that omits intermediate operations. Experiments on a synthetic compositional reasoning task with controlled difficulty show that coarser compression demands larger SFT datasets, while composed and implicit versions improve more from data scaling. Reinforcement learning after SFT tends to expand the compressed steps, and unidirectional ordering aids generalization on longer sequences. These patterns matter for choosing reasoning data formats when training data or token budgets are constrained.

Core claim

Using a synthetic compositional reasoning task that varies difficulty, compression granularity, and data size, experiments across model families show that explicit CoT requires the least SFT data while composed and implicit CoT benefit more from data scaling; composed CoT gains from repetition whereas implicit CoT tends toward memorization; RL with verifiable rewards decomposes the compressed steps acquired in SFT; and unidirectional CoT ordering produces stronger generalization on longer sequential tasks.

What carries the argument

The taxonomy of CoT into Explicit CoT (all operations shown), Composed CoT (multiple operations aggregated), and Implicit CoT (intermediate operations omitted), which controls compression granularity in the synthetic task.

If this is right

Coarser CoT requires more SFT data to reach performance levels comparable to finer forms.
Composed CoT and Implicit CoT benefit more from increases in SFT data volume than Explicit CoT.
Composed CoT improves with data repetition while Implicit CoT tends to produce memorization.
RL with verifiable rewards after SFT decomposes the compressed steps learned during SFT.
Unidirectional CoT ordering improves generalization on longer sequential tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training pipelines could deliberately choose CoT granularity according to the amount of available data and tolerance for repetition.
RL may function as a corrective step that recovers detail from shortcuts introduced by SFT on compressed traces.
The observed ordering effect suggests testing whether forward-only traces also help on non-sequential reasoning problems.
Datasets could be constructed with mixed CoT types to balance scaling benefits against memorization risks.

Load-bearing premise

The synthetic compositional reasoning task produces compression and generalization behaviors that transfer to the natural-language reasoning distributions used in real LLM post-training.

What would settle it

An experiment on a natural-language reasoning benchmark in which implicit CoT shows no greater memorization than explicit CoT, or in which RL fails to decompose SFT-learned compressions, would falsify the reported distinctions.

Figures

Figures reproduced from arXiv: 2605.28008 by Gouki Minegishi, Kohsei Matsutani, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo.

**Figure 2.** Figure 2: Synthetic Dataset for Compositional Tasks. Each question consists of natural language descriptions of inter-parameter relations, including addition, subtraction, and multiplication, with initial parameter values. The task requires sequentially applying the specified operations modulo 23 to infer the value of a target parameter. We use a CoT format of the form: “Define [parameter] as [variable]; so [variabl… view at source ↗

**Figure 4.** Figure 4: Train–Test Split by op. Training is performed on tasks with short op sequences, while tasks with longer op sequences are used for OOD evaluation. For these compositional reasoning tasks, we construct a testbed that lets us control data size, difficulty, and compression granularity. We employ [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 3.** Figure 3: Compression Granularity and Training Steps. The bar chart reports the average performance of Qwen2.5-0.5B, 3B, 7B, and Llama-3.1-8B-Instruct at steps 125, 1000, 4000, and 16000 after SFT with Explicit CoT, Composed CoT, and Implicit CoT with g = 2, 4. Models are trained on tasks with op = 8, 16, 24. Evaluation results are averaged over op = 32, 40, 48, . . . , 96, 104 tasks. synthetic arithmetic dataset il… view at source ↗

**Figure 5.** Figure 5: Data Scaling vs Data Repetition. The bar chart reports average performance of Qwen2.5-3B and Llama-3.2-3B-Instruct after SFT with Composed CoT and Implicit CoT with g = 2. Models are trained with 384k samples for 1 epoch, 6k samples for 64 epochs, and 6k samples for 1 epoch. Evaluation results are averaged over op = 32, 40, 48, . . . , 96, 104 tasks. Since the computation is performed modulo 23, the chance… view at source ↗

**Figure 7.** Figure 7: Illustration of Tasks Requiring Decomposition. For tasks with op = 5 (≡ 1 (mod 2)), applying CoT with g = 2 produces g = 1 fractions required to solve the problem [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 6.** Figure 6: Decomposition of Composed Steps by RLVR. (a) Dumbbell plot showing changes in the average evaluation results over op = 25, 27, 29, . . . , 101, 103 before and after RLVR on op = 9, 11, 13, 15 tasks, using checkpoints obtained by SFT on Qwen2.5-3B and Llama-3.2-3B-Instruct with Composed CoT and Implicit CoT with g = 2. (b) Training dynamics of the mean reward, mean rollout response length, and mean token en… view at source ↗

**Figure 8.** Figure 8: reports evaluation results after SFT on even and odd op. The results show that, when trained on compressed CoT traces with g = 2, both Composed CoT and Implicit CoT solve OOD tasks with even op but fail on tasks with odd op. 0 1 0 1 0 1 2 3 0 1 2 3 op 0 20 40 60 80 100 Pass@1 (%) 98.82 2.55 98.81 2.78 93.90 4.673.992.08 84.32 3.805.742.78 mod 2 mod 2 mod 4 mod 4 Qwen2.5-3B 0 1 0 1 0 1 2 3 0 1 2 3 op 92.40 … view at source ↗

**Figure 9.** Figure 9: Effect of CoT Order. The bar chart reports the average performance of Qwen2.5-3B and Llama-3.2-3BInstruct after SFT on 384k samples using Forward CoT, Backward CoT, and Hierarchical CoT with op = 8, 16, evaluated on op = 8, 16 (ID) and op = 32, 64, 128 (OOD). Since the computation is performed modulo 23, the chance level ( 1 23 ), is indicated by the red dashed line. Response length increases sharply when… view at source ↗

**Figure 10.** Figure 10: Illustration of Different CoT Orders. Forward CoT, Backward CoT, and Hierarchical CoT for tasks with op = 8. We have so far examined compressed CoT mainly for sequential problems such as f8(f7(f6(f5(f4(f3(f2(f1(s0)))))))), where op = 8 and CoT follows the forward order f1 → f2 → f3 → f4 → f5 → f6 → f7 → f8 (we call this forward CoT.) Other reasoning orders are also possible. In backward CoT, the model re… view at source ↗

**Figure 11.** Figure 11: Evaluation Results of Qwen2.5 Models. Evaluation results on op = 32, 40, 48, . . . , 96, 104 tasks for checkpoints after SFT for one epoch with 6k, 48k, 192k, and 768k samples for each CoT type at op = 8, 16, 24 tasks [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Evaluation Results of Llama-3 Models. Evaluation results on op = 32, 40, 48, . . . , 96, 104 tasks for checkpoints after SFT for one epoch with 6k, 48k, 192k, and 768k samples for each CoT type at op = 8, 16, 24 tasks [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Evaluation Results of Different Number of Epochs. Evaluation results on op = 32, 40, 48, . . . , 96, 104 for checkpoints after SFT for one epoch with 384k samples for one epoch, 6k samples for 64 epochs, and 6k samples for one epoch for each CoT type at op = 8, 16, 24 tasks [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Evaluation results before and after RLVR on odd and even op tasks. Training dynamics of the mean reward, mean rollout response length, and mean token entropy at each steps. Odd op task are op = 9, 11, 13, 15 and even op task are op = 10, 12, 14. 30 40 50 60 70 80 90 100 0 20 40 60 80 100 Pass@1 (%) op 0 (mod 2) 30 40 50 60 70 80 90 100 op 1 (mod 2) 30 40 50 60 70 80 90 100 op 0 (mod 2) 30 40 50 60 70 80 9… view at source ↗

**Figure 15.** Figure 15: Evaluation Results of Before and After RLVR on odd op tasks. Evaluation results on op = 25, 27, 29, . . . , 101, 103 tasks for checkpoints after RLVR on op = 9, 11, 13, 15 tasks for each CoT type [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: Evaluation Results of Different CoT Orders. Evaluation results on op = 8, 16, 32, 64, 128 tasks for checkpoints after SFT on op = 8, 16 tasks for each CoT order. op = 8, 16 tasks are ID, and op = 32, 64, 128 tasks are OOD [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

read the original abstract

Large language models (LLMs) can now solve complex problems through long chain-of-thought (CoT) reasoning, but the trade-off between performance and token cost remains a central challenge. To address this issue, supervised fine-tuning (SFT) often uses compressed reasoning data, where CoT traces are shortened into compact forms. However, the effect of such compressed reasoning data on post-training remains poorly understood. In this paper, we propose a taxonomy of CoT consisting of Explicit CoT, which outputs all operations without aggregation, Composed CoT, which combines multiple operations into a single step, and Implicit CoT, which omits intermediate operations. We construct a synthetic compositional reasoning task that allows controlled variation of difficulty, compression granularity, and data size, and conducted a comprehensive set of experiments across different model families and sizes. Notably, we find that (i) coarser CoT requires more SFT data, (ii) compared with Explicit CoT, Composed CoT and Implicit CoT benefit more from data scaling, while Composed CoT benefits from data repetition and Implicit CoT tends to lead to memorization, (iii) unlike SFT, subsequent reinforcement learning (RL) with verifiable rewards (RLVR) decomposes compressed steps learned during SFT, and (iv) unidirectional CoT ordering shows stronger generalization on longer sequential tasks. Our findings provide implications for CoT design under data resource constraints and offer important insights into the mechanisms of SFT and RL in LLM post-training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Controlled synthetic-task experiments cleanly separate how CoT granularity affects SFT scaling, repetition, memorization, and RL decomposition, but the transfer story to natural-language reasoning is untested.

read the letter

The main things to know are that coarser compression needs more SFT data, Composed and Implicit forms scale differently with data volume and repetition, RL tends to unpack the compressed steps, and unidirectional ordering helps length generalization. All of this comes from a synthetic compositional task with explicit controls on difficulty, granularity, and data size.

What is new is the three-way taxonomy (Explicit, Composed, Implicit) plus the deliberate separation of SFT and RL effects under the same controlled conditions. Prior compression work did not run this kind of decomposition across model families and sizes while varying those axes. The experimental design is straightforward and the patterns are reported with the relevant conditions attached, which makes the within-task results easy to inspect.

The soft spot is exactly the one the stress test flags: everything stays inside the synthetic task. No transfer runs on standard reasoning benchmarks appear, so the stated implications for real post-training under token budgets rest on the assumption that lexical variability, pre-trained knowledge, and multi-hop patterns will behave the same way. That assumption is plausible but untested here, which keeps the practical advice at the level of useful hypotheses rather than ready guidelines.

The work is for people who design reasoning data or study SFT-versus-RL interactions and want mechanism-level signals from a clean setup. A reader who already works with synthetic tasks or is planning follow-up transfer experiments will get the most out of it.

It deserves peer review. The controls are tight enough and the RL decomposition finding is worth having referees pressure-test, even with the generalizability caveat.

Referee Report

2 major / 2 minor

Summary. The paper proposes a taxonomy of chain-of-thought (CoT) reasoning into Explicit, Composed, and Implicit forms, then uses a synthetic compositional reasoning task with controlled variation in difficulty, compression granularity, and data volume to run SFT and RLVR experiments across multiple LLM families and sizes. It reports four main empirical patterns: coarser CoT requires more SFT data; Composed and Implicit CoT scale better with data volume than Explicit CoT; Composed CoT benefits from repetition while Implicit CoT risks memorization; RLVR tends to decompose compressed steps learned in SFT; and unidirectional ordering improves length generalization.

Significance. If the reported patterns are robust, the work supplies concrete, controlled evidence on how compression granularity interacts with data scaling, repetition, and the SFT-to-RL transition. The experimental design (explicit variation of difficulty/granularity/size, multi-model replication) is a strength and supports falsifiable claims about post-training mechanisms under resource constraints.

major comments (2)

[Abstract and Conclusion] The manuscript's stated implications for CoT design in real LLM post-training rest on the assumption that compression and scaling behaviors observed on the synthetic compositional task transfer to natural-language reasoning distributions. No transfer experiments on standard reasoning benchmarks are reported, which is load-bearing for the broader claims in the abstract and conclusion.
[Experimental Setup] §4 (Experimental Setup) and the task definition: the precise generation rules for Composed CoT (aggregation of operations) and Implicit CoT (omission of intermediates) are not stated with sufficient formality to allow independent replication or to evaluate how closely they match the distributional properties of natural-language CoT traces.

minor comments (2)

[Figures] Figure captions and axis labels should explicitly state the number of runs and error bars used for each plotted point.
[Methods] The distinction between 'data scaling' and 'data repetition' experiments should be clarified in the methods to avoid reader confusion about whether repetition means multiple epochs on the same examples.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract and Conclusion] The manuscript's stated implications for CoT design in real LLM post-training rest on the assumption that compression and scaling behaviors observed on the synthetic compositional task transfer to natural-language reasoning distributions. No transfer experiments on standard reasoning benchmarks are reported, which is load-bearing for the broader claims in the abstract and conclusion.

Authors: We agree that the broader implications would be strengthened by evidence of transfer to natural language tasks. However, the synthetic task was specifically designed to allow controlled experimentation on the effects of compression granularity, which is difficult to achieve with natural language data. We will revise the abstract and conclusion to emphasize that the findings are from the synthetic setting and discuss the potential implications for real-world post-training as hypotheses for future work. revision: partial
Referee: [Experimental Setup] §4 (Experimental Setup) and the task definition: the precise generation rules for Composed CoT (aggregation of operations) and Implicit CoT (omission of intermediates) are not stated with sufficient formality to allow independent replication or to evaluate how closely they match the distributional properties of natural-language CoT traces.

Authors: We appreciate this point and will provide more formal definitions in the revised manuscript. Specifically, we will include a detailed description of the generation process for each CoT type, including the rules for operation aggregation in Composed CoT and omission in Implicit CoT, using mathematical notation and examples to ensure replicability. revision: yes

Circularity Check

0 steps flagged

No circularity; purely experimental results on synthetic task

full rationale

The paper defines a taxonomy of CoT compression types and reports measured outcomes from controlled experiments on a synthetic compositional reasoning task. All four headline findings are obtained directly from performance metrics on held-out task variants under varying data sizes, repetition, and training stages (SFT then RLVR). No equations, parameter fits, or derivations are presented that reduce to their own inputs by construction, and no self-citations are invoked as load-bearing premises for uniqueness or ansatzes. The work is self-contained against its own experimental benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical study that introduces no new mathematical axioms, free parameters fitted inside a derivation, or postulated entities; all quantities are measured outcomes on a constructed task.

pith-pipeline@v0.9.1-grok · 5825 in / 1106 out tokens · 22391 ms · 2026-06-29T12:11:19.558998+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Training verifiers to solve math word prob- lems.arXiv preprint arXiv:2110.14168. Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. 2025. The entropy mechanism of rein- forcement learning for reasoning l...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Implicit chain of thought reasoning via knowledge distillation

Implicit chain of thought reasoning via knowl- edge distillation.arXiv preprint arXiv:2311.01460. Yanrui Du, Sendong Zhao, Yibo Gao, Danyang Zhao, Qika Lin, Ming Ma, Jiayun Li, Yi Jiang, Kai He, Qianyi Xu, Bing Qin, and Mengling Feng. 2026. S3-CoT: Self-sampled succinct reasoning enables efficient Chain-of-Thought LLMs.arXiv preprint arXiv:2602.01982. Nou...

work page arXiv 2026
[3]

InSecond Conference on Language Modeling

Training large language models to reason in a continuous latent space. InSecond Conference on Language Modeling. Yinghui He, Abhishek Panigrahi, Yong Lin, and Sanjeev Arora. 2025. Skill-Targeted adaptive training.arXiv preprint arXiv:2510.10023. Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing a multi- hop QA dataset for...

work page arXiv 2025
[4]

The invisible leash: Why rlvr may or may not escape its origin, 2026

Reinforcement learning with verifiable re- wards implicitly incentivizes correct reasoning in base LLMs. InThe Fourteenth International Confer- ence on Learning Representations. Fang Wu, Weihao Xuan, Ximing Lu, Zaid Harchaoui, and Yejin Choi. 2025a. The invisible leash: Why rlvr may not escape its origin.arXiv preprint arXiv:2507.14843. Yifan Wu, Jingze S...

work page arXiv 2025
[5]

InProceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing, pages 11257–11272, Suzhou, China

Back attention: Understanding and enhanc- ing multi-hop reasoning in large language models. InProceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing, pages 11257–11272, Suzhou, China. Association for Com- putational Linguistics. Lifan Yuan, Weize Chen, Yuchen Zhang, Ganqu Cui, Hanbin Wang, Ziming You, Ning Ding, Zhiyuan L...

work page arXiv 2025
[6]

InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 3: System Demonstra- tions), Bangkok, Thailand

Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 3: System Demonstra- tions), Bangkok, Thailand. Association for Computa- tional Linguistics. Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, and Beidi Chen. 2025. GSM-$\infty$: How do...

2025
[7]

Define [parameter] as [variable]; so [variable] [op- eration] = [value]

and targeting skills (He et al., 2025), and self-distillation (Sprague et al., 2026). Yao et al. (2026) analyzed compositional generalization under distribution shift and Lippl et al. (2025) identified compositional geometry of algorithmic primitives. Recently, Yuan et al. (2026) and Cheng et al. (2026) investigated the effect of RL on compositional abili...

2025
[8]

The number of onions equals the number of boots times 3

The number of maples equals 1. The number of onions equals the number of boots times 3. The number of opals equals 3. The number of needles equals 4. The number of cats equals the number of celery times 4. What is the number of scarves? Explicit CoT. Example of Explicit CoT Define whelks as JK; so JK = 1. Define celery as FP; so FP = JK * 3 = 1 * 3 = 3. D...
[9]

Example of Composed CoT (g= 4) Define whelks as JK; so JK = 1

Answer: 18. Example of Composed CoT (g= 4) Define whelks as JK; so JK = 1. Define chairs as DP; so DP = JK * 3 * 4 * 2 - 1 = 1 * 3 * 4 * 2 - 1 = 0. Define scarves as TH; so TH = DP * 2 * 3 - 1 - 4 = 0 * 2 * 3 - 1 - 4 = 18. Answer : 18. Example of Composed CoT (g=8) Define whelks as JK; so JK = 1. Define scarves as TH; so TH = (JK * 3 * 4 * 2 - 1) * 2 * 3 ...
[10]

Define scarves as TH; so TH = 22 - 4 = 18

Define onions as MP; so MP = 0 * 3 = 0. Define scarves as TH; so TH = 22 - 4 = 18. Answer: 18. Example of Implicit CoT (g= 4) Define whelks as JK; so JK = 1. Define chairs as DP; so DP = 1 - 1 = 0. Define scarves as TH; so TH = 22 - 4 = 18. Answer: 18. Example of Implicit CoT (g=8) Define whelks as JK; so JK = 1. Define scarves as TH; so TH = 22 - 4 = 18....

2024
[11]

Table 1: SFT Configuration

implemented in the verl framework (Sheng et al., 2024), with hyperparameters listed in Ta- ble 2. Table 1: SFT Configuration. Component Setting Effective batch size 48 Optimizer AdamW Learning rate2.0×10 −5 Weight decay 0.1 Max gradient norm 1.0 Scheduler Cosine Warmup ratio 0.05 Minimum learning rate3.0×10 −6 Mixed precision bfloat16 Table 2: RLVR Config...

2024
[12]

Define aspens as YL; so YL = (YH - 1) * 3 = (21 - 1) * 3 = 14

Define markers as YH; so YH = YL * 4 * 3 = 19 * 4 * 3 = 21. Define aspens as YL; so YL = (YH - 1) * 3 = (21 - 1) * 3 = 14. Define wolves as YL; so YL = YL - 2 - 2 = 14 - 2 - 2 =
[13]

Answer: 12

Define YL + 2 = 10 + 2 = 12. Answer: 12. Qwen2.5-3B, Implicit CoT (g= 2) Define whelks as ZD; so ZD = 3. Define cats as ZF; so ZF = 4 + 4 = 8. Define chairs as ZU; so ZU = 9 * 2 = 18. Define onions as ZL; so ZL = 8 - 2 =
[14]

Define batons as ZY; so ZY = 15 + 2 = 17

Define scarves as ZC; so ZC = 18 * 2 = 13. Define batons as ZY; so ZY = 15 + 2 = 17. Define hats as ZN; so ZN = 5 + 1 = 6. Define maples as ZD; so ZD = 2 + 3 = 5. Define violets as ZF; so ZF = 15 * 2 = 7. Define flutes as ZD; so ZD = 21 * 2 = 19. Define markers as ZD; so ZD = 7 * 3 = 21. Define aspens as ZD; so ZD = 20 * 3 =
[15]

Define tokens as ZD; so ZD = 10 + 2 = 12

Define wolves as ZD; so ZD = 12 - 2 = 10. Define tokens as ZD; so ZD = 10 + 2 = 12. Answer: 12. Llama3.2-3B-Instruct, Composed CoT (g= 2) Define whelks as ZD; so ZD = 3. Define cats as ZT; so ZT = ZD + 1 + 4 = 3 + 1 + 4 = 8. Define chairs as ZQ; so ZQ = ZT * 4 * 2 = 8 * 4 * 2 = 18. Define onions as ZD; so ZD = ZQ * 3 - 2 = 18 * 3 - 2 = 6. Define scarves a...
[16]

Define aspens as ZD; so ZD = (ZT - 1) * 3 = (21 - 1) * 3 = 14

Define markers as ZT; so ZT = ZD * 4 * 3 = 19 * 4 * 3 = 21. Define aspens as ZD; so ZD = (ZT - 1) * 3 = (21 - 1) * 3 = 14. Define wolves as ZT; so ZT = ZD - 2 - 2 = 14 - 2 - 2 =
[17]

Answer: 12

Define tokens as ZD; so ZD = ZT * 1 + 2 = 10 * 1 + 2 = 12. Answer: 12. Llama3.2-3B-Instruct, Implicit CoT ( g= 2) Define whelks as ZD; so ZD = 3. Define cats as ZC; so ZC = 4 + 4 = 8. Define chairs as ZK; so ZK = 9 * 2 = 18. Define onions as ZU; so ZU = 8 - 2 =
[18]

Define batons as ZD; so ZD = 15 + 2 = 17

Define scarves as ZC; so ZC = 18 * 2 = 13. Define batons as ZD; so ZD = 15 + 2 = 17. Define hats as ZU; so ZU = 5 + 1 = 6. Define maples as ZC; so ZC = 2 + 3 = 5. Define violets as ZD; so ZD = 15 * 2 = 7. Define flutes as ZD; so ZD = 21 * 2 = 19. Define markers as ZD; so ZD = 7 * 3 = 21. Define aspens as ZD; so ZD = 20 * 3 =
[19]

Define tokens as ZD; so ZD = 10 + 2 = 12

Define wolves as ZD; so ZD = 12 - 2 = 10. Define tokens as ZD; so ZD = 10 + 2 = 12. Answer: 12. D.4 SFT Results on Different CoT Orders For Qwen2.5-3B and Llama-3.2-3B-Instruct, we consider Forward CoT, Backward CoT, and Hierar- chical CoT. We perform SFT on op= 8,16 tasks (≡0 (mod 8) ), varying the training dataset size among 6k, 24k, 96k, and 384k. Figu...

[1] [1]

Training verifiers to solve math word prob- lems.arXiv preprint arXiv:2110.14168. Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. 2025. The entropy mechanism of rein- forcement learning for reasoning l...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Implicit chain of thought reasoning via knowledge distillation

Implicit chain of thought reasoning via knowl- edge distillation.arXiv preprint arXiv:2311.01460. Yanrui Du, Sendong Zhao, Yibo Gao, Danyang Zhao, Qika Lin, Ming Ma, Jiayun Li, Yi Jiang, Kai He, Qianyi Xu, Bing Qin, and Mengling Feng. 2026. S3-CoT: Self-sampled succinct reasoning enables efficient Chain-of-Thought LLMs.arXiv preprint arXiv:2602.01982. Nou...

work page arXiv 2026

[3] [3]

InSecond Conference on Language Modeling

Training large language models to reason in a continuous latent space. InSecond Conference on Language Modeling. Yinghui He, Abhishek Panigrahi, Yong Lin, and Sanjeev Arora. 2025. Skill-Targeted adaptive training.arXiv preprint arXiv:2510.10023. Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing a multi- hop QA dataset for...

work page arXiv 2025

[4] [4]

The invisible leash: Why rlvr may or may not escape its origin, 2026

Reinforcement learning with verifiable re- wards implicitly incentivizes correct reasoning in base LLMs. InThe Fourteenth International Confer- ence on Learning Representations. Fang Wu, Weihao Xuan, Ximing Lu, Zaid Harchaoui, and Yejin Choi. 2025a. The invisible leash: Why rlvr may not escape its origin.arXiv preprint arXiv:2507.14843. Yifan Wu, Jingze S...

work page arXiv 2025

[5] [5]

InProceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing, pages 11257–11272, Suzhou, China

Back attention: Understanding and enhanc- ing multi-hop reasoning in large language models. InProceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing, pages 11257–11272, Suzhou, China. Association for Com- putational Linguistics. Lifan Yuan, Weize Chen, Yuchen Zhang, Ganqu Cui, Hanbin Wang, Ziming You, Ning Ding, Zhiyuan L...

work page arXiv 2025

[6] [6]

InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 3: System Demonstra- tions), Bangkok, Thailand

Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 3: System Demonstra- tions), Bangkok, Thailand. Association for Computa- tional Linguistics. Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, and Beidi Chen. 2025. GSM-$\infty$: How do...

2025

[7] [7]

Define [parameter] as [variable]; so [variable] [op- eration] = [value]

and targeting skills (He et al., 2025), and self-distillation (Sprague et al., 2026). Yao et al. (2026) analyzed compositional generalization under distribution shift and Lippl et al. (2025) identified compositional geometry of algorithmic primitives. Recently, Yuan et al. (2026) and Cheng et al. (2026) investigated the effect of RL on compositional abili...

2025

[8] [8]

The number of onions equals the number of boots times 3

The number of maples equals 1. The number of onions equals the number of boots times 3. The number of opals equals 3. The number of needles equals 4. The number of cats equals the number of celery times 4. What is the number of scarves? Explicit CoT. Example of Explicit CoT Define whelks as JK; so JK = 1. Define celery as FP; so FP = JK * 3 = 1 * 3 = 3. D...

[9] [9]

Example of Composed CoT (g= 4) Define whelks as JK; so JK = 1

Answer: 18. Example of Composed CoT (g= 4) Define whelks as JK; so JK = 1. Define chairs as DP; so DP = JK * 3 * 4 * 2 - 1 = 1 * 3 * 4 * 2 - 1 = 0. Define scarves as TH; so TH = DP * 2 * 3 - 1 - 4 = 0 * 2 * 3 - 1 - 4 = 18. Answer : 18. Example of Composed CoT (g=8) Define whelks as JK; so JK = 1. Define scarves as TH; so TH = (JK * 3 * 4 * 2 - 1) * 2 * 3 ...

[10] [10]

Define scarves as TH; so TH = 22 - 4 = 18

Define onions as MP; so MP = 0 * 3 = 0. Define scarves as TH; so TH = 22 - 4 = 18. Answer: 18. Example of Implicit CoT (g= 4) Define whelks as JK; so JK = 1. Define chairs as DP; so DP = 1 - 1 = 0. Define scarves as TH; so TH = 22 - 4 = 18. Answer: 18. Example of Implicit CoT (g=8) Define whelks as JK; so JK = 1. Define scarves as TH; so TH = 22 - 4 = 18....

2024

[11] [11]

Table 1: SFT Configuration

implemented in the verl framework (Sheng et al., 2024), with hyperparameters listed in Ta- ble 2. Table 1: SFT Configuration. Component Setting Effective batch size 48 Optimizer AdamW Learning rate2.0×10 −5 Weight decay 0.1 Max gradient norm 1.0 Scheduler Cosine Warmup ratio 0.05 Minimum learning rate3.0×10 −6 Mixed precision bfloat16 Table 2: RLVR Config...

2024

[12] [12]

Define aspens as YL; so YL = (YH - 1) * 3 = (21 - 1) * 3 = 14

Define markers as YH; so YH = YL * 4 * 3 = 19 * 4 * 3 = 21. Define aspens as YL; so YL = (YH - 1) * 3 = (21 - 1) * 3 = 14. Define wolves as YL; so YL = YL - 2 - 2 = 14 - 2 - 2 =

[13] [13]

Answer: 12

Define YL + 2 = 10 + 2 = 12. Answer: 12. Qwen2.5-3B, Implicit CoT (g= 2) Define whelks as ZD; so ZD = 3. Define cats as ZF; so ZF = 4 + 4 = 8. Define chairs as ZU; so ZU = 9 * 2 = 18. Define onions as ZL; so ZL = 8 - 2 =

[14] [14]

Define batons as ZY; so ZY = 15 + 2 = 17

Define scarves as ZC; so ZC = 18 * 2 = 13. Define batons as ZY; so ZY = 15 + 2 = 17. Define hats as ZN; so ZN = 5 + 1 = 6. Define maples as ZD; so ZD = 2 + 3 = 5. Define violets as ZF; so ZF = 15 * 2 = 7. Define flutes as ZD; so ZD = 21 * 2 = 19. Define markers as ZD; so ZD = 7 * 3 = 21. Define aspens as ZD; so ZD = 20 * 3 =

[15] [15]

Define tokens as ZD; so ZD = 10 + 2 = 12

Define wolves as ZD; so ZD = 12 - 2 = 10. Define tokens as ZD; so ZD = 10 + 2 = 12. Answer: 12. Llama3.2-3B-Instruct, Composed CoT (g= 2) Define whelks as ZD; so ZD = 3. Define cats as ZT; so ZT = ZD + 1 + 4 = 3 + 1 + 4 = 8. Define chairs as ZQ; so ZQ = ZT * 4 * 2 = 8 * 4 * 2 = 18. Define onions as ZD; so ZD = ZQ * 3 - 2 = 18 * 3 - 2 = 6. Define scarves a...

[16] [16]

Define aspens as ZD; so ZD = (ZT - 1) * 3 = (21 - 1) * 3 = 14

Define markers as ZT; so ZT = ZD * 4 * 3 = 19 * 4 * 3 = 21. Define aspens as ZD; so ZD = (ZT - 1) * 3 = (21 - 1) * 3 = 14. Define wolves as ZT; so ZT = ZD - 2 - 2 = 14 - 2 - 2 =

[17] [17]

Answer: 12

Define tokens as ZD; so ZD = ZT * 1 + 2 = 10 * 1 + 2 = 12. Answer: 12. Llama3.2-3B-Instruct, Implicit CoT ( g= 2) Define whelks as ZD; so ZD = 3. Define cats as ZC; so ZC = 4 + 4 = 8. Define chairs as ZK; so ZK = 9 * 2 = 18. Define onions as ZU; so ZU = 8 - 2 =

[18] [18]

Define batons as ZD; so ZD = 15 + 2 = 17

Define scarves as ZC; so ZC = 18 * 2 = 13. Define batons as ZD; so ZD = 15 + 2 = 17. Define hats as ZU; so ZU = 5 + 1 = 6. Define maples as ZC; so ZC = 2 + 3 = 5. Define violets as ZD; so ZD = 15 * 2 = 7. Define flutes as ZD; so ZD = 21 * 2 = 19. Define markers as ZD; so ZD = 7 * 3 = 21. Define aspens as ZD; so ZD = 20 * 3 =

[19] [19]

Define tokens as ZD; so ZD = 10 + 2 = 12

Define wolves as ZD; so ZD = 12 - 2 = 10. Define tokens as ZD; so ZD = 10 + 2 = 12. Answer: 12. D.4 SFT Results on Different CoT Orders For Qwen2.5-3B and Llama-3.2-3B-Instruct, we consider Forward CoT, Backward CoT, and Hierar- chical CoT. We perform SFT on op= 8,16 tasks (≡0 (mod 8) ), varying the training dataset size among 6k, 24k, 96k, and 384k. Figu...