Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information

Yao Chen , Jiawei Sheng , Wenyuan Zhang , Tingwen Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:52 UTC · model grok-4.3

classification 💻 cs.CL

keywords reasoningattentiondistillationinformationmodelsstudentduringstepwise

0 comments

The pith

A CoT distillation framework transfers stepwise teacher attention on key information via a Mixture-of-Layers module to improve reasoning in small language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models can reason step by step using chain-of-thought prompting, but they are expensive to run. Researchers want to transfer this ability to smaller models through distillation, where the small model learns from the large one's outputs. Current approaches copy the large model's final reasoning steps but ignore how the large model gradually focuses its attention on the most important parts of the problem as it reasons. The authors observe that attention in these models shifts progressively toward critical clues. They propose to distill this shifting attention pattern directly, so the student model learns to concentrate on key information in the same stepwise way. To handle the fact that teacher and student have different numbers of layers, they add a Mixture-of-Layers component that dynamically matches layers during training. Experiments on math and commonsense reasoning benchmarks show consistent gains over standard distillation baselines.

Core claim

We introduce a novel CoT distillation framework that transfers the teacher's stepwise attention on key information to the student model. ... Our method achieves consistent performance improvements across multiple mathematical and commonsense reasoning datasets. To our knowledge, it is the first method to leverage stepwise attention within CoT distillation to improve small model reasoning.

Load-bearing premise

The assumption that the observed progressive attention shifts toward key information are causally important for correct reasoning and can be effectively transferred to guide a student model's internal focus, combined with the premise that the Mixture-of-Layers module can dynamically align layers without introducing misalignment artifacts.

Figures

Figures reproduced from arXiv: 2604.15701 by Jiawei Sheng, Tingwen Liu, Wenyuan Zhang, Yao Chen.

**Figure 2.** Figure 2: Progressive attention pattern on critical tokens [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The MoLSAKI framework consists of three components. In the example, the question and rationale have [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: We analyse the average column gradient dis [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 6.** Figure 6: Comparative visualization of layer weight [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: We randomly sampled 100 instances from CommonSenseQA to analyze the average attention allocated by Qwen2.5-32B to critical tokens corresponding to keywords relative to other tokens at each step. score the critical role of numerical tokens in mathematical reasoning. B.2 Commonsense Reasoning For commonsense reasoning, we adopt a keyword extraction method. To obtain critical tokens using this method, we de… view at source ↗

**Figure 8.** Figure 8: Layer weight visualization when τ2 = 1.0, τ1 = 0.1 . F.5 Computational Cost Regarding the computational overhead of MoLSAKI, the attention matrix is directly utilized as an intermediate result from the standard forward pass, thus introducing no additional computation. The newly introduced MoL module contributes only a marginal computational cost, consisting of one linear layer and RMSNorm. Consequently, … view at source ↗

**Figure 9.** Figure 9: Prompt template for generating CoT of the teacher model with dual-phase pipeline. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt template for generating keywords in the reasoning process of the teacher model. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Cases from SVAMP and GSM8K [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: We select one example each from SVAMP and GSM8K, visualizing stepwise attention on numerical [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Stepwise attention heatmap on critical tokens from the teacher model Llama3-8B, which consists of 32 layers for a specific example. For each layer, the average attention is computed from all attention heads. The horizontal axis represents the order of critical tokens, while the vertical axis indicates the step number (counted from the beginning of the question). The example is as follows: "Question: A mai… view at source ↗

read the original abstract

The significant computational demands of large language models have increased interest in distilling reasoning abilities into smaller models via Chain-of-Thought (CoT) distillation. Current CoT distillation methods mainly focus on transferring teacher-generated rationales for complex reasoning to student models. However, they do not adequately explore teachers' dynamic attention toward critical information during reasoning. We find that language models exhibit progressive attention shifts towards key information during reasoning, which implies essential clues for drawing conclusions. Building on this observation and analysis, we introduce a novel CoT distillation framework that transfers the teacher's stepwise attention on key information to the student model. This establishes structured guidance for the student's progressive concentration on key information during reasoning. More importantly, we develop a Mixture of Layers module enabling dynamic alignment that adapts to different layers between the teacher and student. Our method achieves consistent performance improvements across multiple mathematical and commonsense reasoning datasets. To our knowledge, it is the first method to leverage stepwise attention within CoT distillation to improve small model reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the empirical observation that language models exhibit progressive attention shifts toward key information; the paper introduces one new module (Mixture of Layers) whose effectiveness is asserted but not derived from prior principles.

axioms (1)

domain assumption Language models exhibit progressive attention shifts towards key information during reasoning.
Stated as the foundational observation in the abstract that motivates the entire framework.

invented entities (1)

Mixture of Layers module no independent evidence
purpose: Enables dynamic alignment that adapts to different layers between teacher and student models.
New component introduced to handle layer mismatch; no independent evidence of its necessity or mechanism is provided in the abstract.

pith-pipeline@v0.9.0 · 5479 in / 1397 out tokens · 42262 ms · 2026-05-10T08:52:08.557862+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Hojae Lee, Junho Kim, and SangKeun Lee

Parsing algebraic word problems into equa- tions.Transactions of the Association for Computa- tional Linguistics, 3:585–597. Hojae Lee, Junho Kim, and SangKeun Lee. 2024. Mentor-kd: Making small language models better multi-step reasoners. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 17643–17658. Liunia...

2024
[2]

From System 1 to System 2: A Survey of Reasoning Large Language Models

From system 1 to system 2: A survey of reasoning large language models.arXiv preprint arXiv:2502.17419. Yantao Liu, Zhao Zhang, Zijun Yao, Shulin Cao, Lei Hou, and Juanzi Li. 2024. Aligning teacher with stu- dent preferences for tailored training data generation. arXiv preprint arXiv:2406.19227. Meta. 2024. Introducing meta llama 3: The most ca- pable ope...

work page internal anchor Pith review arXiv 2024
[3]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal

A diverse corpus for evaluating and developing english math word problem solvers.arXiv preprint arXiv:2106.15772. Arkil Patel, Satwik Bhattamishra, and Navin Goyal

work page arXiv
[4]

Multi-Step Reasoning with Large Language Models, a Survey,

Are NLP models really able to solve simple math word problems? InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online. Association for Computational Linguistics. Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, and Thom...

work page arXiv 2021
[5]

He used 10 tickets to buy toys

treats CoT distillation as a multitask learn- ing problem, assigning two labels per query: the final answer and the rationale generated by the teacher model. Following this, several studies in- corporate an auxiliary loss to further enhance the complex reasoning capabilities of small language models. Mentor-KD (Lee et al., 2024) introduces a mentor model ...

2024
[10]

Q: {Question} (a) Generating CoT when given the question

The (expression) should not contain any commas and should be the raw combined formula. Q: {Question} (a) Generating CoT when given the question. Assume you are one of the greatest AI scientists, logicians, and mathematicians. Please answer the questions according to the following examples and requirementsystem content user content [Examples] Q: {Question}...
[11]

Let's think through the problem step by step and provide the answer strictly in the R format as shown in the above example
[12]

For percentages, to allow the eval() function to compute, express them as a division by
[13]

For example, “40%” should be written as (40 / 100)
[14]

The answer is (expression)

Please ensure that the final answer ends with “The answer is (expression)”, where (expression) is enclosed in parentheses
[15]

Q: {Question} GT: The final answer to this question is {Ground Truth}

The (expression) should not contain any commas and should be the raw combined formula. Q: {Question} GT: The final answer to this question is {Ground Truth}. Based on this answer, please work through the problem step by step to deduce the question. (b) Generating CoT when given the question and ground truth. Figure 9: Prompt template for generating CoT of...
[16]

(c) (S1) Sarah has 9 books and Joseph had twice the number of Sarah's books, but he lost 2 of them

(S4) The difference in the amount spent between the basketball coach and the baseball coach is 63 - 112 = 19. (c) (S1) Sarah has 9 books and Joseph had twice the number of Sarah's books, but he lost 2 of them. How many books does Joseph currently have? Question GPT2-Medium (MMIloss) (S2) Sarah has 9 books. (S3) Joseph had twice the number of Sarah's books...