Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization

Bo Zheng; Han Lu; Jiamang Wang; Jiashun Liu; Junchi Yan; Shaopan Xiong; Weixun Wang; Wenbo Su; Yang Li; Yijia Luo

arxiv: 2510.13554 · v2 · pith:PICR56T5new · submitted 2025-10-15 · 💻 cs.CL · cs.LG

Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization

Yang Li , Zhichen Dong , Yuhan Sun , Weixun Wang , Shaopan Xiong , Yijia Luo , Jiashun Liu , Han Lu

show 4 more authors

Jiamang Wang Wenbo Su Bo Zheng Junchi Yan

This is my paper

classification 💻 cs.CL cs.LG

keywords attentionreasoningtokensoptimizationfocusedheadstokenacross

0 comments

read the original abstract

The reasoning pattern of Large language models (LLMs) remains opaque, and reinforcement learning (RL) typically applies uniform credit across an entire generation, blurring the distinction between pivotal and routine steps. This work positions attention as a privileged substrate that renders the internal logic of LLMs legible, not merely as a byproduct of computation, but as a mechanistic blueprint of reasoning itself. We first distinguish attention heads between locally and globally focused information processing and reveal that locally focused heads produce a sawtooth pattern near the diagonal indicating phrasal chunks, while globally focused heads expose tokens that exert broad downstream influence over future tokens. We formalize these with two metrics: 1) Windowed Average Attention Distance, which measures the extent of backward attention within a clipped window; 2) Future Attention Influence, which quantifies a token's global importance as the average attention it receives from subsequent tokens. Taken together, these signals reveal a recurring preplan-and-anchor mechanism, where the model first performs a long-range contextual reference to generate an introductory token, which is immediately followed by or coincides with a semantic anchor token that organizes subsequent reasoning. Leveraging these insights, we introduce three novel RL strategies that dynamically perform targeted credit assignment to critical nodes (preplan tokens, anchor tokens, and their temporal coupling) and show consistent performance gains across various reasoning tasks. By aligning optimization with the model's intrinsic reasoning rhythm, we aim to transform opaque optimization into an actionable structure-aware process, hoping to offer a potential step toward more transparent and effective optimization of LLM reasoning.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Table to Cell: Attention for Better Reasoning with TABALIGN
cs.AI 2026-05 unverdicted novelty 7.0

TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding exec...
Self-Distilled RLVR
cs.LG 2026-04 unverdicted novelty 7.0

RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
Reinforcement Learning without Ground-Truth Solutions can Improve LLMs
cs.LG 2026-06 unverdicted novelty 6.0

RiVER applies calibrated ranking rewards from execution scores to train LLMs on score-based tasks without ground-truth, producing gains on both heuristic contests and exact-solution coding benchmarks.
Beyond Fully Random Masking: Attention-Guided Denoising and Optimization for Diffusion Language Models
cs.CL 2026-06 unverdicted novelty 6.0

AGDO improves dLLM reasoning performance by determining denoising order and emphasizing tokens based on attention-derived dependencies rather than random masking.
What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
cs.AI 2026-05 unverdicted novelty 6.0

SERL selectively reweights learning using task success and environment feedback to reach 90.0% success on ALFWorld and 80.1% on WebShop, outperforming RL and distillation baselines.
VIMPO: Value-Implicit Policy Optimization for LLMs
cs.LG 2026-06 unverdicted novelty 5.0

VIMPO derives a policy-implied value function from optimality conditions for critic-free RL in LLMs and shows gains over GRPO on math benchmarks.