pith. machine review for the scientific record. sign in

arxiv: 2511.08577 · v2 · submitted 2025-11-11 · 💻 cs.CL · cs.AI· cs.LG· cs.PF

Recognition: unknown

Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

Guohao Dai, Huazhong Yang, Tianyu Fu, Yichen You, Yu Wang, Zekai Chen

Authors on Pith no claims yet
classification 💻 cs.CL cs.AIcs.LGcs.PF
keywords iterationslatentiterationpasstokenattentionbaselinesdecider
0
0 comments X
read the original abstract

Improving reasoning abilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for real-world applications. Looped transformers address this by performing multiple latent iterations to refine each token beyond a single forward pass. However, we identify a latent overthinking phenomenon: most token predictions are already correct after the first pass, but are sometimes revised into errors in later iterations. In this work, we ask whether selectively skipping latent iterations may improve accuracy. We reveal significant potential with an oracle iteration policy that boosts model performance by up to 7.3%. Motivated by this, we propose Think-at-Hard (TaH), a looped transformer optimized for selective iteration. TaH employs a lightweight neural decider to trigger latent iteration only at tokens that are likely incorrect after the standard forward pass. During latent iterations, depth-aware Low-Rank Adaptation (LoRA) modules shift the LLM's objective from general next-token prediction to focused hard-token refinement. A duo-causal attention mechanism extends attention from the token sequence dimension to an additional iteration depth dimension, enabling cross-iteration information flow with full sequential parallelism. Experiments on nine benchmarks show consistent gains across math, QA, and coding tasks. With identical parameter counts, TaH outperforms always-iterate baselines by 3.8-4.4% while skipping iterations on 93% of tokens, and exceeds single-iteration Qwen3 baselines by 3.0-3.8%. When allowing <3% more parameters from LoRA and decider modules, the gains further increase to 5.3-6.2% and 6.1-6.8%, respectively. Our code is available at https://github.com/thu-nics/TaH.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

    cs.LG 2026-04 unverdicted novelty 7.0

    A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.

  2. LASAR: Latent Adaptive Semantic Aligned Reasoning for Generative Recommendation

    cs.IR 2026-05 unverdicted novelty 6.0

    LASAR uses two-stage supervised training plus reinforcement learning to ground semantic IDs, align latent reasoning trajectories to CoT hidden states via KL divergence, and adaptively choose reasoning depth, halving a...

  3. Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating st...