pith. machine review for the scientific record. sign in

arxiv: 2605.12922 · v1 · submitted 2026-05-13 · 💻 cs.AI · cs.CL

Recognition: unknown

When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:16 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords attention mechanismsmulti-turn interactionLLM forgettingresidual streamsgoal accessibilitychannel closuremechanistic interpretabilityinstruction following
0
0 comments X

The pith

LLMs lose track of instructions in multi-turn chats when attention to goal tokens fades, though residual streams may still encode the needed information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models follow complex instructions in one turn but often drift from goals, personas, and rules over many turns. The paper explains this through a channel-transition account: attention to task-defining tokens drops while goal information can remain in residual representations. They measure the drop with the Goal Accessibility Ratio, then use sliding-window ablations and linear probes on residuals to track what survives. Across architectures, the gap between attention loss and residual decodability forecasts whether goal-conditioned behavior continues or collapses. Force-closing attention in one model drops recall sharply and raises rule violations at the exact turns the measures predict.

Core claim

We propose that goal-defining tokens become less accessible through attention while goal-related information persists in residual representations. Using the Goal Accessibility Ratio together with sliding-window ablations and residual-stream probes, we show that the gap between attention loss and residual decodability predicts whether goal-conditioned behavior survives channel closure across architectures and scales. In a causal ablation on Mistral, force-closing the attention channel collapses recall from near-perfect to 11 percent on a 20-fact task and raises persona-constraint violations, with both effects appearing at the predicted crossover turn. Linear probes recover per-episode recall,

What carries the argument

The Goal Accessibility Ratio (GAR), which measures attention from generated tokens to task-defining goal tokens, combined with sliding-window ablations and linear probes on residual streams.

If this is right

  • When attention to instructions closes, models show qualitatively distinct failure modes depending on architecture.
  • Linear probes recover per-episode recall outcomes from residual representations with AUC up to 0.99, while input embeddings stay at chance.
  • Force-closing the attention channel in Mistral collapses recall to 11 percent on a 20-fact retention task.
  • The layer at which goal encoding emerges in residuals varies from 2 to 27 across models.
  • Failure timing becomes predictable under windowed attention closure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Monitoring GAR during deployment could flag upcoming instruction drift before behavioral failure appears.
  • Targeted interventions on residual streams might extend goal retention in models where residuals already encode the information.
  • The same attention-versus-residual gap may explain degradation in other long-sequence tasks such as multi-step planning.
  • Training methods that strengthen residual goal encoding could reduce reliance on sustained attention.

Load-bearing premise

Linear probes on residual streams recover causally relevant goal information rather than spurious correlations, and sliding-window ablations isolate the attention channel without confounding other mechanisms.

What would settle it

An experiment that keeps attention to goal tokens high while disrupting the residual goal encoding (or vice versa) and checks whether behavior follows the residual state or the attention state.

Figures

Figures reproduced from arXiv: 2605.12922 by Dilek Hakkani-T\"ur, Joseph Hsieh, Seunghyun Yoon, Trung Bui, Vardhan Dongre, Viet Dac Lai.

Figure 1
Figure 1. Figure 1: The two-channel framework. (A) The attention channel Cattn(τ ) from response tokens Rτ to goal tokens G decays with τ ; the residual channel Cres(τ ) at the response position carries goal information forward through layers. (B) GAR(τ ) crosses the model-specific threshold θM at τcross; for τ ≥ τcross direct attention to G is removed, so goal-related information that survives must be carried indirectly (e.g… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the four tasks: system-prompt goal, conversation schedule, and per-turn [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: GAR trajectories per architecture under default conversational pressure (left) and the sliding [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Window-size sweep on Mistral-7B and LLaMA￾3.1-8B. Each point is the em￾pirical crossover turn τcross at the corresponding window size W. The Attention Channel Closes: Accessibility to goal-defining tokens (measured by GAR; see § 2.2 and Appendix B) declines monotonically with conversation turn across every LLM architecture we evaluate. Beyond the four primary models, this trend is supported by an extended … view at source ↗
Figure 5
Figure 5. Figure 5: Fact recall tra￾jectories for the controlled￾complexity task under open￾channel (default) and closed￾channel conditions. With direct attention to goal tokens at numerical floor, the post￾crossover regime across the architectures degrades but does not collapse completely. Across episodes, the model recovers some facts under the all-at-once probe (T = 50) even with no direct attention to those facts during r… view at source ↗
Figure 6
Figure 6. Figure 6: Residual probe AUC by layer per architecture, [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Conversation structure for Task 1 (information retention) & Task 2 (Controlled Complexity). [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Conversation structure for Task 3 (persona compliance) under three pressure conditions. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cross-architecture residual-probe AUC-by-layer curves at two fixed post-closure probe [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qwen-2.5 cross-scale validation under SW= [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Mistral-7B behavioral analysis under default vs SW=4096 inference. [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Inter-annotator reliability on the 50-transcript policy-compliance calibration set. Left: [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Confusion matrices for policy-compliance labeling on the 50-transcript calibration set. [PITH_FULL_IMAGE:figures/full_fig_p032_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Prompt template used by the impartial policy-compliance judge. [PITH_FULL_IMAGE:figures/full_fig_p034_14.png] view at source ↗
read the original abstract

Large language models can follow complex instructions in a single turn, yet over long multi-turn interactions they often lose the thread of instructions, persona, and rules. This degradation has been measured behaviorally but not mechanistically explained. We propose a channel-transition account: goal-defining tokens become less accessible through attention, while goal-related information may persist in residual representations. We introduce the Goal Accessibility Ratio (GAR), measuring attention from generated tokens to task-defining goal tokens, and combine it with sliding-window ablations and residual-stream probes. When attention to instructions closes, what survives reveals architecture. Across architectures, the transition yields qualitatively distinct failure modes: some models preserve goal-conditioned behavior at vanishing attention, others fail despite decodable residual goal information, and the layer at which this encoding emerges varies from 2 to 27. A within-model causal ablation that force-closes the attention channel in Mistral collapses recall from near-perfect to 11% on a 20-fact retention task and raises persona-constraint violations above an adversarial-pressure baseline without user pressure, with both effects emerging at the predictable crossover turn. Linear probes recover per-episode recall outcomes from residual representations with AUC up to 0.99 across all four primary architectures, while input embeddings remain at chance. Across architectures and model scales, the gap between attention loss and residual decodability predicts whether goal-conditioned behavior survives channel closure. We contribute GAR as a diagnostic, the channel-transition framework as a controlled mechanistic account, and a parametric prediction of failure timing under windowed attention closure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a channel-transition account for how LLMs lose goal information during multi-turn interactions. It defines the Goal Accessibility Ratio (GAR) from attention weights to goal tokens, combines it with sliding-window ablations and linear probes on residual streams, and claims that the gap between attention loss and residual decodability predicts survival of goal-conditioned behavior across architectures. A causal ablation on Mistral shows recall collapsing to 11% under forced attention closure, with probes achieving AUC up to 0.99 on residuals versus chance on embeddings.

Significance. If the results hold, the work supplies a mechanistic framework distinguishing attention and residual channels for context degradation, with GAR as a practical diagnostic and architecture-specific predictions of failure timing. The within-model causal ablation and cross-architecture patterns are notable strengths that could inform context-management methods.

major comments (2)
  1. [Residual-stream probes (abstract and methods)] The linear probes achieve AUC up to 0.99 for predicting per-episode recall from residuals, yet the manuscript reports only post-hoc classification accuracy without activation patching, causal tracing, or direction ablation to test whether perturbing the probed residual directions alters goal-conditioned generation under closed attention. This is load-bearing for the central claim that the attention-residual gap predicts behavioral survival.
  2. [Sliding-window ablations] The sliding-window ablations are presented as isolating the attention channel, but potential residual effects on KV-cache statistics or positional signals are not addressed or controlled, which risks confounding the attribution of distinct failure modes specifically to the attention/residual distinction.
minor comments (2)
  1. [Abstract and results] The abstract and results sections would benefit from explicit reporting of error bars, exact data splits, and statistical tests for the AUC values and ablation outcomes to support verification.
  2. [GAR definition] Clarify the precise formula and hyperparameters for GAR computation, including window size and attention aggregation, in the main text or an early methods subsection.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important points about causal validation and experimental controls. We address each major comment below and have made partial revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Residual-stream probes (abstract and methods)] The linear probes achieve AUC up to 0.99 for predicting per-episode recall from residuals, yet the manuscript reports only post-hoc classification accuracy without activation patching, causal tracing, or direction ablation to test whether perturbing the probed residual directions alters goal-conditioned generation under closed attention. This is load-bearing for the central claim that the attention-residual gap predicts behavioral survival.

    Authors: We appreciate this observation. The manuscript's central causal evidence is the direct within-model attention-closure intervention on Mistral, which forces the attention channel closed and collapses recall from near-perfect to 11% while elevating persona violations above the adversarial baseline. This intervention demonstrates that goal-conditioned behavior depends on the attention channel even when residuals retain decodable goal information (probes reach AUC 0.99 while embeddings remain at chance). The probes themselves are presented as providing predictive evidence of residual retention rather than as a standalone causal demonstration. We agree that activation patching or direction ablation on the probed residual directions would provide stronger causal support for the residual channel's role. We will revise the methods and discussion sections to explicitly frame the probes as correlational/predictive, clarify that the attention-closure ablation supplies the primary causal test of the gap, and add a limitations paragraph noting the absence of residual-direction interventions as an avenue for future work. revision: partial

  2. Referee: [Sliding-window ablations] The sliding-window ablations are presented as isolating the attention channel, but potential residual effects on KV-cache statistics or positional signals are not addressed or controlled, which risks confounding the attribution of distinct failure modes specifically to the attention/residual distinction.

    Authors: We acknowledge the potential for residual confounds in the sliding-window setup. The sliding-window ablations were intended to simulate progressive attention closure, but KV-cache statistics and positional encodings could indeed carry over information. Our strongest causal result, however, comes from the separate direct attention-forcing ablation on Mistral rather than from the sliding-window results alone. We will revise the methods section to explicitly discuss these possible confounds, add controls (e.g., verification that positional encodings remain intact and that KV-cache content is not artificially altered beyond attention masking), and clarify that cross-architecture patterns are interpreted in light of the direct causal intervention. This will better isolate the attention/residual distinction. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the channel-transition account

full rationale

The paper defines GAR directly from attention weights to goal tokens and trains linear probes to classify per-episode recall outcomes from residual streams, reporting high AUC. The central claim—that the gap between attention loss and residual decodability predicts survival of goal-conditioned behavior—is an empirical cross-architecture correlation supported by independent behavioral measurements and a causal ablation (force-closing attention in Mistral, dropping recall to 11%). No step reduces a prediction to its inputs by construction: the probe AUC measures correlation with outcomes but does not mathematically force the survival rates or the gap's predictive power, which are evaluated separately across models and via the ablation. The derivation chain remains self-contained against external benchmarks rather than self-referential.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The account rests on standard transformer assumptions plus the new claim that attention closure is the primary channel while residuals remain informative; no explicit free parameters are named in the abstract.

axioms (2)
  • domain assumption Attention weights directly measure accessibility of goal tokens to later generations
    Central to GAR definition and channel-transition account
  • domain assumption Residual stream representations can be linearly decoded for goal information without the attention channel
    Required for the probe results and survival prediction
invented entities (2)
  • Goal Accessibility Ratio (GAR) no independent evidence
    purpose: Quantify attention from generated tokens to task-defining goal tokens
    New diagnostic metric introduced in the paper
  • channel-transition account no independent evidence
    purpose: Explain degradation as attention closure while residuals persist
    Core explanatory framework of the paper

pith-pipeline@v0.9.0 · 5599 in / 1371 out tokens · 31482 ms · 2026-05-14T20:16:15.218921+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 25 canonical work pages · 16 internal anchors

  1. [1]

    LLMs Get Lost In Multi-Turn Conversation

    Llms get lost in multi-turn conversation , author=. arXiv preprint arXiv:2505.06120 , year=

  2. [2]

    arXiv preprint arXiv:2410.15553 , year=

    Multi-if: Benchmarking llms on multi-turn and multilingual instructions following , author=. arXiv preprint arXiv:2410.15553 , year=

  3. [3]

    Transactions of the association for computational linguistics , volume=

    Lost in the middle: How language models use long contexts , author=. Transactions of the association for computational linguistics , volume=

  4. [4]

    Neurocomputing , volume=

    Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

  5. [5]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Hope: A novel positional encoding without long-term decay for enhanced context awareness and extrapolation , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  6. [6]

    Understanding intermediate layers using linear classifier probes

    Understanding intermediate layers using linear classifier probes , author=. arXiv preprint arXiv:1610.01644 , year=

  7. [7]

    Computational Linguistics , volume=

    Probing classifiers: Promises, shortcomings, and advances , author=. Computational Linguistics , volume=

  8. [8]

    In-context Learning and Induction Heads

    In-context learning and induction heads , author=. arXiv preprint arXiv:2209.11895 , year=

  9. [9]

    Transformer Circuits Thread , volume=

    A mathematical framework for transformer circuits , author=. Transformer Circuits Thread , volume=

  10. [10]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

    Gqa: Training generalized multi-query transformer models from multi-head checkpoints , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

  11. [11]

    Fast Transformer Decoding: One Write-Head is All You Need

    Fast transformer decoding: One write-head is all you need , author=. arXiv preprint arXiv:1911.02150 , year=

  12. [12]

    arXiv preprint arXiv:2511.03508 , year=

    One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework , author=. arXiv preprint arXiv:2511.03508 , year=

  13. [13]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Followbench: A multi-level fine-grained constraints following benchmark for large language models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  14. [14]

    arXiv preprint arXiv:2403.20194 , year=

    Convbench: A multi-turn conversation evaluation benchmark with hierarchical capability for large vision-language models , author=. arXiv preprint arXiv:2403.20194 , year=

  15. [15]

    Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

    Found in the middle: Permutation self-consistency improves listwise ranking in large language models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  16. [16]

    Are you still on track!? catching llm task drift with activations

    Are you still on track!? catching llm task drift with activations , author=. arXiv preprint arXiv:2406.00799 , volume=

  17. [17]

    arXiv preprint arXiv:2510.07777 , year=

    Drift No More? Context Equilibria in Multi-Turn LLM Interactions , author=. arXiv preprint arXiv:2510.07777 , year=

  18. [18]

    Advances in neural information processing systems , volume=

    Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=

  19. [19]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  20. [20]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

    Mt-eval: A multi-turn capabilities evaluation benchmark for large language models , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

  21. [21]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    -bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. arXiv preprint arXiv:2406.12045 , year=

  22. [22]

    $\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

    ^2 -bench: Evaluating Conversational Agents in a Dual-Control Environment , author=. arXiv preprint arXiv:2506.07982 , year=

  23. [23]

    arXiv preprint arXiv:2508.02016 , year=

    Dynamic Context Adaptation for Consistent Role-Playing Agents with Retrieval-Augmented Generations , author=. arXiv preprint arXiv:2508.02016 , year=

  24. [24]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    RULER: What's the real context size of your long-context language models? , author=. arXiv preprint arXiv:2404.06654 , year=

  25. [25]

    Efficient Streaming Language Models with Attention Sinks

    Efficient streaming language models with attention sinks , author=. arXiv preprint arXiv:2309.17453 , year=

  26. [26]

    The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

    The geometry of truth: Emergent linear structure in large language model representations of true/false datasets , author=. arXiv preprint arXiv:2310.06824 , year=

  27. [27]

    Mistral 7B

    Mistral 7B. arXiv 2023 , author=. arXiv preprint arXiv:2310.06825 , year=

  28. [28]

    Mixtral of Experts

    Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=

  29. [29]

    Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

    Train short, test long: Attention with linear biases enables input length extrapolation , author=. arXiv preprint arXiv:2108.12409 , year=

  30. [30]

    Extending Context Window of Large Language Models via Positional Interpolation

    Extending context window of large language models via positional interpolation , author=. arXiv preprint arXiv:2306.15595 , year=

  31. [31]

    YaRN: Efficient Context Window Extension of Large Language Models

    Yarn: Efficient context window extension of large language models , author=. arXiv preprint arXiv:2309.00071 , year=

  32. [32]

    Round and round we go! what makes rotary positional encodings useful?, 2025

    Round and round we go! what makes rotary positional encodings useful? , author=. arXiv preprint arXiv:2410.06205 , year=

  33. [33]

    IEEE Access , year=

    Quantifying Generation Quality for RoPE-Based Long Context Extrapolation , author=. IEEE Access , year=

  34. [34]

    When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781,

    When attention sink emerges in language models: An empirical view , author=. arXiv preprint arXiv:2410.10781 , year=

  35. [35]

    Attention sinks and compression valleys in llms are two sides of the same coin.arXiv preprint arXiv:2510.06477,

    Attention sinks and compression valleys in llms are two sides of the same coin , author=. arXiv preprint arXiv:2510.06477 , year=

  36. [36]

    Journal of Machine Learning Research , volume=

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=

  37. [37]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , author=. arXiv preprint arXiv:1701.06538 , year=

  38. [38]

    Longformer: The Long-Document Transformer

    Longformer: The long-document transformer , author=. arXiv preprint arXiv:2004.05150 , year=

  39. [39]

    Advances in neural information processing systems , volume=

    Big bird: Transformers for longer sequences , author=. Advances in neural information processing systems , volume=