pith. sign in

arxiv: 2605.30981 · v1 · pith:26D3XP3Dnew · submitted 2026-05-29 · 💻 cs.CL · cs.LG

Cognitive Fatigue in Autoregressive Transformers: Formalization and Measurement

Pith reviewed 2026-06-28 22:25 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords cognitive fatigueFatigue Indexautoregressive transformersruntime monitoringattention decayrepresentational driftentropy miscalibrationLLM reliability
0
0 comments X

The pith

Autoregressive language models suffer from cognitive fatigue in long generations, which can be diagnosed in real time by the Fatigue Index combining three degradation signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines cognitive fatigue as a state where models lose focus on the initial prompt, drift in their internal representations, and miscalibrate their output probabilities during extended text generation. It proposes the Fatigue Index to measure this by combining those signals according to rules that ensure the measure increases as fatigue worsens, stays within bounds, and remains easy to interpret. A reader would care because this offers an online way to spot when a model is about to repeat itself or ignore instructions, rather than waiting for the output to fail. Experiments on models from 1B to 13B show the index rises in predictable ways and forecasts problems accurately.

Core claim

We formalize degradation during long-horizon generation as cognitive fatigue, characterized by decay in attention to the original prompt, representational drift, and entropy miscalibration. We introduce the Fatigue Index as a model-agnostic diagnostic that aggregates these signals under axioms of monotonicity, boundedness, and interpretability. Across nine models, FI trajectories predict task degradation with AUROC of 0.95 and repetition with Spearman rho of 0.94, and show scaling behaviors and sensitivity to context length and precision.

What carries the argument

The Fatigue Index (FI), a lightweight diagnostic that aggregates prompt attention decay, representational drift, and entropy miscalibration under the axioms of monotonicity, boundedness, and interpretability to enable runtime monitoring of generation quality.

If this is right

  • FI can be used for real-time reliability monitoring in production LLM systems.
  • Instruction-tuned models show different fatigue onset depending on size, with reversal around 7B parameters.
  • Fatigue onset speeds up with longer contexts, middle-positioned evidence, and lower numerical precision.
  • FI trajectories are structured and can predict when repetition or loss of instruction adherence will occur.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If FI works as claimed, systems could dynamically adjust generation parameters like temperature when fatigue is detected to maintain coherence longer.
  • The metric might extend to measuring degradation in other autoregressive generation domains such as code or structured data.
  • Observing FI could help compare the long-context stability of different training regimes beyond standard benchmarks.

Load-bearing premise

That the combination of attention decay, representational drift, and entropy miscalibration under monotonicity, boundedness, and interpretability axioms produces a diagnostic that is generalizable and does not circularly assume the fatigue it measures.

What would settle it

If the Fatigue Index fails to rise before observable increases in repetition or task failures in long-horizon generation experiments, or if it rises without corresponding degradation, the formalization would be challenged.

Figures

Figures reproduced from arXiv: 2605.30981 by Amit Sheth, Atishay Jain, Michael Stewart, Riju Marwah, Ritvik Garimella, Vishal Pallagani.

Figure 1
Figure 1. Figure 1: Overview of the cognitive fatigue measurement framework. (a) A prompt and context are fed to a decoder-only model, with probes extracting attention logits and hidden states. (b) Three signals are computed online at each decoding step: prompt-attention decay, entropy deviation, and embedding drift, normalized and aggregated into the Fatigue Index (FI). Evaluation datasets are used only to validate FI, not t… view at source ↗
Figure 3
Figure 3. Figure 3: Attention to evidence tokens as a function of their posi￾tion in the prompt, measured over 60 generation steps. Evidence placed at the front of the context receives 5–10× higher mean attention than evidence placed in the middle or at the end, and this gap is stable across all steps, confirming a strong primacy bias that may cause the model to underweight late-context evidence during reasoning. 6.3. One-tim… view at source ↗
Figure 5
Figure 5. Figure 5: Universal fatigue trajectories across domains (OPT￾2.7B). Mean Fatigue Index (FI) over generated token positions for HotpotQA (reasoning), TriviaQA (knowledge), and SQuAD (comprehension). Shaded regions denote 95% confidence intervals. All three datasets exhibit a sharp FI rise within the first 5 tokens followed by a consistent logarithmic increase toward saturation near FI 0.94, indicating that output fat… view at source ↗
Figure 6
Figure 6. Figure 6: Per-family cognitive stamina (mean ± 95% CI) under matched decoding conditions. Base models show a gradual mono￾tonic decline in final entropy with scale, while instruction-tuned models exhibit a non-monotonic pattern, collapsing earlier at sub￾3B scales, recovering near 7B, then collapsing sharply at 13B. This divergence suggests that alignment fine-tuning imposes a scale-dependent cost on output diversit… view at source ↗
Figure 7
Figure 7. Figure 7: Embedding drift slope per model, sorted in ascending order (lower is better). Despite spanning 1B to 13B parameters, most models cluster tightly between drift slopes of 0.08–0.14, with BLOOM-3B as a high-variance outlier. The absence of a clear size-dependent trend confirms that drift stability is largely architecture- and training-regime-dependent rather than a function of scale. Larger models do not drif… view at source ↗
Figure 8
Figure 8. Figure 8: Robustness landscape of the Fatigue Index (FI) definition over the weight simplex (A + D + E = 1). Each point represents a candidate weighting of attention (A), drift (D), and entropy (E), colored by its AUROC against downstream degeneration labels. The landscape is smooth and structured rather than erratic, with a broad high-AUROC region concentrated at moderate-to-high attention weights. Our chosen defin… view at source ↗
Figure 9
Figure 9. Figure 9: Supplementary precision stress results on HotpotQA (FP16 vs. 4-bit NF4, 100 generation steps). Entropy under 4-bit quantization exhibits deeper troughs and higher variance than FP16, consistent with increased susceptibility to predictive calibration collapse. Attention-to-prompt and embedding drift trajectories remain closely matched across precisions, replicating the main finding that quantization selecti… view at source ↗
Figure 10
Figure 10. Figure 10: Supplementary precision stress results on TriviaQA (FP16 vs. 4-bit NF4, 100 generation steps). Entropy trajectories under 4-bit quantization exhibit greater depth and variability of collapse compared to FP16, replicating the pattern observed on HotpotQA. Attention￾to-prompt and embedding drift remain closely matched across precisions, confirming that reduced precision selectively destabilizes predictive c… view at source ↗
read the original abstract

Autoregressive language models frequently degrade during long-horizon generation, producing repetitive text, losing instruction adherence, and exhibiting unstable entropy. Despite the prevalence of these failures, practitioners lack online diagnostics to detect them in real-time as they occur. We formalize this degradation as cognitive fatigue, a measurable generation-time state characterized by decay in attention to the original prompt, representational drift, and entropy miscalibration. We introduce the Fatigue Index (FI), a lightweight, model-agnostic diagnostic that aggregates these three signals under explicit axioms (monotonicity, boundedness, interpretability) enabling reliable runtime monitoring. Across nine models (1B-13B parameters), FI trajectories exhibit structured temporal dynamics, predict task degradation (AUROC = 0.95) and repetition (Spearman rho = 0.94), and reveal non-monotonic scaling behavior: instruction-tuned models below 3B exhibit faster collapse than base models, with this trend reversing at 7B. Stress analyses further show that FI onset accelerates under longer contexts, middle-positioned evidence, and reduced numerical precision. These results establish cognitive fatigue as a coherent and measurable phenomenon, and position FI as a principled tool for runtime reliability monitoring in production LLM systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript formalizes degradation during long-horizon autoregressive generation as 'cognitive fatigue,' a state characterized by decay in attention to the original prompt, representational drift, and entropy miscalibration. It defines the Fatigue Index (FI) as the aggregation of these three signals under the axioms of monotonicity, boundedness, and interpretability. Experiments on nine models (1B–13B) report that FI trajectories show structured dynamics, achieve AUROC=0.95 in predicting task degradation and Spearman rho=0.94 in predicting repetition, and exhibit non-monotonic scaling (faster collapse in small instruction-tuned models, reversal at 7B) plus sensitivity to context length, evidence position, and numerical precision.

Significance. If the axioms independently justify signal selection and aggregation without reference to the outcome metrics, and if the reported predictive performance is robust to the validation procedure, the work would supply a lightweight, model-agnostic runtime diagnostic with clear practical value for production LLM monitoring. The scaling and stress-test results would further contribute to understanding of generation dynamics. The significance is currently limited by the unresolved circularity risk between definition, construction, and validation.

major comments (2)
  1. [Abstract] Abstract: The definition of cognitive fatigue is given directly in terms of the three signals whose aggregation forms FI; FI is then shown to predict the same class of degradation phenomena (task degradation, repetition) that the signals were chosen to capture. Unless the formalization section demonstrates that the axioms of monotonicity, boundedness, and interpretability alone suffice to select and combine exactly these signals a priori, the high AUROC and rho values risk being circular rather than confirmatory.
  2. [Formalization] Formalization and Methods: The manuscript must supply the explicit derivation of the FI aggregation function from the three axioms without reference to the empirical targets (AUROC, rho). If any weighting, normalization, or signal inclusion decision was informed by observed correlations with task degradation or repetition, the claim that FI is 'axiom-driven' and 'principled' does not hold.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'nine models (1B-13B parameters)' should list the exact model families and sizes to allow immediate assessment of the scaling claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the risk of circularity between the definition of cognitive fatigue, the construction of the Fatigue Index, and its empirical validation. We address each major comment below and commit to revisions that strengthen the a priori justification of the formalization.

read point-by-point responses
  1. Referee: [Abstract] The definition of cognitive fatigue is given directly in terms of the three signals whose aggregation forms FI; FI is then shown to predict the same class of degradation phenomena (task degradation, repetition) that the signals were chosen to capture. Unless the formalization section demonstrates that the axioms of monotonicity, boundedness, and interpretability alone suffice to select and combine exactly these signals a priori, the high AUROC and rho values risk being circular rather than confirmatory.

    Authors: We agree that the current presentation risks appearing circular if signal selection is not clearly separated from validation. The three signals (attention decay to prompt, representational drift, entropy miscalibration) are introduced in the formalization as the observable components of the degradation state described in the introduction, prior to any aggregation or predictive experiments. The axioms apply only to the aggregation step to ensure the resulting index is monotonic, bounded, and interpretable. The AUROC and rho results are computed on held-out generation trajectories and tasks not used in any design choice. To eliminate ambiguity, we will revise the abstract and formalization section to state the signal selection rationale explicitly before introducing the axioms and to emphasize that predictive metrics constitute independent validation. revision: partial

  2. Referee: [Formalization] The manuscript must supply the explicit derivation of the FI aggregation function from the three axioms without reference to the empirical targets (AUROC, rho). If any weighting, normalization, or signal inclusion decision was informed by observed correlations with task degradation or repetition, the claim that FI is 'axiom-driven' and 'principled' does not hold.

    Authors: We accept this requirement. The manuscript currently states that the aggregation satisfies the three axioms but does not provide a step-by-step derivation of the specific functional form (including any normalization or weighting) solely from monotonicity, boundedness, and interpretability. We will add a dedicated subsection deriving the FI formula from the axioms alone, with no reference to predictive performance. If any design decision was influenced by preliminary correlations, we will disclose it and revise the 'axiom-driven' language accordingly; otherwise, we will confirm that all choices follow directly from the axioms. revision: yes

Circularity Check

1 steps flagged

Fatigue Index aggregates the exact signals used to define cognitive fatigue by construction

specific steps
  1. self definitional [Abstract]
    "We formalize this degradation as cognitive fatigue, a measurable generation-time state characterized by decay in attention to the original prompt, representational drift, and entropy miscalibration. We introduce the Fatigue Index (FI), a lightweight, model-agnostic diagnostic that aggregates these three signals under explicit axioms (monotonicity, boundedness, interpretability)"

    The phenomenon (cognitive fatigue) is defined directly in terms of the three signals; FI is then defined as the aggregation of precisely those same signals. The measure is therefore equivalent to the definition by construction rather than derived from independent first principles or external validation criteria.

full rationale

The abstract explicitly defines cognitive fatigue as the degradation state characterized by the three signals and then defines FI as their direct aggregation under axioms. This creates a self-definitional loop: the diagnostic is constructed from the defining characteristics of the phenomenon it diagnoses, and its reported predictive power (AUROC 0.95 on task degradation) is then measured against outcomes that overlap with the original degradation symptoms. No independent derivation or external grounding separates the inputs from the constructed measure.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 1 invented entities

The central claim rests on the newly introduced concept of cognitive fatigue defined by three internal signals and the Fatigue Index constructed as their aggregation under three explicit axioms. No free parameters are mentioned. The entity cognitive fatigue is postulated without independent evidence outside the paper.

axioms (3)
  • ad hoc to paper monotonicity
    Explicit axiom stated for the Fatigue Index to enable reliable runtime monitoring.
  • ad hoc to paper boundedness
    Explicit axiom stated for the Fatigue Index.
  • ad hoc to paper interpretability
    Explicit axiom stated for the Fatigue Index.
invented entities (1)
  • cognitive fatigue no independent evidence
    purpose: To name and formalize the observed degradation state during long-horizon autoregressive generation
    New postulated state characterized by the three signals; no falsifiable handle outside the paper is provided.

pith-pipeline@v0.9.1-grok · 5763 in / 1425 out tokens · 38989 ms · 2026-06-28T22:25:15.564796+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 14 canonical work pages · 8 internal anchors

  1. [1]

    and Luqman, H

    Alansari, A. and Luqman, H. Large language models hal- lucination: A comprehensive survey.arXiv preprint arXiv:2510.06265,

  2. [2]

    Layer Normalization

    Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450,

  3. [3]

    Extending Context Window of Large Language Models via Positional Interpolation

    Chen, S., Wong, S., Chen, L., and Tian, Y . Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595,

  4. [4]

    QLoRA: Efficient Finetuning of Quantized LLMs

    Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. QLoRA: Efficient finetuning of quantized LLMs.arXiv preprint arXiv:2305.14314,

  5. [5]

    Farquhar, S., Kossen, J., Kuhn, L., and Gal, Y

    Verify this is the intended citation; paper text references chain-of-thought faithfulness but arXiv may correspond to Faith and Fate (NeurIPS 2023). Farquhar, S., Kossen, J., Kuhn, L., and Gal, Y . Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630,

  6. [6]

    How (not) to Train your Generative Model: Scheduled Sampling, Likelihood, Adversary?

    Husz´ar, F. How (not) to train your generative model: Sched- uled sampling, likelihood, adversary?arXiv preprint arXiv:1511.05101,

  7. [7]

    Training LLMs for honesty via con- fessions.arXiv preprint arXiv:2512.08093,

    Joglekar, M., Chen, J., Wu, G., Yosinski, J., Wang, J., Barak, B., and Glaese, A. Training LLMs for honesty via con- fessions.arXiv preprint arXiv:2512.08093,

  8. [8]

    and Zheng, R

    Kadem, M. and Zheng, R. Interpreting transformers through attention head intervention.arXiv preprint arXiv:2601.04398,

  9. [9]

    Kayan, C

    Cited as CREST in text; verify arXiv ID matches intended paper before final submission. Kayan, C. E. and Zhang, L. Prototype-based dynamic steering for large language models.arXiv preprint arXiv:2510.05498,

  10. [10]

    A Diversity-Promoting Objective Function for Neural Conversation Models

    Li, J., Galley, M., Brockett, C., Gao, J., and Dolan, B. A diversity-promoting objective function for neural conver- sation models.arXiv preprint arXiv:1510.03055,

  11. [11]

    Measuring and controlling instruction instability in large language model dialogs

    Li, K., Liu, T., Bashkansky, N., Bau, D., Vi ´egas, F., Pfis- ter, H., and Wattenberg, M. Measuring and controlling instruction instability in large language model dialogs. arXiv preprint arXiv:2402.10962, 2024a. Li, Y ., Mi, F., Li, Y ., Wang, Y ., Sun, B., Feng, S., and Li, K. Dynamic stochastic decoding strategy for open-domain dialogue generation.arXi...

  12. [12]

    YaRN: Efficient Context Window Extension of Large Language Models

    Peng, B., Quesnelle, J., Fan, H., and Shippole, E. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071,

  13. [13]

    InFoBench: Evaluating instruction following ability in large language models

    Qin, Y ., Song, K., Hu, Y ., Yao, W., Cho, S., Wang, X., Wu, X., Liu, F., Liu, P., and Yu, D. InFoBench: Evaluating instruction following ability in large language models. InFindings of the Association for Computational Lin- guistics: ACL 2024, pp. 13025–13048. Association for Computational Linguistics, August

  14. [14]

    SQuAD: 100,000+ questions for machine comprehension of text

    Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. SQuAD: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2383–2392,

  15. [15]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    Su, J., Lu, Y ., Pan, S., Wen, B., and Liu, Y . RoFormer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864,

  16. [16]

    W., Salakhutdinov, R., and Manning, C

    Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W. W., Salakhutdinov, R., and Manning, C. D. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pp. 2369–2380,

  17. [17]

    Recursive Language Models

    doi: 10.18653/ v1/D18-1259. Zhang, A. L., Kraska, T., and Khattab, O. Recursive lan- guage models.arXiv preprint arXiv:2512.24601,

  18. [18]

    EDT: Improving large language models’ generation by entropy-based dynamic temperature sampling.arXiv preprint arXiv:2403.14541,

    Zhang, S., Bao, Y ., and Huang, S. EDT: Improving large language models’ generation by entropy-based dynamic temperature sampling.arXiv preprint arXiv:2403.14541,

  19. [19]

    Fixed Defaults and Calibration Table 6.Fixed defaults used in all Fatigue Index (FI) experiments

    11 Cognitive Fatigue in Autoregressive Transformers A. Fixed Defaults and Calibration Table 6.Fixed defaults used in all Fatigue Index (FI) experiments. Values were selected once from a small preliminary pass and then frozen; no per-item or per-dataset tuning was performed. Parameter Value Prompt sliceK64 Entropy band[H ℓ, Hu][3.8, 5.0] FI smoothing windo...