pith. machine review for the scientific record. sign in

arxiv: 2605.01853 · v1 · submitted 2026-05-03 · 💻 cs.CL · cs.AI

Recognition: unknown

Spatiotemporal Hidden-State Dynamics as a Signature of Internal Reasoning in Large Language Models

Kotaro Furuya, Takahito Tanimura

Authors on Pith no claims yet

Pith reviewed 2026-05-09 17:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords hidden stateslarge reasoning modelsspatiotemporal dynamicsinternal reasoningcorrectness signallabel-free evaluationlarge language modelstrajectory analysis
0
0 comments X

The pith

Spatiotemporal patterns in hidden states distinguish correct reasoning trajectories in large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how hidden states change across decoding steps and layers inside large reasoning models. It observes that correct solutions display wide temporal shifts concentrated in particular layers, a structure that appears weaker in non-reasoning models or knowledge-recall tasks. From this observation the authors define StALT, a simple statistic that sums token-to-token hidden-state changes weighted by layer importance inside each token. StALT separates correct from incorrect answers across several models and reasoning benchmarks without any training or ground-truth labels. Intervention tests show the statistic rises or falls when the model is prompted in ways that increase or decrease the need for step-by-step internal computation.

Core claim

Successful reasoning trajectories exhibit broad temporal dynamics with localized layer-wise concentration in hidden states; this structure is weaker in non-reasoning models and knowledge-heavy domains. The authors formalize the pattern as Spatiotemporal Amplitude of Latent Transition (StALT), a training-free trajectory statistic that reliably separates correct from incorrect outputs in reasoning-intensive regimes and responds systematically to manipulations that alter internal reasoning demand.

What carries the argument

Spatiotemporal Amplitude of Latent Transition (StALT), a statistic that aggregates temporal changes between adjacent tokens weighted by within-token layer saliency to summarize hidden-state dynamics.

If this is right

  • StALT supplies a competitive label-free correctness signal that works alongside length-based and output-space baselines.
  • The statistic changes in the expected direction under interventions that raise or lower the demand for internal reasoning.
  • The same spatiotemporal signature is weaker in non-reasoning models and in knowledge-heavy domains.
  • The findings supply direct empirical evidence that large reasoning models produce measurable hidden-state dynamics during extended generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If StALT genuinely indexes reasoning effort, it could be monitored in real time to decide when to stop or continue generation.
  • The metric might help compare the internal computation profiles of models trained with different reasoning objectives or reinforcement schedules.
  • Similar layer-and-time analyses could be applied to other sequential generation tasks where correctness is hard to verify from output alone.

Load-bearing premise

The measured spatiotemporal amplitude in hidden states tracks genuine internal reasoning computation rather than surface features such as output length or token distribution.

What would settle it

An experiment in which StALT loses its ability to separate correct from incorrect trajectories after matching or regressing out solution length and token-frequency statistics.

Figures

Figures reproduced from arXiv: 2605.01853 by Kotaro Furuya, Takahito Tanimura.

Figure 1
Figure 1. Figure 1: Difference heatmaps between correct and incorrect hidden-state trajectories. Each cell view at source ↗
Figure 2
Figure 2. Figure 2: Standardized gap in hidden-state dynamics between correct and incorrect trajectories on view at source ↗
Figure 3
Figure 3. Figure 3: Conceptual overview of StALT. Temporal changes ( view at source ↗
Figure 4
Figure 4. Figure 4: Standardized gap between correct and incorrect trajectories, measured by StALT, across view at source ↗
Figure 5
Figure 5. Figure 5: Mean AUROC of StALT and representative baselines averaged within model groups. view at source ↗
Figure 6
Figure 6. Figure 6: Length-stratified AUROC on MATH-500 for Qwen3-4B. Error bars denote one standard deviation across runs. In the LRM groups, StALT is competitive with strong label-free predictors, particularly on reasoning-intensive datasets such as GSM8K, MATH-500, and MMLU-Pro STEM. The effect is clearest in the Qwen3 group, but the same tendency remains in other LRMs, although with smaller margins. By contrast, the non-L… view at source ↗
Figure 7
Figure 7. Figure 7: Within-family comparison on s1K￾1.1 for Qwen3-4B across base, no-thinking, and thinking-mode settings. We first consider interventions that should amplify internal dynamics, either by tuning the model for reasoning or by activating thinking mode at inference time. We compare Qwen3-4B-Base, Qwen3-4B with thinking mode disabled, and Qwen3-4B with thinking mode enabled. If our probe captures latent reasoning,… view at source ↗
Figure 8
Figure 8. Figure 8: Effect of in-context CoT scaffolding on hidden-state dynamics for Qwen3-4B. We first consider prompt-space interventions. Instead of encouraging the model to reason more internally, we provide part of the solution process externally. We use the DeepSeek-R1 reasoning traces and answers included in s1K-1.1 as externally supplied solution traces and compare four prompting conditions: a baseline prompt without… view at source ↗
Figure 9
Figure 9. Figure 9: Effect of supervised fine-tuning on hidden-state dynamics for s1K-1.1. Supervised fine-tuning provides a second reasoning￾reducing intervention, but in parameter space. We train Qwen3-4B on the DeepSeek-R1 reasoning traces and responses for the integer problems in s1K￾1.1, and evaluate the resulting checkpoints on the same set of problems. This allows us to examine how parameterizing the solution traces ch… view at source ↗
Figure 10
Figure 10. Figure 10: Within-family comparison on GPQA-Diamond for Qwen3-4B across base, no-thinking, view at source ↗
Figure 11
Figure 11. Figure 11: Effect of in-context CoT scaffolding on hidden-state dynamics for Qwen3-8B on integer view at source ↗
Figure 9
Figure 9. Figure 9: Here, StALT declines over training checkpoints, but unlike the in-domain setting, accuracy view at source ↗
Figure 12
Figure 12. Figure 12: Effect of supervised fine-tuning (trained on s1K-1.1 integer problems) evaluated on view at source ↗
read the original abstract

Large reasoning models (LRMs) generate extended solutions, yet it remains unclear whether these traces reflect substantive internal computation or merely verbosity and overthinking. Although recent hidden-state analyses suggest that internal representations carry correctness-related signals, their coarse aggregations may obscure the token and layer structure underlying reasoning computation. We investigate hidden-state transitions across decoding steps and layers, and identify a distinct spatiotemporal pattern in LRMs: successful trajectories exhibit broad temporal dynamics with localized layer-wise concentration, while this structure is weaker in non-reasoning models and knowledge-heavy domains. We formalize this characteristic as Spatiotemporal Amplitude of Latent Transition (StALT), a training-free trajectory statistic that summarizes temporal changes between adjacent tokens weighted by within-token layer saliency. Across diverse models and benchmarks, StALT reliably separates correct from incorrect trajectories in reasoning-intensive regimes, providing a competitive label-free correctness signal alongside strong output-space and length-based baselines. Intervention analyses further show that this spatiotemporal amplitude responds systematically to manipulations that increase or reduce the demand for internal reasoning, supporting its association with latent reasoning dynamics in LRMs. These findings provide empirical evidence that LRMs exhibit measurable hidden-state dynamics and offer a practical probe for understanding internal computation beyond output-based evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that large reasoning models exhibit distinct spatiotemporal patterns in hidden-state transitions during decoding—successful trajectories show broad temporal dynamics with localized layer-wise concentration—while non-reasoning models and knowledge-heavy domains do not. It formalizes this as the training-free Spatiotemporal Amplitude of Latent Transition (StALT) statistic, which summarizes temporal changes between adjacent tokens weighted by within-token layer saliency. Across models and benchmarks, StALT separates correct from incorrect trajectories in reasoning-intensive regimes, remains competitive with length-based and output-space baselines, and responds systematically to interventions that increase or reduce reasoning demand.

Significance. If the central empirical claims hold after controls, the work supplies a practical, label-free probe for latent reasoning dynamics that goes beyond output evaluation. Credit is due for the training-free construction of StALT, the use of intervention analyses to test association with reasoning demand, and the cross-model/cross-benchmark consistency reported in the abstract.

major comments (2)
  1. [Abstract] Abstract: the claim of reliable separation across models and benchmarks rests on unexamined empirical robustness; no details are supplied on statistical controls, multiple-testing correction, or exact data-exclusion rules, which are load-bearing for the central correctness-signal claim.
  2. [Intervention analyses] The weakest assumption—that observed spatiotemporal amplitude directly indexes substantive internal reasoning rather than correlated surface features such as output length or token distribution—requires explicit ablations or controls; without them the intervention results cannot yet distinguish the two interpretations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and indicate the revisions that will be incorporated to strengthen the empirical claims and controls.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of reliable separation across models and benchmarks rests on unexamined empirical robustness; no details are supplied on statistical controls, multiple-testing correction, or exact data-exclusion rules, which are load-bearing for the central correctness-signal claim.

    Authors: We agree that the abstract would be strengthened by explicit reference to the statistical procedures supporting the separation claims. In the revised manuscript we will update the abstract to note that separation is assessed via paired t-tests with Bonferroni correction for multiple comparisons across the 12 model-benchmark combinations, with exact data-exclusion rules (trajectories shorter than 8 tokens or containing NaN hidden states) stated in Section 3.2. These procedures are already detailed in the methods and supplementary results; the revision will simply surface them in the abstract for clarity. revision: yes

  2. Referee: [Intervention analyses] The weakest assumption—that observed spatiotemporal amplitude directly indexes substantive internal reasoning rather than correlated surface features such as output length or token distribution—requires explicit ablations or controls; without them the intervention results cannot yet distinguish the two interpretations.

    Authors: We acknowledge that stronger isolation from surface confounds is desirable. The existing intervention suite already includes length-matched prompt variants and shows that StALT changes track reasoning demand (CoT vs. direct answer) even when output length is statistically controlled via regression residuals. Nevertheless, we agree that additional explicit ablations are warranted. In the revision we will add (i) a length-matched subset analysis and (ii) a token-distribution control that permutes within-layer activations while preserving marginal token statistics. These will be reported in an expanded Section 4.3 and will allow readers to directly compare the reasoning-specific versus surface-feature accounts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; StALT is an independent statistic

full rationale

The paper defines StALT explicitly as a training-free summary of temporal changes between adjacent tokens weighted by layer saliency, constructed solely from hidden-state observations. This definition precedes and does not reference correctness labels, output length, or any fitted parameters. The subsequent evaluation of StALT's separation power on correct versus incorrect trajectories is an external test rather than a definitional input, so the statistic does not reduce to its evaluation targets by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to justify the core formulation. The derivation chain remains self-contained and falsifiable against independent benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated. The central claim implicitly assumes hidden states encode computation and that layer-wise saliency can be meaningfully aggregated without additional fitting.

pith-pipeline@v0.9.0 · 5514 in / 1007 out tokens · 26094 ms · 2026-05-09T17:17:53.516562+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 29 canonical work pages · 10 internal anchors

  1. [1]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, et al. Chain-of-thought prompting elicits reasoning in large language models, 2023. URLhttps://arxiv.org/abs/2201.11903

  2. [2]

    Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning

    Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025

  3. [3]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, et al. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286– 20332, 2025

  4. [4]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  5. [5]

    Abhimanyu Dubey et al

    Daya Guo, Dejian Yang, Haowei Zhang, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, September 2025. ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URL http://dx.doi.org/10.1038/ s41586-025-09422-z

  6. [6]

    Trustllm: Trustworthiness in large language models.arXiv preprint arXiv:2401.05561,

    Yue Huang, Lichao Sun, Haoran Wang, et al. Trustllm: Trustworthiness in large language models, 2024. URLhttps://arxiv.org/abs/2401.05561

  7. [7]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, et al. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023

  8. [8]

    Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

    Yue Zhang, Yafu Li, Leyang Cui, et al. Siren’s song in the ai ocean: A survey on hallucination in large language models, 2025. URLhttps://arxiv.org/abs/2309.01219

  9. [9]

    Teaching models to express their uncertainty in words

    Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words, 2022. URLhttps://arxiv.org/abs/2205.14334

  10. [10]

    Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher Manning

    Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empiri...

  11. [11]

    The unreasonable effectiveness of entropy minimization in LLM reasoning

    Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in LLM reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum? id=UfFTBEsLgI

  12. [12]

    Learning to reason without external rewards

    Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song. Learning to reason without external rewards. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=OU9nFEYR2M

  13. [13]

    Confidence is all you need: Few-shot rl fine-tuning of language models.arXiv preprint arXiv:2506.06395, 2025

    Pengyi Li, Matvey Skripkin, Alexander Zubrey, Andrey Kuznetsov, and Ivan Oseledets. Confidence is all you need: Few-shot rl fine-tuning of language models, 2025. URL https://arxiv.org/abs/2506.06395

  14. [14]

    Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

    Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms, 2024. URL https://arxiv.org/abs/2306.13063

  15. [15]

    Reasoning models know when they’re right: Probing hidden states for self-verification.arXiv preprint arXiv:2504.05419, 2025

    Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, and He He. Reasoning models know when they’re right: Probing hidden states for self-verification, 2025. URL https://arxiv.org/abs/2504.05419

  16. [16]

    2506.07240 , archivePrefix=

    Roy Eisenstadt, Itamar Zimerman, and Lior Wolf. Overclocking llm reasoning: Monitoring and controlling thinking path lengths in llms, 2025. URLhttps://arxiv.org/abs/2506.07240. 10

  17. [17]

    arXiv preprint arXiv:2502.07266 (2025)

    Yuyang Wu, Yifei Wang, Ziyu Ye, Tianqi Du, Stefanie Jegelka, and Yisen Wang. When more is less: Understanding chain-of-thought length in llms, 2025. URL https://arxiv.org/abs/ 2502.07266

  18. [18]

    Think deep, not just long: Measuring llm reasoning effort via deep-thinking tokens, 2026

    Wei-Lin Chen, Liqian Peng, Tian Tan, Chao Zhao, Blake JianHang Chen, Ziqian Lin, Alec Go, and Yu Meng. Think deep, not just long: Measuring llm reasoning effort via deep-thinking tokens, 2026. URLhttps://arxiv.org/abs/2602.13517

  19. [19]

    Wong, and Rui Wang

    Yiming Wang, Pei Zhang, Baosong Yang, Derek F. Wong, and Rui Wang. Latent space chain-of-embedding enables output-free LLM self-evaluation. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=jxo70B9fQo

  20. [20]

    Clue: Non-parametric verification from experience via hidden-state clustering, 2025

    Zhenwen Liang, Ruosen Li, Yujun Zhou, Linfeng Song, Dian Yu, Xinya Du, Haitao Mi, and Dong Yu. Clue: Non-parametric verification from experience via hidden-state clustering, 2025. URLhttps://arxiv.org/abs/2510.01591

  21. [21]

    Can llms predict their own failures? self-awareness via internal circuits.arXiv preprint arXiv:2512.20578, 2025

    Amirhosein Ghasemabadi and Di Niu. Can llms predict their own failures? self-awareness via internal circuits, 2026. URLhttps://arxiv.org/abs/2512.20578

  22. [22]

    Vilas, Safoora Yousefi, Besmira Nushi, Eric Horvitz, and Vidhisha Balachandran

    Martina G. Vilas, Safoora Yousefi, Besmira Nushi, Eric Horvitz, and Vidhisha Balachandran. Tracing the traces: Latent temporal signals for efficient and accurate reasoning, 2025. URL https://arxiv.org/abs/2510.10494

  23. [23]

    Decoupling knowledge and reasoning in llms: An exploration using cognitive dual-system theory

    Mutian Yang, Jiandong Gao, and Ji Wu. Decoupling knowledge and reasoning in llms: An exploration using cognitive dual-system theory. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34268–34276, 2026

  24. [24]

    Knowledge or reasoning? a close look at how llms think across domains, 2025

    Juncheng Wu, Sheng Liu, Haoqin Tu, Hang Yu, Xiaoke Huang, James Zou, Cihang Xie, and Yuyin Zhou. Knowledge or reasoning? a close look at how llms think across domains, 2025. URLhttps://arxiv.org/abs/2506.02126

  25. [25]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, et al. Qwen3 technical report, 2025. URL https: //arxiv.org/abs/2505.09388

  26. [26]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The llama 3 herd of models, 2024. URLhttps://arxiv.org/abs/2407.21783

  27. [27]

    Chong Wu, Jiawang Cao, Renjie Xu, Zhuoheng Ran, Maolin Che, Wenbo Zhu, and Hong Yan

    Yubo Wang, Xueguang Ma, Ge Zhang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 95266–95290. Curran Associates, Inc., 2024. doi: 10.52202/079017-3018...

  28. [28]

    Larry V . Hedges. Distribution theory for glass’s estimator of effect size and related estimators. Journal of Educational Statistics, 6(2):107–128, 1981. ISSN 03629791. URL http://www. jstor.org/stable/1164588

  29. [29]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, et al. Training verifiers to solve math word problems, 2021. URLhttps://arxiv.org/abs/2110.14168

  30. [30]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In J. Vanschoren and S. Yeung, editors,Proceedings of the Neural Infor- mation Processing Systems Track on Datasets and Benchmarks, volume 1, 2021. URL https://datasets-benchmark...

  31. [31]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google- proof q&a benchmark. InFirst Conference on Language Modeling, 2024. URL https: //openreview.net/forum?id=Ti67584b98. 11

  32. [32]

    SmolLM3: smol, multilingual, long- context reasoner.https://huggingface.co/blog/smollm3, 2025

    Elie Bakouch, Loubna Ben Allal, Anton Lozhkov, et al. SmolLM3: smol, multilingual, long- context reasoner.https://huggingface.co/blog/smollm3, 2025

  33. [33]

    gpt-oss-120b & gpt-oss-20b Model Card

    OpenAI, Sandhini Agarwal, Lama Ahmad, et al. gpt-oss-120b & gpt-oss-20b model card, 2025. URLhttps://arxiv.org/abs/2508.10925

  34. [34]

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

    An Yang, Beichen Zhang, Binyuan Hui, et al. Qwen2.5-math technical report: Toward mathemat- ical expert model via self-improvement, 2024. URL https://arxiv.org/abs/2409.12122

  35. [35]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. Curran Associates Inc. ISBN 9781713871088

  36. [36]

    Jianhao Huang, Zixuan Wang, and Jason D. Lee. Transformers learn to implement multi-step gradient descent with chain of thought. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=r3DF5sOo5B

  37. [37]

    Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=VD-AYtP0dve

  38. [38]

    The internal state of an LLM knows when it ' s lying

    Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it’s lying. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Com- putational Linguistics: EMNLP 2023, pages 967–976, Singapore, December 2023. Asso- ciation for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.68. URL https://aclanthol...

  39. [39]

    Kevin Liu, Stephen Casper, Dylan Hadfield-Menell, and Jacob Andreas. Cognitive dissonance: Why do language model outputs disagree with internal representations of truthfulness? In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4791–4797, Singapore, December

  40. [40]

    Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?

    Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.291. URLhttps://aclanthology.org/2023.emnlp-main.291/

  41. [41]

    Hidden States as Early Signals: Step-level Trace Evaluation and Pruning for Efficient Test-Time Scaling

    Zhixiang Liang, Beichen Huang, Zheng Wang, and Minjia Zhang. Hidden states as early signals: Step-level trace evaluation and pruning for efficient test-time scaling, 2026. URL https://arxiv.org/abs/2601.09093

  42. [42]

    Cot-kinetics: A theoretical modeling assessing lrm reasoning process

    Jinhe Bi, Danqi Yan, Yifan Wang, et al. Cot-kinetics: A theoretical modeling assessing lrm reasoning process, 2025. URLhttps://arxiv.org/abs/2505.13408

  43. [43]

    Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

  44. [44]

    Efficient memory management for large lan- guage model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, et al. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  45. [45]

    Transformers: State-of-the-art natural language processing

    Thomas Wolf, Lysandre Debut, Victor Sanh, et al. Transformers: State-of-the-art natural language processing. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Associ- ation for Computational Linguistics. URL https://www.aclweb.org/anthology/2020. emnlp-demos.6

  46. [46]

    the answer is (X)

    Leandro von Werra, Younes Belkada, Lewis Tunstall, et al. TRL: Transformers Reinforcement Learning, 2020. URLhttps://github.com/huggingface/trl. 12 Method / ModelQwen3-1.7B Qwen3-4B Qwen3-8B SmolLM3-3B gpt-oss-20bQwen2.5-Math-7B Llama-3.1-8B-Instruct MATH-500 (AUROC↑/ FPR95↓/ AUPR↑) Gen. Tokens0.78/0.60/0.910.71/0.64/0.920.70/0.65/0.920.74/0.63/0.910.63/0...