LLM Reasoning Is Latent, Not the Chain of Thought

Wenshuo Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:46 UTC · model grok-4.3

classification 💻 cs.AI

keywords reasoningobjectshouldsurfacelatentlatent-statecomputedefault

0 comments

The pith

LLM reasoning is primarily mediated by latent-state trajectories rather than by explicit surface chain-of-thought outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models process text by updating a set of internal numbers at each step. These updates create paths through a high-dimensional space of possible states. The paper proposes that the useful reasoning happens along these hidden paths. The words the model eventually prints, called chain-of-thought, are only a surface trace that may or may not match the internal path. A third factor is simply that printing more words uses more total computation. The authors separate these three elements and review existing experiments on whether the printed steps are faithful to the model, whether changing internal states without changing the output alters answers, and how performance changes when compute is increased without changing the output format. They add their own small examples that try to hold compute constant while varying the internal versus surface aspects. From this reorganized evidence they conclude that the hidden trajectories are the main thing worth studying. They recommend that future work treat latent dynamics as the default target and build evaluations that keep the three factors apart.

Core claim

current evidence most strongly supports H1 as a default working hypothesis rather than as a task-independent verdict. We therefore make two recommendations: the field should treat latent-state dynamics as the default object of study for LLM reasoning, and it should evaluate reasoning with designs that explicitly disentangle surface traces, latent states, and serial compute.

Load-bearing premise

That prior empirical, mechanistic, and survey results can be reorganized under the three hypotheses without material selection bias and that the added compute-audited exemplars cleanly factorize surface traces, latent interventions, and matched budget expansions.

read the original abstract

This position paper argues that large language model (LLM) reasoning should be studied as latent-state trajectory formation rather than as faithful surface chain-of-thought (CoT). This matters because claims about faithfulness, interpretability, reasoning benchmarks, and inference-time intervention all depend on what the field takes the primary object of reasoning to be. We ask what that object should be once three often-confounded factors are separated and formalize three competing hypotheses: H1, reasoning is primarily mediated by latent-state trajectories; H2, reasoning is primarily mediated by explicit surface CoT; and H0, most apparent reasoning gains are better explained by generic serial compute than by any privileged representational object. Reorganizing recent empirical, mechanistic, and survey work under this framework, and adding compute-audited worked exemplars that factorize surface traces, latent interventions, and matched budget expansions, we find that current evidence most strongly supports H1 as a default working hypothesis rather than as a task-independent verdict. We therefore make two recommendations: the field should treat latent-state dynamics as the default object of study for LLM reasoning, and it should evaluate reasoning with designs that explicitly disentangle surface traces, latent states, and serial compute.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a position paper that reframes LLM reasoning around latent state trajectories and gives a clean way to separate them from surface CoT and raw compute, but it does not add new measurements or a systematic review to back the preference for H1.

read the letter

The main thing to know is that the paper treats reasoning as changes inside the model rather than the text it produces, and it organizes the literature around three hypotheses to make that separation explicit. H1 says latent trajectories do the real work, H2 says surface CoT mediates it, and H0 says extra serial compute explains most gains. The authors map recent empirical and mechanistic studies onto this frame and add a few compute-audited examples meant to hold token and FLOP budgets fixed while varying only latent access or surface traces. They conclude that H1 is the best current default and recommend treating latent dynamics as the primary object of study while designing evaluations that disentangle the three factors. That framing is the clearest part of the work and could help people stop conflating visible output with internal process. The exemplars are presented as a concrete step toward factorization, which is better than pure reinterpretation. The soft spot is that the argument still rests on re-categorizing existing results without a stated protocol for which papers were included or how results favoring H2 or H0 were weighed. Selection effects are therefore hard to rule out, and the abstract gives no quantitative breakdown of how cleanly the exemplars actually isolate the variables. If those controls are incomplete, the support for H1 collapses back into the same confounds the framework tries to separate. This is useful for researchers already working on interpretability and reasoning benchmarks who want a shared language for what they are measuring. It is less useful for anyone looking for fresh data or formal derivations. The questions it raises are central enough that it deserves a serious referee, even though the evidence would need tightening on coverage and controls before publication.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper is a position paper that defines three hypotheses (H1 latent trajectories, H2 surface CoT, H0 generic compute) and reorganizes external empirical/mechanistic/survey literature under them while adding new compute-audited exemplars that are described as factorizing the relevant quantities. The central claim—that evidence supports H1 as default—is an interpretive synthesis of cited prior work rather than a first-principles derivation, fitted parameter, or prediction that reduces to the paper's own inputs by construction. No equations, self-definitional loops, load-bearing self-citations, or renamed results are present; the argument treats the reorganized studies as independent external evidence. This is a standard non-finding for a conceptual reorganization paper whose claims remain falsifiable against the cited literature.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the premise that latent trajectories, surface CoT, and serial compute are separable quantities that can be independently varied in experiments; no new free parameters or invented entities are introduced.

axioms (1)

domain assumption The three factors of latent-state trajectories, surface chain-of-thought, and generic serial compute can be experimentally isolated and factorized.
Invoked when defining H1, H2, and H0 and when describing the added worked exemplars.

pith-pipeline@v0.9.0 · 5502 in / 1390 out tokens · 62797 ms · 2026-05-10T08:46:40.288747+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 47 canonical work pages · 9 internal anchors

[1]

V ., and Zhou, D

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q. V ., and Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 35:24824–24837

2022
[2]

V ., Chi, E

Wang, X., Wei, J., Schuurmans, D., Le, Q. V ., Chi, E. H., Narang, S., Chowdhery, A., and Zhou, D. (2023). Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations

2023
[3]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y ., and Narasimhan, K. (2023). Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601

work page internal anchor Pith review arXiv 2023
[4]

Li, Z., Liu, H., Zhou, D., and Ma, T. (2024). Chain of thought empowers transformers to solve inherently serial problems. arXiv preprint arXiv:2402.12875

work page arXiv 2024
[5]

Snell, C., Lee, J., Xu, K., and Kumar, A. (2024). Scaling LLM test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314

work page Pith review arXiv 2024
[6]

Turpin, M., Michael, J., Perez, E., and Bowman, S. R. (2023). Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. arXiv preprint arXiv:2305.04388

work page internal anchor Pith review arXiv 2023
[7]

Lanham, T., Chen, A., Radhakrishnan, A., Steiner, B., Denison, C., Hernandez, D., Li, D., Durmus, E., Hubinger, E., Kernion, J., et al. (2023). Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702

work page Pith review arXiv 2023
[8]

Feng, J., Russell, S., and Steinhardt, J. (2024). Monitoring latent world states in language models with propositional probes. arXiv preprint arXiv:2406.19501

work page arXiv 2024
[9]

Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y . (2024). Training large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769

work page internal anchor Pith review arXiv 2024
[10]

He, Z., Xiong, G., Liu, B., Sinha, S., and Zhang, A. (2026). Reasoning beyond chain-of-thought: A latent computational mode in large language models. arXiv preprint arXiv:2601.08058

work page arXiv 2026
[11]

Faithful reasoning using large language models.arXiv preprint arXiv:2208.14271, 2022

Creswell, A. and Shanahan, M. (2022). Faithful reasoning using large language models. arXiv preprint arXiv:2208.14271

work page arXiv 2022
[12]

Lyu, Q., Havaldar, S., Stein, A., Zhang, L., Rao, D., Wong, E., Apidianaki, M., and Callison-Burch, C. (2023). Faithful chain-of-thought reasoning. arXiv preprint arXiv:2301.13379

work page arXiv 2023
[13]

Arakelyan, E., Minervini, P., Verga, P., Lewis, P., and Augenstein, I. (2024). FLARE: Faithful logic-aided reasoning and exploration. arXiv preprint arXiv:2410.11900

work page arXiv 2024
[14]

Pan, L., Albalak, A., Wang, X., and Wang, W. Y . (2023). Logic-LM: Empowering large language models with symbolic solvers for faithful logical reasoning. arXiv preprint arXiv:2305.12295

work page arXiv 2023
[15]

Shi, Y ., Sun, M., Liu, Z., Yang, M., Fang, Y ., Sun, T., and Gu, X. (2026). Reasoning in trees: Improving retrieval-augmented generation for multi-hop question answering. arXiv preprint arXiv:2601.11255

work page arXiv 2026
[16]

Li, X., Wang, R., Wang, Y ., Guo, M., Li, C., Sheng, T., Ravi, S., and Roth, D. (2026). PAR2-RAG: Planned active retrieval and reasoning for multi-hop question answering. arXiv preprint arXiv:2603.29085

work page arXiv 2026
[17]

Wei, K., Shan, R., Zou, D., Yang, J., Zhao, B., Zhu, J., and Zhong, J. (2025). MIRAGE: Scaling test-time inference with parallel graph-retrieval-augmented reasoning chains. arXiv preprint arXiv:2508.18260

work page arXiv 2025
[18]

Ferguson, N., Bundy, A., and Nuamah, K. (2026). Exploring the meta-level reasoning of large language models via a tool-based multi-hop tabular question answering task. arXiv preprint arXiv:2601.07696

work page arXiv 2026
[19]

Huang, Y ., Huang, Z., Xiang, L., Yang, Q., and Yin, H. (2025). PathoHR: Hierarchical reasoning for vision-language models in pathology. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 2296–2311

2025
[20]

Seeing twice and thinking backwards

Yang, J., Li, Y ., and Huang, Z. (2025). ReLoop: “Seeing twice and thinking backwards” via closed- loop training to mitigate hallucinations in multimodal understanding. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 4162–4179

2025
[21]

Radhakrishnan, A., Nguyen, K., Chen, A., Chen, C., Denison, C., Hernandez, D., Durmus, E., Hubinger, E., Kernion, J., Lukoši¯ut˙e, K., et al. (2023). Question decomposition improves the faithfulness of model- generated reasoning. arXiv preprint arXiv:2307.11768. 10

work page arXiv 2023
[22]

Are deepseek r1 and other reasoning models more faithful?, 2025

Chua, J. and Evans, O. (2025). Are DeepSeek R1 and other reasoning models more faithful? arXiv preprint arXiv:2501.08156

work page arXiv 2025
[23]

When chain of thought is necessary, language models struggle to evade monitors.arXiv preprint arXiv:2507.05246, 2025

Emmons, S., Jenner, E., Elson, D. K., Saurous, R. A., Rajamanoharan, S., Chen, H., Shafkat, I., and Shah, R. (2025a). When chain of thought is necessary, language models struggle to evade monitors. arXiv preprint arXiv:2507.05246

work page arXiv
[24]

Lee, S., Shin, J., Ahn, Y ., Seo, S., Kwon, O., and Kim, K.-E. (2024). Zero-shot multi-hop question answering via monte-carlo tree search with large language models. arXiv preprint arXiv:2409.19382

work page arXiv 2024
[25]

Singhi, N., Bansal, H., Hosseini, A., Grover, A., Chang, K.-W., Rohrbach, M., and Rohrbach, A. (2025). When to solve, when to verify: Compute-optimal problem solving and generative verification for LLM reasoning. arXiv preprint arXiv:2504.01005

work page arXiv 2025
[26]

Pfau, J., Merrill, W., and Bowman, S. R. (2024). Let’s think dot by dot: Hidden computation in transformer language models. arXiv preprint arXiv:2404.15758

work page arXiv 2024
[27]

Zhang, A., Chen, Y ., Pan, J., Zhao, C., Panda, A., Li, J., and He, H. (2025). Reasoning models know when they’re right: Probing hidden states for self-verification. arXiv preprint arXiv:2504.05419

work page arXiv 2025
[28]

W., Simmons, W

Barsalou, L. W., Simmons, W. K., Barbey, A. K., and Wilson, C. D. (2003). Grounding conceptual knowledge in modality-specific systems.Trends in Cognitive Sciences, 7(2):84–93

2003
[29]

Arcuschin, I., Janiak, J., Krzyzanowski, R., Rajamanoharan, S., Nanda, N., and Conmy, A. (2025). Chain-of-thought reasoning in the wild is not always faithful. arXiv preprint arXiv:2503.08679

work page arXiv 2025
[30]

Reasoning Models Don't Always Say What They Think

Chen, Y ., Benton, J., Radhakrishnan, A., Uesato, J., Denison, C., Schulman, J., Somani, A., Hase, P., Wagner, M., Roger, F., Mikulik, V ., Bowman, S. R., Leike, J., Kaplan, J., and Perez, E. (2025). Reasoning models don’t always say what they think. arXiv preprint arXiv:2505.05410

work page internal anchor Pith review arXiv 2025
[31]

Xiong, Z., Chen, S., Qi, Z., and Lakkaraju, H. (2025). Measuring the faithfulness of thinking drafts in large reasoning models. arXiv preprint arXiv:2505.13774

work page arXiv 2025
[32]

Ye, D., Loffgren, M., Kotadia, O., and Wong, L. (2026). Mechanistic evidence for faithfulness decay in chain-of-thought reasoning. arXiv preprint arXiv:2602.11201

work page arXiv 2026
[33]

Yang, S., Gribovskaya, E., Kassner, N., Geva, M., and Riedel, S. (2024). Do large language models latently perform multi-hop reasoning? InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 10210–10229

2024
[34]

Kazama, K., Shirafuji, D., and Saito, T. (2026). GeoSteer: Faithful chain-of-thought steering via latent manifold gradients. arXiv preprint arXiv:2601.10229

work page arXiv 2026
[35]

and Le, T

Nguyen, T. and Le, T. (2026). ATLAS: Adaptive test-time latent steering with external verifiers for enhancing LLMs reasoning. arXiv preprint arXiv:2601.03093

work page arXiv 2026
[36]

Li, C., Zhang, K., Xu, H., Shi, Y ., Zhang, Z., Song, K., and Ren, K. (2026). Interpreting and controlling LLM reasoning through integrated policy gradient. arXiv preprint arXiv:2602.02313

work page arXiv 2026
[37]

Saunshi, N., Dikkala, N., Li, Z., Kumar, S., and Reddi, S. J. (2025). Reasoning with latent thoughts: On the power of looped transformers. arXiv preprint arXiv:2502.17416

work page arXiv 2025
[38]

Boppana, S., Ma, A., Loeffler, M., Sarfati, R., Bigelow, E., Geiger, A., Lewis, O., and Merullo, J. (2026). Reasoning theater: Disentangling model beliefs from chain-of-thought. arXiv preprint arXiv:2603.05488

work page arXiv 2026
[39]

Wang, M., Vu, T.-T., Shareghi, E., and Haffari, G. (2025). Towards inference-time scaling for continuous space reasoning. arXiv preprint arXiv:2510.12167

work page arXiv 2025
[40]

Sheikhi, S. (2026). Chain of simulation: A dual-mode reasoning framework for large language models with dynamic problem routing. arXiv preprint arXiv:2602.02842

work page arXiv 2026
[41]

and Niu, D

Chen, S. and Niu, D. (2025). iCLP: Large language model reasoning with implicit cognition latent planning. arXiv preprint arXiv:2512.24014

work page arXiv 2025
[42]

and Srivastava, S

Zaman, K. and Srivastava, S. (2025). Is chain-of-thought really not explainability? Chain-of-thought can be faithful without hint verbalization. arXiv preprint arXiv:2512.23032

work page internal anchor Pith review arXiv 2025
[43]

Kaplan, J., McCandlish, S., Henighan, T., Brown, T

Kambhampati, S., Valmeekam, K., Bhambri, S., Palod, V ., Saldyt, L., Stechly, K., Samineni, S. R., Kalwar, D., and Biswas, U. (2025). Position: Stop anthropomorphizing intermediate tokens as reasoning/thinking traces! arXiv preprint arXiv:2504.09762. 11

work page arXiv 2025
[44]

Barez, F., Wu, T.-Y ., Arcuschin, I., Lan, M., Wang, V ., Siegel, N., Collignon, N., Neo, C., Lee, I., Paren, A., Bibi, A., Trager, R., Fornasiere, D., Yan, J., Elazar, Y ., and Bengio, Y . (2025). Chain-of-thought is not explainability. Working paper

2025
[45]

Cabannes, V ., Arnal, C., Bouaziz, W., Yang, A., Charton, F., and Kempe, J. (2024). Iteration head: A mechanistic study of chain-of-thought. arXiv preprint arXiv:2406.02128

work page arXiv 2024
[46]

Pan, L., Liang, J., Ye, J., Yang, M., Lu, X., and Zhu, F. (2026). Opening the black box: A survey on the mechanisms of multi-step reasoning in large language models. arXiv preprint arXiv:2601.14270

work page arXiv 2026
[47]

Chen, X., Zhao, A., Xia, H., Lu, X., Wang, H., Chen, Y ., Zhang, W., Wang, J., Li, W., and Shen, X. (2025a). Reasoning beyond language: A comprehensive survey on latent chain-of-thought reasoning. arXiv preprint arXiv:2505.16782

work page arXiv
[48]

Zhu, R.-J., Peng, T., Cheng, T., Qu, X., Huang, J., Zhu, D., Wang, H., Xue, K., Zhang, X., Shan, Y ., et al. (2025). A survey on latent reasoning. arXiv preprint arXiv:2507.06203

work page arXiv 2025
[49]

Hu, Y ., Gu, J., Wang, R., Yao, Z., Peng, H., Wu, X., Chen, J., Zhang, M., and Pan, L. (2026). Towards a mechanistic understanding of large reasoning models: A survey of training, inference, and failures. arXiv preprint arXiv:2601.19928

work page arXiv 2026
[50]

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. (2021). Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[51]

Vendrow, J., Vendrow, E., Beery, S., and Madry, A. (2025). Do large language model benchmarks test reliability? arXiv preprint arXiv:2502.03461

work page arXiv 2025
[52]

MadryLab. (2025). GSM8K-Platinum dataset card. Hugging Face dataset card. https://huggingface. co/datasets/madrylab/gsm8k-platinum

2025
[53]

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. (2025). Qwen3 technical report. arXiv preprint arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

W., Salakhutdinov, R., and Manning, C

Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W. W., Salakhutdinov, R., and Manning, C. D. (2018). HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380

2018
[55]

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. (2021). Measuring mathematical problem solving with the MATH dataset. arXiv preprint arXiv:2103.03874

work page internal anchor Pith review arXiv 2021
[56]

Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D. W., Plappert, M....

work page internal anchor Pith review Pith/arXiv arXiv 2021
[57]

S., Wang, Y ., and Zhang, L

Liu, J., Xia, C. S., Wang, Y ., and Zhang, L. (2023). Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation. InThirty-seventh Conference on Neural Information Processing Systems. A Adjudication Program Protocol This appendix records the protocol details underlying Section 5. It collects the cont...

2023