pith. machine review for the scientific record. sign in

arxiv: 2605.02442 · v1 · submitted 2026-05-04 · 💻 cs.AI · cs.CL

Recognition: 3 theorem links

· Lean Theorem

Measuring AI Reasoning: A Guide for Researchers

Kareem Ali, Kentaro Inui, Munachiso Samuel Nwadike, Rifo Genadi, Zangir Iklassov

Pith reviewed 2026-05-08 18:49 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords reasoning evaluationlanguage modelsprocess-based assessmentintermediate tracessearch proceduresAI diagnosticsmulti-step computation
0
0 comments X

The pith

Evaluating AI reasoning requires assessing intermediate steps rather than final answers alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper makes the case that final-answer accuracy provides insufficient insight into how language models reason. Reasoning is defined as an adaptive, multi-step search process that selects intermediate steps and halts based on input-specific conditions. Single forward passes limit this variable-depth computation, so intermediate decoding and externalized traces are needed. Assessing the faithfulness and validity of those traces becomes the key way to diagnose and debug model processes.

Core claim

Under an evaluation-oriented definition, reasoning requires selecting intermediate steps and halting according to input-dependent conditions, formalized as a search-like procedure. Single forward passes in scalable architectures are structurally limited in realizing such variable-depth computation. Final-answer accuracy alone is insufficient because it provides little ability to diagnose or debug the underlying processes that produce individual solutions in frontier models. Therefore, reasoning should be assessed through the faithfulness and validity of intermediate reasoning traces as first-class evaluation targets.

What carries the argument

The formalization of reasoning as a search-like procedure involving input-dependent selection of steps and halting conditions, which carries the argument for shifting to process-based evaluation.

If this is right

  • Researchers gain better diagnostic tools for understanding why models succeed or fail on specific instances.
  • Evaluation protocols should prioritize interfaces that expose and assess reasoning traces.
  • Model development may need to focus on architectures supporting multi-step, adaptive computation.
  • Debugging becomes feasible at the level of individual steps rather than only outcomes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training objectives could be updated to reward faithful trace generation in addition to correct answers.
  • Existing benchmarks might be reanalyzed to check for correlation between accuracy and trace quality.
  • New tasks designed for variable computation depth could better test reasoning capabilities.
  • Automated methods for validating traces might emerge as a complementary research area.

Load-bearing premise

Intermediate reasoning traces can be reliably judged for faithfulness and validity, and single forward passes cannot support the variable-depth computation required for reasoning.

What would settle it

A large-scale study finding that final-answer accuracy does not predict the presence of valid reasoning traces, or one where models achieve high accuracy exclusively through flawed intermediate processes that can be verified independently.

Figures

Figures reproduced from arXiv: 2605.02442 by Kareem Ali, Kentaro Inui, Munachiso Samuel Nwadike, Rifo Genadi, Zangir Iklassov.

Figure 1
Figure 1. Figure 1: Recurring training-phase patterns reported in (Liu et al., 2022). The key implication for evaluation is that models can shift between qualitatively different solution regimes under the same task, motivating diagnostics beyond aggregate accuracy. From the perspective of evaluation, we treat grokking as a special case of comprehension, characterized by delayed generalization to held￾out data. In this framewo… view at source ↗
Figure 2
Figure 2. Figure 2: Reasoning as search: We define reasoning as a search process that maps input concepts A to target concepts B through a sequence of intermediate states st. Both the choice of the next state transition and when to halt depend on intermediate states, and the process terminates once the input-conditioned stopping criterion is satisfied. A simple search task can be described as a sequence of input-dependent sta… view at source ↗
Figure 3
Figure 3. Figure 3: Contamination provides an operational lens for orga￾nizing capabilities in this hierarchy. With task contamination, apparent reasoning can collapse into comprehension, defined as token-level factual associations. With dataset contamination, eval￾uation can further collapse into memorization, a degenerate case of comprehension involving near token-exact reproduction. The concentric structure denotes procedu… view at source ↗
read the original abstract

In this paper, we offer a guide for researchers on evaluating reasoning in language models, building the case that reasoning should be assessed through evidence of adaptive, multi-step search rather than final-answer accuracy alone. Under an evaluation-oriented definition, reasoning requires selecting intermediate steps and halting according to input-dependent conditions, which we formalize as a search-like procedure. We show that single forward passes in scalable architectures are structurally limited in their ability to realize such variable-depth computation, motivating intermediate decoding and externalized reasoning traces as appropriate evaluation interfaces. Central to our argument is that final-answer accuracy alone is an insufficient measure of reasoning, because it provides little ability to diagnose or debug the underlying processes that produce individual solutions in frontier models. We therefore argue for a shift toward process-based evaluation, in which reasoning is assessed through the faithfulness and validity of intermediate reasoning traces as first-class evaluation targets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper offers a guide for researchers on evaluating reasoning in language models, arguing that reasoning should be assessed via evidence of adaptive, multi-step search rather than final-answer accuracy alone. It defines reasoning under an evaluation-oriented lens as requiring input-dependent selection of intermediate steps and halting conditions, formalized as a search-like procedure. The authors claim that single forward passes in scalable architectures are structurally limited for variable-depth computation, motivating intermediate decoding and externalized reasoning traces. The central thesis is that final-answer accuracy provides little diagnostic power for process failures in frontier models, so evaluation should treat the faithfulness and validity of intermediate traces as first-class targets.

Significance. If the definitional framework is adopted, the paper could usefully redirect evaluation practices in AI toward more process-oriented methods that better support debugging and diagnosis. The argument is internally consistent once the search-procedure definition is granted, and the absence of empirical claims or parameter fitting is appropriate for a prescriptive guide. Credit is due for the explicit, non-circular formalization of reasoning and the clear linkage between the variable-depth requirement and the call for externalized traces.

major comments (2)
  1. [Abstract] Abstract: the claim that single forward passes are 'structurally limited' in realizing variable-depth computation is load-bearing for the recommendation of process-based evaluation, yet the manuscript provides only a definitional derivation rather than a formal complexity argument or reference to why fixed-depth feed-forward computation cannot simulate input-dependent halting without external mechanisms.
  2. [process-based evaluation discussion] The section motivating process-based evaluation: the assumption that intermediate reasoning traces can be reliably assessed for faithfulness and validity is presented as enabling the shift away from final-answer metrics, but no discussion is given of verification methods, inter-annotator reliability, or failure modes when traces are unfaithful, which directly affects the practicality of the proposed evaluation target.
minor comments (2)
  1. [Abstract] The abstract introduces 'scalable architectures' without specifying whether this refers exclusively to transformers or includes other fixed-depth models; a brief clarification would improve precision.
  2. [Introduction or motivation section] A short concrete example illustrating how final-answer accuracy alone fails to diagnose a specific process failure (e.g., an incorrect intermediate step that happens to yield a correct final answer) would strengthen the central claim without altering the paper's scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and positive feedback on our guide. We address each major comment below and outline revisions that will strengthen the manuscript without altering its prescriptive focus.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that single forward passes are 'structurally limited' in realizing variable-depth computation is load-bearing for the recommendation of process-based evaluation, yet the manuscript provides only a definitional derivation rather than a formal complexity argument or reference to why fixed-depth feed-forward computation cannot simulate input-dependent halting without external mechanisms.

    Authors: We agree the structural-limitation claim is central. Our argument derives directly from the fixed per-pass depth of transformer architectures and the input-dependent halting requirement in the search-procedure definition. To address the request for additional support, we will add brief references to established results on the computational limitations of fixed-depth feed-forward networks and transformers (e.g., their inability to simulate arbitrary-depth computation without external mechanisms). We will revise the abstract and the relevant section to include these citations and a short clarifying sentence. revision: yes

  2. Referee: [process-based evaluation discussion] The section motivating process-based evaluation: the assumption that intermediate reasoning traces can be reliably assessed for faithfulness and validity is presented as enabling the shift away from final-answer metrics, but no discussion is given of verification methods, inter-annotator reliability, or failure modes when traces are unfaithful, which directly affects the practicality of the proposed evaluation target.

    Authors: We acknowledge that practical assessment of traces requires explicit treatment of verification challenges. In the revised manuscript we will expand the process-based evaluation section with a short subsection covering verification methods (human step-wise validation, automated consistency checks), inter-annotator reliability considerations, and common failure modes such as unfaithful or post-hoc rationalization traces. This addition will better equip readers to implement the proposed evaluation targets. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper advances an explicit evaluation-oriented definition of reasoning as requiring adaptive, input-dependent selection of intermediate steps with halting conditions, formalized as a variable-depth search procedure. From this definition it directly infers that fixed-depth single forward passes cannot realize the required computation and that final-answer accuracy therefore supplies insufficient diagnostic information. This chain is self-contained and definitional; no claim reduces by construction to a fitted parameter, a self-citation, an ansatz smuggled via prior work, or a renamed empirical pattern. The central recommendation for process-based evaluation follows logically once the definition is granted and does not loop back to any unverified input or external result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on a domain-specific definition of reasoning and an assumption about architectural limitations, with no free parameters or new entities introduced.

axioms (1)
  • domain assumption Reasoning requires selecting intermediate steps and halting according to input-dependent conditions, formalized as a search-like procedure.
    This definition is invoked to distinguish reasoning from final-answer accuracy and to motivate the need for intermediate decoding.

pith-pipeline@v0.9.0 · 5455 in / 1120 out tokens · 75379 ms · 2026-05-08T18:49:22.278350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

160 extracted references · 58 canonical work pages · 10 internal anchors

  1. [1]

    C ommonsense QA : A Question Answering Challenge Targeting Commonsense Knowledge

    Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan. C ommonsense QA : A Question Answering Challenge Targeting Commonsense Knowledge. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). 2019

  2. [2]

    H ella S wag: Can a Machine Really Finish Your Sentence?

    Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin. H ella S wag: Can a Machine Really Finish Your Sentence?. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). 2019

  3. [3]

    A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories

    Mostafazadeh, Nasrin and Chambers, Nathanael and He, Xiaodong and Parikh, Devi and Batra, Dhruv and Vanderwende, Lucy and Kohli, Pushmeet and Allen, James. A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human L...

  4. [4]

    The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task

    Schwartz, Roy and Sap, Maarten and Konstas, Ioannis and Zilles, Leila and Choi, Yejin and Smith, Noah A. The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task. Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL). 2017

  5. [5]

    The Argument Reasoning Comprehension Task: Identification and Reconstruction of Implicit Warrants

    Habernal, Ivan and Wachsmuth, Henning and Gurevych, Iryna and Stein, Benno. The Argument Reasoning Comprehension Task: Identification and Reconstruction of Implicit Warrants. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). 2018

  6. [6]

    Probing Neural Network Comprehension of Natural Language Arguments

    Niven, Timothy and Kao, Hung-Yu. Probing Neural Network Comprehension of Natural Language Arguments. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). 2019

  7. [7]

    A large annotated corpus for learning natural language inference

    Bowman, Samuel and Angeli, Gabor and Potts, Christopher and Manning, Christopher D. A large annotated corpus for learning natural language inference. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2015

  8. [8]

    A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

    Williams, Adina and Nangia, Nikita and Bowman, Samuel. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). 2018

  9. [9]

    Annotation Artifacts in Natural Language Inference Data

    Gururangan, Suchin and Swayamdipta, Swabha and Levy, Omer and Schwartz, Roy and Bowman, Samuel and Smith, Noah A. Annotation Artifacts in Natural Language Inference Data. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). 2018

  10. [10]

    SWAG : A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

    Zellers, Rowan and Bisk, Yonatan and Schwartz, Roy and Choi, Yejin. SWAG : A Large-Scale Adversarial Dataset for Grounded Commonsense Inference. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2018

  11. [11]

    The winograd schema challenge

    Levesque, Hector and Davis, Ernest and Morgenstern, Leora. The winograd schema challenge. Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning. 2012

  12. [12]

    On the Evaluation of Common-Sense Reasoning in Natural Language Understanding

    Trichelair, Paul and Emami, Ali and Trischler, Adam and Suleman, Kaheer and Cheung, Jackie Chi Kit. On the Evaluation of Common-Sense Reasoning in Natural Language Understanding. Proceedings of the Workshop on Generalization in the Age of Deep Learning. 2018

  13. [13]

    2014 , publisher=

    Probabilistic reasoning in intelligent systems: networks of plausible inference , author=. 2014 , publisher=

  14. [14]

    2009 , publisher=

    Causality , author=. 2009 , publisher=

  15. [15]

    2009 , edition =

    The Routledge Dictionary of Philosophy , author =. 2009 , edition =. doi:10.4324/9780203428467 , isbn =

  16. [16]

    Rules for the Direction of the Mind , booktitle =

    Descartes, Ren. Rules for the Direction of the Mind , booktitle =. 1985 , volume =

  17. [17]

    Aristotle , title =. c. 350 BCE , note =

  18. [18]

    1781 , note =

    Kant, Immanuel , title =. 1781 , note =

  19. [19]

    2004 , note =

    Locke, John , title =. 2004 , note =

  20. [20]

    URL https: //arxiv.org/abs/2510.04871

    Less is more: Recursive reasoning with tiny networks , author=. arXiv preprint arXiv:2510.04871 , year=

  21. [21]

    Mitchell , title =

    Tom M. Mitchell , title =. 1997 , isbn =

  22. [22]

    The platonic representation hypothesis.arXiv preprint arXiv:2405.07987, 2024

    The platonic representation hypothesis , author=. arXiv preprint arXiv:2405.07987 , year=

  23. [23]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Task contamination: Language models may not be few-shot anymore , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  24. [24]

    Singh and Muhammed Yusuf Kocyigit and Andrew Poulton and David Esiobu and Maria Lomeli and Gergely Szilvasy and Dieuwke Hupkes , year=

    Evaluation data contamination in LLMs: how do we measure it and (when) does it matter? , author=. arXiv preprint arXiv:2411.03923 , year=

  25. [25]

    Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

    Investigating data contamination in modern benchmarks for large language models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  26. [26]

    Advances in Neural Information Processing Systems , volume=

    Towards understanding grokking: An effective theory of representation learning , author=. Advances in Neural Information Processing Systems , volume=

  27. [27]

    Findings of the Association for Computational Linguistics: EMNLP 2022 , year =

    Saturated Transformers are Constant-Depth Circuits , author =. Findings of the Association for Computational Linguistics: EMNLP 2022 , year =

  28. [28]

    Transactions of the Association for Computational Linguistics , volume=

    The parallelism tradeoff: Limitations of log-precision transformers , author=. Transactions of the Association for Computational Linguistics , volume=. 2023 , publisher=

  29. [29]

    arXiv preprint arXiv:2402.09548 , year =

    Transformers as Decision Makers: Provable Guarantees for Bandits and Reinforcement Learning , author =. arXiv preprint arXiv:2402.09548 , year =. 2402.09548 , archiveprefix =

  30. [30]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , year =

    Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Chi, Ed and Le, Quoc and Zhou, Denny , booktitle =. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , year =

  31. [31]

    The expressive power of transformers with chain of thought, 2024

    The expressive power of transformers with chain of thought , author=. arXiv preprint arXiv:2310.07923 , year=

  32. [32]

    Journal of Computer and System Sciences , volume=

    On uniformity within NC1 , author=. Journal of Computer and System Sciences , volume=. 1990 , publisher=

  33. [33]

    The illusion of state in state-space models

    The illusion of state in state-space models , author=. arXiv preprint arXiv:2404.08819 , year=

  34. [34]

    arXiv preprint arXiv:2402.01339 , year =

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author =. arXiv preprint arXiv:2402.01339 , year =

  35. [35]

    Nature , volume=

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. Nature , volume=. 2025 , publisher=

  36. [36]

    Findings of the Association for Computational Linguistics: EMNLP 2024 , year =

    Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning , author =. Findings of the Association for Computational Linguistics: EMNLP 2024 , year =

  37. [37]

    2022 , eprint =

    PINTO: Faithful Language Reasoning Using Prompt-Generated Rationales , author =. 2022 , eprint =

  38. [38]

    2025 , eprint =

    FRIT: Using Causal Importance to Improve Chain-of-Thought Faithfulness , author =. 2025 , eprint =

  39. [39]

    arXiv preprint arXiv:2501.13491 , year=

    RECALL: Library-Like Behavior In Language Models is Enhanced by Self-Referencing Causal Cycles , author=. arXiv preprint arXiv:2501.13491 , year=

  40. [40]

    A is B” fail to learn “B is A

    The Reversal Curse: LLMs trained on" A is B" fail to learn" B is A" , author=. arXiv preprint arXiv:2309.12288 , year=

  41. [41]

    Chain of thought empowers transformers to solve inherently serial problems, 2024

    Chain of thought empowers transformers to solve inherently serial problems , author=. arXiv preprint arXiv:2402.12875 , volume=

  42. [42]

    Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models

    Dong, Yihong and Jiang, Xue and Liu, Huanyu and Jin, Zhi and Gu, Bin and Yang, Mengfei and Li, Ge. Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models. Findings of the Association for Computational Linguistics: ACL 2024. 2024

  43. [43]

    and Ladhak, Faisal and Hashimoto, Tatsunori

    Oren, Yonatan and Meister, Nicole and Chatterji, Niladri S. and Ladhak, Faisal and Hashimoto, Tatsunori. Proving Test Set Contamination in Black-Box Language Models. The Twelfth International Conference on Learning Representations (ICLR). 2024

  44. [44]

    NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark

    Sainz, Oscar and Campos, Jon and Garc \' a-Ferrero, Iker and Etxaniz, Julen and de Lacalle, Oier Lopez and Agirre, Eneko. NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023

  45. [45]

    Gonzalez, and Ion Stoica

    Yang, Shuo and Chiang, Wei-Lin and Zheng, Lianmin and Gonzalez, Joseph E. and Stoica, Ion. Rethinking Benchmark and Contamination for Language Models with Rephrased Samples. arXiv preprint arXiv:2311.04850. 2023

  46. [46]

    Cheng, Z

    Cheng, Yuxing and Chang, Yi and Wu, Yuan. A Survey on Data Contamination for Large Language Models. arXiv preprint arXiv:2502.14425. 2025

  47. [47]

    Investigating Data Contamination in Modern Benchmarks for Large Language Models

    Deng, Chunyuan and Zhao, Yilun and Tang, Xiangru and Gerstein, Mark and Cohan, Arman. Investigating Data Contamination in Modern Benchmarks for Large Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024

  48. [48]

    Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models

    Golchin, Shahriar and Surdeanu, Mihai. Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models. Transactions of the Association for Computational Linguistics (TACL). 2025. doi:10.1162/tacl_a_00720

  49. [49]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  50. [50]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  51. [51]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    Neural Module Networks , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  52. [52]

    International Conference on Learning Representations (ICLR) , year=

    Compositional Attention Networks for Machine Reasoning , author=. International Conference on Learning Representations (ICLR) , year=

  53. [53]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  54. [54]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Overcoming Language Priors in Visual Question Answering with Adversarial Regularization , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  55. [55]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    MoReVQA: Exploring Modular Reasoning Models for Video Question Answering , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  56. [56]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Video Diffusion Models , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  57. [57]

    Think-then-generate: Reasoning-aware text-to-image diffusion with llm encoders,

    Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders , author=. arXiv preprint arXiv:2601.10332 , year=

  58. [58]

    International Conference on Machine Learning (ICML) , year=

    Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs , author=. International Conference on Machine Learning (ICML) , year=

  59. [59]

    VideoCoF: Unified Video Editing with Temporal Reasoner

    Unified Video Editing with Temporal Reasoner , author=. arXiv preprint arXiv:2512.07469 , year=

  60. [60]

    2025 , url =

    Google DeepMind , title =. 2025 , url =

  61. [61]

    2025 , url =

    Anthropic , title =. 2025 , url =

  62. [62]

    2025 , url =

    OpenAI , title =. 2025 , url =

  63. [63]

    DeepSeek-AI , title =

  64. [64]

    The Llama 3 Herd of Models

    The Llama 3 Herd of Models , author =. arXiv preprint arXiv:2407.21783 , year =

  65. [65]

    Meta Llama 3 Model Card , howpublished =

  66. [66]

    Llama 3 Evaluation Details , howpublished =

  67. [67]

    2024 , journal =

    Benchmark Data Contamination of Large Language Models: A Survey , author =. 2024 , journal =

  68. [68]

    Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation

    Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation , author =. Findings of the Association for Computational Linguistics: EMNLP 2024 , year =. doi:10.18653/v1/2024.findings-emnlp.532 , url =

  69. [69]

    Quantifying Memorization Across Neural Language Models

    Quantifying Memorization Across Neural Language Models , author =. International Conference on Learning Representations (ICLR) , year =. 2202.07646 , archivePrefix=

  70. [70]

    USENIX Security Symposium , year =

    Extracting Training Data from Large Language Models , author =. USENIX Security Symposium , year =

  71. [71]

    2023 , journal =

    Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting , author =. 2023 , journal =

  72. [72]

    2023 , journal =

    Measuring Faithfulness in Chain-of-Thought Reasoning , author =. 2023 , journal =

  73. [73]

    2025 , journal =

    Reasoning Models Don't Always Say What They Think , author =. 2025 , journal =

  74. [74]

    2024 , journal =

    Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers , author =. 2024 , journal =

  75. [75]

    International Conference on Learning Representations (ICLR) , year =

    Large Language Models Are Not Robust Multiple Choice Selectors , author =. International Conference on Learning Representations (ICLR) , year =

  76. [76]

    Findings of the Association for Computational Linguistics: NAACL 2024 , year =

    Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions , author =. Findings of the Association for Computational Linguistics: NAACL 2024 , year =

  77. [77]

    Yuan, Yu and Zhao, Lili and Zhang, Kai and Zheng, Guangting and Liu, Qi , booktitle =. Do. 2024 , address =. doi:10.18653/v1/2024.emnlp-main.679 , url =

  78. [78]

    Annotation Artifacts in Natural Language Inference Data

    Annotation Artifacts in Natural Language Inference Data , author =. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , year =. doi:10.18653/v1/N18-2017 , url =

  79. [79]

    Hypothesis Only Baselines in Natural Language Inference

    Hypothesis Only Baselines in Natural Language Inference , author =. Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics , year =. doi:10.18653/v1/S18-2023 , url =

  80. [80]

    Nature Machine Intelligence , year =

    Shortcut learning in deep neural networks , author =. Nature Machine Intelligence , year =

Showing first 80 references.