arxiv: 2605.02442 · v1 · submitted 2026-05-04 · 💻 cs.AI · cs.CL

Recognition: 3 theorem links

· Lean Theorem

Measuring AI Reasoning: A Guide for Researchers

Kareem Ali, Kentaro Inui, Munachiso Samuel Nwadike, Rifo Genadi, Zangir Iklassov

Pith reviewed 2026-05-08 18:49 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords reasoning evaluationlanguage modelsprocess-based assessmentintermediate tracessearch proceduresAI diagnosticsmulti-step computation

0 comments

The pith

Evaluating AI reasoning requires assessing intermediate steps rather than final answers alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper makes the case that final-answer accuracy provides insufficient insight into how language models reason. Reasoning is defined as an adaptive, multi-step search process that selects intermediate steps and halts based on input-specific conditions. Single forward passes limit this variable-depth computation, so intermediate decoding and externalized traces are needed. Assessing the faithfulness and validity of those traces becomes the key way to diagnose and debug model processes.

Core claim

Under an evaluation-oriented definition, reasoning requires selecting intermediate steps and halting according to input-dependent conditions, formalized as a search-like procedure. Single forward passes in scalable architectures are structurally limited in realizing such variable-depth computation. Final-answer accuracy alone is insufficient because it provides little ability to diagnose or debug the underlying processes that produce individual solutions in frontier models. Therefore, reasoning should be assessed through the faithfulness and validity of intermediate reasoning traces as first-class evaluation targets.

What carries the argument

The formalization of reasoning as a search-like procedure involving input-dependent selection of steps and halting conditions, which carries the argument for shifting to process-based evaluation.

If this is right

Researchers gain better diagnostic tools for understanding why models succeed or fail on specific instances.
Evaluation protocols should prioritize interfaces that expose and assess reasoning traces.
Model development may need to focus on architectures supporting multi-step, adaptive computation.
Debugging becomes feasible at the level of individual steps rather than only outcomes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training objectives could be updated to reward faithful trace generation in addition to correct answers.
Existing benchmarks might be reanalyzed to check for correlation between accuracy and trace quality.
New tasks designed for variable computation depth could better test reasoning capabilities.
Automated methods for validating traces might emerge as a complementary research area.

Load-bearing premise

Intermediate reasoning traces can be reliably judged for faithfulness and validity, and single forward passes cannot support the variable-depth computation required for reasoning.

What would settle it

A large-scale study finding that final-answer accuracy does not predict the presence of valid reasoning traces, or one where models achieve high accuracy exclusively through flawed intermediate processes that can be verified independently.

Figures

Figures reproduced from arXiv: 2605.02442 by Kareem Ali, Kentaro Inui, Munachiso Samuel Nwadike, Rifo Genadi, Zangir Iklassov.

**Figure 1.** Figure 1: Recurring training-phase patterns reported in (Liu et al., 2022). The key implication for evaluation is that models can shift between qualitatively different solution regimes under the same task, motivating diagnostics beyond aggregate accuracy. From the perspective of evaluation, we treat grokking as a special case of comprehension, characterized by delayed generalization to heldout data. In this framewo… view at source ↗

**Figure 2.** Figure 2: Reasoning as search: We define reasoning as a search process that maps input concepts A to target concepts B through a sequence of intermediate states st. Both the choice of the next state transition and when to halt depend on intermediate states, and the process terminates once the input-conditioned stopping criterion is satisfied. A simple search task can be described as a sequence of input-dependent sta… view at source ↗

**Figure 3.** Figure 3: Contamination provides an operational lens for organizing capabilities in this hierarchy. With task contamination, apparent reasoning can collapse into comprehension, defined as token-level factual associations. With dataset contamination, evaluation can further collapse into memorization, a degenerate case of comprehension involving near token-exact reproduction. The concentric structure denotes procedu… view at source ↗

read the original abstract

In this paper, we offer a guide for researchers on evaluating reasoning in language models, building the case that reasoning should be assessed through evidence of adaptive, multi-step search rather than final-answer accuracy alone. Under an evaluation-oriented definition, reasoning requires selecting intermediate steps and halting according to input-dependent conditions, which we formalize as a search-like procedure. We show that single forward passes in scalable architectures are structurally limited in their ability to realize such variable-depth computation, motivating intermediate decoding and externalized reasoning traces as appropriate evaluation interfaces. Central to our argument is that final-answer accuracy alone is an insufficient measure of reasoning, because it provides little ability to diagnose or debug the underlying processes that produce individual solutions in frontier models. We therefore argue for a shift toward process-based evaluation, in which reasoning is assessed through the faithfulness and validity of intermediate reasoning traces as first-class evaluation targets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper offers a guide for researchers on evaluating reasoning in language models, arguing that reasoning should be assessed via evidence of adaptive, multi-step search rather than final-answer accuracy alone. It defines reasoning under an evaluation-oriented lens as requiring input-dependent selection of intermediate steps and halting conditions, formalized as a search-like procedure. The authors claim that single forward passes in scalable architectures are structurally limited for variable-depth computation, motivating intermediate decoding and externalized reasoning traces. The central thesis is that final-answer accuracy provides little diagnostic power for process failures in frontier models, so evaluation should treat the faithfulness and validity of intermediate traces as first-class targets.

Significance. If the definitional framework is adopted, the paper could usefully redirect evaluation practices in AI toward more process-oriented methods that better support debugging and diagnosis. The argument is internally consistent once the search-procedure definition is granted, and the absence of empirical claims or parameter fitting is appropriate for a prescriptive guide. Credit is due for the explicit, non-circular formalization of reasoning and the clear linkage between the variable-depth requirement and the call for externalized traces.

major comments (2)

[Abstract] Abstract: the claim that single forward passes are 'structurally limited' in realizing variable-depth computation is load-bearing for the recommendation of process-based evaluation, yet the manuscript provides only a definitional derivation rather than a formal complexity argument or reference to why fixed-depth feed-forward computation cannot simulate input-dependent halting without external mechanisms.
[process-based evaluation discussion] The section motivating process-based evaluation: the assumption that intermediate reasoning traces can be reliably assessed for faithfulness and validity is presented as enabling the shift away from final-answer metrics, but no discussion is given of verification methods, inter-annotator reliability, or failure modes when traces are unfaithful, which directly affects the practicality of the proposed evaluation target.

minor comments (2)

[Abstract] The abstract introduces 'scalable architectures' without specifying whether this refers exclusively to transformers or includes other fixed-depth models; a brief clarification would improve precision.
[Introduction or motivation section] A short concrete example illustrating how final-answer accuracy alone fails to diagnose a specific process failure (e.g., an incorrect intermediate step that happens to yield a correct final answer) would strengthen the central claim without altering the paper's scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and positive feedback on our guide. We address each major comment below and outline revisions that will strengthen the manuscript without altering its prescriptive focus.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that single forward passes are 'structurally limited' in realizing variable-depth computation is load-bearing for the recommendation of process-based evaluation, yet the manuscript provides only a definitional derivation rather than a formal complexity argument or reference to why fixed-depth feed-forward computation cannot simulate input-dependent halting without external mechanisms.

Authors: We agree the structural-limitation claim is central. Our argument derives directly from the fixed per-pass depth of transformer architectures and the input-dependent halting requirement in the search-procedure definition. To address the request for additional support, we will add brief references to established results on the computational limitations of fixed-depth feed-forward networks and transformers (e.g., their inability to simulate arbitrary-depth computation without external mechanisms). We will revise the abstract and the relevant section to include these citations and a short clarifying sentence. revision: yes
Referee: [process-based evaluation discussion] The section motivating process-based evaluation: the assumption that intermediate reasoning traces can be reliably assessed for faithfulness and validity is presented as enabling the shift away from final-answer metrics, but no discussion is given of verification methods, inter-annotator reliability, or failure modes when traces are unfaithful, which directly affects the practicality of the proposed evaluation target.

Authors: We acknowledge that practical assessment of traces requires explicit treatment of verification challenges. In the revised manuscript we will expand the process-based evaluation section with a short subsection covering verification methods (human step-wise validation, automated consistency checks), inter-annotator reliability considerations, and common failure modes such as unfaithful or post-hoc rationalization traces. This addition will better equip readers to implement the proposed evaluation targets. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper advances an explicit evaluation-oriented definition of reasoning as requiring adaptive, input-dependent selection of intermediate steps with halting conditions, formalized as a variable-depth search procedure. From this definition it directly infers that fixed-depth single forward passes cannot realize the required computation and that final-answer accuracy therefore supplies insufficient diagnostic information. This chain is self-contained and definitional; no claim reduces by construction to a fitted parameter, a self-citation, an ansatz smuggled via prior work, or a renamed empirical pattern. The central recommendation for process-based evaluation follows logically once the definition is granted and does not loop back to any unverified input or external result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on a domain-specific definition of reasoning and an assumption about architectural limitations, with no free parameters or new entities introduced.

axioms (1)

domain assumption Reasoning requires selecting intermediate steps and halting according to input-dependent conditions, formalized as a search-like procedure.
This definition is invoked to distinguish reasoning from final-answer accuracy and to motivate the need for intermediate decoding.

pith-pipeline@v0.9.0 · 5455 in / 1120 out tokens · 75379 ms · 2026-05-08T18:49:22.278350+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

160 extracted references · 58 canonical work pages · 10 internal anchors

[1]

C ommonsense QA : A Question Answering Challenge Targeting Commonsense Knowledge

Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan. C ommonsense QA : A Question Answering Challenge Targeting Commonsense Knowledge. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). 2019

2019
[2]

H ella S wag: Can a Machine Really Finish Your Sentence?

Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin. H ella S wag: Can a Machine Really Finish Your Sentence?. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). 2019

2019
[3]

A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories

Mostafazadeh, Nasrin and Chambers, Nathanael and He, Xiaodong and Parikh, Devi and Batra, Dhruv and Vanderwende, Lucy and Kohli, Pushmeet and Allen, James. A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human L...

2016
[4]

The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task

Schwartz, Roy and Sap, Maarten and Konstas, Ioannis and Zilles, Leila and Choi, Yejin and Smith, Noah A. The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task. Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL). 2017

2017
[5]

The Argument Reasoning Comprehension Task: Identification and Reconstruction of Implicit Warrants

Habernal, Ivan and Wachsmuth, Henning and Gurevych, Iryna and Stein, Benno. The Argument Reasoning Comprehension Task: Identification and Reconstruction of Implicit Warrants. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). 2018

2018
[6]

Probing Neural Network Comprehension of Natural Language Arguments

Niven, Timothy and Kao, Hung-Yu. Probing Neural Network Comprehension of Natural Language Arguments. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). 2019

2019
[7]

A large annotated corpus for learning natural language inference

Bowman, Samuel and Angeli, Gabor and Potts, Christopher and Manning, Christopher D. A large annotated corpus for learning natural language inference. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2015

2015
[8]

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

Williams, Adina and Nangia, Nikita and Bowman, Samuel. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). 2018

2018
[9]

Annotation Artifacts in Natural Language Inference Data

Gururangan, Suchin and Swayamdipta, Swabha and Levy, Omer and Schwartz, Roy and Bowman, Samuel and Smith, Noah A. Annotation Artifacts in Natural Language Inference Data. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). 2018

2018
[10]

SWAG : A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

Zellers, Rowan and Bisk, Yonatan and Schwartz, Roy and Choi, Yejin. SWAG : A Large-Scale Adversarial Dataset for Grounded Commonsense Inference. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2018

2018
[11]

The winograd schema challenge

Levesque, Hector and Davis, Ernest and Morgenstern, Leora. The winograd schema challenge. Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning. 2012

2012
[12]

On the Evaluation of Common-Sense Reasoning in Natural Language Understanding

Trichelair, Paul and Emami, Ali and Trischler, Adam and Suleman, Kaheer and Cheung, Jackie Chi Kit. On the Evaluation of Common-Sense Reasoning in Natural Language Understanding. Proceedings of the Workshop on Generalization in the Age of Deep Learning. 2018

2018
[13]

2014 , publisher=

Probabilistic reasoning in intelligent systems: networks of plausible inference , author=. 2014 , publisher=

2014
[14]

2009 , publisher=

Causality , author=. 2009 , publisher=

2009
[15]

2009 , edition =

The Routledge Dictionary of Philosophy , author =. 2009 , edition =. doi:10.4324/9780203428467 , isbn =

work page doi:10.4324/9780203428467 2009
[16]

Rules for the Direction of the Mind , booktitle =

Descartes, Ren. Rules for the Direction of the Mind , booktitle =. 1985 , volume =

1985
[17]

Aristotle , title =. c. 350 BCE , note =
[18]

1781 , note =

Kant, Immanuel , title =. 1781 , note =
[19]

2004 , note =

Locke, John , title =. 2004 , note =

2004
[20]

URL https: //arxiv.org/abs/2510.04871

Less is more: Recursive reasoning with tiny networks , author=. arXiv preprint arXiv:2510.04871 , year=

work page arXiv
[21]

Mitchell , title =

Tom M. Mitchell , title =. 1997 , isbn =

1997
[22]

The platonic representation hypothesis.arXiv preprint arXiv:2405.07987, 2024

The platonic representation hypothesis , author=. arXiv preprint arXiv:2405.07987 , year=

work page arXiv
[23]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Task contamination: Language models may not be few-shot anymore , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[24]

Singh and Muhammed Yusuf Kocyigit and Andrew Poulton and David Esiobu and Maria Lomeli and Gergely Szilvasy and Dieuwke Hupkes , year=

Evaluation data contamination in LLMs: how do we measure it and (when) does it matter? , author=. arXiv preprint arXiv:2411.03923 , year=

work page arXiv
[25]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Investigating data contamination in modern benchmarks for large language models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2024
[26]

Advances in Neural Information Processing Systems , volume=

Towards understanding grokking: An effective theory of representation learning , author=. Advances in Neural Information Processing Systems , volume=
[27]

Findings of the Association for Computational Linguistics: EMNLP 2022 , year =

Saturated Transformers are Constant-Depth Circuits , author =. Findings of the Association for Computational Linguistics: EMNLP 2022 , year =

2022
[28]

Transactions of the Association for Computational Linguistics , volume=

The parallelism tradeoff: Limitations of log-precision transformers , author=. Transactions of the Association for Computational Linguistics , volume=. 2023 , publisher=

2023
[29]

arXiv preprint arXiv:2402.09548 , year =

Transformers as Decision Makers: Provable Guarantees for Bandits and Reinforcement Learning , author =. arXiv preprint arXiv:2402.09548 , year =. 2402.09548 , archiveprefix =

work page arXiv
[30]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , year =

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Chi, Ed and Le, Quoc and Zhou, Denny , booktitle =. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , year =
[31]

The expressive power of transformers with chain of thought, 2024

The expressive power of transformers with chain of thought , author=. arXiv preprint arXiv:2310.07923 , year=

work page arXiv
[32]

Journal of Computer and System Sciences , volume=

On uniformity within NC1 , author=. Journal of Computer and System Sciences , volume=. 1990 , publisher=

1990
[33]

The illusion of state in state-space models

The illusion of state in state-space models , author=. arXiv preprint arXiv:2404.08819 , year=

work page arXiv
[34]

arXiv preprint arXiv:2402.01339 , year =

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author =. arXiv preprint arXiv:2402.01339 , year =

work page arXiv
[35]

Nature , volume=

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. Nature , volume=. 2025 , publisher=

2025
[36]

Findings of the Association for Computational Linguistics: EMNLP 2024 , year =

Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning , author =. Findings of the Association for Computational Linguistics: EMNLP 2024 , year =

2024
[37]

2022 , eprint =

PINTO: Faithful Language Reasoning Using Prompt-Generated Rationales , author =. 2022 , eprint =

2022
[38]

2025 , eprint =

FRIT: Using Causal Importance to Improve Chain-of-Thought Faithfulness , author =. 2025 , eprint =

2025
[39]

arXiv preprint arXiv:2501.13491 , year=

RECALL: Library-Like Behavior In Language Models is Enhanced by Self-Referencing Causal Cycles , author=. arXiv preprint arXiv:2501.13491 , year=

work page arXiv
[40]

A is B” fail to learn “B is A

The Reversal Curse: LLMs trained on" A is B" fail to learn" B is A" , author=. arXiv preprint arXiv:2309.12288 , year=

work page arXiv
[41]

Chain of thought empowers transformers to solve inherently serial problems, 2024

Chain of thought empowers transformers to solve inherently serial problems , author=. arXiv preprint arXiv:2402.12875 , volume=

work page arXiv
[42]

Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models

Dong, Yihong and Jiang, Xue and Liu, Huanyu and Jin, Zhi and Gu, Bin and Yang, Mengfei and Li, Ge. Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models. Findings of the Association for Computational Linguistics: ACL 2024. 2024

2024
[43]

and Ladhak, Faisal and Hashimoto, Tatsunori

Oren, Yonatan and Meister, Nicole and Chatterji, Niladri S. and Ladhak, Faisal and Hashimoto, Tatsunori. Proving Test Set Contamination in Black-Box Language Models. The Twelfth International Conference on Learning Representations (ICLR). 2024

2024
[44]

NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark

Sainz, Oscar and Campos, Jon and Garc \' a-Ferrero, Iker and Etxaniz, Julen and de Lacalle, Oier Lopez and Agirre, Eneko. NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023

2023
[45]

Gonzalez, and Ion Stoica

Yang, Shuo and Chiang, Wei-Lin and Zheng, Lianmin and Gonzalez, Joseph E. and Stoica, Ion. Rethinking Benchmark and Contamination for Language Models with Rephrased Samples. arXiv preprint arXiv:2311.04850. 2023

work page arXiv 2023
[46]

Cheng, Z

Cheng, Yuxing and Chang, Yi and Wu, Yuan. A Survey on Data Contamination for Large Language Models. arXiv preprint arXiv:2502.14425. 2025

work page arXiv 2025
[47]

Investigating Data Contamination in Modern Benchmarks for Large Language Models

Deng, Chunyuan and Zhao, Yilun and Tang, Xiangru and Gerstein, Mark and Cohan, Arman. Investigating Data Contamination in Modern Benchmarks for Large Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024

2024
[48]

Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models

Golchin, Shahriar and Surdeanu, Mihai. Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models. Transactions of the Association for Computational Linguistics (TACL). 2025. doi:10.1162/tacl_a_00720

work page doi:10.1162/tacl_a_00720 2025
[49]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=
[50]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
[51]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Neural Module Networks , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=
[52]

International Conference on Learning Representations (ICLR) , year=

Compositional Attention Networks for Machine Reasoning , author=. International Conference on Learning Representations (ICLR) , year=
[53]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=
[54]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Overcoming Language Priors in Visual Question Answering with Adversarial Regularization , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[55]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
[56]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Video Diffusion Models , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[57]

Think-then-generate: Reasoning-aware text-to-image diffusion with llm encoders,

Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders , author=. arXiv preprint arXiv:2601.10332 , year=

work page arXiv
[58]

International Conference on Machine Learning (ICML) , year=

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs , author=. International Conference on Machine Learning (ICML) , year=
[59]

VideoCoF: Unified Video Editing with Temporal Reasoner

Unified Video Editing with Temporal Reasoner , author=. arXiv preprint arXiv:2512.07469 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[60]

2025 , url =

Google DeepMind , title =. 2025 , url =

2025
[61]

2025 , url =

Anthropic , title =. 2025 , url =

2025
[62]

2025 , url =

OpenAI , title =. 2025 , url =

2025
[63]

DeepSeek-AI , title =
[64]

The Llama 3 Herd of Models

The Llama 3 Herd of Models , author =. arXiv preprint arXiv:2407.21783 , year =

work page internal anchor Pith review arXiv
[65]

Meta Llama 3 Model Card , howpublished =
[66]

Llama 3 Evaluation Details , howpublished =
[67]

2024 , journal =

Benchmark Data Contamination of Large Language Models: A Survey , author =. 2024 , journal =

2024
[68]

Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation

Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation , author =. Findings of the Association for Computational Linguistics: EMNLP 2024 , year =. doi:10.18653/v1/2024.findings-emnlp.532 , url =

work page doi:10.18653/v1/2024.findings-emnlp.532 2024
[69]

Quantifying Memorization Across Neural Language Models

Quantifying Memorization Across Neural Language Models , author =. International Conference on Learning Representations (ICLR) , year =. 2202.07646 , archivePrefix=

work page internal anchor Pith review arXiv
[70]

USENIX Security Symposium , year =

Extracting Training Data from Large Language Models , author =. USENIX Security Symposium , year =
[71]

2023 , journal =

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting , author =. 2023 , journal =

2023
[72]

2023 , journal =

Measuring Faithfulness in Chain-of-Thought Reasoning , author =. 2023 , journal =

2023
[73]

2025 , journal =

Reasoning Models Don't Always Say What They Think , author =. 2025 , journal =

2025
[74]

2024 , journal =

Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers , author =. 2024 , journal =

2024
[75]

International Conference on Learning Representations (ICLR) , year =

Large Language Models Are Not Robust Multiple Choice Selectors , author =. International Conference on Learning Representations (ICLR) , year =
[76]

Findings of the Association for Computational Linguistics: NAACL 2024 , year =

Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions , author =. Findings of the Association for Computational Linguistics: NAACL 2024 , year =

2024
[77]

Yuan, Yu and Zhao, Lili and Zhang, Kai and Zheng, Guangting and Liu, Qi , booktitle =. Do. 2024 , address =. doi:10.18653/v1/2024.emnlp-main.679 , url =

work page doi:10.18653/v1/2024.emnlp-main.679 2024
[78]

Annotation Artifacts in Natural Language Inference Data

Annotation Artifacts in Natural Language Inference Data , author =. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , year =. doi:10.18653/v1/N18-2017 , url =

work page doi:10.18653/v1/n18-2017 2018
[79]

Hypothesis Only Baselines in Natural Language Inference

Hypothesis Only Baselines in Natural Language Inference , author =. Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics , year =. doi:10.18653/v1/S18-2023 , url =

work page doi:10.18653/v1/s18-2023 2023
[80]

Nature Machine Intelligence , year =

Shortcut learning in deep neural networks , author =. Nature Machine Intelligence , year =

Showing first 80 references.