Recognition: 3 theorem links
· Lean TheoremMeasuring AI Reasoning: A Guide for Researchers
Pith reviewed 2026-05-08 18:49 UTC · model grok-4.3
The pith
Evaluating AI reasoning requires assessing intermediate steps rather than final answers alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under an evaluation-oriented definition, reasoning requires selecting intermediate steps and halting according to input-dependent conditions, formalized as a search-like procedure. Single forward passes in scalable architectures are structurally limited in realizing such variable-depth computation. Final-answer accuracy alone is insufficient because it provides little ability to diagnose or debug the underlying processes that produce individual solutions in frontier models. Therefore, reasoning should be assessed through the faithfulness and validity of intermediate reasoning traces as first-class evaluation targets.
What carries the argument
The formalization of reasoning as a search-like procedure involving input-dependent selection of steps and halting conditions, which carries the argument for shifting to process-based evaluation.
If this is right
- Researchers gain better diagnostic tools for understanding why models succeed or fail on specific instances.
- Evaluation protocols should prioritize interfaces that expose and assess reasoning traces.
- Model development may need to focus on architectures supporting multi-step, adaptive computation.
- Debugging becomes feasible at the level of individual steps rather than only outcomes.
Where Pith is reading between the lines
- Training objectives could be updated to reward faithful trace generation in addition to correct answers.
- Existing benchmarks might be reanalyzed to check for correlation between accuracy and trace quality.
- New tasks designed for variable computation depth could better test reasoning capabilities.
- Automated methods for validating traces might emerge as a complementary research area.
Load-bearing premise
Intermediate reasoning traces can be reliably judged for faithfulness and validity, and single forward passes cannot support the variable-depth computation required for reasoning.
What would settle it
A large-scale study finding that final-answer accuracy does not predict the presence of valid reasoning traces, or one where models achieve high accuracy exclusively through flawed intermediate processes that can be verified independently.
Figures
read the original abstract
In this paper, we offer a guide for researchers on evaluating reasoning in language models, building the case that reasoning should be assessed through evidence of adaptive, multi-step search rather than final-answer accuracy alone. Under an evaluation-oriented definition, reasoning requires selecting intermediate steps and halting according to input-dependent conditions, which we formalize as a search-like procedure. We show that single forward passes in scalable architectures are structurally limited in their ability to realize such variable-depth computation, motivating intermediate decoding and externalized reasoning traces as appropriate evaluation interfaces. Central to our argument is that final-answer accuracy alone is an insufficient measure of reasoning, because it provides little ability to diagnose or debug the underlying processes that produce individual solutions in frontier models. We therefore argue for a shift toward process-based evaluation, in which reasoning is assessed through the faithfulness and validity of intermediate reasoning traces as first-class evaluation targets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper offers a guide for researchers on evaluating reasoning in language models, arguing that reasoning should be assessed via evidence of adaptive, multi-step search rather than final-answer accuracy alone. It defines reasoning under an evaluation-oriented lens as requiring input-dependent selection of intermediate steps and halting conditions, formalized as a search-like procedure. The authors claim that single forward passes in scalable architectures are structurally limited for variable-depth computation, motivating intermediate decoding and externalized reasoning traces. The central thesis is that final-answer accuracy provides little diagnostic power for process failures in frontier models, so evaluation should treat the faithfulness and validity of intermediate traces as first-class targets.
Significance. If the definitional framework is adopted, the paper could usefully redirect evaluation practices in AI toward more process-oriented methods that better support debugging and diagnosis. The argument is internally consistent once the search-procedure definition is granted, and the absence of empirical claims or parameter fitting is appropriate for a prescriptive guide. Credit is due for the explicit, non-circular formalization of reasoning and the clear linkage between the variable-depth requirement and the call for externalized traces.
major comments (2)
- [Abstract] Abstract: the claim that single forward passes are 'structurally limited' in realizing variable-depth computation is load-bearing for the recommendation of process-based evaluation, yet the manuscript provides only a definitional derivation rather than a formal complexity argument or reference to why fixed-depth feed-forward computation cannot simulate input-dependent halting without external mechanisms.
- [process-based evaluation discussion] The section motivating process-based evaluation: the assumption that intermediate reasoning traces can be reliably assessed for faithfulness and validity is presented as enabling the shift away from final-answer metrics, but no discussion is given of verification methods, inter-annotator reliability, or failure modes when traces are unfaithful, which directly affects the practicality of the proposed evaluation target.
minor comments (2)
- [Abstract] The abstract introduces 'scalable architectures' without specifying whether this refers exclusively to transformers or includes other fixed-depth models; a brief clarification would improve precision.
- [Introduction or motivation section] A short concrete example illustrating how final-answer accuracy alone fails to diagnose a specific process failure (e.g., an incorrect intermediate step that happens to yield a correct final answer) would strengthen the central claim without altering the paper's scope.
Simulated Author's Rebuttal
We thank the referee for their constructive and positive feedback on our guide. We address each major comment below and outline revisions that will strengthen the manuscript without altering its prescriptive focus.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that single forward passes are 'structurally limited' in realizing variable-depth computation is load-bearing for the recommendation of process-based evaluation, yet the manuscript provides only a definitional derivation rather than a formal complexity argument or reference to why fixed-depth feed-forward computation cannot simulate input-dependent halting without external mechanisms.
Authors: We agree the structural-limitation claim is central. Our argument derives directly from the fixed per-pass depth of transformer architectures and the input-dependent halting requirement in the search-procedure definition. To address the request for additional support, we will add brief references to established results on the computational limitations of fixed-depth feed-forward networks and transformers (e.g., their inability to simulate arbitrary-depth computation without external mechanisms). We will revise the abstract and the relevant section to include these citations and a short clarifying sentence. revision: yes
-
Referee: [process-based evaluation discussion] The section motivating process-based evaluation: the assumption that intermediate reasoning traces can be reliably assessed for faithfulness and validity is presented as enabling the shift away from final-answer metrics, but no discussion is given of verification methods, inter-annotator reliability, or failure modes when traces are unfaithful, which directly affects the practicality of the proposed evaluation target.
Authors: We acknowledge that practical assessment of traces requires explicit treatment of verification challenges. In the revised manuscript we will expand the process-based evaluation section with a short subsection covering verification methods (human step-wise validation, automated consistency checks), inter-annotator reliability considerations, and common failure modes such as unfaithful or post-hoc rationalization traces. This addition will better equip readers to implement the proposed evaluation targets. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper advances an explicit evaluation-oriented definition of reasoning as requiring adaptive, input-dependent selection of intermediate steps with halting conditions, formalized as a variable-depth search procedure. From this definition it directly infers that fixed-depth single forward passes cannot realize the required computation and that final-answer accuracy therefore supplies insufficient diagnostic information. This chain is self-contained and definitional; no claim reduces by construction to a fitted parameter, a self-citation, an ansatz smuggled via prior work, or a renamed empirical pattern. The central recommendation for process-based evaluation follows logically once the definition is granted and does not loop back to any unverified input or external result.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reasoning requires selecting intermediate steps and halting according to input-dependent conditions, formalized as a search-like procedure.
Reference graph
Works this paper leans on
-
[1]
C ommonsense QA : A Question Answering Challenge Targeting Commonsense Knowledge
Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan. C ommonsense QA : A Question Answering Challenge Targeting Commonsense Knowledge. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). 2019
2019
-
[2]
H ella S wag: Can a Machine Really Finish Your Sentence?
Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin. H ella S wag: Can a Machine Really Finish Your Sentence?. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). 2019
2019
-
[3]
A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories
Mostafazadeh, Nasrin and Chambers, Nathanael and He, Xiaodong and Parikh, Devi and Batra, Dhruv and Vanderwende, Lucy and Kohli, Pushmeet and Allen, James. A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human L...
2016
-
[4]
The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task
Schwartz, Roy and Sap, Maarten and Konstas, Ioannis and Zilles, Leila and Choi, Yejin and Smith, Noah A. The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task. Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL). 2017
2017
-
[5]
The Argument Reasoning Comprehension Task: Identification and Reconstruction of Implicit Warrants
Habernal, Ivan and Wachsmuth, Henning and Gurevych, Iryna and Stein, Benno. The Argument Reasoning Comprehension Task: Identification and Reconstruction of Implicit Warrants. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). 2018
2018
-
[6]
Probing Neural Network Comprehension of Natural Language Arguments
Niven, Timothy and Kao, Hung-Yu. Probing Neural Network Comprehension of Natural Language Arguments. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). 2019
2019
-
[7]
A large annotated corpus for learning natural language inference
Bowman, Samuel and Angeli, Gabor and Potts, Christopher and Manning, Christopher D. A large annotated corpus for learning natural language inference. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2015
2015
-
[8]
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
Williams, Adina and Nangia, Nikita and Bowman, Samuel. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). 2018
2018
-
[9]
Annotation Artifacts in Natural Language Inference Data
Gururangan, Suchin and Swayamdipta, Swabha and Levy, Omer and Schwartz, Roy and Bowman, Samuel and Smith, Noah A. Annotation Artifacts in Natural Language Inference Data. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). 2018
2018
-
[10]
SWAG : A Large-Scale Adversarial Dataset for Grounded Commonsense Inference
Zellers, Rowan and Bisk, Yonatan and Schwartz, Roy and Choi, Yejin. SWAG : A Large-Scale Adversarial Dataset for Grounded Commonsense Inference. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2018
2018
-
[11]
The winograd schema challenge
Levesque, Hector and Davis, Ernest and Morgenstern, Leora. The winograd schema challenge. Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning. 2012
2012
-
[12]
On the Evaluation of Common-Sense Reasoning in Natural Language Understanding
Trichelair, Paul and Emami, Ali and Trischler, Adam and Suleman, Kaheer and Cheung, Jackie Chi Kit. On the Evaluation of Common-Sense Reasoning in Natural Language Understanding. Proceedings of the Workshop on Generalization in the Age of Deep Learning. 2018
2018
-
[13]
2014 , publisher=
Probabilistic reasoning in intelligent systems: networks of plausible inference , author=. 2014 , publisher=
2014
-
[14]
2009 , publisher=
Causality , author=. 2009 , publisher=
2009
-
[15]
The Routledge Dictionary of Philosophy , author =. 2009 , edition =. doi:10.4324/9780203428467 , isbn =
-
[16]
Rules for the Direction of the Mind , booktitle =
Descartes, Ren. Rules for the Direction of the Mind , booktitle =. 1985 , volume =
1985
-
[17]
Aristotle , title =. c. 350 BCE , note =
-
[18]
1781 , note =
Kant, Immanuel , title =. 1781 , note =
-
[19]
2004 , note =
Locke, John , title =. 2004 , note =
2004
-
[20]
URL https: //arxiv.org/abs/2510.04871
Less is more: Recursive reasoning with tiny networks , author=. arXiv preprint arXiv:2510.04871 , year=
-
[21]
Mitchell , title =
Tom M. Mitchell , title =. 1997 , isbn =
1997
-
[22]
The platonic representation hypothesis.arXiv preprint arXiv:2405.07987, 2024
The platonic representation hypothesis , author=. arXiv preprint arXiv:2405.07987 , year=
-
[23]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Task contamination: Language models may not be few-shot anymore , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[24]
Evaluation data contamination in LLMs: how do we measure it and (when) does it matter? , author=. arXiv preprint arXiv:2411.03923 , year=
-
[25]
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
Investigating data contamination in modern benchmarks for large language models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
2024
-
[26]
Advances in Neural Information Processing Systems , volume=
Towards understanding grokking: An effective theory of representation learning , author=. Advances in Neural Information Processing Systems , volume=
-
[27]
Findings of the Association for Computational Linguistics: EMNLP 2022 , year =
Saturated Transformers are Constant-Depth Circuits , author =. Findings of the Association for Computational Linguistics: EMNLP 2022 , year =
2022
-
[28]
Transactions of the Association for Computational Linguistics , volume=
The parallelism tradeoff: Limitations of log-precision transformers , author=. Transactions of the Association for Computational Linguistics , volume=. 2023 , publisher=
2023
-
[29]
arXiv preprint arXiv:2402.09548 , year =
Transformers as Decision Makers: Provable Guarantees for Bandits and Reinforcement Learning , author =. arXiv preprint arXiv:2402.09548 , year =. 2402.09548 , archiveprefix =
-
[30]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , year =
Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Chi, Ed and Le, Quoc and Zhou, Denny , booktitle =. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , year =
-
[31]
The expressive power of transformers with chain of thought, 2024
The expressive power of transformers with chain of thought , author=. arXiv preprint arXiv:2310.07923 , year=
-
[32]
Journal of Computer and System Sciences , volume=
On uniformity within NC1 , author=. Journal of Computer and System Sciences , volume=. 1990 , publisher=
1990
-
[33]
The illusion of state in state-space models
The illusion of state in state-space models , author=. arXiv preprint arXiv:2404.08819 , year=
-
[34]
arXiv preprint arXiv:2402.01339 , year =
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author =. arXiv preprint arXiv:2402.01339 , year =
-
[35]
Nature , volume=
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. Nature , volume=. 2025 , publisher=
2025
-
[36]
Findings of the Association for Computational Linguistics: EMNLP 2024 , year =
Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning , author =. Findings of the Association for Computational Linguistics: EMNLP 2024 , year =
2024
-
[37]
2022 , eprint =
PINTO: Faithful Language Reasoning Using Prompt-Generated Rationales , author =. 2022 , eprint =
2022
-
[38]
2025 , eprint =
FRIT: Using Causal Importance to Improve Chain-of-Thought Faithfulness , author =. 2025 , eprint =
2025
-
[39]
arXiv preprint arXiv:2501.13491 , year=
RECALL: Library-Like Behavior In Language Models is Enhanced by Self-Referencing Causal Cycles , author=. arXiv preprint arXiv:2501.13491 , year=
-
[40]
The Reversal Curse: LLMs trained on" A is B" fail to learn" B is A" , author=. arXiv preprint arXiv:2309.12288 , year=
-
[41]
Chain of thought empowers transformers to solve inherently serial problems, 2024
Chain of thought empowers transformers to solve inherently serial problems , author=. arXiv preprint arXiv:2402.12875 , volume=
-
[42]
Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models
Dong, Yihong and Jiang, Xue and Liu, Huanyu and Jin, Zhi and Gu, Bin and Yang, Mengfei and Li, Ge. Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models. Findings of the Association for Computational Linguistics: ACL 2024. 2024
2024
-
[43]
and Ladhak, Faisal and Hashimoto, Tatsunori
Oren, Yonatan and Meister, Nicole and Chatterji, Niladri S. and Ladhak, Faisal and Hashimoto, Tatsunori. Proving Test Set Contamination in Black-Box Language Models. The Twelfth International Conference on Learning Representations (ICLR). 2024
2024
-
[44]
NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark
Sainz, Oscar and Campos, Jon and Garc \' a-Ferrero, Iker and Etxaniz, Julen and de Lacalle, Oier Lopez and Agirre, Eneko. NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023
2023
-
[45]
Yang, Shuo and Chiang, Wei-Lin and Zheng, Lianmin and Gonzalez, Joseph E. and Stoica, Ion. Rethinking Benchmark and Contamination for Language Models with Rephrased Samples. arXiv preprint arXiv:2311.04850. 2023
- [46]
-
[47]
Investigating Data Contamination in Modern Benchmarks for Large Language Models
Deng, Chunyuan and Zhao, Yilun and Tang, Xiangru and Gerstein, Mark and Cohan, Arman. Investigating Data Contamination in Modern Benchmarks for Large Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024
2024
-
[48]
Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models
Golchin, Shahriar and Surdeanu, Mihai. Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models. Transactions of the Association for Computational Linguistics (TACL). 2025. doi:10.1162/tacl_a_00720
-
[49]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=
-
[50]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
-
[51]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=
Neural Module Networks , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=
-
[52]
International Conference on Learning Representations (ICLR) , year=
Compositional Attention Networks for Machine Reasoning , author=. International Conference on Learning Representations (ICLR) , year=
-
[53]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=
-
[54]
Advances in Neural Information Processing Systems (NeurIPS) , year=
Overcoming Language Priors in Visual Question Answering with Adversarial Regularization , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
-
[55]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
-
[56]
Advances in Neural Information Processing Systems (NeurIPS) , year=
Video Diffusion Models , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
-
[57]
Think-then-generate: Reasoning-aware text-to-image diffusion with llm encoders,
Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders , author=. arXiv preprint arXiv:2601.10332 , year=
-
[58]
International Conference on Machine Learning (ICML) , year=
Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs , author=. International Conference on Machine Learning (ICML) , year=
-
[59]
VideoCoF: Unified Video Editing with Temporal Reasoner
Unified Video Editing with Temporal Reasoner , author=. arXiv preprint arXiv:2512.07469 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[60]
2025 , url =
Google DeepMind , title =. 2025 , url =
2025
-
[61]
2025 , url =
Anthropic , title =. 2025 , url =
2025
-
[62]
2025 , url =
OpenAI , title =. 2025 , url =
2025
-
[63]
DeepSeek-AI , title =
-
[64]
The Llama 3 Herd of Models , author =. arXiv preprint arXiv:2407.21783 , year =
work page internal anchor Pith review arXiv
-
[65]
Meta Llama 3 Model Card , howpublished =
-
[66]
Llama 3 Evaluation Details , howpublished =
-
[67]
2024 , journal =
Benchmark Data Contamination of Large Language Models: A Survey , author =. 2024 , journal =
2024
-
[68]
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation , author =. Findings of the Association for Computational Linguistics: EMNLP 2024 , year =. doi:10.18653/v1/2024.findings-emnlp.532 , url =
-
[69]
Quantifying Memorization Across Neural Language Models
Quantifying Memorization Across Neural Language Models , author =. International Conference on Learning Representations (ICLR) , year =. 2202.07646 , archivePrefix=
work page internal anchor Pith review arXiv
-
[70]
USENIX Security Symposium , year =
Extracting Training Data from Large Language Models , author =. USENIX Security Symposium , year =
-
[71]
2023 , journal =
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting , author =. 2023 , journal =
2023
-
[72]
2023 , journal =
Measuring Faithfulness in Chain-of-Thought Reasoning , author =. 2023 , journal =
2023
-
[73]
2025 , journal =
Reasoning Models Don't Always Say What They Think , author =. 2025 , journal =
2025
-
[74]
2024 , journal =
Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers , author =. 2024 , journal =
2024
-
[75]
International Conference on Learning Representations (ICLR) , year =
Large Language Models Are Not Robust Multiple Choice Selectors , author =. International Conference on Learning Representations (ICLR) , year =
-
[76]
Findings of the Association for Computational Linguistics: NAACL 2024 , year =
Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions , author =. Findings of the Association for Computational Linguistics: NAACL 2024 , year =
2024
-
[77]
Yuan, Yu and Zhao, Lili and Zhang, Kai and Zheng, Guangting and Liu, Qi , booktitle =. Do. 2024 , address =. doi:10.18653/v1/2024.emnlp-main.679 , url =
-
[78]
Annotation Artifacts in Natural Language Inference Data
Annotation Artifacts in Natural Language Inference Data , author =. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , year =. doi:10.18653/v1/N18-2017 , url =
-
[79]
Hypothesis Only Baselines in Natural Language Inference
Hypothesis Only Baselines in Natural Language Inference , author =. Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics , year =. doi:10.18653/v1/S18-2023 , url =
-
[80]
Nature Machine Intelligence , year =
Shortcut learning in deep neural networks , author =. Nature Machine Intelligence , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.