pith. machine review for the scientific record. sign in

arxiv: 2605.11388 · v1 · submitted 2026-05-12 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Deep Reasoning in General Purpose Agents via Structured Meta-Cognition

Aylin Caliskan, Benjamin Newman, Chirag Shah, Dan Suciu, Dean Light, Kshitish Ghate, Michael Theologitis, Pang Wei Koh, Shuyue Stella Li, Yulia Tsvetkov

Pith reviewed 2026-05-13 02:41 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords meta-reasoningLLM agentstask decompositioninference-time scaffoldingadaptive reasoningDOLORESmulti-hop reasoninggeneral-purpose agents
0
0 comments X

The pith

LLM agents can dynamically construct their own task-specific reasoning scaffolds at inference time using structured meta-reasoning in a formal language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Deep Reasoning, an inference-time method that lets agents adapt the structure of their own reasoning instead of depending on scaffolds whose structure is fixed in advance. It defines a formal language in which meta-reasoning appears as executable decompositions over associative inference, formal computation, and recursive subproblem solving; these decomposition principles are supplied as in-context examples that guide the model to build a custom scaffold for each new task. The approach is realized in the DOLORES agent, which distributes work across multiple lower-load reasoning threads. On four demanding benchmarks covering multi-hop reasoning, long-chain QA, long-context aggregation, and research-style information seeking, DOLORES improves over the strongest baseline scaffold by 24.8 percent on average across three model sizes and two families. The gains even allow an 8B model to surpass several 32B baselines from the same family in more than half the evaluated settings.

Core claim

Deep Reasoning treats scaffolding itself as adaptive reasoning: a formal language represents meta-reasoning as executable decompositions over associative inference, formal computation, and recursive subproblem solving, so that decomposition principles can be encoded once as in-context examples and then used at test time to construct a task-specific scaffold on the fly. When instantiated in the DOLORES agent, this produces better performance than any fixed scaffold on hard benchmarks while reducing premature termination and hallucinations by spreading cognition across more controlled threads.

What carries the argument

A formal language for structured meta-reasoning that encodes decompositions over associative inference, formal computation, and recursive subproblem solving as executable in-context examples for test-time scaffold construction.

If this is right

  • DOLORES outperforms every evaluated fixed scaffold by 24.8 percent on average across model sizes and families.
  • An 8B-parameter DOLORES agent surpasses multiple 32B baselines from the same family in more than half of the tested settings.
  • Distributing work across structured, lower-load reasoning threads reduces premature termination and hallucinations.
  • Scaffolding can be treated as just-in-time adaptive reasoning rather than pre-engineered structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same formal language could be used to inspect or debug an agent's reasoning plan before execution.
  • If the language is sufficiently general, it might reduce the amount of task-specific prompt engineering needed when deploying agents to new domains.
  • Extending the decomposition primitives to include uncertainty estimation or tool-use planning would be a direct next step within the same framework.

Load-bearing premise

That examples of meta-reasoning decompositions written in the formal language will let the model reliably invent effective task-specific scaffolds for diverse and previously unseen problems without any additional training or hand-crafted prompts.

What would settle it

A controlled test on a suite of novel tasks whose required reasoning structure differs markedly from the in-context decomposition examples, where DOLORES either matches or underperforms the strongest fixed-scaffold baseline.

Figures

Figures reproduced from arXiv: 2605.11388 by Aylin Caliskan, Benjamin Newman, Chirag Shah, Dan Suciu, Dean Light, Kshitish Ghate, Michael Theologitis, Pang Wei Koh, Shuyue Stella Li, Yulia Tsvetkov.

Figure 1
Figure 1. Figure 1: Deep Reasoning leverages human meta-reasoning traces to build “just-in-time” scaffolds. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Informal and formal reason￾ing describe how reasoning is carried out, while meta and object levels de￾scribe what the reasoning is about. Associative vs. Formal (how). When solving a task, some steps rely on intuition and associations, while others fol￾low explicit rules. Associative reasoning generally operates through intuitive proximity shaped by memory and context [Mednick, 1962, Sloman, 1996]. For exa… view at source ↗
Figure 3
Figure 3. Figure 3: Low-level overview of Deep Reasoning on Running example. The original task is decom [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗
read the original abstract

Humans intuitively solve complex problems by flexibly shifting among reasoning modes: they plan, execute, revise intermediate goals, resolve ambiguity through associative judgment, and apply formal procedures to well-specified subproblems. Current LLM agents lack this flexibility, as their scaffolds hard-code such reasoning decisions in advance. These scaffolds are effective when their prescribed structure matches the task, but brittle when solving the task requires adapting the structure of reasoning itself. We introduce Deep Reasoning -- an inference-time approach for constructing task-specific scaffolds through structured meta-reasoning. Deep Reasoning uses a formal language that represents meta-reasoning as executable decompositions over associative inference, formal computation, and recursive subproblem solving, enabling decomposition principles to be encoded as in-context examples that guide test-time scaffold construction. We instantiate this approach in a general-purpose agent (DOLORES) that distributes complex tasks across more controlled reasoning threads. We evaluate it against state-of-the-art scaffolding methods across four hard benchmarks: multi-hop reasoning, long-chain question answering, long-context aggregation, and deep research-style information seeking. DOLORES outperforms all evaluated scaffolds across three model sizes and two model families, improving over the strongest evaluated scaffold baseline by 24.8% on average. DOLORES distributes cognition across structured, lower-load reasoning threads, thereby reducing premature termination and hallucinations. This advantage can even bridge the scaling gap, with an 8B version surpassing all evaluated 32B baselines from the same family in more than half the settings. These results point toward future agentic systems that treat scaffolding as adaptive reasoning, constructing the structure each task requires just-in-time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims to introduce 'Deep Reasoning' as an inference-time method for general-purpose LLM agents to construct task-specific scaffolds via structured meta-reasoning in a formal language representing decompositions over associative inference, formal computation, and recursive solving. The DOLORES agent implements this and is evaluated on four benchmarks (multi-hop reasoning, long-chain question answering, long-context aggregation, and deep research-style information seeking), outperforming state-of-the-art scaffolding methods by 24.8% on average across three model sizes and two families, with an 8B model surpassing 32B baselines in over half the settings.

Significance. Should the empirical results prove robust with truly fixed in-context examples, this would be significant for LLM agent research by demonstrating that structured meta-cognition can enable adaptive, just-in-time scaffolding without per-task engineering, addressing the brittleness of current hard-coded approaches. The broad evaluation across model sizes/families and the observation of scaling-gap bridging provide useful evidence of potential impact, while the mechanism of distributing cognition to reduce hallucinations offers a practical direction for more reliable agents.

major comments (1)
  1. Abstract: The description states that the formal language enables 'decomposition principles to be encoded as in-context examples that guide test-time scaffold construction' without clarifying whether the same fixed set of examples is used across all four benchmarks or adapted per task type. This is load-bearing for the central claim of reliable meta-reasoning on novel tasks; if examples differ by benchmark, the 24.8% average gain and scaling-gap bridging may reflect benchmark-specific prompt engineering rather than the proposed general approach.
minor comments (2)
  1. Abstract: The new term 'Deep Reasoning' is introduced alongside the title's 'Structured Meta-Cognition' without an explicit statement of their relationship; a one-sentence clarification in the introduction would improve readability.
  2. Abstract: The acronym DOLORES is used without expansion; define it on first use for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater precision in the abstract regarding the in-context examples. This is a substantive point about the generality of the approach, and we address it directly below while committing to a revision.

read point-by-point responses
  1. Referee: Abstract: The description states that the formal language enables 'decomposition principles to be encoded as in-context examples that guide test-time scaffold construction' without clarifying whether the same fixed set of examples is used across all four benchmarks or adapted per task type. This is load-bearing for the central claim of reliable meta-reasoning on novel tasks; if examples differ by benchmark, the 24.8% average gain and scaling-gap bridging may reflect benchmark-specific prompt engineering rather than the proposed general approach.

    Authors: The in-context examples consist of a single fixed set that encodes general decomposition principles over associative inference, formal computation, and recursive solving. These examples are not adapted or rewritten per benchmark or task type; the same examples are used for all four evaluation settings (multi-hop reasoning, long-chain QA, long-context aggregation, and deep research-style seeking) to test the claim of general-purpose meta-reasoning. Section 3.2 and the supplementary prompt appendix describe the construction of this fixed prompt template. We agree that the abstract's phrasing leaves this ambiguous and will revise it to state explicitly that a fixed set of examples is employed across benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation of meta-reasoning scaffold

full rationale

The paper introduces DOLORES as an inference-time method using a formal language for meta-reasoning encoded in fixed in-context examples, then reports direct empirical performance gains (24.8% average) over scaffolding baselines on four standard benchmarks across model sizes. No equations, fitted parameters, or first-principles derivations are presented whose outputs reduce by construction to the inputs; the central claims rest on benchmark comparisons rather than any self-referential prediction or self-citation load-bearing step. The evaluation is therefore self-contained against external benchmarks, consistent with the reader's assessment of score 1.0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the premise that LLMs can perform reliable meta-reasoning from in-context examples of decomposition principles and that the introduced formal language adequately captures the flexibility needed for arbitrary tasks.

axioms (1)
  • domain assumption Large language models can execute structured meta-reasoning and construct effective task-specific scaffolds when provided with in-context examples of decomposition principles.
    This is invoked to justify the inference-time construction of scaffolds without task-specific fine-tuning.
invented entities (2)
  • Deep Reasoning no independent evidence
    purpose: Inference-time construction of task-specific reasoning scaffolds via structured meta-reasoning
    New approach introduced to address brittleness of fixed scaffolds.
  • DOLORES no independent evidence
    purpose: General-purpose agent that distributes tasks across structured reasoning threads using Deep Reasoning
    Specific implementation of the proposed method.

pith-pipeline@v0.9.0 · 5623 in / 1547 out tokens · 65109 ms · 2026-05-13T02:41:23.616828+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

142 extracted references · 142 canonical work pages · 9 internal anchors

  1. [1]

    Meta-reasoning: Monitoring and control of thinking and reasoning

    Rakefet Ackerman and Valerie A Thompson. Meta-reasoning: Monitoring and control of thinking and reasoning. Trends in cognitive sciences, 21 0 (8): 0 607--617, 2017

  2. [3]

    Dual process theory: Embodied and predictive; symbolic and classical

    Samuel C Bellini-Leite. Dual process theory: Embodied and predictive; symbolic and classical. Frontiers in Psychology, 13: 0 805386, 2022

  3. [5]

    Broadbent

    Donald E. Broadbent. Perception and Communication. Pergamon Press, London, 1958

  4. [6]

    Bias, prevalence and kappa

    Ted Byrt, Janet Bishop, and John B Carlin. Bias, prevalence and kappa. Journal of clinical epidemiology, 46 0 (5): 0 423--429, 1993

  5. [7]

    M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation

    Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. In Findings of the association for computational linguistics: ACL 2024, pages 2318--2335, 2024 a

  6. [8]

    Unlocking the capabilities of thought: A reasoning boundary framework to quantify and optimize chain-of-thought

    Qiguang Chen, Libo Qin, Jiaqi Wang, Jingxuan Zhou, and Wanxiang Che. Unlocking the capabilities of thought: A reasoning boundary framework to quantify and optimize chain-of-thought. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural Information Processing Systems ...

  7. [9]

    Do not think that much for 2+ 3=? on the overthinking of long reasoning models

    Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of long reasoning models. In Forty-second International Conference on Machine Learning, 2025

  8. [10]

    The llama 3 herd of models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv e-prints, pages arXiv--2407, 2024

  9. [11]

    Metacognition and cognitive monitoring: A new area of cognitive--developmental inquiry

    John H Flavell. Metacognition and cognitive monitoring: A new area of cognitive--developmental inquiry. American psychologist, 34 0 (10): 0 906, 1979

  10. [12]

    Agentrefine: Enhancing agent generalization through refinement tuning

    Dayuan Fu, Keqing He, Yejie Wang, Wentao Hong, Zhuoma GongQue, Weihao Zeng, Wei Wang, Jingang Wang, Xunliang Cai, and Weiran Xu. Agentrefine: Enhancing agent generalization through refinement tuning. In The Thirteenth International Conference on Learning Representations, 2025 a . URL https://openreview.net/forum?id=FDimWzmcWn

  11. [14]

    Phantomwiki: On-demand datasets for reasoning and retrieval evaluation

    Albert Gong, Kamil \.e Stankevi c i \=u t \.e , Chao Wan, Anmol Kabra, Raphael Thesmar, Johann Lee, Julius Klenke, Carla P Gomes, and Kilian Q Weinberger. Phantomwiki: On-demand datasets for reasoning and retrieval evaluation. In International Conference on Machine Learning, pages 19964--19995. PMLR, 2025

  12. [16]

    Synthworlds: Controlled parallel worlds for disentangling reasoning and knowledge in language models

    Ken Gu, Advait Bhat, Mike A Merrill, Robert West, Xin Liu, Daniel McDuff, and Tim Althoff. Synthworlds: Controlled parallel worlds for disentangling reasoning and knowledge in language models. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=46AQ4qaWqQ

  13. [21]

    Deduction

    Philip Nicholas Johnson-Laird and Ruth MJ Byrne. Deduction. Lawrence Erlbaum Associates, Inc, 1991

  14. [23]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  15. [24]

    Zabreyko, Chenning Li, Ferdi Kossmann, Jialin Ding, Jun Chen, Markos Markakis, Matthew Russo, Weiyang Wang, Ziniu Wu, Mike Cafarella, Lei Cao, Samuel Madden, and Tim Kraska

    Eugenie Lai, Gerardo Vitagliano, Ziyu Zhang, Om Chabra, SIVAPRASAD SUDHIR, Anna Zeng, Anton A. Zabreyko, Chenning Li, Ferdi Kossmann, Jialin Ding, Jun Chen, Markos Markakis, Matthew Russo, Weiyang Wang, Ziniu Wu, Mike Cafarella, Lei Cao, Samuel Madden, and Tim Kraska. KRAMABENCH : A benchmark for AI systems on data-to-insight pipelines over data lakes. In...

  16. [28]

    Faithful chain-of-thought reasoning

    Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch. Faithful chain-of-thought reasoning. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Lo...

  17. [31]

    Tahmid Rahman Laskar, Mizanur Rahman, Shadikur Rahman, Mehrad Shahmohammadi, Megh Thakkar, Md

    Ahmed Masry, Mohammed Saidul Islam, Mahir Ahmed, Aayush Bajaj, Firoz Kabir, Aaryaman Kartha, Md. Tahmid Rahman Laskar, Mizanur Rahman, Shadikur Rahman, Mehrad Shahmohammadi, Megh Thakkar, Md. Rizwan Parvez, Enamul Hoque, and Shafiq Joty. Chartqapro: A more diverse and challenging benchmark for chart question answering. In Wanxiang Che, Joyce Nabende, Ekat...

  18. [32]

    The associative basis of the creative process

    Sarnoff Mednick. The associative basis of the creative process. Psychological review, 69 0 (3): 0 220, 1962

  19. [33]

    Report on a general problem solving program

    Allen Newell, John C Shaw, and Herbert A Simon. Report on a general problem solving program. In IFIP congress, volume 256, page 1959. Pittsburgh, PA, 1959

  20. [34]

    NVIDIA AI-Q Blueprint for Intelligent Agents , 2026

    NVIDIA . NVIDIA AI-Q Blueprint for Intelligent Agents , 2026. URL https://build.nvidia.com/nvidia/aiq. Accessed: 2026-04-14

  21. [35]

    Introducing deep research, February 2025

    OpenAI . Introducing deep research, February 2025. URL https://openai.com/index/introducing-deep-research/. Accessed: 2026-05-06

  22. [36]

    Cognitive processes in propositional reasoning

    Lance J Rips. Cognitive processes in propositional reasoning. Psychol. Rev., 90 0 (1): 0 38--71, January 1983

  23. [37]

    Principles of categorization

    Eleanor Rosch. Principles of categorization. In Eleanor Rosch and Barbara Bloom Lloyd, editors, Cognition and Categorization, pages 27--48. Lawrence Elbaum Associates, 1978

  24. [38]

    Agentbreeder: Mitigating the AI safety risks of multi-agent scaffolds via self-improvement

    J Rosser and Jakob Nicolaus Foerster. Agentbreeder: Mitigating the AI safety risks of multi-agent scaffolds via self-improvement. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=mlU9KqdZUS

  25. [39]

    `smolagents`: a smol library to build great agentic systems

    Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. `smolagents`: a smol library to build great agentic systems. https://github.com/huggingface/smolagents, 2025 a

  26. [40]

    Open-source deepresearch -- freeing our search agents, February 2025 b

    Aymeric Roucher, Albert Villanova del Moral, Merve, Thomas Wolf, and Cl \'e mentine Fourrier. Open-source deepresearch -- freeing our search agents, February 2025 b . URL https://huggingface.co/blog/open-deep-research. Accessed: 2026-04-13

  27. [42]

    The empirical case for two systems of reasoning

    Steven A Sloman. The empirical case for two systems of reasoning. Psychological bulletin, 119 0 (1): 0 3, 1996

  28. [43]

    Rationality and the reflective mind

    Keith Stanovich. Rationality and the reflective mind. Oxford University Press, 2011

  29. [46]

    Monitoring and storage of irrelevant messages in selective attention

    Anne Treisman. Monitoring and storage of irrelevant messages in selective attention. Journal of Verbal Learning and Verbal Behavior, 3 0 (6): 0 449--459, 1964

  30. [47]

    Executable code actions elicit better llm agents

    Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. In Forty-first International Conference on Machine Learning, 2024

  31. [48]

    Reasoning about a rule

    Peter C Wason. Reasoning about a rule. Quarterly journal of experimental psychology, 20 0 (3): 0 273--281, 1968

  32. [51]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations, 2022

  33. [55]

    Judging llm-as-a-judge with mt-bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36: 0 46595--46623, 2023

  34. [56]

    CoRR , volume =

    Chunqiu Steven Xia and Zhe Wang and Yan Yang and Yuxiang Wei and Lingming Zhang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2511.13646 , eprinttype =. 2511.13646 , timestamp =

  35. [57]

    CoRR , volume =

    Guangyi Liu and Haojun Lin and Huan Zeng and Heng Wang and Quanming Yao , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2602.13671 , eprinttype =. 2602.13671 , timestamp =

  36. [58]

    The Thirteenth International Conference on Learning Representations,

    Shengran Hu and Cong Lu and Jeff Clune , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

  37. [59]

    ReCreate: Reasoning and Creating Domain Agents Driven by Experience

    Zhezheng Hao and Hong Wang and Jian Luo and Jianqing Zhang and Yuyan Zhou and Qiang Lin and Can Wang and Hande Dong and Jiawei Chen , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2601.11100 , eprinttype =. 2601.11100 , timestamp =

  38. [60]

    Eugenie Lai and Gerardo Vitagliano and Ziyu Zhang and Om Chabra and SIVAPRASAD SUDHIR and Anna Zeng and Anton A. Zabreyko and Chenning Li and Ferdi Kossmann and Jialin Ding and Jun Chen and Markos Markakis and Matthew Russo and Weiyang Wang and Ziniu Wu and Mike Cafarella and Lei Cao and Samuel Madden and Tim Kraska , booktitle=. 2026 , url=

  39. [61]

    Stephen Casper and Luke Bailey and Rosco Hunter and Carson Ezell and Emma Cabal. The. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2502.01635 , eprinttype =. 2502.01635 , timestamp =

  40. [62]

    The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents

    Xingyao Wang and Simon Rosenberg and Juan Michelini and Calvin Smith and Hoang H. Tran and Engel Nyst and Rohit Malhotra and Xuhui Zhou and Valerie Chen and Robert Brennan and Graham Neubig , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2511.03690 , eprinttype =. 2511.03690 , timestamp =

  41. [63]

    Frontier LLMs Still Struggle with Simple Reasoning Tasks , journal =

    Alan Malek and Jiawei Ge and Nevena Lazic and Chi Jin and Andr. Frontier LLMs Still Struggle with Simple Reasoning Tasks , journal =. 2025 , url =. doi:10.48550/ARXIV.2507.07313 , eprinttype =. 2507.07313 , timestamp =

  42. [64]

    Sweller, John and van Merriënboer, Jeroen J. G. and Paas, Fred , year =. Cognitive Architecture and Instructional Design: 20 Years Later , volume =. Educational Psychology Review , publisher =. doi:10.1007/s10648-019-09465-5 , number =

  43. [65]

    Doing more with less: meta-reasoning and meta-learning in humans and machines , journal =

    Thomas L Griffiths and Frederick Callaway and Michael B Chang and Erin Grant and Paul M Krueger and Falk Lieder , abstract =. Doing more with less: meta-reasoning and meta-learning in humans and machines , journal =. 2019 , note =. doi:https://doi.org/10.1016/j.cobeha.2019.01.005 , url =

  44. [66]

    Unlocking the Capabilities of Thought:

    Qiguang Chen and Libo Qin and Jiaqi Wang and Jingxuan Zhou and Wanxiang Che , editor =. Unlocking the Capabilities of Thought:. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024 , year =

  45. [67]

    Exclusion of Thought: Mitigating Cognitive Load in Large Language Models for Enhanced Reasoning in Multiple-Choice Tasks , booktitle =

    Qihang Fu and Yongbin Qin and Ruizhang Huang and Yanping Chen and Yulin Zhou and Lintao Long , editor =. Exclusion of Thought: Mitigating Cognitive Load in Large Language Models for Enhanced Reasoning in Multiple-Choice Tasks , booktitle =. 2025 , url =. doi:10.18653/V1/2025.ACL-LONG.1051 , timestamp =

  46. [68]

    Working Memory Identifies Reasoning Limits in Language Models , booktitle =

    Chunhui Zhang and Yiren Jian and Zhongyu Ouyang and Soroush Vosoughi , editor =. Working Memory Identifies Reasoning Limits in Language Models , booktitle =. 2024 , url =. doi:10.18653/V1/2024.EMNLP-MAIN.938 , timestamp =

  47. [69]

    The Thirteenth International Conference on Learning Representations , year=

    AgentRefine: Enhancing Agent Generalization through Refinement Tuning , author=. The Thirteenth International Conference on Learning Representations , year=

  48. [70]

    2025 , eprint=

    Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning , author=. 2025 , eprint=

  49. [71]

    Meta-Harness: End-to-End Optimization of Model Harnesses

    Yoonho Lee and Roshen Nair and Qizheng Zhang and Kangwook Lee and Omar Khattab and Chelsea Finn , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2603.28052 , eprinttype =. 2603.28052 , timestamp =

  50. [72]

    Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , editor =

    John Yang and Carlos E. Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , editor =. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering , booktitle =. 2024 , url =

  51. [73]

    Frontier AI Trends Report , year =

  52. [74]

    Anthropic Raises \ 30 Billion in Series G Funding at \ 380 Billion Post-Money Valuation , year =

  53. [75]

    Thomas McCoy and Shunyu Yao and Dan Friedman and Matthew Hardy and Thomas L

    R. Thomas McCoy and Shunyu Yao and Dan Friedman and Matthew Hardy and Thomas L. Griffiths , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.13638 , eprinttype =. 2309.13638 , timestamp =

  54. [76]

    AgentBreeder: Mitigating the

    J Rosser and Jakob Nicolaus Foerster , booktitle=. AgentBreeder: Mitigating the. 2026 , url=

  55. [77]

    2025 , note =

    Thang Luong and Edward Lockhart , title =. 2025 , note =

  56. [78]

    2024 , url=

    Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan , booktitle=. 2024 , url=

  57. [79]

    Measuring

    Thomas Kwa and Ben West and Joel Becker and Amy Deng and Katharyn Garcia and Max Hasin and Sami Jawhar and Megan Kinniment and Nate Rush and Sydney Von Arx and Ryan Bloom and Thomas Broadley and Haoxing Du and Brian Goodrich and Nikola Jurkovic and Luke Harold Miles and Seraphina Nix and Tao Roa Lin and Neev Parikh and David Rein and Lucas Jun Koba Sato a...

  58. [80]

    Course in General Linguistics , year =

    de Saussure, Ferdinand , biburl =. Course in General Linguistics , year =

  59. [81]

    Yoshihiko Futamura , title =. High. Order Symb. Comput. , volume =. 1999 , url =. doi:10.1023/A:1010043619517 , timestamp =

  60. [82]

    Proceedings of the Symposium on Computers and Automata , editor =

    Scott, Dana and Strachey, Christopher , title =. Proceedings of the Symposium on Computers and Automata , editor =. 1971 , volume =

  61. [83]

    Formal Philosophy: Selected Papers of Richard Montague , editor =

    Montague, Richard , title =. Formal Philosophy: Selected Papers of Richard Montague , editor =. 1974 , pages =

  62. [84]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

  63. [85]

    Measuring Faithfulness in Chain-of-Thought Reasoning

    Measuring faithfulness in chain-of-thought reasoning , author=. arXiv preprint arXiv:2307.13702 , year=

  64. [86]

    arXiv preprint arXiv:2405.15092 , year=

    Dissociation of faithful and unfaithful reasoning in llms , author=. arXiv preprint arXiv:2405.15092 , year=

  65. [87]

    Faithful chain-of-thought reasoning , author=. Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  66. [88]

    Iv´an Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy

    Chain-of-thought reasoning in the wild is not always faithful , author=. arXiv preprint arXiv:2503.08679 , year=

  67. [89]

    Forty-second International Conference on Machine Learning , year=

    Do NOT think that much for 2+ 3=? on the overthinking of long reasoning models , author=. Forty-second International Conference on Machine Learning , year=

  68. [90]

    Don't overthink it

    Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning , author=. arXiv preprint arXiv:2505.17813 , year=

  69. [91]

    ArXiv , year=

    Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs , author=. ArXiv , year=

  70. [92]

    Nature human behaviour , volume=

    Building machines that learn and think with people , author=. Nature human behaviour , volume=. 2024 , publisher=

  71. [93]

    Proceedings of the National Academy of Sciences , volume=

    Mental models and human reasoning , author=. Proceedings of the National Academy of Sciences , volume=. 2010 , publisher=

  72. [94]

    , author=

    Metacognition and cognitive monitoring: A new area of cognitive--developmental inquiry. , author=. American psychologist , volume=. 1979 , publisher=

  73. [95]

    Trends in cognitive sciences , volume=

    Meta-reasoning: Monitoring and control of thinking and reasoning , author=. Trends in cognitive sciences , volume=. 2017 , publisher=

  74. [96]

    2011 , publisher=

    Rationality and the reflective mind , author=. 2011 , publisher=

  75. [97]

    Frontiers in Psychology , volume=

    Dual process theory: Embodied and predictive; symbolic and classical , author=. Frontiers in Psychology , volume=. 2022 , publisher=

  76. [98]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  77. [99]

    OpenAI o1 System Card

    Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

  78. [100]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in

    Yang Yue and Zhiqi Chen and Rui Lu and Andrew Zhao and Zhaokai Wang and Yang Yue and Shiji Song and Gao Huang , booktitle=. Does Reinforcement Learning Really Incentivize Reasoning Capacity in. 2026 , url=

  79. [101]

    The eleventh international conference on learning representations , year=

    React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=

  80. [102]

    Forty-first International Conference on Machine Learning , year=

    Executable code actions elicit better llm agents , author=. Forty-first International Conference on Machine Learning , year=

Showing first 80 references.