arxiv: 2605.13414 · v1 · submitted 2026-05-13 · 💻 cs.AI

Recognition: unknown

TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints

Shubhashis Roy Dipta, Zabir Al Nazi

Pith reviewed 2026-05-14 19:24 UTC · model grok-4.3

classification 💻 cs.AI

keywords prospective metacognitive controlLLM evaluationtoken budget planningresource-efficient agentsTRIAGE frameworktask selection under constraintsoracle benchmarkingmetacognition in language models

0 comments

The pith

Language models lack the ability to prospectively plan task selection and compute allocation under fixed token budgets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TRIAGE, a framework that gives a model a pool of problems plus a token budget set to its own average cost, then requires it to output one ordered plan that chooses which problems to attempt, their sequence, and the token allocation for each before any solving begins. Plans are scored by comparing the value they achieve against an oracle that already knows the model's true success rate and cost on every problem, producing a triage efficiency ratio. Evaluation on competition math, graduate science, code generation, and expert knowledge tasks shows frontier and open-source models, with or without reasoning, produce plans far below the oracle optimum. This gap points to a missing capability needed for language models to act as resource-efficient autonomous agents.

Core claim

TRIAGE measures prospective metacognitive control by requiring models to commit to a single ordered plan that jointly encodes selection, sequencing, and per-problem token allocation under a budget calibrated to the model's baseline cost; the resulting triage efficiency ratio quantifies how closely the plan matches the value an oracle with full knowledge of solvability and cost would achieve.

What carries the argument

The TRIAGE efficiency ratio, computed by scoring a model's committed plan against an oracle that knows each problem's solvability and exact cost for that model.

If this is right

Agents built on models with stronger prospective control could complete more problems within the same token budget by avoiding low-yield tasks.
The measured capability is distinct from single-task accuracy and directly affects deployment cost in queued problem settings.
Both reasoning-enabled and base models show the same deficit, suggesting the gap is not fixed by adding chain-of-thought at inference time.
Closing the gap would require training objectives that reward joint planning over isolated problem solving.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models that improve on TRIAGE could be paired with lighter verification steps, freeing tokens for harder problems.
The framework could be extended to dynamic queues where new problems arrive after partial execution.
Training data that includes explicit triage examples might narrow the gap faster than accuracy-only fine-tuning.
Human performance on analogous triage tasks could serve as an upper bound for future model comparisons.

Load-bearing premise

An oracle that already knows the model's success rate and cost on every problem supplies a fair and unbiased benchmark without introducing hindsight bias or selection effects from the budget calibration.

What would settle it

A model that repeatedly produces plans whose achieved value reaches at least 85 percent of the oracle optimum across held-out task pools of varying difficulty would falsify the reported gaps.

Figures

Figures reproduced from arXiv: 2605.13414 by Shubhashis Roy Dipta, Zabir Al Nazi.

**Figure 1.** Figure 1: Triage efficiency across models and benchmarks at moderate budget (α = 0.5). Solid bars show ηU (unconstrained regime, advisory allocations), hatched bars show ηE (constrained regime, binding allocations), and red dashed lines mark base accuracy. η = 1 is oracle triage; η = 0 is random; η < 0 is worse than random. Bar color distinguishes standard inference from extended reasoning. accurate self-assessment … view at source ↗

**Figure 2.** Figure 2: Normalized triage regret across models and benchmarks at moderate budget (α = 0.5). Solid bars show R˜U = (Voracle − VU )/Voracle (unconstrained regime), hatched bars show R˜E (constrained regime), and red dashed lines mark base accuracy. R˜ = 0 is oracle (no value lost); R˜ = 1 is full regret (no value captured). Lower is better. Behavior across budget pressure. Full per-(model, α, dataset) breakdowns app… view at source ↗

**Figure 3.** Figure 3: Trajectories through (D, W) space as the unsolvable-injection ratio r increases from 0.25 to 1.00 (marker size). The star marks the ideal corner: high detection rate, low waste rate. the range, suggesting the relationship between reasoning training and abstention is not uniform across models. The dependence of detection on injection ratio. We characterize each model by how its detection rate varies with th… view at source ↗

**Figure 4.** Figure 4: Per-(model, mode) accuracy on each benchmark. Accuracy is the proportion of solvable problems and is [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Triage skill in the advisory-budget regime, [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Triage skill in the enforced-budget regime, [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Budget-aware re-solve, aggregate per model. Cells colored by intensity within each metric. N is the number of (problem, allocation) pairs re-issued. Compliance is the fraction of problems for which the model’s actual output length stays within its self-declared budget ai . The four right-hand counts (newly correct, lost correct, kept correct, still wrong) sum to N and decompose the change between Accbaseli… view at source ↗

**Figure 8.** Figure 8: Budget-aware re-solve, per (model × dataset). HLE sub-domains are aggregated by problem-count weighted mean. Rows within each model block: original baseline accuracy, accuracy with the budget banner, and compliance rate. Color encoding follows [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

read the original abstract

Deploying language models as autonomous agents requires more than per-task accuracy: when an agent faces a queue of problems under a finite token budget, it must decide which to attempt, in what order, and how much compute to commit to each, all before any execution feedback is available. This is the prospective form of metacognitive control studied for decades in human cognition, yet whether language models possess it remains untested. We introduce TRIAGE, an evaluation framework in which a model receives a task pool and a token budget calibrated to its own baseline cost, and commits to a single ordered plan that jointly encodes selection, sequencing, and per-problem allocation. Plans are scored against an oracle with full knowledge of the model's solvability and cost on each problem, yielding a triage efficiency ratio on a common scale. We evaluate frontier and open-source models, with and without reasoning enabled, across competition mathematics, graduate-level science, code generation, and expert multidisciplinary knowledge, and find that current language models exhibit substantial gaps in prospective metacognitive control, revealing a previously unmeasured capability dimension with direct implications for resource-efficient agent deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TRIAGE gives a clean new way to test upfront planning under token budgets, but the oracle scoring and baseline calibration introduce an information asymmetry that needs checking before the gap claims land cleanly.

read the letter

The paper's core move is to treat selection, sequencing, and per-problem token allocation as one joint decision that a model must make before seeing any results. It supplies a single plan, scores it with an efficiency ratio against an oracle that knows solvability and costs, and reports that frontier and open models fall short across math, science, code, and knowledge tasks. That framing is new and useful for anyone thinking about agents that face real resource queues rather than isolated prompts. The setup is straightforward to describe and applies the same metric across model families and reasoning modes, which makes the comparison direct. They also calibrate the budget to each model's own baseline cost, which at least tries to keep the test model-specific instead of using a fixed external limit. If the full results tables and variance numbers back the abstract's claim of substantial gaps, the framework gives a practical handle on a capability that prior accuracy-only benchmarks miss. The soft spot is exactly the one the stress-test flags. The oracle has complete knowledge of which problems the model can solve and at what cost, while the model must commit to its plan without that information. The budget itself is derived from the same task pool after measuring baseline costs, so the scoring metric effectively has post-measurement statistics the model cannot use prospectively. This mismatch can inflate the apparent deficit in metacognitive control. The paper would be stronger with an explicit check, such as a blinded oracle or a budget set from a held-out task set, to show the gaps survive that adjustment. Readers working on resource-aware agents or evaluation protocols will find the framework worth examining even if they end up tweaking the scoring. It is coherent on its own terms and raises a real deployment question, so it deserves a serious referee who can press on the oracle calibration details.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces TRIAGE, a framework to evaluate prospective metacognitive control in LLMs. A model is given a task pool and a token budget calibrated to its baseline cost on those tasks. It must produce one ordered plan encoding selection, sequencing, and allocation decisions before any execution. The plan is scored against an oracle possessing full knowledge of the model's solvability and costs per problem, producing a triage efficiency ratio. Evaluations across math, science, code, and knowledge tasks indicate substantial gaps in current models' prospective control abilities.

Significance. If the central findings hold, the work identifies a new, previously unmeasured capability dimension relevant to resource-efficient deployment of LLMs as agents. The framework offers a concrete, scalable method to quantify selection, sequencing, and allocation reasoning under constraints, with direct implications for practical agent systems. The multi-domain evaluation and comparison of frontier and open-source models with and without reasoning are strengths.

major comments (1)

[Abstract] The triage efficiency ratio is defined against an oracle with complete knowledge of solvability and per-problem costs, while the model's plan is formed prospectively without feedback. This creates an information asymmetry that may cause the reported gaps to partly reflect the oracle's hindsight rather than a pure deficit in metacognitive control. The token budget is calibrated to the model's baseline cost on the evaluation task pool, which embeds task-specific statistics unavailable at planning time.

minor comments (2)

The abstract reports high-level findings without specific quantitative results, error bars, or detailed task descriptions, which limits immediate assessment of effect sizes.
Clarify whether the baseline cost measurement uses the same task pool or a held-out set to avoid circularity in calibration.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of TRIAGE's significance. We address the major comment below.

read point-by-point responses

Referee: [Abstract] The triage efficiency ratio is defined against an oracle with complete knowledge of solvability and per-problem costs, while the model's plan is formed prospectively without feedback. This creates an information asymmetry that may cause the reported gaps to partly reflect the oracle's hindsight rather than a pure deficit in metacognitive control. The token budget is calibrated to the model's baseline cost on the evaluation task pool, which embeds task-specific statistics unavailable at planning time.

Authors: The information asymmetry is a deliberate feature of the evaluation design. The oracle establishes the theoretical maximum triage efficiency given the model's actual solvability and per-problem costs, providing a normalized measure of how closely the prospective plan approaches optimality. This approach is standard in planning and resource-allocation benchmarks to quantify deviation from the best achievable outcome under uncertainty. The gaps therefore capture limitations in the model's prospective selection, sequencing, and allocation reasoning. Regarding budget calibration, the total token budget is derived from the model's baseline costs on the pool to ensure realism and model-specificity; however, at planning time the model is given only the aggregate budget and task pool, with no per-task cost or solvability information disclosed. We will revise the abstract and methods section to explicitly clarify this design rationale and the oracle's role as an upper-bound reference. revision: yes

Circularity Check

0 steps flagged

No significant circularity in TRIAGE evaluation framework

full rationale

The paper defines the triage efficiency ratio by scoring a model's single pre-execution plan against an independent oracle that holds full knowledge of solvability and per-problem costs on the task pool. Token-budget calibration to the model's measured baseline cost on the same pool serves only as normalization to place results on a common scale; it does not embed the target ratio or any fitted parameter into the reported metric. No equations, self-citations, or uniqueness claims reduce the central result to a definition, a prior fit, or an author-supplied ansatz. The evaluation draws on external benchmarks across mathematics, science, code, and knowledge domains, keeping the derivation self-contained against verifiable external oracles.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that prospective planning without feedback is the relevant form of metacognition for agents and that the oracle represents an achievable optimum; no free parameters or new entities are explicitly introduced beyond the triage efficiency ratio metric.

free parameters (1)

token budget calibration
Budget is set to the model's own baseline cost, which requires empirical measurement that may involve choices in how baseline is computed.

axioms (1)

domain assumption Prospective metacognitive control (planning before any execution feedback) is a necessary capability for resource-efficient autonomous agents.
Stated in the opening motivation for the framework.

pith-pipeline@v0.9.0 · 5495 in / 1329 out tokens · 51454 ms · 2026-05-14T19:24:34.695836+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 3 internal anchors

[1]

Snell, Charlie and Lee, Jaehoon and Xu, Kelvin and Kumar, Aviral , booktitle =. Scaling. 2025 , note =

work page 2025
[2]

Reasoning on a Budget: A Survey of Adaptive and Controllable Test-Time Compute in

Alomrani, Mohammad Ali and Zhang, Yingxue and Li, Derek and Sun, Qianyi and Pal, Soumyasundar and Zhang, Zhanguang and Hu, Yaochen and Ajwani, Rohan Deepak and Valkanas, Antonios and Karimi, Raika and Cheng, Peng and Wang, Yunzhou and Liao, Pengyi and Huang, Hanrui and Wang, Bin and Hao, Jianye and Coates, Mark , journal =. Reasoning on a Budget: A Survey...

work page
[3]

Chen, Xingyu and Xu, Jiahao and Liang, Tian and He, Zhiwei and Pang, Jianhui and Yu, Dian and Song, Linfeng and Liu, Qiuzhi and Zhou, Mengfei and Zhang, Zhuosheng and Wang, Rui and Tu, Zhaopeng and Mi, Haitao and Yu, Dong , journal =. Do

work page
[4]

Thoughts Are All Over the Place: On the Underthinking of o1-Like

Wang, Yue and Liu, Qiuzhi and Xu, Jiahao and Liang, Tian and Chen, Xingyu and He, Zhiwei and Song, Linfeng and Yu, Dian and Li, Juntao and Zhang, Zhuosheng and Wang, Rui and Tu, Zhaopeng and Mi, Haitao and Yu, Dong , booktitle =. Thoughts Are All Over the Place: On the Underthinking of o1-Like. 2025 , note =

work page 2025
[5]

arXiv preprint arXiv:2502.08235 , year =

The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks , author =. arXiv preprint arXiv:2502.08235 , year =

work page arXiv
[6]

Behavioral and Brain Sciences , volume =

Resource-rational analysis: Understanding human cognition as the optimal use of limited computational resources , author =. Behavioral and Brain Sciences , volume =. 2020 , doi =

work page 2020
[7]

The Psychology of Learning and Motivation , editor =

Metamemory: A theoretical framework and new findings , author =. The Psychology of Learning and Motivation , editor =. 1990 , publisher =

work page 1990
[8]

Language Models (Mostly) Know What They Know

Language Models (Mostly) Know What They Know , author =. arXiv preprint arXiv:2207.05221 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[9]

, booktitle =

Kirichenko, Polina and Ibrahim, Mark and Chaudhuri, Kamalika and Bell, Samuel J. , booktitle =. 2025 , note =

work page 2025
[10]

arXiv preprint arXiv:2512.24661 , year =

Do Large Language Models Know What They Are Capable Of? , author =. arXiv preprint arXiv:2512.24661 , year =

work page arXiv
[11]

Token-Budget-Aware

Han, Tingxu and Wang, Zhenting and Fang, Chunrong and Zhao, Shiyu and Ma, Shiqing and Chen, Zhenyu , booktitle =. Token-Budget-Aware. 2025 , note =

work page 2025
[12]

Li, Zheng and Dong, Qingxiu and Ma, Jingyuan and Zhang, Di and Jia, Kai and Sui, Zhifang , journal =

work page
[13]

2023 , note =

Jin, Yunho and Wu, Chun-Feng and Brooks, David and Wei, Gu-Yeon , booktitle =. 2023 , note =

work page 2023
[14]

Efficient

Fu, Yichao and Bailis, Peter and Stoica, Ion and Zhang, Hao , booktitle =. Efficient. 2024 , note =

work page 2024
[15]

Evidence for Limited Metacognition in

Ackerman, Christopher , booktitle =. Evidence for Limited Metacognition in. 2026 , note =

work page 2026
[16]

Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =

Metacognitive and control strategies in study-time allocation , author =. Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =. 2000 , doi =

work page 2000
[17]

Jain, Naman and Han, King and Gu, Alex and Li, Wen-Ding and Yan, Fanjia and Zhang, Tianjun and Wang, Sida and Solar-Lezama, Armando and Sen, Koushik and Stoica, Ion , journal =

work page
[18]

Humanity's Last Exam

Humanity's last exam , author=. arXiv preprint arXiv:2501.14249 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

, author=

Monitoring one's own knowledge during study: A cue-utilization approach to judgments of learning. , author=. Journal of experimental psychology: General , volume=. 1997 , publisher=

work page 1997
[20]

Advances in Neural Information Processing Systems , volume=

Large Language Models Must Be Taught to Know What They Don't Know , author=. Advances in Neural Information Processing Systems , volume=

work page
[21]

arXiv preprint arXiv:2505.13763 , year=

Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations , author=. arXiv preprint arXiv:2505.13763 , year=

work page arXiv
[22]

Current Directions in Psychological Science , year=

Metacognition and Uncertainty Communication in Humans and Large Language Models , author=. Current Directions in Psychological Science , year=

work page
[23]

Adaptively Robust

Chen, Zixi and Ye, Yinyu and Zhou, Zijie , journal=. Adaptively Robust

work page
[24]

2023 , note=

Valmeekam, Karthik and Marquez, Matthew and Olmo, Alberto and Sreedharan, Sarath and Kambhampati, Subbarao , booktitle=. 2023 , note=

work page 2023
[25]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. arXiv preprint arXiv:2406.12045 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

International Conference on Learning Representations , year=

Mialon, Gr\'. International Conference on Learning Representations , year=

work page
[27]

Journal of Applied Meteorology , volume=

A New Vector Partition of the Probability Score , author=. Journal of Applied Meteorology , volume=

work page
[28]

Psychological Review , volume =

Koriat, Asher and Goldsmith, Morris , title =. Psychological Review , volume =. 1996 , doi =

work page 1996
[29]

Journal of Memory and Language , volume =

Metcalfe, Janet and Kornell, Nate , title =. Journal of Memory and Language , volume =. 2005 , doi =

work page 2005
[30]

Journal of Experimental Psychology: Applied , volume =

Ackerman, Rakefet and Goldsmith, Morris , title =. Journal of Experimental Psychology: Applied , volume =. 2011 , doi =

work page 2011
[31]

2009 , isbn =

Dunlosky, John and Metcalfe, Janet , title =. 2009 , isbn =

work page 2009
[32]

, title =

Underwood, Benton J. , title =

work page
[33]

Jacob and Nelson, Thomas O

Leonesio, R. Jacob and Nelson, Thomas O. , title =. Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =. 1990 , doi =

work page 1990
[34]

Journal of Experimental Psychology: General , volume =

Mazzoni, Giuliana and Cornoldi, Cesare , title =. Journal of Experimental Psychology: General , volume =. 1993 , doi =

work page 1993
[35]

, title =

Son, Lisa K. , title =. Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =. 2004 , doi =

work page 2004
[36]

and Dunlosky, John , title =

Thiede, Keith W. and Dunlosky, John , title =. Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =. 1999 , doi =

work page 1999
[37]

, title =

Murphy, Allan H. , title =. Journal of Applied Meteorology , volume =. 1973 , doi =

work page 1973
[38]

Advances in Neural Information Processing Systems , volume =

Geifman, Yonatan and El-Yaniv, Ran , title =. Advances in Neural Information Processing Systems , volume =. 2017 , eprint =

work page 2017
[39]

2004 , isbn =

Kellerer, Hans and Pferschy, Ulrich and Pisinger, David , title =. 2004 , isbn =

work page 2004
[40]

Findings of the Association for Computational Linguistics: ACL 2025 , pages =

Han, Tingxu and Wang, Zhenting and Fang, Chunrong and Zhao, Shiyu and Ma, Shiqing and Chen, Zhenyu , title =. Findings of the Association for Computational Linguistics: ACL 2025 , pages =. 2025 , doi =

work page 2025
[41]

arXiv preprint , year =

Xu, Binfeng and Peng, Zhiyuan and Lei, Bowen and Mukherjee, Subhabrata and Liu, Yuchen and Xu, Dongkuan , title =. arXiv preprint , year =. 2305.18323 , archivePrefix =

work page arXiv
[42]

arXiv preprint , year =

Lin, Xiaoqiang and Liew, Jun Hao and Savarese, Silvio and Li, Junnan , title =. arXiv preprint , year =. 2602.07359 , archivePrefix =

work page arXiv
[43]

and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik R

Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik R. , title =. The Twelfth International Conference on Learning Representations (ICLR) , year =

work page
[44]

Findings of the Association for Computational Linguistics: ACL 2023 , pages =

Tang, Ruixiang and Kong, Dehan and Huang, Longtao and Xue, Hui , title =. Findings of the Association for Computational Linguistics: ACL 2023 , pages =. 2023 , doi =

work page 2023
[45]

, booktitle =

Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. , booktitle =. 2024 , eprint =

work page 2024
[46]

Efficient

Fu, Yichao and Zhu, Siqi and Su, Runlong and Qiao, Aurick and Stoica, Ion and Zhang, Hao , booktitle =. Efficient

work page
[47]

Evidence for Limited Metacognition in

Ackerman, Christopher , booktitle =. Evidence for Limited Metacognition in

work page
[48]

Yao, Shunyu and Shinn, Noah and Razavi, Pedram and Narasimhan, Karthik , journal =

work page
[49]

International Conference on Learning Representations (ICLR) , year =

Mialon, Gr. International Conference on Learning Representations (ICLR) , year =

work page
[50]

Behavioral and Brain Sciences , volume =

Resource-Rational Analysis: Understanding Human Cognition as the Optimal Use of Limited Computational Resources , author =. Behavioral and Brain Sciences , volume =. 2020 , doi =

work page 2020
[51]

Journal of Applied Meteorology , volume =

A New Vector Partition of the Probability Score , author =. Journal of Applied Meteorology , volume =

work page
[52]

Psychological Review , volume =

Monitoring and Control Processes in the Strategic Regulation of Memory Accuracy , author =. Psychological Review , volume =

work page
[53]

Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =

Metacognitive and Control Strategies in Study-Time Allocation , author =. Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =

work page
[54]

Journal of Experimental Psychology: Applied , volume =

Metacognitive Regulation of Text Learning: On Screen Versus on Paper , author =. Journal of Experimental Psychology: Applied , volume =

work page
[55]

Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =

Toward a General Model of Self-Regulated Study: An Analysis of Selection of Items for Study and Self-Paced Study Time , author =. Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =

work page
[56]

Journal of Experimental Psychology: General , volume =

Strategies in Study-Time Allocation: Why Is Study Time Sometimes Not Effective? , author =. Journal of Experimental Psychology: General , volume =

work page
[57]

Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =

Spacing One's Study: Evidence for a Metacognitive Control Strategy , author =. Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =

work page
[58]

Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =

Do Different Metamemory Judgments Tap the Same Underlying Aspects of Memory? , author =. Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =

work page
[59]

Transactions of the Association for Computational Linguistics , volume =

Lost in the Middle: How Language Models Use Long Contexts , author =. Transactions of the Association for Computational Linguistics , volume =

work page
[60]

Advances in Neural Information Processing Systems , volume=

Metacognitive capabilities of llms: An exploration in mathematical problem solving , author=. Advances in Neural Information Processing Systems , volume=

work page
[61]

Xiong, Miao and Hu, Zhiyuan and Lu, Xinyang and Li, Yifei and Fu, Jie and He, Junxian and Hooi, Bryan , booktitle=. Can. 2024 , url=

work page 2024
[62]

Decoupling Metacognition from Cognition: A Framework for Quantifying Metacognitive Ability in

Wang and others , booktitle=. Decoupling Metacognition from Cognition: A Framework for Quantifying Metacognitive Ability in. 2025 , doi=

work page 2025
[63]

Nature Communications , volume=

Large Language Models lack essential metacognition for reliable medical reasoning , author=. Nature Communications , volume=. 2025 , doi=

work page 2025
[64]

arXiv preprint arXiv:2508.15124 , year=

Side Effects of Erasing Concepts from Diffusion Models , author=. arXiv preprint arXiv:2508.15124 , year=

work page arXiv