Recognition: unknown
MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition
Pith reviewed 2026-05-10 08:46 UTC · model grok-4.3
The pith
MEDLEY-BENCH reveals an evaluation/control dissociation in AI metacognition where scale improves reflective scoring but not proportional belief revision, with a consistent knowing/doing gap across 35 models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Results show a robust evaluation/control dissociation: evaluation ability increases with model size within families, whereas control does not. ... Smaller and cheaper models often matched or outperformed larger counterparts, suggesting that metacognitive competence is not simply a function of scale.
Load-bearing premise
That the constructed tasks and tier-based MMS/MAS scoring genuinely isolate metacognitive control and revision behavior rather than measuring surface-level response patterns or prompt sensitivity.
Figures
read the original abstract
Metacognition, the ability to monitor and regulate one's own reasoning, remains under-evaluated in AI benchmarking. We introduce MEDLEY-BENCH, a benchmark of behavioural metacognition that separates independent reasoning, private self-revision, and socially influenced revision under genuine inter-model disagreement. The benchmark evaluates 35 models from 12 families on 130 ambiguous instances across five domains and reports two complementary scores: the Medley Metacognition Score (MMS), a tier-based aggregate of reflective updating, social robustness, and epistemic articulation, and the Medley Ability Score (MAS), derived from four metacognitive sub-abilities. Results show a robust evaluation/control dissociation: evaluation ability increases with model size within families, whereas control does not. In a follow-up progressive adversarial analysis of 11 models, we observed two behavioural profiles, i.e., models that revise primarily in response to argument quality and models that track consensus statistics. Under within-model relative profiling (ipsative scoring), evaluation was the weakest relative ability in all 35 models, indicating a systematic knowing/doing gap. Smaller and cheaper models often matched or outperformed larger counterparts, suggesting that metacognitive competence is not simply a function of scale. These findings position MEDLEY-BENCH as a tool for measuring belief revision under social pressure and suggest that future training should reward calibrated, proportional updating rather than output quality alone.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MEDLEY-BENCH, a benchmark for behavioral metacognition that separates independent reasoning, private self-revision, and socially influenced revision under genuine inter-model disagreement. It evaluates 35 models from 12 families on 130 ambiguous instances across five domains, reporting the Medley Metacognition Score (MMS) as a tier-based aggregate of reflective updating, social robustness, and epistemic articulation, plus the Medley Ability Score (MAS) derived from four metacognitive sub-abilities. Central results claim a robust evaluation/control dissociation: evaluation ability increases with model size within families while control does not; smaller/cheaper models often match or outperform larger ones; ipsative (within-model relative) profiling shows evaluation as the weakest ability across all 35 models, indicating a systematic knowing/doing gap; and a follow-up analysis of 11 models identifies two behavioral profiles (argument-quality-driven vs. consensus-tracking revision).
Significance. If the dissociation and ipsative gap hold after proper validation, the work would be significant for AI metacognition research by providing a tool to measure belief revision under social pressure and showing that scale improves monitoring more than regulation. The within-family comparisons and identification of distinct revision profiles offer concrete, falsifiable patterns that could guide training objectives focused on calibrated updating rather than output quality alone. The benchmark's emphasis on ambiguous instances and inter-model disagreement addresses a gap in current evaluations.
major comments (2)
- [Benchmark construction and scoring (Methods/Results sections)] The dissociation claim (evaluation scales with size; control does not) is load-bearing but rests on MMS/MAS tier scoring whose rules for reflective updating, social robustness, and the four MAS sub-abilities are not specified with sufficient detail to confirm isolation from prompt compliance or general instruction-following. Without baseline compliance tasks, matched prompt-complexity controls, or inter-rater reliability statistics for the 130 instances, larger models' higher evaluation scores could arise from superior adherence to revision prompts rather than metacognitive control per se.
- [Ipsative scoring and within-model profiling (Results section)] The ipsative profiling result—that evaluation is the weakest relative ability in all 35 models—is central to the knowing/doing gap claim, yet the paper provides no explicit description of how the four MAS sub-abilities are normalized or ranked within each model, nor any statistical test confirming the gap exceeds what would be expected from random variation in tier assignments.
minor comments (2)
- [Abstract] The abstract states '130 ambiguous instances' and 'five domains' but does not list the domains or example instances; adding one or two concrete examples would improve clarity without lengthening the paper.
- [Follow-up analysis] The progressive adversarial analysis on 11 models is mentioned but lacks a table or figure summarizing the two behavioral profiles (argument-quality vs. consensus-tracking); a small summary table would make the finding easier to evaluate.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on MEDLEY-BENCH. The comments on scoring transparency and statistical rigor for the ipsative analysis are well-taken, and we will revise the manuscript to improve clarity while preserving the core claims.
read point-by-point responses
-
Referee: The dissociation claim (evaluation scales with size; control does not) is load-bearing but rests on MMS/MAS tier scoring whose rules for reflective updating, social robustness, and the four MAS sub-abilities are not specified with sufficient detail to confirm isolation from prompt compliance or general instruction-following. Without baseline compliance tasks, matched prompt-complexity controls, or inter-rater reliability statistics for the 130 instances, larger models' higher evaluation scores could arise from superior adherence to revision prompts rather than metacognitive control per se.
Authors: We agree that the Methods section would benefit from expanded detail on tier scoring. In revision we will add explicit criteria, decision rules, and examples for assigning tiers to reflective updating, social robustness, and epistemic articulation, plus the exact mapping to the four MAS sub-abilities. The benchmark deliberately uses genuine inter-model disagreement on pre-validated ambiguous items rather than direct instruction prompts; this design choice reduces (though does not eliminate) simple compliance confounds. We will add a limitations paragraph acknowledging the absence of separate baseline compliance controls and note that future extensions could include them. The 130 instances were selected via objective ambiguity heuristics across domains rather than multi-rater subjective scoring, so traditional inter-rater reliability does not directly apply; we will clarify the selection protocol and report any available validation checks. revision: partial
-
Referee: The ipsative profiling result—that evaluation is the weakest relative ability in all 35 models—is central to the knowing/doing gap claim, yet the paper provides no explicit description of how the four MAS sub-abilities are normalized or ranked within each model, nor any statistical test confirming the gap exceeds what would be expected from random variation in tier assignments.
Authors: We accept that the ipsative procedure requires fuller documentation. The revised manuscript will include a dedicated paragraph describing the within-model normalization (ranking the four sub-ability scores relative to each model's own MAS distribution) and will report a statistical test (e.g., a one-sample sign test or permutation test on the rank positions) demonstrating that evaluation's consistent lowest rank across all 35 models exceeds chance expectation under random tier assignment. These additions will be placed in the Results section alongside the existing profiling figure. revision: yes
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Jon-PaulCacioli. DoLLMsknowwhattheyknow? measuringmetacognitiveefficiencywithsignaldetection theory.arXiv preprint arXiv:2603.25112, 2026. URLhttps://arxiv.org/abs/2603.25112
-
[2]
Measuring the metacognition of AI
Richard Servajean and Philippe Servajean. Measuring the metacognition of AI.arXiv preprint arXiv:2603.29693, 2026. URLhttps://arxiv.org/abs/2603.29693
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
John H. Flavell. Metacognition and cognitive monitoring: A new area of cognitive-developmental inquiry. American Psychologist, 34(10):906–911, 1979. doi: 10.1037/0003-066X.34.10.906
-
[4]
Nelson and Louis Narens
Thomas O. Nelson and Louis Narens. Metamemory: A theoretical framework and new findings. In Psychology of Learning and Motivation, volume 26, pages 125–173. Academic Press, 1990. doi: 10.1016/ S0079-7421(08)60053-5
1990
-
[5]
Galatzer-Levy, Meredith Ringel Morris, Allan Dafoe, Alison M
Ryan Burnell, Yumeya Yamamori, Orhan Firat, Kate Olszewska, Steph Hughes-Fitt, Oran Kelly, Isaac R. Galatzer-Levy, Meredith Ringel Morris, Allan Dafoe, Alison M. Snyder, Noah D. Goodman, Matthew Botvinick, and Shane Legg. Measuring progress toward AGI: A cognitive framework. Technical report, Google DeepMind, 2026. URLhttps://storage.googleapis.com/deepmi...
2026
-
[6]
Language Models (Mostly) Know What They Know
Saurav Kadavath et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,
work page internal anchor Pith review arXiv
-
[7]
Language Models (Mostly) Know What They Know
doi: 10.48550/arXiv.2207.05221. URLhttps://arxiv.org/abs/2207.05221
work page internal anchor Pith review doi:10.48550/arxiv.2207.05221
-
[8]
doi: 10.18653/v1/2020.emnlp-main.466
Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Luke Zettlemoyer. AmbigQA: Answering ambiguous open-domain questions. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5783–5797, Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.466. URLhttps://aclanthology.o...
-
[9]
ASQA: Factoid questions meet long- form answers
Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-Wei Chang. ASQA: Factoid questions meet long- form answers. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8273–8288, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.566. URLhttps://aclantholo...
-
[10]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Constitutional AI: Harmlessness from AI Feedback
doi: 10.48550/arXiv.2212.08073. URLhttps://arxiv.org/abs/2212.08073
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.08073
-
[12]
Manning, Stefano Ermon, and Chelsea Finn
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023), 2023. URLhttps://openreview.net/forum?id= HPuSIXJaa9
2023
-
[13]
InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES)
Aaron Fanous, Jacob Goldberg, Ank A. Agarwal, Joanna Lin, Anson Zhou, Roxana Daneshjou, and Sanmi Koyejo. SycEval: Evaluating LLM sycophancy.Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 8(1):893–900, 2025. doi: 10.1609/aies.v8i1.36598
-
[14]
Bowman, Nandi Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Sam Ringer, Rose E
Mrinank Sharma, Katie Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Nandi Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Sam Ringer, Rose E. Yan, Ethan Zhang, Ethan Perez, and Nicholas Schiefer. Towards understanding sycophancy in language models. InThe Twelfth...
2024
- [15]
-
[16]
Morton Deutsch and Harold B. Gerard. A study of normative and informational social influences upon individual judgment.The Journal of Abnormal and Social Psychology, 51(3):629–636, 1955. doi: 10.1037/ h0046408
1955
-
[17]
Farhad Abtahi, Mehdi Astaraki, and Fernando Seoane. Leveraging imperfection with MEDLEY: a multi- model approach harnessing bias in medical AI.Frontiers in Artificial Intelligence, 9:1701665, 2026. doi: 10.3389/frai.2026.1701665. URLhttps://www.frontiersin.org/journals/artificial-intelligence/ articles/10.3389/frai.2026.1701665/full. 12
-
[18]
Stephen M. Fleming and Hakwan C. Lau. How to measure metacognition.Frontiers in Human Neuro- science, 8:443, 2014. doi: 10.3389/fnhum.2014.00443. URLhttps://www.frontiersin.org/journals/ human-neuroscience/articles/10.3389/fnhum.2014.00443/full
-
[19]
Farrar, Straus and Giroux, New York, 2011
Daniel Kahneman.Thinking, Fast and Slow. Farrar, Straus and Giroux, New York, 2011
2011
-
[20]
arXiv preprint arXiv:2603.15381 , year =
Emmanuel Dupoux, Yann LeCun, and Jitendra Malik. Why AI systems don’t learn and what to do about it: Lessons on autonomous learning from cognitive science.arXiv preprint arXiv:2603.15381, 2026. URL https://arxiv.org/abs/2603.15381
-
[21]
Measuring Faithfulness in Chain-of-Thought Reasoning
Tamera Lanham et al. Measuring faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702, 2023. doi: 10.48550/arXiv.2307.13702. URLhttps://arxiv.org/abs/2307.13702
-
[22]
Language models don’t always say what they think: Unfaithful explana- tions in chain-of-thought prompting
Miles Turpin et al. Language models don’t always say what they think: Unfaithful explana- tions in chain-of-thought prompting. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023), 2023. URLhttps://proceedings.neurips.cc/paper_files/paper/2023/hash/ f2543511e5f4d4764857f9ad833a977d-Abstract-Conference.html
2023
-
[23]
Regulation (EU) 2024/1689 of the european parliament and of the council of 13 june 2024 laying down harmonised rules on artificial intelligence (artificial intelligence act)
European Union. Regulation (EU) 2024/1689 of the european parliament and of the council of 13 june 2024 laying down harmonised rules on artificial intelligence (artificial intelligence act). Official Journal of the European Union, 2024. URLhttps://eur-lex.europa.eu/eli/reg/2024/1689/oj
2024
-
[24]
The Illusion of Insight in Reasoning Models
Liv G. d’Aliberti and Manoel Horta Ribeiro. The illusion of insight in reasoning models.arXiv preprint arXiv:2601.00514, 2026. doi: 10.48550/arXiv.2601.00514. URLhttps://arxiv.org/abs/2601.00514. 13 Appendix A Formal measure definitions LetC A j ,C P j , andC S j denote the model’s confidence on claimjat Steps A, B-Private, and B-Social, respectively. Let...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.00514 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.