pith. machine review for the scientific record. sign in

arxiv: 2605.13414 · v1 · submitted 2026-05-13 · 💻 cs.AI

Recognition: unknown

TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints

Shubhashis Roy Dipta, Zabir Al Nazi

Pith reviewed 2026-05-14 19:24 UTC · model grok-4.3

classification 💻 cs.AI
keywords prospective metacognitive controlLLM evaluationtoken budget planningresource-efficient agentsTRIAGE frameworktask selection under constraintsoracle benchmarkingmetacognition in language models
0
0 comments X

The pith

Language models lack the ability to prospectively plan task selection and compute allocation under fixed token budgets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TRIAGE, a framework that gives a model a pool of problems plus a token budget set to its own average cost, then requires it to output one ordered plan that chooses which problems to attempt, their sequence, and the token allocation for each before any solving begins. Plans are scored by comparing the value they achieve against an oracle that already knows the model's true success rate and cost on every problem, producing a triage efficiency ratio. Evaluation on competition math, graduate science, code generation, and expert knowledge tasks shows frontier and open-source models, with or without reasoning, produce plans far below the oracle optimum. This gap points to a missing capability needed for language models to act as resource-efficient autonomous agents.

Core claim

TRIAGE measures prospective metacognitive control by requiring models to commit to a single ordered plan that jointly encodes selection, sequencing, and per-problem token allocation under a budget calibrated to the model's baseline cost; the resulting triage efficiency ratio quantifies how closely the plan matches the value an oracle with full knowledge of solvability and cost would achieve.

What carries the argument

The TRIAGE efficiency ratio, computed by scoring a model's committed plan against an oracle that knows each problem's solvability and exact cost for that model.

If this is right

  • Agents built on models with stronger prospective control could complete more problems within the same token budget by avoiding low-yield tasks.
  • The measured capability is distinct from single-task accuracy and directly affects deployment cost in queued problem settings.
  • Both reasoning-enabled and base models show the same deficit, suggesting the gap is not fixed by adding chain-of-thought at inference time.
  • Closing the gap would require training objectives that reward joint planning over isolated problem solving.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models that improve on TRIAGE could be paired with lighter verification steps, freeing tokens for harder problems.
  • The framework could be extended to dynamic queues where new problems arrive after partial execution.
  • Training data that includes explicit triage examples might narrow the gap faster than accuracy-only fine-tuning.
  • Human performance on analogous triage tasks could serve as an upper bound for future model comparisons.

Load-bearing premise

An oracle that already knows the model's success rate and cost on every problem supplies a fair and unbiased benchmark without introducing hindsight bias or selection effects from the budget calibration.

What would settle it

A model that repeatedly produces plans whose achieved value reaches at least 85 percent of the oracle optimum across held-out task pools of varying difficulty would falsify the reported gaps.

Figures

Figures reproduced from arXiv: 2605.13414 by Shubhashis Roy Dipta, Zabir Al Nazi.

Figure 1
Figure 1. Figure 1: Triage efficiency across models and benchmarks at moderate budget (α = 0.5). Solid bars show ηU (unconstrained regime, advisory allocations), hatched bars show ηE (constrained regime, binding allocations), and red dashed lines mark base accuracy. η = 1 is oracle triage; η = 0 is random; η < 0 is worse than random. Bar color distinguishes standard inference from extended reasoning. accurate self-assessment … view at source ↗
Figure 2
Figure 2. Figure 2: Normalized triage regret across models and benchmarks at moderate budget (α = 0.5). Solid bars show R˜U = (Voracle − VU )/Voracle (unconstrained regime), hatched bars show R˜E (constrained regime), and red dashed lines mark base accuracy. R˜ = 0 is oracle (no value lost); R˜ = 1 is full regret (no value captured). Lower is better. Behavior across budget pressure. Full per-(model, α, dataset) breakdowns app… view at source ↗
Figure 3
Figure 3. Figure 3: Trajectories through (D, W) space as the unsolvable-injection ratio r increases from 0.25 to 1.00 (marker size). The star marks the ideal corner: high detection rate, low waste rate. the range, suggesting the relationship between reasoning training and abstention is not uniform across models. The dependence of detection on injection ratio. We characterize each model by how its detection rate varies with th… view at source ↗
Figure 4
Figure 4. Figure 4: Per-(model, mode) accuracy on each benchmark. Accuracy is the proportion of solvable problems and is [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Triage skill in the advisory-budget regime, [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Triage skill in the enforced-budget regime, [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Budget-aware re-solve, aggregate per model. Cells colored by intensity within each metric. N is the number of (problem, allocation) pairs re-issued. Compliance is the fraction of problems for which the model’s actual output length stays within its self-declared budget ai . The four right-hand counts (newly correct, lost correct, kept correct, still wrong) sum to N and decompose the change between Accbaseli… view at source ↗
Figure 8
Figure 8. Figure 8: Budget-aware re-solve, per (model × dataset). HLE sub-domains are aggregated by problem-count weighted mean. Rows within each model block: original baseline accuracy, accuracy with the budget banner, and compliance rate. Color encoding follows [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
read the original abstract

Deploying language models as autonomous agents requires more than per-task accuracy: when an agent faces a queue of problems under a finite token budget, it must decide which to attempt, in what order, and how much compute to commit to each, all before any execution feedback is available. This is the prospective form of metacognitive control studied for decades in human cognition, yet whether language models possess it remains untested. We introduce TRIAGE, an evaluation framework in which a model receives a task pool and a token budget calibrated to its own baseline cost, and commits to a single ordered plan that jointly encodes selection, sequencing, and per-problem allocation. Plans are scored against an oracle with full knowledge of the model's solvability and cost on each problem, yielding a triage efficiency ratio on a common scale. We evaluate frontier and open-source models, with and without reasoning enabled, across competition mathematics, graduate-level science, code generation, and expert multidisciplinary knowledge, and find that current language models exhibit substantial gaps in prospective metacognitive control, revealing a previously unmeasured capability dimension with direct implications for resource-efficient agent deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces TRIAGE, a framework to evaluate prospective metacognitive control in LLMs. A model is given a task pool and a token budget calibrated to its baseline cost on those tasks. It must produce one ordered plan encoding selection, sequencing, and allocation decisions before any execution. The plan is scored against an oracle possessing full knowledge of the model's solvability and costs per problem, producing a triage efficiency ratio. Evaluations across math, science, code, and knowledge tasks indicate substantial gaps in current models' prospective control abilities.

Significance. If the central findings hold, the work identifies a new, previously unmeasured capability dimension relevant to resource-efficient deployment of LLMs as agents. The framework offers a concrete, scalable method to quantify selection, sequencing, and allocation reasoning under constraints, with direct implications for practical agent systems. The multi-domain evaluation and comparison of frontier and open-source models with and without reasoning are strengths.

major comments (1)
  1. [Abstract] The triage efficiency ratio is defined against an oracle with complete knowledge of solvability and per-problem costs, while the model's plan is formed prospectively without feedback. This creates an information asymmetry that may cause the reported gaps to partly reflect the oracle's hindsight rather than a pure deficit in metacognitive control. The token budget is calibrated to the model's baseline cost on the evaluation task pool, which embeds task-specific statistics unavailable at planning time.
minor comments (2)
  1. The abstract reports high-level findings without specific quantitative results, error bars, or detailed task descriptions, which limits immediate assessment of effect sizes.
  2. Clarify whether the baseline cost measurement uses the same task pool or a held-out set to avoid circularity in calibration.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of TRIAGE's significance. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract] The triage efficiency ratio is defined against an oracle with complete knowledge of solvability and per-problem costs, while the model's plan is formed prospectively without feedback. This creates an information asymmetry that may cause the reported gaps to partly reflect the oracle's hindsight rather than a pure deficit in metacognitive control. The token budget is calibrated to the model's baseline cost on the evaluation task pool, which embeds task-specific statistics unavailable at planning time.

    Authors: The information asymmetry is a deliberate feature of the evaluation design. The oracle establishes the theoretical maximum triage efficiency given the model's actual solvability and per-problem costs, providing a normalized measure of how closely the prospective plan approaches optimality. This approach is standard in planning and resource-allocation benchmarks to quantify deviation from the best achievable outcome under uncertainty. The gaps therefore capture limitations in the model's prospective selection, sequencing, and allocation reasoning. Regarding budget calibration, the total token budget is derived from the model's baseline costs on the pool to ensure realism and model-specificity; however, at planning time the model is given only the aggregate budget and task pool, with no per-task cost or solvability information disclosed. We will revise the abstract and methods section to explicitly clarify this design rationale and the oracle's role as an upper-bound reference. revision: yes

Circularity Check

0 steps flagged

No significant circularity in TRIAGE evaluation framework

full rationale

The paper defines the triage efficiency ratio by scoring a model's single pre-execution plan against an independent oracle that holds full knowledge of solvability and per-problem costs on the task pool. Token-budget calibration to the model's measured baseline cost on the same pool serves only as normalization to place results on a common scale; it does not embed the target ratio or any fitted parameter into the reported metric. No equations, self-citations, or uniqueness claims reduce the central result to a definition, a prior fit, or an author-supplied ansatz. The evaluation draws on external benchmarks across mathematics, science, code, and knowledge domains, keeping the derivation self-contained against verifiable external oracles.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that prospective planning without feedback is the relevant form of metacognition for agents and that the oracle represents an achievable optimum; no free parameters or new entities are explicitly introduced beyond the triage efficiency ratio metric.

free parameters (1)
  • token budget calibration
    Budget is set to the model's own baseline cost, which requires empirical measurement that may involve choices in how baseline is computed.
axioms (1)
  • domain assumption Prospective metacognitive control (planning before any execution feedback) is a necessary capability for resource-efficient autonomous agents.
    Stated in the opening motivation for the framework.

pith-pipeline@v0.9.0 · 5495 in / 1329 out tokens · 51454 ms · 2026-05-14T19:24:34.695836+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 3 internal anchors

  1. [1]

    Snell, Charlie and Lee, Jaehoon and Xu, Kelvin and Kumar, Aviral , booktitle =. Scaling. 2025 , note =

  2. [2]

    Reasoning on a Budget: A Survey of Adaptive and Controllable Test-Time Compute in

    Alomrani, Mohammad Ali and Zhang, Yingxue and Li, Derek and Sun, Qianyi and Pal, Soumyasundar and Zhang, Zhanguang and Hu, Yaochen and Ajwani, Rohan Deepak and Valkanas, Antonios and Karimi, Raika and Cheng, Peng and Wang, Yunzhou and Liao, Pengyi and Huang, Hanrui and Wang, Bin and Hao, Jianye and Coates, Mark , journal =. Reasoning on a Budget: A Survey...

  3. [3]

    Chen, Xingyu and Xu, Jiahao and Liang, Tian and He, Zhiwei and Pang, Jianhui and Yu, Dian and Song, Linfeng and Liu, Qiuzhi and Zhou, Mengfei and Zhang, Zhuosheng and Wang, Rui and Tu, Zhaopeng and Mi, Haitao and Yu, Dong , journal =. Do

  4. [4]

    Thoughts Are All Over the Place: On the Underthinking of o1-Like

    Wang, Yue and Liu, Qiuzhi and Xu, Jiahao and Liang, Tian and Chen, Xingyu and He, Zhiwei and Song, Linfeng and Yu, Dian and Li, Juntao and Zhang, Zhuosheng and Wang, Rui and Tu, Zhaopeng and Mi, Haitao and Yu, Dong , booktitle =. Thoughts Are All Over the Place: On the Underthinking of o1-Like. 2025 , note =

  5. [5]

    arXiv preprint arXiv:2502.08235 , year =

    The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks , author =. arXiv preprint arXiv:2502.08235 , year =

  6. [6]

    Behavioral and Brain Sciences , volume =

    Resource-rational analysis: Understanding human cognition as the optimal use of limited computational resources , author =. Behavioral and Brain Sciences , volume =. 2020 , doi =

  7. [7]

    The Psychology of Learning and Motivation , editor =

    Metamemory: A theoretical framework and new findings , author =. The Psychology of Learning and Motivation , editor =. 1990 , publisher =

  8. [8]

    Language Models (Mostly) Know What They Know

    Language Models (Mostly) Know What They Know , author =. arXiv preprint arXiv:2207.05221 , year =

  9. [9]

    , booktitle =

    Kirichenko, Polina and Ibrahim, Mark and Chaudhuri, Kamalika and Bell, Samuel J. , booktitle =. 2025 , note =

  10. [10]

    arXiv preprint arXiv:2512.24661 , year =

    Do Large Language Models Know What They Are Capable Of? , author =. arXiv preprint arXiv:2512.24661 , year =

  11. [11]

    Token-Budget-Aware

    Han, Tingxu and Wang, Zhenting and Fang, Chunrong and Zhao, Shiyu and Ma, Shiqing and Chen, Zhenyu , booktitle =. Token-Budget-Aware. 2025 , note =

  12. [12]

    Li, Zheng and Dong, Qingxiu and Ma, Jingyuan and Zhang, Di and Jia, Kai and Sui, Zhifang , journal =

  13. [13]

    2023 , note =

    Jin, Yunho and Wu, Chun-Feng and Brooks, David and Wei, Gu-Yeon , booktitle =. 2023 , note =

  14. [14]

    Efficient

    Fu, Yichao and Bailis, Peter and Stoica, Ion and Zhang, Hao , booktitle =. Efficient. 2024 , note =

  15. [15]

    Evidence for Limited Metacognition in

    Ackerman, Christopher , booktitle =. Evidence for Limited Metacognition in. 2026 , note =

  16. [16]

    Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =

    Metacognitive and control strategies in study-time allocation , author =. Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =. 2000 , doi =

  17. [17]

    Jain, Naman and Han, King and Gu, Alex and Li, Wen-Ding and Yan, Fanjia and Zhang, Tianjun and Wang, Sida and Solar-Lezama, Armando and Sen, Koushik and Stoica, Ion , journal =

  18. [18]

    Humanity's Last Exam

    Humanity's last exam , author=. arXiv preprint arXiv:2501.14249 , year=

  19. [19]

    , author=

    Monitoring one's own knowledge during study: A cue-utilization approach to judgments of learning. , author=. Journal of experimental psychology: General , volume=. 1997 , publisher=

  20. [20]

    Advances in Neural Information Processing Systems , volume=

    Large Language Models Must Be Taught to Know What They Don't Know , author=. Advances in Neural Information Processing Systems , volume=

  21. [21]

    arXiv preprint arXiv:2505.13763 , year=

    Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations , author=. arXiv preprint arXiv:2505.13763 , year=

  22. [22]

    Current Directions in Psychological Science , year=

    Metacognition and Uncertainty Communication in Humans and Large Language Models , author=. Current Directions in Psychological Science , year=

  23. [23]

    Adaptively Robust

    Chen, Zixi and Ye, Yinyu and Zhou, Zijie , journal=. Adaptively Robust

  24. [24]

    2023 , note=

    Valmeekam, Karthik and Marquez, Matthew and Olmo, Alberto and Sreedharan, Sarath and Kambhampati, Subbarao , booktitle=. 2023 , note=

  25. [25]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    -bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. arXiv preprint arXiv:2406.12045 , year=

  26. [26]

    International Conference on Learning Representations , year=

    Mialon, Gr\'. International Conference on Learning Representations , year=

  27. [27]

    Journal of Applied Meteorology , volume=

    A New Vector Partition of the Probability Score , author=. Journal of Applied Meteorology , volume=

  28. [28]

    Psychological Review , volume =

    Koriat, Asher and Goldsmith, Morris , title =. Psychological Review , volume =. 1996 , doi =

  29. [29]

    Journal of Memory and Language , volume =

    Metcalfe, Janet and Kornell, Nate , title =. Journal of Memory and Language , volume =. 2005 , doi =

  30. [30]

    Journal of Experimental Psychology: Applied , volume =

    Ackerman, Rakefet and Goldsmith, Morris , title =. Journal of Experimental Psychology: Applied , volume =. 2011 , doi =

  31. [31]

    2009 , isbn =

    Dunlosky, John and Metcalfe, Janet , title =. 2009 , isbn =

  32. [32]

    , title =

    Underwood, Benton J. , title =

  33. [33]

    Jacob and Nelson, Thomas O

    Leonesio, R. Jacob and Nelson, Thomas O. , title =. Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =. 1990 , doi =

  34. [34]

    Journal of Experimental Psychology: General , volume =

    Mazzoni, Giuliana and Cornoldi, Cesare , title =. Journal of Experimental Psychology: General , volume =. 1993 , doi =

  35. [35]

    , title =

    Son, Lisa K. , title =. Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =. 2004 , doi =

  36. [36]

    and Dunlosky, John , title =

    Thiede, Keith W. and Dunlosky, John , title =. Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =. 1999 , doi =

  37. [37]

    , title =

    Murphy, Allan H. , title =. Journal of Applied Meteorology , volume =. 1973 , doi =

  38. [38]

    Advances in Neural Information Processing Systems , volume =

    Geifman, Yonatan and El-Yaniv, Ran , title =. Advances in Neural Information Processing Systems , volume =. 2017 , eprint =

  39. [39]

    2004 , isbn =

    Kellerer, Hans and Pferschy, Ulrich and Pisinger, David , title =. 2004 , isbn =

  40. [40]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages =

    Han, Tingxu and Wang, Zhenting and Fang, Chunrong and Zhao, Shiyu and Ma, Shiqing and Chen, Zhenyu , title =. Findings of the Association for Computational Linguistics: ACL 2025 , pages =. 2025 , doi =

  41. [41]

    arXiv preprint , year =

    Xu, Binfeng and Peng, Zhiyuan and Lei, Bowen and Mukherjee, Subhabrata and Liu, Yuchen and Xu, Dongkuan , title =. arXiv preprint , year =. 2305.18323 , archivePrefix =

  42. [42]

    arXiv preprint , year =

    Lin, Xiaoqiang and Liew, Jun Hao and Savarese, Silvio and Li, Junnan , title =. arXiv preprint , year =. 2602.07359 , archivePrefix =

  43. [43]

    and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik R

    Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik R. , title =. The Twelfth International Conference on Learning Representations (ICLR) , year =

  44. [44]

    Findings of the Association for Computational Linguistics: ACL 2023 , pages =

    Tang, Ruixiang and Kong, Dehan and Huang, Longtao and Xue, Hui , title =. Findings of the Association for Computational Linguistics: ACL 2023 , pages =. 2023 , doi =

  45. [45]

    , booktitle =

    Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. , booktitle =. 2024 , eprint =

  46. [46]

    Efficient

    Fu, Yichao and Zhu, Siqi and Su, Runlong and Qiao, Aurick and Stoica, Ion and Zhang, Hao , booktitle =. Efficient

  47. [47]

    Evidence for Limited Metacognition in

    Ackerman, Christopher , booktitle =. Evidence for Limited Metacognition in

  48. [48]

    Yao, Shunyu and Shinn, Noah and Razavi, Pedram and Narasimhan, Karthik , journal =

  49. [49]

    International Conference on Learning Representations (ICLR) , year =

    Mialon, Gr. International Conference on Learning Representations (ICLR) , year =

  50. [50]

    Behavioral and Brain Sciences , volume =

    Resource-Rational Analysis: Understanding Human Cognition as the Optimal Use of Limited Computational Resources , author =. Behavioral and Brain Sciences , volume =. 2020 , doi =

  51. [51]

    Journal of Applied Meteorology , volume =

    A New Vector Partition of the Probability Score , author =. Journal of Applied Meteorology , volume =

  52. [52]

    Psychological Review , volume =

    Monitoring and Control Processes in the Strategic Regulation of Memory Accuracy , author =. Psychological Review , volume =

  53. [53]

    Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =

    Metacognitive and Control Strategies in Study-Time Allocation , author =. Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =

  54. [54]

    Journal of Experimental Psychology: Applied , volume =

    Metacognitive Regulation of Text Learning: On Screen Versus on Paper , author =. Journal of Experimental Psychology: Applied , volume =

  55. [55]

    Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =

    Toward a General Model of Self-Regulated Study: An Analysis of Selection of Items for Study and Self-Paced Study Time , author =. Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =

  56. [56]

    Journal of Experimental Psychology: General , volume =

    Strategies in Study-Time Allocation: Why Is Study Time Sometimes Not Effective? , author =. Journal of Experimental Psychology: General , volume =

  57. [57]

    Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =

    Spacing One's Study: Evidence for a Metacognitive Control Strategy , author =. Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =

  58. [58]

    Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =

    Do Different Metamemory Judgments Tap the Same Underlying Aspects of Memory? , author =. Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =

  59. [59]

    Transactions of the Association for Computational Linguistics , volume =

    Lost in the Middle: How Language Models Use Long Contexts , author =. Transactions of the Association for Computational Linguistics , volume =

  60. [60]

    Advances in Neural Information Processing Systems , volume=

    Metacognitive capabilities of llms: An exploration in mathematical problem solving , author=. Advances in Neural Information Processing Systems , volume=

  61. [61]

    Xiong, Miao and Hu, Zhiyuan and Lu, Xinyang and Li, Yifei and Fu, Jie and He, Junxian and Hooi, Bryan , booktitle=. Can. 2024 , url=

  62. [62]

    Decoupling Metacognition from Cognition: A Framework for Quantifying Metacognitive Ability in

    Wang and others , booktitle=. Decoupling Metacognition from Cognition: A Framework for Quantifying Metacognitive Ability in. 2025 , doi=

  63. [63]

    Nature Communications , volume=

    Large Language Models lack essential metacognition for reliable medical reasoning , author=. Nature Communications , volume=. 2025 , doi=

  64. [64]

    arXiv preprint arXiv:2508.15124 , year=

    Side Effects of Erasing Concepts from Diffusion Models , author=. arXiv preprint arXiv:2508.15124 , year=