pith. machine review for the scientific record. sign in

arxiv: 2605.06522 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.CV

Recognition: unknown

Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models

Haibo Chen, Wenwu Zhu, Wenxuan Liu, Xin Wang

Pith reviewed 2026-05-08 12:26 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords out-of-distribution generalizationfoundation modelsagentic AIparameter coverage ceilingopen-world settingsmodel-centric methodsdistribution shift
0
0 comments X

The pith

Foundation models face a parameter coverage ceiling on out-of-distribution inputs that agentic systems can extend beyond.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that out-of-distribution challenges for foundation models differ from prior settings and cannot be resolved through model-centric methods alone, whether at training or test time. It establishes a parameter coverage ceiling showing that some practically relevant inputs lie outside what any parameter-based representation can handle within a given tolerance. Agentic systems add four structural properties that expand the set of reachable behaviors past this limit. The position treats agentic and model-centric approaches as complementary rather than competing, and calls for research that recognizes the agentic paradigm explicitly.

Core claim

We prove a parameter coverage ceiling: there exist practically relevant inputs that no model-centric method (training-time or test-time) can handle within tolerance ε, for reasons intrinsic to parameter-based representation. Agentic OOD systems are characterized by four structural properties—perception, strategy selection, external action, and closed-loop verification—and these properties strictly extend the reachable set beyond the ceiling.

What carries the argument

The parameter coverage ceiling, a limit on what parameter-based representations can achieve for certain OOD inputs, together with the four structural properties of agentic systems that extend the reachable set.

If this is right

  • Model-centric methods alone are insufficient for the full range of OOD phenomena faced by foundation models in open-world settings.
  • Agentic systems must be studied as a distinct and necessary research direction rather than an add-on.
  • Progress on foundation-model OOD requires treating the two paradigms as complementary.
  • A research agenda should focus on integrating the four agentic properties with existing foundation-model pipelines.
  • Partially observed multi-stage training distributions must be formalized stage-by-stage to assess coverage limits accurately.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid architectures that alternate between parameter updates and external action loops could become standard for deployed foundation models.
  • Domains with high open-ended task variation, such as interactive agents or robotic planning, offer natural test beds for measuring extension beyond the ceiling.
  • The argument implies that evaluation benchmarks for OOD should include explicit tests for closed-loop verification rather than single-pass prediction.
  • Similar coverage ceilings may appear in other parameter-heavy systems outside language or vision, suggesting the result generalizes.

Load-bearing premise

The four structural properties of agentic systems strictly extend the reachable set beyond the parameter coverage ceiling without introducing new unaddressed limitations.

What would settle it

A concrete demonstration of either a model-centric method that processes within tolerance ε an input shown to lie beyond the parameter coverage ceiling, or an agentic system that fails to reach such an input because of its own structural constraints.

Figures

Figures reproduced from arXiv: 2605.06522 by Haibo Chen, Wenwu Zhu, Wenxuan Liu, Xin Wang.

Figure 1
Figure 1. Figure 1: Three paradigms of OOD generalization for foundation models. Training-time and test-time model-centric methods both adjust the model. Agentic methods keep the model fixed and wrap it in a perceive–reason–act–verify loop with strategies including retrieval, tools, decomposition, verification, and abstention. The paradigms overlap on inference-time model adjustment but each contains actions outside the other… view at source ↗
Figure 2
Figure 2. Figure 2: The unified agentic OOD framework. An input passes through PERCEIVE (diagnose OOD type w.r.t. relevant D(k) ), REASON (select strategy or composition), ACT (execute), and VERIFY (check reliability; loop back if unreliable). TTA corresponds to the SELF-ADAPT branch; training-time methods are complementary, shaping fθ before the loop runs. Retrieval-augmented generation (RAG) [42] addresses knowledge-boundar… view at source ↗
read the original abstract

Foundation models (FMs) are increasingly deployed in open-world settings where distribution shift is the rule rather than the exception. The out-of-distribution (OOD) phenomena they face -- knowledge boundaries, capability ceilings, compositional shifts, and open-ended task variation -- differ in kind from the settings that have shaped prior OOD research, and are further complicated because the pretraining and post-training distributions of modern FMs are often only partially observed. Our position is that OOD for foundation models is a structurally distinct problem that cannot be solved within the prevailing model-centric paradigm, and that agentic systems constitute the missing paradigm required to address it. We defend this claim through four steps. First, we give a stage-aware formalization of OOD that accommodates partially observed multi-stage training distributions. Second, we prove a parameter coverage ceiling: there exist practically relevant inputs that no model-centric method (training-time or test-time) can handle within tolerance $\varepsilon$, for reasons intrinsic to parameter-based representation. Third, we characterize agentic OOD systems by four structural properties -- perception, strategy selection, external action, and closed-loop verification -- and show that they strictly extend the reachable set beyond the ceiling. Fourth, we respond to seven counterarguments, conceding two, and outline a research agenda. We do not claim that agentic methods subsume model-centric ones; we argue that the two are complementary, and that progress on FM-OOD requires explicit recognition of the agentic paradigm as a first-class research direction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that out-of-distribution (OOD) generalization for foundation models constitutes a structurally distinct problem that cannot be solved within the model-centric paradigm. It supports this position via four steps: a stage-aware formalization of OOD that handles partially observed multi-stage training distributions; a proof of a parameter coverage ceiling showing that certain practically relevant inputs lie outside the reach of any training-time or test-time model-centric method within tolerance ε; a characterization of agentic OOD systems by four structural properties (perception, strategy selection, external action, closed-loop verification) that strictly extend the reachable set; and responses to seven counterarguments with an outline of a research agenda. The authors emphasize complementarity rather than replacement of model-centric methods.

Significance. If the formalization and extension argument hold, the work would provide a clear conceptual framework for why intrinsic limits exist in parameter-based representations and motivate treating agentic systems as a first-class research direction alongside model improvements. The stage-aware OOD definition and explicit ceiling result could serve as useful reference points for future empirical and theoretical work on open-world deployment of foundation models.

major comments (2)
  1. [Step 3] Step 3: The central claim that the four structural properties strictly extend the reachable set beyond the parameter coverage ceiling is load-bearing. The manuscript must explicitly address whether closed-loop verification is implemented via non-parameter mechanisms or remains subject to the stage-aware OOD definition and coverage ceiling from steps 1–2. If verification reuses the same foundation model (or any parameter-based component), the extension is not guaranteed to be strict; a formal argument or counterexample showing immunity to the ceiling is required.
  2. [Step 2] Step 2: The proof of the parameter coverage ceiling asserts the existence of inputs no model-centric method can handle within ε for intrinsic representational reasons. The derivation, including all assumptions on partially observed distributions, the precise definition of the reachable set, and error bounds, should be presented in full so that readers can verify the claim is not tautological to the chosen formalization.
minor comments (2)
  1. The abstract is information-dense; expanding the four-step outline with one sentence each would improve immediate readability without lengthening the abstract excessively.
  2. Notation for the tolerance parameter ε and the reachable set should be introduced consistently when first used in the formal sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and precise comments, which highlight important areas for strengthening the formal arguments in our manuscript. We address each major comment below and will make the indicated revisions to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Step 3] Step 3: The central claim that the four structural properties strictly extend the reachable set beyond the parameter coverage ceiling is load-bearing. The manuscript must explicitly address whether closed-loop verification is implemented via non-parameter mechanisms or remains subject to the stage-aware OOD definition and coverage ceiling from steps 1–2. If verification reuses the same foundation model (or any parameter-based component), the extension is not guaranteed to be strict; a formal argument or counterexample showing immunity to the ceiling is required.

    Authors: We agree that the strict extension claim requires explicit formal justification, especially concerning closed-loop verification when it may reuse foundation model components. In the revised manuscript we will add a new subsection in Step 3 that supplies a formal argument showing how the combination of external action and closed-loop verification extends the reachable set. The argument proceeds by demonstrating that agentic interaction allows the system to generate new observations and modify the effective input distribution through environmental actions; this process is not available to any fixed-parameter model-centric method. We will include a proof sketch establishing that, for any input outside the coverage ceiling, there exists a finite sequence of actions and verifications that reaches it within ε, even when the underlying model is reused for verification steps. This relies on the non-stationary input distribution induced by external actions rather than on non-parameter mechanisms per se. revision: yes

  2. Referee: [Step 2] Step 2: The proof of the parameter coverage ceiling asserts the existence of inputs no model-centric method can handle within ε for intrinsic representational reasons. The derivation, including all assumptions on partially observed distributions, the precise definition of the reachable set, and error bounds, should be presented in full so that readers can verify the claim is not tautological to the chosen formalization.

    Authors: We concur that the proof of the parameter coverage ceiling must be expanded to allow independent verification. In the revised manuscript we will present the complete derivation, either in the main text or as a self-contained appendix. The expanded version will explicitly list all assumptions on partially observed multi-stage training distributions, provide the precise set-theoretic definition of the reachable set for model-centric methods (training-time and test-time), and detail the error bounds with respect to tolerance ε. The derivation will be structured as a sequence of lemmas showing that the ceiling follows from the fixed-parameter representational capacity under partial observability, rather than being a direct restatement of the stage-aware OOD definition. revision: yes

Circularity Check

1 steps flagged

Agentic extension claim reduces to definitional choice of properties that bypass the ceiling by construction

specific steps
  1. self definitional [Abstract (step 3)]
    "we characterize agentic OOD systems by four structural properties -- perception, strategy selection, external action, and closed-loop verification -- and show that they strictly extend the reachable set beyond the ceiling."

    The four properties are selected to include 'external action' and 'closed-loop verification,' which are defined as operating outside parameter-based representation. The claim that these properties 'strictly extend' the reachable set therefore follows directly from the definitional inclusion of non-parameter mechanisms rather than from a separate proof that the properties can be implemented while remaining immune to the stage-aware OOD ceiling established earlier.

full rationale

The paper's central derivation proceeds from a parameter coverage ceiling (step 2) to the claim that agentic systems strictly extend the reachable set (step 3). The extension is obtained by characterizing agentic systems via four properties that explicitly incorporate external mechanisms; this characterization makes the strict extension hold by the definition of the properties rather than by an independent argument that such properties can be realized without reintroducing the ceiling. The abstract states the properties 'strictly extend' the set, but the load-bearing move is the choice of definition itself. No other circular steps are present; the formalization of the ceiling and the counterargument responses appear self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The position rests on an unshown proof of a parameter coverage ceiling and the assumption that agentic properties extend reachability; no free parameters or invented entities are visible in the abstract.

axioms (1)
  • domain assumption There exist practically relevant inputs outside the coverage of any parameter-based representation within tolerance ε
    This is the load-bearing premise of the coverage ceiling claim stated in the abstract.

pith-pipeline@v0.9.0 · 5581 in / 1224 out tokens · 49756 ms · 2026-05-08T12:26:48.460130+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 20 canonical work pages · 11 internal anchors

  1. [1]

    Dataset shift in machine learning,

    T. S. I. M. LEARNING, “Dataset shift in machine learning,” 2009

  2. [2]

    Towards out-of-distribution generalization: A survey,

    J. Liu, Z. Shen, Y . He, X. Zhang, R. Xu, H. Yu, and P. Cui, “Towards out-of-distribution generalization: A survey,”arXiv preprint arXiv:2108.13624, 2021

  3. [3]

    The clinician and dataset shift in artificial intelligence,

    S. G. Finlayson, A. Subbaswamy, K. Singh, J. Bowers, A. Kupke, J. Zittrain, I. S. Kohane, and S. Saria, “The clinician and dataset shift in artificial intelligence,”New England Journal of Medicine, vol. 385, no. 3, pp. 283–286, 2021

  4. [4]

    Can autonomous vehicles identify, recover from, and adapt to distribution shifts?

    A. Filos, P. Tigkas, R. McAllister, N. Rhinehart, S. Levine, and Y . Gal, “Can autonomous vehicles identify, recover from, and adapt to distribution shifts?” inInternational Conference on Machine Learning. PMLR, 2020, pp. 3145–3153

  5. [5]

    Large language models struggle to learn long-tail knowledge,

    N. Kandpal, H. Deng, A. Roberts, E. Wallace, and C. Raffel, “Large language models struggle to learn long-tail knowledge,” inInternational conference on machine learning. PMLR, 2023, pp. 15 696–15 707

  6. [6]

    On the Opportunities and Risks of Foundation Models

    R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskillet al., “On the opportunities and risks of foundation models,”arXiv preprint arXiv:2108.07258, 2021

  7. [7]

    Language models are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

  8. [8]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

  9. [9]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  10. [10]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

  11. [11]

    Towards graph foundation models: A survey and beyond,

    J. Liu, C. Yang, Z. Lu, J. Chen, Y . Li, M. Zhang, T. Bai, Y . Fang, L. Sun, P. S. Yuet al., “Towards graph foundation models: A survey and beyond,”arXiv preprint arXiv:2310.11829, 2023

  12. [12]

    Position: Graph foundation models are already here,

    H. Mao, Z. Chen, W. Tang, J. Zhao, Y . Ma, T. Zhao, N. Shah, M. Galkin, and J. Tang, “Position: Graph foundation models are already here,” inForty-first International Conference on Machine Learning, 2024

  13. [13]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

  14. [14]

    Wilds: A benchmark of in-the-wild distribution shifts,

    P. W. Koh, S. Sagawa, H. Marklund, S. M. Xie, M. Zhang, A. Balsubramani, W. Hu, M. Yasunaga, R. L. Phillips, I. Gaoet al., “Wilds: A benchmark of in-the-wild distribution shifts,” inInternational conference on machine learning. PMLR, 2021, pp. 5637–5664

  15. [15]

    Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

    D. Hendrycks and T. Dietterich, “Benchmarking neural network robustness to common corruptions and perturbations,”arXiv preprint arXiv:1903.12261, 2019

  16. [16]

    Survey of hallucination in natural language generation,

    Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . J. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,”ACM computing surveys, vol. 55, no. 12, pp. 1–38, 2023

  17. [18]

    Invariant Risk Minimization

    M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz, “Invariant risk minimization,”arXiv preprint arXiv:1907.02893, 2019

  18. [19]

    Out-of-distribution generalization via risk extrapolation (rex),

    D. Krueger, E. Caballero, J.-H. Jacobsen, A. Zhang, J. Binas, D. Zhang, R. Le Priol, and A. Courville, “Out-of-distribution generalization via risk extrapolation (rex),” inInternational conference on machine learning. PMLR, 2021, pp. 5815–5826. 10

  19. [20]

    Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

    S. Sagawa, P. W. Koh, T. B. Hashimoto, and P. Liang, “Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization,”arXiv preprint arXiv:1911.08731, 2019

  20. [21]

    Augmix: A simple data processing method to improve robustness and uncertainty

    D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, and B. Lakshminarayanan, “Augmix: A simple data processing method to improve robustness and uncertainty,”arXiv preprint arXiv:1912.02781, 2019

  21. [22]

    Tent: Fully test-time adaptation by entropy minimization.arXiv preprint arXiv:2006.10726, 2020

    D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell, “Tent: Fully test-time adaptation by entropy minimization,”arXiv preprint arXiv:2006.10726, 2020

  22. [23]

    Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation,

    J. Liang, D. Hu, and J. Feng, “Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation,” inInternational conference on machine learning. PMLR, 2020, pp. 6028–6039

  23. [24]

    Test-time training with self-supervision for generalization under distribution shifts,

    Y . Sun, X. Wang, Z. Liu, J. Miller, A. Efros, and M. Hardt, “Test-time training with self-supervision for generalization under distribution shifts,” inInternational conference on machine learning. PMLR, 2020, pp. 9229–9248

  24. [25]

    Domain-adversarial training of neural networks,

    Y . Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. March, and V . Lempitsky, “Domain-adversarial training of neural networks,”Journal of machine learning research, vol. 17, no. 59, pp. 1–35, 2016

  25. [26]

    Faith and fate: Limits of transformers on compositionality,

    N. Dziri, X. Lu, M. Sclar, X. L. Li, L. Jiang, B. Y . Lin, S. Welleck, P. West, C. Bhagavatula, R. Le Bras et al., “Faith and fate: Limits of transformers on compositionality,”Advances in neural information processing systems, vol. 36, pp. 70 293–70 332, 2023

  26. [27]

    A fine-grained analysis on distribution shift,

    O. Wiles, S. Gowal, F. Stimberg, S. Alvise-Rebuffi, I. Ktena, K. Dvijotham, and T. Cemgil, “A fine-grained analysis on distribution shift,”arXiv preprint arXiv:2110.11328, 2021

  27. [28]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,

    L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qinet al., “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” ACM Transactions on Information Systems, vol. 43, no. 2, pp. 1–55, 2025

  28. [29]

    Realtime qa: What’s the answer right now?

    J. Kasai, K. Sakaguchi, R. Le Bras, A. Asai, X. Yu, D. Radev, N. A. Smith, Y . Choi, K. Inuiet al., “Realtime qa: What’s the answer right now?”Advances in neural information processing systems, vol. 36, pp. 49 025–49 043, 2023

  29. [30]

    Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks,

    Y . Wang, S. Mishra, P. Alipoormolabashi, Y . Kordi, A. Mirzaei, A. Naik, A. Ashok, A. S. Dhanasekaran, A. Arunkumar, D. Stapet al., “Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks,” inProceedings of the 2022 conference on empirical methods in natural language processing, 2022, pp. 5085–5109

  30. [31]

    Detecting spurious correlations with sanity tests for artificial intelligence guided radiology systems,

    U. Mahmood, R. Shrestha, D. D. Bates, L. Mannelli, G. Corrias, Y . E. Erdi, and C. Kanan, “Detecting spurious correlations with sanity tests for artificial intelligence guided radiology systems,”Frontiers in digital health, vol. 3, p. 671015, 2021

  31. [32]

    Satmae: Pre- training transformers for temporal and multi-spectral satellite imagery,

    Y . Cong, S. Khanna, C. Meng, P. Liu, E. Rozi, Y . He, M. Burke, D. Lobell, and S. Ermon, “Satmae: Pre- training transformers for temporal and multi-spectral satellite imagery,”Advances in Neural Information Processing Systems, vol. 35, pp. 197–211, 2022

  32. [33]

    Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection,

    P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger, “Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9592–9600

  33. [34]

    Vision-language models for vision tasks: A survey,

    J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for vision tasks: A survey,”IEEE transactions on pattern analysis and machine intelligence, vol. 46, no. 8, pp. 5625–5644, 2024

  34. [35]

    Out-of-distribution generalization on graphs: A survey,

    H. Li, X. Wang, Z. Zhang, and W. Zhu, “Out-of-distribution generalization on graphs: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  35. [36]

    arXiv preprint arXiv:2010.05761 , year =

    E. Rosenfeld, P. Ravikumar, and A. Risteski, “The risks of invariant risk minimization,”arXiv preprint arXiv:2010.05761, 2020

  36. [37]

    Learning models with uniform performance via distributionally robust optimization,

    J. C. Duchi and H. Namkoong, “Learning models with uniform performance via distributionally robust optimization,”The Annals of Statistics, vol. 49, no. 3, pp. 1378–1406, 2021

  37. [38]

    mixup: Beyond Empirical Risk Minimization

    H. Zhang, M. Cisse, Y . N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017. 11

  38. [39]

    Domain generalization: A survey,

    K. Zhou, Z. Liu, Y . Qiao, T. Xiang, and C. C. Loy, “Domain generalization: A survey,”IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 4, pp. 4396–4415, 2022

  39. [40]

    Conditional adversarial domain adaptation,

    M. Long, Z. Cao, J. Wang, and M. I. Jordan, “Conditional adversarial domain adaptation,”Advances in neural information processing systems, vol. 31, 2018

  40. [41]

    Test-time prompt tuning for zero-shot generalization in vision-language models,

    M. Shu, W. Nie, D.-A. Huang, Z. Yu, T. Goldstein, A. Anandkumar, and C. Xiao, “Test-time prompt tuning for zero-shot generalization in vision-language models,”Advances in Neural Information Processing Systems, vol. 35, pp. 14 274–14 289, 2022

  41. [42]

    Retrieval-augmented generation for knowledge-intensive nlp tasks,

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschelet al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020

  42. [43]

    Toolformer: Language models can teach themselves to use tools,

    T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,”Advances in neural information processing systems, vol. 36, pp. 68 539–68 551, 2023

  43. [44]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

  44. [45]

    On the foundations of noise-free selective classification

    R. El-Yanivet al., “On the foundations of noise-free selective classification.”Journal of Machine Learning Research, vol. 11, no. 5, 2010

  45. [46]

    Reflexion: Language agents with verbal reinforcement learning,

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,”Advances in neural information processing systems, vol. 36, pp. 8634–8652, 2023

  46. [47]

    Siren’s song in the ai ocean: A survey on hallucination in large language models,

    Y . Zhang, Y . Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y . Zhang, Y . Chenet al., “Siren’s song in the ai ocean: A survey on hallucination in large language models,”Computational Linguistics, vol. 51, no. 4, pp. 1373–1418, 2025

  47. [48]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,”arXiv preprint arXiv:2203.11171, 2022

  48. [49]

    Chain-of-verification reduces hallucination in large language models,

    S. Dhuliawala, M. Komeili, J. Xu, R. Raileanu, X. Li, A. Celikyilmaz, and J. Weston, “Chain-of-verification reduces hallucination in large language models,” inFindings of the association for computational linguistics: ACL 2024, 2024, pp. 3563–3578

  49. [50]

    Retrieval augmentation reduces hallucination in conversation,

    K. Shuster, S. Poff, M. Chen, D. Kiela, and J. Weston, “Retrieval augmentation reduces hallucination in conversation,” inFindings of the Association for Computational Linguistics: EMNLP 2021, 2021, pp. 3784–3803

  50. [51]

    ReAct: Synergizing Reasoning and Acting in Language Models

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,”arXiv preprint arXiv:2210.03629, 2022

  51. [52]

    Toolllm: Facilitat- ing large language models to master 16000+ real-world apis,

    Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qianet al., “Toolllm: Facilitat- ing large language models to master 16000+ real-world apis,” inThe twelfth international conference on learning representations, 2023

  52. [53]

    Selective question answering under domain shift,

    A. Kamath, R. Jia, and P. Liang, “Selective question answering under domain shift,” inProceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 5684–5696

  53. [54]

    A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation,

    N. Varshney, W. Yao, H. Zhang, J. Chen, and D. Yu, “A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation,”arXiv preprint arXiv:2307.03987, 2023

  54. [55]

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

    D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet, Q. Le et al., “Least-to-most prompting enables complex reasoning in large language models,”arXiv preprint arXiv:2205.10625, 2022

  55. [56]

    Decomposed prompting: A modular approach for solving complex tasks,

    T. Khot, H. Trivedi, M. Finlayson, Y . Fu, K. Richardson, P. Clark, and A. Sabharwal, “Decomposed prompting: A modular approach for solving complex tasks,”arXiv preprint arXiv:2210.02406, 2022

  56. [57]

    Sus-x: Training-free name-only transfer of vision-language models,

    V . Udandarao, A. Gupta, and S. Albanie, “Sus-x: Training-free name-only transfer of vision-language models,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2725–2736

  57. [58]

    Rma: Rapid motor adaptation for legged robots,

    A. Kumar, Z. Fu, D. Pathak, and J. Malik, “Rma: Rapid motor adaptation for legged robots,”arXiv preprint arXiv:2107.04034, 2021. 12

  58. [59]

    Consistent estimators for learning to defer to an expert,

    H. Mozannar and D. Sontag, “Consistent estimators for learning to defer to an expert,” inInternational conference on machine learning. PMLR, 2020, pp. 7076–7087

  59. [60]

    Autonomous chemical research with large language models,

    D. A. Boiko, R. MacKnight, B. Kline, and G. Gomes, “Autonomous chemical research with large language models,”Nature, vol. 624, no. 7992, pp. 570–578, 2023

  60. [61]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020

  61. [62]

    Emergent Abilities of Large Language Models

    J. Wei, Y . Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzleret al., “Emergent abilities of large language models,”arXiv preprint arXiv:2206.07682, 2022

  62. [63]

    How is chatgpt’s behavior changing over time?

    L. Chen, M. Zaharia, and J. Zou, “How is chatgpt’s behavior changing over time?”Harvard Data Science Review, vol. 6, no. 2, 2024

  63. [64]

    Confident adaptive language modeling,

    T. Schuster, A. Fisch, J. Gupta, M. Dehghani, D. Bahri, V . Tran, Y . Tay, and D. Metzler, “Confident adaptive language modeling,”Advances in Neural Information Processing Systems, vol. 35, pp. 17 456–17 472, 2022. 13