Recognition: unknown
Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models
Pith reviewed 2026-05-08 12:26 UTC · model grok-4.3
The pith
Foundation models face a parameter coverage ceiling on out-of-distribution inputs that agentic systems can extend beyond.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We prove a parameter coverage ceiling: there exist practically relevant inputs that no model-centric method (training-time or test-time) can handle within tolerance ε, for reasons intrinsic to parameter-based representation. Agentic OOD systems are characterized by four structural properties—perception, strategy selection, external action, and closed-loop verification—and these properties strictly extend the reachable set beyond the ceiling.
What carries the argument
The parameter coverage ceiling, a limit on what parameter-based representations can achieve for certain OOD inputs, together with the four structural properties of agentic systems that extend the reachable set.
If this is right
- Model-centric methods alone are insufficient for the full range of OOD phenomena faced by foundation models in open-world settings.
- Agentic systems must be studied as a distinct and necessary research direction rather than an add-on.
- Progress on foundation-model OOD requires treating the two paradigms as complementary.
- A research agenda should focus on integrating the four agentic properties with existing foundation-model pipelines.
- Partially observed multi-stage training distributions must be formalized stage-by-stage to assess coverage limits accurately.
Where Pith is reading between the lines
- Hybrid architectures that alternate between parameter updates and external action loops could become standard for deployed foundation models.
- Domains with high open-ended task variation, such as interactive agents or robotic planning, offer natural test beds for measuring extension beyond the ceiling.
- The argument implies that evaluation benchmarks for OOD should include explicit tests for closed-loop verification rather than single-pass prediction.
- Similar coverage ceilings may appear in other parameter-heavy systems outside language or vision, suggesting the result generalizes.
Load-bearing premise
The four structural properties of agentic systems strictly extend the reachable set beyond the parameter coverage ceiling without introducing new unaddressed limitations.
What would settle it
A concrete demonstration of either a model-centric method that processes within tolerance ε an input shown to lie beyond the parameter coverage ceiling, or an agentic system that fails to reach such an input because of its own structural constraints.
Figures
read the original abstract
Foundation models (FMs) are increasingly deployed in open-world settings where distribution shift is the rule rather than the exception. The out-of-distribution (OOD) phenomena they face -- knowledge boundaries, capability ceilings, compositional shifts, and open-ended task variation -- differ in kind from the settings that have shaped prior OOD research, and are further complicated because the pretraining and post-training distributions of modern FMs are often only partially observed. Our position is that OOD for foundation models is a structurally distinct problem that cannot be solved within the prevailing model-centric paradigm, and that agentic systems constitute the missing paradigm required to address it. We defend this claim through four steps. First, we give a stage-aware formalization of OOD that accommodates partially observed multi-stage training distributions. Second, we prove a parameter coverage ceiling: there exist practically relevant inputs that no model-centric method (training-time or test-time) can handle within tolerance $\varepsilon$, for reasons intrinsic to parameter-based representation. Third, we characterize agentic OOD systems by four structural properties -- perception, strategy selection, external action, and closed-loop verification -- and show that they strictly extend the reachable set beyond the ceiling. Fourth, we respond to seven counterarguments, conceding two, and outline a research agenda. We do not claim that agentic methods subsume model-centric ones; we argue that the two are complementary, and that progress on FM-OOD requires explicit recognition of the agentic paradigm as a first-class research direction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that out-of-distribution (OOD) generalization for foundation models constitutes a structurally distinct problem that cannot be solved within the model-centric paradigm. It supports this position via four steps: a stage-aware formalization of OOD that handles partially observed multi-stage training distributions; a proof of a parameter coverage ceiling showing that certain practically relevant inputs lie outside the reach of any training-time or test-time model-centric method within tolerance ε; a characterization of agentic OOD systems by four structural properties (perception, strategy selection, external action, closed-loop verification) that strictly extend the reachable set; and responses to seven counterarguments with an outline of a research agenda. The authors emphasize complementarity rather than replacement of model-centric methods.
Significance. If the formalization and extension argument hold, the work would provide a clear conceptual framework for why intrinsic limits exist in parameter-based representations and motivate treating agentic systems as a first-class research direction alongside model improvements. The stage-aware OOD definition and explicit ceiling result could serve as useful reference points for future empirical and theoretical work on open-world deployment of foundation models.
major comments (2)
- [Step 3] Step 3: The central claim that the four structural properties strictly extend the reachable set beyond the parameter coverage ceiling is load-bearing. The manuscript must explicitly address whether closed-loop verification is implemented via non-parameter mechanisms or remains subject to the stage-aware OOD definition and coverage ceiling from steps 1–2. If verification reuses the same foundation model (or any parameter-based component), the extension is not guaranteed to be strict; a formal argument or counterexample showing immunity to the ceiling is required.
- [Step 2] Step 2: The proof of the parameter coverage ceiling asserts the existence of inputs no model-centric method can handle within ε for intrinsic representational reasons. The derivation, including all assumptions on partially observed distributions, the precise definition of the reachable set, and error bounds, should be presented in full so that readers can verify the claim is not tautological to the chosen formalization.
minor comments (2)
- The abstract is information-dense; expanding the four-step outline with one sentence each would improve immediate readability without lengthening the abstract excessively.
- Notation for the tolerance parameter ε and the reachable set should be introduced consistently when first used in the formal sections.
Simulated Author's Rebuttal
We thank the referee for the constructive and precise comments, which highlight important areas for strengthening the formal arguments in our manuscript. We address each major comment below and will make the indicated revisions to improve clarity and rigor.
read point-by-point responses
-
Referee: [Step 3] Step 3: The central claim that the four structural properties strictly extend the reachable set beyond the parameter coverage ceiling is load-bearing. The manuscript must explicitly address whether closed-loop verification is implemented via non-parameter mechanisms or remains subject to the stage-aware OOD definition and coverage ceiling from steps 1–2. If verification reuses the same foundation model (or any parameter-based component), the extension is not guaranteed to be strict; a formal argument or counterexample showing immunity to the ceiling is required.
Authors: We agree that the strict extension claim requires explicit formal justification, especially concerning closed-loop verification when it may reuse foundation model components. In the revised manuscript we will add a new subsection in Step 3 that supplies a formal argument showing how the combination of external action and closed-loop verification extends the reachable set. The argument proceeds by demonstrating that agentic interaction allows the system to generate new observations and modify the effective input distribution through environmental actions; this process is not available to any fixed-parameter model-centric method. We will include a proof sketch establishing that, for any input outside the coverage ceiling, there exists a finite sequence of actions and verifications that reaches it within ε, even when the underlying model is reused for verification steps. This relies on the non-stationary input distribution induced by external actions rather than on non-parameter mechanisms per se. revision: yes
-
Referee: [Step 2] Step 2: The proof of the parameter coverage ceiling asserts the existence of inputs no model-centric method can handle within ε for intrinsic representational reasons. The derivation, including all assumptions on partially observed distributions, the precise definition of the reachable set, and error bounds, should be presented in full so that readers can verify the claim is not tautological to the chosen formalization.
Authors: We concur that the proof of the parameter coverage ceiling must be expanded to allow independent verification. In the revised manuscript we will present the complete derivation, either in the main text or as a self-contained appendix. The expanded version will explicitly list all assumptions on partially observed multi-stage training distributions, provide the precise set-theoretic definition of the reachable set for model-centric methods (training-time and test-time), and detail the error bounds with respect to tolerance ε. The derivation will be structured as a sequence of lemmas showing that the ceiling follows from the fixed-parameter representational capacity under partial observability, rather than being a direct restatement of the stage-aware OOD definition. revision: yes
Circularity Check
Agentic extension claim reduces to definitional choice of properties that bypass the ceiling by construction
specific steps
-
self definitional
[Abstract (step 3)]
"we characterize agentic OOD systems by four structural properties -- perception, strategy selection, external action, and closed-loop verification -- and show that they strictly extend the reachable set beyond the ceiling."
The four properties are selected to include 'external action' and 'closed-loop verification,' which are defined as operating outside parameter-based representation. The claim that these properties 'strictly extend' the reachable set therefore follows directly from the definitional inclusion of non-parameter mechanisms rather than from a separate proof that the properties can be implemented while remaining immune to the stage-aware OOD ceiling established earlier.
full rationale
The paper's central derivation proceeds from a parameter coverage ceiling (step 2) to the claim that agentic systems strictly extend the reachable set (step 3). The extension is obtained by characterizing agentic systems via four properties that explicitly incorporate external mechanisms; this characterization makes the strict extension hold by the definition of the properties rather than by an independent argument that such properties can be realized without reintroducing the ceiling. The abstract states the properties 'strictly extend' the set, but the load-bearing move is the choice of definition itself. No other circular steps are present; the formalization of the ceiling and the counterargument responses appear self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption There exist practically relevant inputs outside the coverage of any parameter-based representation within tolerance ε
Reference graph
Works this paper leans on
-
[1]
Dataset shift in machine learning,
T. S. I. M. LEARNING, “Dataset shift in machine learning,” 2009
2009
-
[2]
Towards out-of-distribution generalization: A survey,
J. Liu, Z. Shen, Y . He, X. Zhang, R. Xu, H. Yu, and P. Cui, “Towards out-of-distribution generalization: A survey,”arXiv preprint arXiv:2108.13624, 2021
-
[3]
The clinician and dataset shift in artificial intelligence,
S. G. Finlayson, A. Subbaswamy, K. Singh, J. Bowers, A. Kupke, J. Zittrain, I. S. Kohane, and S. Saria, “The clinician and dataset shift in artificial intelligence,”New England Journal of Medicine, vol. 385, no. 3, pp. 283–286, 2021
2021
-
[4]
Can autonomous vehicles identify, recover from, and adapt to distribution shifts?
A. Filos, P. Tigkas, R. McAllister, N. Rhinehart, S. Levine, and Y . Gal, “Can autonomous vehicles identify, recover from, and adapt to distribution shifts?” inInternational Conference on Machine Learning. PMLR, 2020, pp. 3145–3153
2020
-
[5]
Large language models struggle to learn long-tail knowledge,
N. Kandpal, H. Deng, A. Roberts, E. Wallace, and C. Raffel, “Large language models struggle to learn long-tail knowledge,” inInternational conference on machine learning. PMLR, 2023, pp. 15 696–15 707
2023
-
[6]
On the Opportunities and Risks of Foundation Models
R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskillet al., “On the opportunities and risks of foundation models,”arXiv preprint arXiv:2108.07258, 2021
work page internal anchor Pith review arXiv 2021
-
[7]
Language models are few-shot learners,
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020
1901
-
[8]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review arXiv 2023
-
[9]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763
2021
-
[10]
Segment anything,
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026
2023
-
[11]
Towards graph foundation models: A survey and beyond,
J. Liu, C. Yang, Z. Lu, J. Chen, Y . Li, M. Zhang, T. Bai, Y . Fang, L. Sun, P. S. Yuet al., “Towards graph foundation models: A survey and beyond,”arXiv preprint arXiv:2310.11829, 2023
-
[12]
Position: Graph foundation models are already here,
H. Mao, Z. Chen, W. Tang, J. Zhao, Y . Ma, T. Zhao, N. Shah, M. Galkin, and J. Tang, “Position: Graph foundation models are already here,” inForty-first International Conference on Machine Learning, 2024
2024
-
[13]
Training language models to follow instructions with human feedback,
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022
2022
-
[14]
Wilds: A benchmark of in-the-wild distribution shifts,
P. W. Koh, S. Sagawa, H. Marklund, S. M. Xie, M. Zhang, A. Balsubramani, W. Hu, M. Yasunaga, R. L. Phillips, I. Gaoet al., “Wilds: A benchmark of in-the-wild distribution shifts,” inInternational conference on machine learning. PMLR, 2021, pp. 5637–5664
2021
-
[15]
Benchmarking Neural Network Robustness to Common Corruptions and Perturbations
D. Hendrycks and T. Dietterich, “Benchmarking neural network robustness to common corruptions and perturbations,”arXiv preprint arXiv:1903.12261, 2019
work page internal anchor Pith review arXiv 1903
-
[16]
Survey of hallucination in natural language generation,
Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . J. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,”ACM computing surveys, vol. 55, no. 12, pp. 1–38, 2023
2023
-
[18]
M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz, “Invariant risk minimization,”arXiv preprint arXiv:1907.02893, 2019
work page internal anchor Pith review arXiv 1907
-
[19]
Out-of-distribution generalization via risk extrapolation (rex),
D. Krueger, E. Caballero, J.-H. Jacobsen, A. Zhang, J. Binas, D. Zhang, R. Le Priol, and A. Courville, “Out-of-distribution generalization via risk extrapolation (rex),” inInternational conference on machine learning. PMLR, 2021, pp. 5815–5826. 10
2021
-
[20]
S. Sagawa, P. W. Koh, T. B. Hashimoto, and P. Liang, “Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization,”arXiv preprint arXiv:1911.08731, 2019
work page internal anchor Pith review arXiv 1911
-
[21]
Augmix: A simple data processing method to improve robustness and uncertainty
D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, and B. Lakshminarayanan, “Augmix: A simple data processing method to improve robustness and uncertainty,”arXiv preprint arXiv:1912.02781, 2019
-
[22]
Tent: Fully test-time adaptation by entropy minimization.arXiv preprint arXiv:2006.10726, 2020
D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell, “Tent: Fully test-time adaptation by entropy minimization,”arXiv preprint arXiv:2006.10726, 2020
-
[23]
Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation,
J. Liang, D. Hu, and J. Feng, “Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation,” inInternational conference on machine learning. PMLR, 2020, pp. 6028–6039
2020
-
[24]
Test-time training with self-supervision for generalization under distribution shifts,
Y . Sun, X. Wang, Z. Liu, J. Miller, A. Efros, and M. Hardt, “Test-time training with self-supervision for generalization under distribution shifts,” inInternational conference on machine learning. PMLR, 2020, pp. 9229–9248
2020
-
[25]
Domain-adversarial training of neural networks,
Y . Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. March, and V . Lempitsky, “Domain-adversarial training of neural networks,”Journal of machine learning research, vol. 17, no. 59, pp. 1–35, 2016
2016
-
[26]
Faith and fate: Limits of transformers on compositionality,
N. Dziri, X. Lu, M. Sclar, X. L. Li, L. Jiang, B. Y . Lin, S. Welleck, P. West, C. Bhagavatula, R. Le Bras et al., “Faith and fate: Limits of transformers on compositionality,”Advances in neural information processing systems, vol. 36, pp. 70 293–70 332, 2023
2023
-
[27]
A fine-grained analysis on distribution shift,
O. Wiles, S. Gowal, F. Stimberg, S. Alvise-Rebuffi, I. Ktena, K. Dvijotham, and T. Cemgil, “A fine-grained analysis on distribution shift,”arXiv preprint arXiv:2110.11328, 2021
-
[28]
A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,
L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qinet al., “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” ACM Transactions on Information Systems, vol. 43, no. 2, pp. 1–55, 2025
2025
-
[29]
Realtime qa: What’s the answer right now?
J. Kasai, K. Sakaguchi, R. Le Bras, A. Asai, X. Yu, D. Radev, N. A. Smith, Y . Choi, K. Inuiet al., “Realtime qa: What’s the answer right now?”Advances in neural information processing systems, vol. 36, pp. 49 025–49 043, 2023
2023
-
[30]
Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks,
Y . Wang, S. Mishra, P. Alipoormolabashi, Y . Kordi, A. Mirzaei, A. Naik, A. Ashok, A. S. Dhanasekaran, A. Arunkumar, D. Stapet al., “Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks,” inProceedings of the 2022 conference on empirical methods in natural language processing, 2022, pp. 5085–5109
2022
-
[31]
Detecting spurious correlations with sanity tests for artificial intelligence guided radiology systems,
U. Mahmood, R. Shrestha, D. D. Bates, L. Mannelli, G. Corrias, Y . E. Erdi, and C. Kanan, “Detecting spurious correlations with sanity tests for artificial intelligence guided radiology systems,”Frontiers in digital health, vol. 3, p. 671015, 2021
2021
-
[32]
Satmae: Pre- training transformers for temporal and multi-spectral satellite imagery,
Y . Cong, S. Khanna, C. Meng, P. Liu, E. Rozi, Y . He, M. Burke, D. Lobell, and S. Ermon, “Satmae: Pre- training transformers for temporal and multi-spectral satellite imagery,”Advances in Neural Information Processing Systems, vol. 35, pp. 197–211, 2022
2022
-
[33]
Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection,
P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger, “Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9592–9600
2019
-
[34]
Vision-language models for vision tasks: A survey,
J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for vision tasks: A survey,”IEEE transactions on pattern analysis and machine intelligence, vol. 46, no. 8, pp. 5625–5644, 2024
2024
-
[35]
Out-of-distribution generalization on graphs: A survey,
H. Li, X. Wang, Z. Zhang, and W. Zhu, “Out-of-distribution generalization on graphs: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
2025
-
[36]
arXiv preprint arXiv:2010.05761 , year =
E. Rosenfeld, P. Ravikumar, and A. Risteski, “The risks of invariant risk minimization,”arXiv preprint arXiv:2010.05761, 2020
-
[37]
Learning models with uniform performance via distributionally robust optimization,
J. C. Duchi and H. Namkoong, “Learning models with uniform performance via distributionally robust optimization,”The Annals of Statistics, vol. 49, no. 3, pp. 1378–1406, 2021
2021
-
[38]
mixup: Beyond Empirical Risk Minimization
H. Zhang, M. Cisse, Y . N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017. 11
work page internal anchor Pith review arXiv 2017
-
[39]
Domain generalization: A survey,
K. Zhou, Z. Liu, Y . Qiao, T. Xiang, and C. C. Loy, “Domain generalization: A survey,”IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 4, pp. 4396–4415, 2022
2022
-
[40]
Conditional adversarial domain adaptation,
M. Long, Z. Cao, J. Wang, and M. I. Jordan, “Conditional adversarial domain adaptation,”Advances in neural information processing systems, vol. 31, 2018
2018
-
[41]
Test-time prompt tuning for zero-shot generalization in vision-language models,
M. Shu, W. Nie, D.-A. Huang, Z. Yu, T. Goldstein, A. Anandkumar, and C. Xiao, “Test-time prompt tuning for zero-shot generalization in vision-language models,”Advances in Neural Information Processing Systems, vol. 35, pp. 14 274–14 289, 2022
2022
-
[42]
Retrieval-augmented generation for knowledge-intensive nlp tasks,
P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschelet al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020
2020
-
[43]
Toolformer: Language models can teach themselves to use tools,
T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,”Advances in neural information processing systems, vol. 36, pp. 68 539–68 551, 2023
2023
-
[44]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022
2022
-
[45]
On the foundations of noise-free selective classification
R. El-Yanivet al., “On the foundations of noise-free selective classification.”Journal of Machine Learning Research, vol. 11, no. 5, 2010
2010
-
[46]
Reflexion: Language agents with verbal reinforcement learning,
N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,”Advances in neural information processing systems, vol. 36, pp. 8634–8652, 2023
2023
-
[47]
Siren’s song in the ai ocean: A survey on hallucination in large language models,
Y . Zhang, Y . Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y . Zhang, Y . Chenet al., “Siren’s song in the ai ocean: A survey on hallucination in large language models,”Computational Linguistics, vol. 51, no. 4, pp. 1373–1418, 2025
2025
-
[48]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,”arXiv preprint arXiv:2203.11171, 2022
work page internal anchor Pith review arXiv 2022
-
[49]
Chain-of-verification reduces hallucination in large language models,
S. Dhuliawala, M. Komeili, J. Xu, R. Raileanu, X. Li, A. Celikyilmaz, and J. Weston, “Chain-of-verification reduces hallucination in large language models,” inFindings of the association for computational linguistics: ACL 2024, 2024, pp. 3563–3578
2024
-
[50]
Retrieval augmentation reduces hallucination in conversation,
K. Shuster, S. Poff, M. Chen, D. Kiela, and J. Weston, “Retrieval augmentation reduces hallucination in conversation,” inFindings of the Association for Computational Linguistics: EMNLP 2021, 2021, pp. 3784–3803
2021
-
[51]
ReAct: Synergizing Reasoning and Acting in Language Models
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,”arXiv preprint arXiv:2210.03629, 2022
work page internal anchor Pith review arXiv 2022
-
[52]
Toolllm: Facilitat- ing large language models to master 16000+ real-world apis,
Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qianet al., “Toolllm: Facilitat- ing large language models to master 16000+ real-world apis,” inThe twelfth international conference on learning representations, 2023
2023
-
[53]
Selective question answering under domain shift,
A. Kamath, R. Jia, and P. Liang, “Selective question answering under domain shift,” inProceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 5684–5696
2020
-
[54]
N. Varshney, W. Yao, H. Zhang, J. Chen, and D. Yu, “A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation,”arXiv preprint arXiv:2307.03987, 2023
-
[55]
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet, Q. Le et al., “Least-to-most prompting enables complex reasoning in large language models,”arXiv preprint arXiv:2205.10625, 2022
work page internal anchor Pith review arXiv 2022
-
[56]
Decomposed prompting: A modular approach for solving complex tasks,
T. Khot, H. Trivedi, M. Finlayson, Y . Fu, K. Richardson, P. Clark, and A. Sabharwal, “Decomposed prompting: A modular approach for solving complex tasks,”arXiv preprint arXiv:2210.02406, 2022
-
[57]
Sus-x: Training-free name-only transfer of vision-language models,
V . Udandarao, A. Gupta, and S. Albanie, “Sus-x: Training-free name-only transfer of vision-language models,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2725–2736
2023
-
[58]
Rma: Rapid motor adaptation for legged robots,
A. Kumar, Z. Fu, D. Pathak, and J. Malik, “Rma: Rapid motor adaptation for legged robots,”arXiv preprint arXiv:2107.04034, 2021. 12
-
[59]
Consistent estimators for learning to defer to an expert,
H. Mozannar and D. Sontag, “Consistent estimators for learning to defer to an expert,” inInternational conference on machine learning. PMLR, 2020, pp. 7076–7087
2020
-
[60]
Autonomous chemical research with large language models,
D. A. Boiko, R. MacKnight, B. Kline, and G. Gomes, “Autonomous chemical research with large language models,”Nature, vol. 624, no. 7992, pp. 570–578, 2023
2023
-
[61]
Scaling Laws for Neural Language Models
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review arXiv 2001
-
[62]
Emergent Abilities of Large Language Models
J. Wei, Y . Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzleret al., “Emergent abilities of large language models,”arXiv preprint arXiv:2206.07682, 2022
work page internal anchor Pith review arXiv 2022
-
[63]
How is chatgpt’s behavior changing over time?
L. Chen, M. Zaharia, and J. Zou, “How is chatgpt’s behavior changing over time?”Harvard Data Science Review, vol. 6, no. 2, 2024
2024
-
[64]
Confident adaptive language modeling,
T. Schuster, A. Fisch, J. Gupta, M. Dehghani, D. Bahri, V . Tran, Y . Tay, and D. Metzler, “Confident adaptive language modeling,”Advances in Neural Information Processing Systems, vol. 35, pp. 17 456–17 472, 2022. 13
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.