Recognition: no theorem link
Hallucination is Inevitable: An Innate Limitation of Large Language Models
Pith reviewed 2026-05-15 20:34 UTC · model grok-4.3
The pith
LLMs cannot learn all computable functions and will therefore inevitably hallucinate when used as general problem solvers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In a formal world, hallucination is defined as inconsistencies between a computable LLM and a computable ground truth function. Employing results from learning theory shows that LLMs cannot learn all the computable functions and will therefore inevitably hallucinate if used as general problem solvers. Since the formal world is a part of the real world, hallucinations are also inevitable for real world LLMs. For LLMs constrained by provable time complexity, hallucination-prone tasks are described and claims validated empirically.
What carries the argument
The formal computable world in which hallucination is an inconsistency between the LLM's computable function and the ground truth function, together with learning theory results that no single computable function can approximate all others.
If this is right
- Any LLM used for general problem solving will produce hallucinations on some inputs.
- Hallucinations cannot be completely eliminated through any finite training process.
- Tasks requiring computation beyond the LLM's time complexity limits are especially prone to hallucination.
- Existing mitigation methods can reduce but not remove the possibility of hallucination.
- Safe deployment of LLMs requires acknowledging this limitation rather than assuming it can be trained away.
Where Pith is reading between the lines
- If the result holds, LLMs are best suited for tasks within a restricted class of functions rather than open-ended use.
- Hybrid systems that pair LLMs with symbolic verifiers or exact algorithms could address the gaps left by this limitation.
- Research into the precise boundary of learnable functions for current architectures might identify safer application domains.
- Real-world complexity may make the formal bound even stricter, suggesting hallucinations are more frequent than the minimal theoretical rate.
Load-bearing premise
The formal computable world is representative enough of the real world for the impossibility result to apply to practical LLMs.
What would settle it
Finding or constructing an LLM that correctly computes every function in a broad class of computable functions without any hallucinations would disprove the inevitability claim.
read the original abstract
Hallucination has been widely recognized to be a significant drawback for large language models (LLMs). There have been many works that attempt to reduce the extent of hallucination. These efforts have mostly been empirical so far, which cannot answer the fundamental question whether it can be completely eliminated. In this paper, we formalize the problem and show that it is impossible to eliminate hallucination in LLMs. Specifically, we define a formal world where hallucination is defined as inconsistencies between a computable LLM and a computable ground truth function. By employing results from learning theory, we show that LLMs cannot learn all the computable functions and will therefore inevitably hallucinate if used as general problem solvers. Since the formal world is a part of the real world which is much more complicated, hallucinations are also inevitable for real world LLMs. Furthermore, for real world LLMs constrained by provable time complexity, we describe the hallucination-prone tasks and empirically validate our claims. Finally, using the formal world framework, we discuss the possible mechanisms and efficacies of existing hallucination mitigators as well as the practical implications on the safe deployment of LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formalizes hallucination as inconsistency between a computable LLM and a computable ground-truth function in a restricted formal world. It invokes learning-theory results to prove that no fixed computable LLM can match every computable function, hence any LLM used as a general solver must hallucinate on some inputs. The authors then argue that because the formal world is a subset of the real world, the same inevitability holds for practical LLMs; they further characterize hallucination-prone tasks under provable time bounds, supply empirical checks, and evaluate existing mitigation techniques.
Significance. If the formal-to-real bridge can be made rigorous, the result would supply a clean theoretical explanation for the persistence of hallucinations and would constrain expectations for general-purpose LLM deployment. The work correctly applies standard no-free-lunch style arguments to the LLM setting and supplies an empirical component, both of which are strengths.
major comments (2)
- [§4] §4 (Extension to real-world LLMs): The sentence 'Since the formal world is a part of the real world which is much more complicated, hallucinations are also inevitable for real world LLMs' is load-bearing for the central claim yet supplies no measure over task distributions. The learning-theoretic impossibility shows existence of failing functions; it does not show that the functions arising in practical NLP tasks lie outside the approximable subset.
- [§5] §5 (Time-constrained case and empirical validation): The characterization of hallucination-prone tasks under provable time complexity is stated without an explicit reduction showing that the chosen tasks correspond to functions that are unlearnable by any polynomial-time LLM. The empirical results therefore test a weaker claim than the theoretical one.
minor comments (2)
- [Abstract] Abstract: the phrase 'provable time complexity' is introduced without definition or forward reference; a brief parenthetical or citation to the relevant section would improve readability.
- [§2] Notation: the paper should clarify whether 'computable LLM' means a Turing machine with finite description or a fixed neural architecture with fixed weights; the distinction affects which learning-theoretic theorems apply directly.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Extension to real-world LLMs): The sentence 'Since the formal world is a part of the real world which is much more complicated, hallucinations are also inevitable for real world LLMs' is load-bearing for the central claim yet supplies no measure over task distributions. The learning-theoretic impossibility shows existence of failing functions; it does not show that the functions arising in practical NLP tasks lie outside the approximable subset.
Authors: We thank the referee for this observation. The core result establishes that no fixed computable LLM can learn every computable function, so any LLM deployed as a general solver must hallucinate on some inputs. Because the formal world is a subset of the real world, this limitation extends to practical LLMs when they are used for broad problem-solving. We agree that the manuscript does not supply a probability measure over real-world task distributions demonstrating that typical NLP tasks fall outside the learnable subset. We will revise §4 to clarify that the inevitability claim applies specifically to general-purpose use rather than asserting that every practical task is unlearnable, and we will add a short discussion of how the existence result constrains expectations for universal solvers. revision: partial
-
Referee: [§5] §5 (Time-constrained case and empirical validation): The characterization of hallucination-prone tasks under provable time complexity is stated without an explicit reduction showing that the chosen tasks correspond to functions that are unlearnable by any polynomial-time LLM. The empirical results therefore test a weaker claim than the theoretical one.
Authors: We appreciate the referee noting the missing link. In §5 the tasks are selected because they are known to require super-polynomial time under standard complexity assumptions. While we did not include an explicit reduction connecting these tasks to the learning-theoretic unlearnable functions, the section is intended to illustrate the practical consequences of time bounds. The empirical experiments show performance degradation consistent with the theoretical expectations. We will revise the section with a clarifying paragraph that explicitly relates the chosen tasks to the time-bounded unlearnability result and will note that the empirical component is illustrative and complementary to the theory. revision: partial
Circularity Check
No circularity; derivation applies external learning-theory results to a self-contained formal model
full rationale
The paper defines hallucination explicitly as inconsistency between a computable LLM and a computable ground-truth function, then invokes standard results from learning theory (no-free-lunch style) to conclude that no fixed LLM can match every computable function. This step is a direct logical consequence of the chosen definition plus external theorems, not a reduction by construction or self-reference. The extension to real-world LLMs is stated as an explicit modeling assumption ('the formal world is a part of the real world'), not smuggled in via self-citation or ansatz. No equations or parameters are fitted and then relabeled as predictions, and no uniqueness theorem is imported from the authors' prior work. The central claim therefore remains independent of its own inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs and ground truth are both computable functions
- standard math Results from learning theory apply to show no LLM can learn all computable functions
Forward citations
Cited by 20 Pith papers
-
SciIntegrity-Bench: A Benchmark for Evaluating Academic Integrity in AI Scientist Systems
SciIntegrity-Bench shows state-of-the-art LLMs violate academic integrity in 34.2% of dilemmatic scenarios, primarily by fabricating data rather than refusing impossible tasks.
-
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to ad...
-
Green Shielding: A User-Centric Approach Towards Trustworthy AI
Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like t...
-
Beyond Explicit Refusals: Soft-Failure Attacks on Retrieval-Augmented Generation
DEJA uses evolutionary optimization guided by an LLM-based Answer Utility Score to induce soft-failure responses in RAG systems, achieving over 79% soft attack success rate with under 15% hard failures and high stealt...
-
Navig-AI-tion: Navigation by Contextual AI and Spatial Audio
A system combining VLM landmark instructions with real-time corrective spatial audio reduces route deviations in a small user study compared to VLM-only and Google Maps audio baselines.
-
Integrating Domain-Specialized Language Models with AI Measurement Tools for Deterministic Atomic-Resolution Experimentation
Domain-specialized small language models enable deterministic atomic-resolution scanning probe microscopy control with 99.3% command accuracy, lower computational cost, and better domain performance than larger genera...
-
Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation
Dimension-level evaluation reveals that 25-58% of LLM outputs with perfect holistic scores still show measurable intent deficits across languages and domains.
-
A Unified Benchmark for Evaluating Knowledge Graph Construction Methods and Graph Neural Networks
A dual-purpose benchmark supplies two text-derived knowledge graphs and one expert reference graph on the same biomedical corpus to jointly measure construction method quality and GNN robustness via semi-supervised no...
-
Using Large Language Models as a Co-Author in Undergraduate Quantum Group Research
An AI model produced a new formula for a central element of U_q(so_12) at the quality level of advanced undergraduate research, along with faster computation via SageMath, prompting changes in mentorship practices.
-
Hallucinations Undermine Trust; Metacognition is a Way Forward
LLMs need metacognition to align expressed uncertainty with their actual knowledge boundaries, moving beyond knowledge expansion to reduce confident errors.
-
Adaptive Test-Time Compute Allocation with Evolving In-Context Demonstrations
An adaptive test-time framework uses a warm-up phase on the test set to build evolving in-context examples, then concentrates compute on unresolved queries to outperform static baselines on math, coding, and reasoning...
-
FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models
FineSteer decomposes inference-time steering into Subspace-guided Conditional Steering and Mixture-of-Steering-Experts to deliver stronger control over LLM behaviors with less utility loss than prior methods.
-
Scaling Synthetic Data Creation with 1,000,000,000 Personas
A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.
-
AgentReputation: A Decentralized Agentic AI Reputation Framework
AgentReputation proposes separating AI agent task execution, reputation management, and secure record-keeping into distinct layers, with context-specific reputation cards and a risk-based policy engine to handle verif...
-
Beyond Accuracy: LLM Variability in Evidence Screening for Software Engineering SLRs
LLMs exhibit substantial heterogeneity and non-determinism in SLR evidence screening, abstracts are decisive for performance, and they show no reliable superiority over classical classifiers on two real SLRs.
-
A pragmatic approach to regulating AI agents
AI agents require distinct regulation as AI systems under the EU AI Act with orchestration-layer oversight and a risk-based traffic light authorization system in contract law to preserve human accountability.
-
V2E: Validating Smart Contract Vulnerabilities through Profit-driven Exploit Generation and Execution
V2E automates PoC generation, triggerability and profitability validation, and iterative refinement using LLMs to confirm exploitable smart contract vulnerabilities, outperforming baselines on 264 labeled contracts.
-
Learning Project-wise Subsequent Code Edits via Interleaving Neural-based Induction and Tool-based Deduction
TRACE improves project-wise subsequent code editing by interleaving neural-based induction for semantic edits and tool-based deduction for syntactic edits.
-
Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation
Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.
-
Designing for Error Recovery in Human-Robot Interaction
Position paper calls for designing robotic AI to detect and recover from its own errors in continuous interactions, using nuclear glovebox operations as an illustrative case.
Reference graph
Works this paper leans on
-
[1]
Characterizing Attribution and Fluency Tradeoffs for Retrieval-Augmented Large Language Models
Renat Aksitov, Chung-Ching Chang, David Reitter, Siamak Shakeri, and Yunhsuan Sung. “Characterizing Attribution and Fluency Tradeoffs for Retrieval-Augmented Large Language Models”. In: arXiv preprint 2302.05578 (2023)
-
[2]
Computational Complexity - A Modern Approach
Sanjeev Arora and Boaz Barak. Computational Complexity - A Modern Approach . Cambridge University Press, 2009
work page 2009
-
[3]
On the prediction of General Recursive Functions
J. B¯arzdin, š and R. Freivald. “On the prediction of General Recursive Functions”. In:Soviet Mathematics Doklady, (Dokl. Akad. Nauk SSSR) 13 (1972), pp. 1224–1228
work page 1972
-
[4]
Learning families of algebraic structures from informant
Nikolay Bazhenov, Ekaterina B. Fokina, and Luca San Mauro. “Learning families of algebraic structures from informant”. In: Information and Computation 275 (2020), p. 104590
work page 2020
-
[5]
Airline held liable for its chatbot giving passenger bad advice - what this means for travellers
BBC. Airline held liable for its chatbot giving passenger bad advice - what this means for travellers . Accessed: 2024-02-22. 2024. URL: https://web.archive.org/web/ 20240224015400/https://www.bbc.com/travel/article/20240222-air-canada- chatbot-misinformation-what-travellers-should-know
-
[6]
Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. “Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks”. In: Advances in Neural Information Processing Systems. V ol. 28. 2015
work page 2015
-
[7]
Language Models are Few-Shot Learners
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Lit...
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[8]
Ein beitrag zur mannigfaltigkeitslehre
Georg Cantor. “Ein beitrag zur mannigfaltigkeitslehre”. In: Journal für die reine und ange- wandte Mathematik (Crelles Journal) 1878.84 (1878), pp. 242–258
-
[9]
Ueber eine elementare Frage der Mannigfaltigketislehre
Georg Cantor. “Ueber eine elementare Frage der Mannigfaltigketislehre.” In:Jahresbericht der Deutschen Mathematiker-V ereinigung1 (1890/91), pp. 72–78
-
[10]
Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions
Haw-Shiuan Chang and Andrew McCallum. “Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). May 2022, pp. 8048–8073
work page 2022
-
[11]
Benchmarking Large Language Models in Retrieval-Augmented Generation
Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. “Benchmarking Large Language Models in Retrieval-Augmented Generation”. In: AAAI. AAAI Press, 2024, pp. 17754–17762
work page 2024
-
[12]
Overcoming a Theoretical Limitation of Self-Attention
David Chiang and Peter Cholak. “Overcoming a Theoretical Limitation of Self-Attention”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). May 2022, pp. 7654–7664
work page 2022
-
[13]
The Complexity of Theorem-Proving Procedures
Stephen A. Cook. “The Complexity of Theorem-Proving Procedures”. In: Proceedings of the Third Annual ACM Symposium on Theory of Computing . 1971, pp. 151–158
work page 1971
-
[14]
The Pitfalls of Defining Hallucination
Kees van Deemter. “The Pitfalls of Defining Hallucination”. In:arXiv preprint 2401.07897 (2024)
-
[15]
Chain-of-Verification Reduces Hallucination in Large Language Models
Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. “Chain-of-Verification Reduces Hallucination in Large Language Models”. In: arXiv preprint 2309.11495 (2023)
-
[16]
Neural Path Hunter: Reducing Hallucination in Dialogue Systems via Path Grounding
Nouha Dziri, Andrea Madotto, Osmar Zaiane, and Avishek Joey Bose. “Neural Path Hunter: Reducing Hallucination in Dialogue Systems via Path Grounding”. In:Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing . Nov. 2021, pp. 2197– 2214. 11
work page 2021
-
[17]
From Kant to Hilbert: A Source Book in the F oundations of Mathematics
William Ewald. From Kant to Hilbert: A Source Book in the F oundations of Mathematics . Oxford University Press, Apr. 2005. ISBN : 9780198505358
work page 2005
-
[18]
Super-exponential complexity of Presburger arith- metic
Michael J Fischer and Michael O Rabin. “Super-exponential complexity of Presburger arith- metic”. In: 1974
work page 1974
-
[19]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. “The Pile: An 800GB Dataset of Diverse Text for Language Modeling”. In:arXiv preprint 2101.00027 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[20]
Language Identification in the Limit
E. Mark Gold. “Language Identification in the Limit”. In:Information and Control 10.5 (1967), pp. 447–474
work page 1967
-
[21]
Assessing The Factual Accu- racy of Generated Text
Ben Goodrich, Vinay Rao, Peter J. Liu, and Mohammad Saleh. “Assessing The Factual Accu- racy of Generated Text”. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining . 2019, pp. 166–175
work page 2019
-
[22]
Nuno M. Guerreiro, Elena V oita, and André Martins. “Looking for a Needle in a Haystack: A Comprehensive Study of Hallucinations in Neural Machine Translation”. In: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics . May 2023, pp. 1059–1075
work page 2023
-
[23]
Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. “Textbooks Are All You Need”. In:arXiv preprint 2306...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Theoretical Limitations of Self-Attention in Neural Sequence Models
Michael Hahn. “Theoretical Limitations of Self-Attention in Neural Sequence Models”. In: Transactions of the Association for Computational Linguistics 8 (2020), pp. 156–171
work page 2020
-
[25]
Pengfei Hong, Deepanway Ghosal, Navonil Majumder, Somak Aditya, Rada Mihalcea, and Soujanya Poria. “Stuck in the Quicksand of Numeracy, Far from AGI Summit: Evaluat- ing LLMs’ Mathematical Competency through Ontology-guided Perturbations”. In: arXiv 2401.09395 (2024)
-
[26]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. “A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions”. In:arXiv preprint 2311.05232 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
The Factual Inconsistency Problem in Abstractive Text Summarization: A Survey
Yichong Huang, Xiachong Feng, Xiaocheng Feng, and Bing Qin. “The Factual Inconsistency Problem in Abstractive Text Summarization: A Survey”. In:arXiv preprint 2104.14839 (2023)
-
[30]
Llama-3-70b-Intruct Model on HuggingFace
HuggingFace. Llama-3-70b-Intruct Model on HuggingFace . Accessed: 2023-12-15. URL: https://web.archive.org/web/20240518083325/https://huggingface.co/ unsloth/llama-3-70b-Instruct-bnb-4bit
-
[31]
Sanjay Jain, Daniel N. Osherson, James S. Royer, and Arun Sharma. Systems That Learn: An Introduction to Learning Theory. The MIT Press, Feb. 1999. ISBN : 9780262276252
work page 1999
-
[32]
Survey of Hallucination in Natural Language Generation
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. “Survey of Hallucination in Natural Language Generation”. In: ACM Computing Surveys 55.12 (Mar. 2023)
work page 2023
-
[33]
Calibrated Language Models Must Halluci- nate
Adam Tauman Kalai and Santosh S. Vempala. “Calibrated Language Models Must Halluci- nate”. In: Proceedings of the 56th Annual ACM Symposium on Theory of Computing (STOC) . 2024
work page 2024
-
[34]
Large Language Models Struggle to Learn Long-Tail Knowledge
Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. “Large Language Models Struggle to Learn Long-Tail Knowledge”. In: Proceedings of the 40th International Conference on Machine Learning . V ol. 202. July 2023, pp. 15696–15707. 12
work page 2023
-
[35]
Deduplicating Training Data Makes Language Models Better
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. “Deduplicating Training Data Makes Language Models Better”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). May 2022, pp. 8424–8445
work page 2022
-
[36]
Factuality Enhanced Language Models for Open-Ended Text Generation
Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pascale N Fung, Mohammad Shoeybi, and Bryan Catanzaro. “Factuality Enhanced Language Models for Open-Ended Text Generation”. In: Advances in Neural Information Processing Systems . V ol. 35. 2022, pp. 34586–34599
work page 2022
-
[37]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”. In: Proceedings of the 34th International Conference on Neural Information Processing Systems . 2020
work page 2020
-
[38]
Large Language Models with Controllable Working Memory
Daliang Li, Ankit Singh Rawat, Manzil Zaheer, Xin Wang, Michal Lukasik, Andreas Veit, Felix Yu, and Sanjiv Kumar. “Large Language Models with Controllable Working Memory”. In: Findings of the Association for Computational Linguistics: ACL 2023 . July 2023, pp. 1774– 1793
work page 2023
-
[39]
HaluEval: A Large- Scale Hallucination Evaluation Benchmark for Large Language Models
Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. “HaluEval: A Large- Scale Hallucination Evaluation Benchmark for Large Language Models”. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing . Association for Computational Linguistics, 2023, pp. 6449–6464
work page 2023
-
[40]
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. “TruthfulQA: Measuring How Models Mimic Human Falsehoods”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). May 2022, pp. 3214–3252
work page 2022
-
[41]
Learning Quickly When Irrelevant Attributes Abound: A New Linear- threshold Algorithm
Nick Littlestone. “Learning Quickly When Irrelevant Attributes Abound: A New Linear- threshold Algorithm”. In: Mach. Learn. 2.4 (1987), pp. 285–318
work page 1987
-
[42]
Exposing Attention Glitches with Flip-Flop Language Modeling
Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. “Exposing Attention Glitches with Flip-Flop Language Modeling”. In: Advances in Neural Information Processing Systems. V ol. 36. 2023
work page 2023
-
[43]
Hallucina- tion Detection and Hallucination Mitigation: An Investigation
Junliang Luo, Tianyu Li, Di Wu, Michael Jenkin, Steve Liu, and Gregory Dudek. “Hallucina- tion Detection and Hallucination Mitigation: An Investigation”. In: arXiv preprint 2401.08358 (2024)
-
[44]
Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. “When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories”. In:Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). July 2023, pp. 9802–9822
work page 2023
-
[45]
Knowledge Injection to Counter Large Language Model (LLM) Hallucination
Ariana Martino, Michael Iannelli, and Coleen Truong. “Knowledge Injection to Counter Large Language Model (LLM) Hallucination”. In: The Semantic Web: ESWC 2023 Satellite Events . 2023, pp. 182–185
work page 2023
-
[46]
Introducing Meta Llama 3: The most capable openly available LLM to date
Meta Platforms, Inc. Introducing Meta Llama 3: The most capable openly available LLM to date. Accessed: 2024-04-30. URL: https://web.archive.org/web/20231207183448/ https://huggingface.co/blog/llama2
-
[47]
Anshuman Mishra, Dhruvesh Patel, Aparna Vijayakumar, Xiang Lorraine Li, Pavan Kapani- pathi, and Kartik Talamadupula. “Looking Beyond Sentence-Level Natural Language Inference for Question Answering and Text Summarization”. In:Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language ...
work page 2021
-
[48]
Nationality Bias in Text Generation
Pranav Narayanan Venkit, Sanjana Gautam, Ruchi Panchanadikar, Ting-Hao Huang, and Shomir Wilson. “Nationality Bias in Text Generation”. In:Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics . May 2023, pp. 116– 122
work page 2023
-
[49]
A Simple Recipe towards Reducing Hallucination in Neural Surface Realisation
Feng Nie, Jin-Ge Yao, Jinpeng Wang, Rong Pan, and Chin-Yew Lin. “A Simple Recipe towards Reducing Hallucination in Neural Surface Realisation”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . July 2019, pp. 2673–2679. 13
work page 2019
-
[50]
Entity Cloze By Date: What LMs Know About Unseen Entities
Yasumasa Onoe, Michael J. Q. Zhang, Eunsol Choi, and Greg Durrett. “Entity Cloze By Date: What LMs Know About Unseen Entities”. In: NAACL-HLT (Findings). Association for Computational Linguistics, 2022, pp. 693–702
work page 2022
-
[51]
OpenAI. “ChatGPT Release Notes”. In: (2023). Accessed: 2023-12-16. URL:https://web. archive.org/web/20231214021113/https://help.openai.com/en/articles/ 6825453-chatgpt-release-notes
-
[52]
OpenAI. “GPT-4 Technical Report”. In: arXiv preprint 2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, and William Yang Wang. “Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies”. In: arXiv preprint 2308.03188 (2023)
-
[54]
Data and its (dis)contents: A survey of dataset development and use in machine learning research
Amandalynne Paullada, Inioluwa Deborah Raji, Emily M. Bender, Emily Denton, and Alex Hanna. “Data and its (dis)contents: A survey of dataset development and use in machine learning research”. In: Patterns 2.11 (2021), p. 100336
work page 2021
-
[55]
Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, and Jianfeng Gao. “Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback”. In: arXiv preprint 2302.12813 (2023)
-
[56]
Mojzesz Presburger. “Uber die vollstandigkeiteines gewissen systems der arithmetik ganzer zahlen, in welchen die addition als einzige operation hervortritt”. In: Comptes-Rendus du ler Congres des Mathematiciens des Pays Slavs . 1929
work page 1929
-
[57]
The Curious Case of Hal- lucinations in Neural Machine Translation
Vikas Raunak, Arul Menezes, and Marcin Junczys-Dowmunt. “The Curious Case of Hal- lucinations in Neural Machine Translation”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. June 2021, pp. 1172–1183
work page 2021
-
[58]
Vipula Rawte, Swagata Chakraborty, Agnibh Pathak, Anubhav Sarkar, S. M Towhidul Islam Tonmoy, Aman Chadha, Amit P. Sheth, and Amitava Das. “The Troubling Emergence of Hallucination in Large Language Models – An Extensive Definition, Quantification, and Prescriptive Remediations”. In: arXiv preprint 2310.04988 (2023)
-
[59]
NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails
Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar, Christopher Parisien, and Jonathan Cohen. “NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails”. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Dec. 2023, pp. 431–445
work page 2023
-
[60]
Lei Shen, Haolan Zhan, Xin Shen, Hongshen Chen, Xiaofang Zhao, and Xiaodan Zhu. “Iden- tifying Untrustworthy Samples: Data Filtering for Open-Domain Dialogues with Bayesian Optimization”. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2021, pp. 1598–1608
work page 2021
-
[61]
Trusting Your Evidence: Hallucinate Less with Context-aware Decoding
Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia Tsvetkov, Luke Zettlemoyer, and Scott Wen- tau Yih. “Trusting Your Evidence: Hallucinate Less with Context-aware Decoding”. In:arXiv preprint 2305.14739 (2023)
-
[62]
In-Context Pretraining: Language Modeling Beyond Document Boundaries
Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Rich James, Xi Victoria Lin, Noah A. Smith, Luke Zettlemoyer, Scott Yih, and Mike Lewis. “In-Context Pretraining: Language Modeling Beyond Document Boundaries”. In: arXiv preprint 2310.10638 (2023)
-
[63]
Retrieval Augmenta- tion Reduces Hallucination in Conversation
Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. “Retrieval Augmenta- tion Reduces Hallucination in Conversation”. In:Findings of the Association for Computational Linguistics: EMNLP 2021. Nov. 2021, pp. 3784–3803
work page 2021
-
[64]
Jeff Speaks. “Theories of Meaning”. In: The Stanford Encyclopedia of Philosophy . Ed. by Edward N. Zalta. Spring 2021. Metaphysics Research Lab, Stanford University, 2021. URL: https://plato.stanford.edu/archives/spr2021/entries/meaning/
work page 2021
-
[65]
Presburger’s article on integer arithmetic: Remarks and translation
Ryan Stansifer. Presburger’s article on integer arithmetic: Remarks and translation. Tech. rep. Cornell University, 1984. URL:https://dl.acm.org/doi/book/10.5555/867696
-
[66]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. “LLaMA: Open and Efficient Founda- tion Language Models”. In: arXiv preprint 2302.13971 (2023). 14
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[67]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[68]
On Computable Numbers, with an Application to the Entscheidungsproblem
A. M. Turing. “On Computable Numbers, with an Application to the Entscheidungsproblem”. In: Proceedings of the London Mathematical Society s2-42.1 (1937), pp. 230–265
work page 1937
-
[69]
Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. “Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting”. In: NeurIPS. 2023
work page 2023
-
[70]
L. G. Valiant. “A Theory of the Learnable”. In: Communications of the ACM 27.11 (Nov. 1984), pp. 1134–1142
work page 1984
-
[71]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is All You Need”. In: Proceedings of the 31st International Conference on Neural Information Processing Systems . 2017, pp. 6000–6010
work page 2017
-
[72]
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, and Thang Luong. “FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation”. In: arXiv preprint 2310.03214 (2023)
-
[73]
Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity
Cunxiang Wang, Xiaoze Liu, Yuanhao Yue, Xiangru Tang, Tianhang Zhang, Cheng Jiayang, Yunzhi Yao, Wenyang Gao, Xuming Hu, Zehan Qi, Yidong Wang, Linyi Yang, Jindong Wang, Xing Xie, Zheng Zhang, and Yue Zhang. “Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity”. In:arXiv preprint 2310.07521 (2023)
-
[74]
SCOTT: Self-Consistent Chain-of-Thought Distillation
Peifeng Wang, Zhengyang Wang, Zheng Li, Yifan Gao, Bing Yin, and Xiang Ren. “SCOTT: Self-Consistent Chain-of-Thought Distillation”. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). July 2023, pp. 5546– 5558
work page 2023
-
[75]
Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. “Emergent Abilities of Large Language Models”. In: Transactions on Machine Learning Research (2022)
work page 2022
-
[76]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter brian, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”. In: Advances in Neural Information Processing Systems . V ol. 35. 2022, pp. 24824–24837
work page 2022
-
[77]
Cantor’s diagonal argument — Wikipedia, The Free Encyclopedia
Wikipedia contributors. “Cantor’s diagonal argument — Wikipedia, The Free Encyclopedia”. In: (2023). [Online; accessed 6-December-2023]. URL: https://en.wikipedia.org/w/ index.php?title=Cantor%27s_diagonal_argument&oldid=1173962712
work page 2023
-
[78]
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. “Breaking the Softmax Bottleneck: A High-Rank RNN Language Model”. In: International Conference on Learning Representations. 2018
work page 2018
-
[79]
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models”. In: arXiv preprint 2305.10601 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[80]
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. “Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models”. In: arXiv preprint 2309.01219 (2023). 15
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[81]
Verify-and- Edit: A Knowledge-Enhanced Chain-of-Thought Framework
Ruochen Zhao, Xingxuan Li, Shafiq Joty, Chengwei Qin, and Lidong Bing. “Verify-and- Edit: A Knowledge-Enhanced Chain-of-Thought Framework”. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). Association for Computational Linguistics, July 2023, pp. 5823–5840. 16 Appendix A Notation and Ter...
work page 2023
-
[82]
with modifications made to prompt the LLMs to answer directly instead of providing high-level descriptions or code snippets. Base Prompt You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.