pith. machine review for the scientific record. sign in

arxiv: 2401.11817 · v2 · submitted 2024-01-22 · 💻 cs.CL · cs.AI· cs.LG

Recognition: no theorem link

Hallucination is Inevitable: An Innate Limitation of Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords hallucinationlarge language modelscomputable functionslearning theoryimpossibility resultsAI limitations
0
0 comments X

The pith

LLMs cannot learn all computable functions and will therefore inevitably hallucinate when used as general problem solvers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that hallucinations in large language models are not just a practical issue but an inherent impossibility. It defines a formal computable world where an LLM is a computable function and hallucination occurs when it differs from the true computable ground truth. By applying results from learning theory, it proves that no LLM can learn every possible computable function. This means that when LLMs are applied to arbitrary problems, some inconsistencies with reality are unavoidable. Because this formal setting is a simplified part of the real world, the same limitation applies to actual LLMs.

Core claim

In a formal world, hallucination is defined as inconsistencies between a computable LLM and a computable ground truth function. Employing results from learning theory shows that LLMs cannot learn all the computable functions and will therefore inevitably hallucinate if used as general problem solvers. Since the formal world is a part of the real world, hallucinations are also inevitable for real world LLMs. For LLMs constrained by provable time complexity, hallucination-prone tasks are described and claims validated empirically.

What carries the argument

The formal computable world in which hallucination is an inconsistency between the LLM's computable function and the ground truth function, together with learning theory results that no single computable function can approximate all others.

If this is right

  • Any LLM used for general problem solving will produce hallucinations on some inputs.
  • Hallucinations cannot be completely eliminated through any finite training process.
  • Tasks requiring computation beyond the LLM's time complexity limits are especially prone to hallucination.
  • Existing mitigation methods can reduce but not remove the possibility of hallucination.
  • Safe deployment of LLMs requires acknowledging this limitation rather than assuming it can be trained away.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the result holds, LLMs are best suited for tasks within a restricted class of functions rather than open-ended use.
  • Hybrid systems that pair LLMs with symbolic verifiers or exact algorithms could address the gaps left by this limitation.
  • Research into the precise boundary of learnable functions for current architectures might identify safer application domains.
  • Real-world complexity may make the formal bound even stricter, suggesting hallucinations are more frequent than the minimal theoretical rate.

Load-bearing premise

The formal computable world is representative enough of the real world for the impossibility result to apply to practical LLMs.

What would settle it

Finding or constructing an LLM that correctly computes every function in a broad class of computable functions without any hallucinations would disprove the inevitability claim.

read the original abstract

Hallucination has been widely recognized to be a significant drawback for large language models (LLMs). There have been many works that attempt to reduce the extent of hallucination. These efforts have mostly been empirical so far, which cannot answer the fundamental question whether it can be completely eliminated. In this paper, we formalize the problem and show that it is impossible to eliminate hallucination in LLMs. Specifically, we define a formal world where hallucination is defined as inconsistencies between a computable LLM and a computable ground truth function. By employing results from learning theory, we show that LLMs cannot learn all the computable functions and will therefore inevitably hallucinate if used as general problem solvers. Since the formal world is a part of the real world which is much more complicated, hallucinations are also inevitable for real world LLMs. Furthermore, for real world LLMs constrained by provable time complexity, we describe the hallucination-prone tasks and empirically validate our claims. Finally, using the formal world framework, we discuss the possible mechanisms and efficacies of existing hallucination mitigators as well as the practical implications on the safe deployment of LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper formalizes hallucination as inconsistency between a computable LLM and a computable ground-truth function in a restricted formal world. It invokes learning-theory results to prove that no fixed computable LLM can match every computable function, hence any LLM used as a general solver must hallucinate on some inputs. The authors then argue that because the formal world is a subset of the real world, the same inevitability holds for practical LLMs; they further characterize hallucination-prone tasks under provable time bounds, supply empirical checks, and evaluate existing mitigation techniques.

Significance. If the formal-to-real bridge can be made rigorous, the result would supply a clean theoretical explanation for the persistence of hallucinations and would constrain expectations for general-purpose LLM deployment. The work correctly applies standard no-free-lunch style arguments to the LLM setting and supplies an empirical component, both of which are strengths.

major comments (2)
  1. [§4] §4 (Extension to real-world LLMs): The sentence 'Since the formal world is a part of the real world which is much more complicated, hallucinations are also inevitable for real world LLMs' is load-bearing for the central claim yet supplies no measure over task distributions. The learning-theoretic impossibility shows existence of failing functions; it does not show that the functions arising in practical NLP tasks lie outside the approximable subset.
  2. [§5] §5 (Time-constrained case and empirical validation): The characterization of hallucination-prone tasks under provable time complexity is stated without an explicit reduction showing that the chosen tasks correspond to functions that are unlearnable by any polynomial-time LLM. The empirical results therefore test a weaker claim than the theoretical one.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'provable time complexity' is introduced without definition or forward reference; a brief parenthetical or citation to the relevant section would improve readability.
  2. [§2] Notation: the paper should clarify whether 'computable LLM' means a Turing machine with finite description or a fixed neural architecture with fixed weights; the distinction affects which learning-theoretic theorems apply directly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Extension to real-world LLMs): The sentence 'Since the formal world is a part of the real world which is much more complicated, hallucinations are also inevitable for real world LLMs' is load-bearing for the central claim yet supplies no measure over task distributions. The learning-theoretic impossibility shows existence of failing functions; it does not show that the functions arising in practical NLP tasks lie outside the approximable subset.

    Authors: We thank the referee for this observation. The core result establishes that no fixed computable LLM can learn every computable function, so any LLM deployed as a general solver must hallucinate on some inputs. Because the formal world is a subset of the real world, this limitation extends to practical LLMs when they are used for broad problem-solving. We agree that the manuscript does not supply a probability measure over real-world task distributions demonstrating that typical NLP tasks fall outside the learnable subset. We will revise §4 to clarify that the inevitability claim applies specifically to general-purpose use rather than asserting that every practical task is unlearnable, and we will add a short discussion of how the existence result constrains expectations for universal solvers. revision: partial

  2. Referee: [§5] §5 (Time-constrained case and empirical validation): The characterization of hallucination-prone tasks under provable time complexity is stated without an explicit reduction showing that the chosen tasks correspond to functions that are unlearnable by any polynomial-time LLM. The empirical results therefore test a weaker claim than the theoretical one.

    Authors: We appreciate the referee noting the missing link. In §5 the tasks are selected because they are known to require super-polynomial time under standard complexity assumptions. While we did not include an explicit reduction connecting these tasks to the learning-theoretic unlearnable functions, the section is intended to illustrate the practical consequences of time bounds. The empirical experiments show performance degradation consistent with the theoretical expectations. We will revise the section with a clarifying paragraph that explicitly relates the chosen tasks to the time-bounded unlearnability result and will note that the empirical component is illustrative and complementary to the theory. revision: partial

Circularity Check

0 steps flagged

No circularity; derivation applies external learning-theory results to a self-contained formal model

full rationale

The paper defines hallucination explicitly as inconsistency between a computable LLM and a computable ground-truth function, then invokes standard results from learning theory (no-free-lunch style) to conclude that no fixed LLM can match every computable function. This step is a direct logical consequence of the chosen definition plus external theorems, not a reduction by construction or self-reference. The extension to real-world LLMs is stated as an explicit modeling assumption ('the formal world is a part of the real world'), not smuggled in via self-citation or ansatz. No equations or parameters are fitted and then relabeled as predictions, and no uniqueness theorem is imported from the authors' prior work. The central claim therefore remains independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that both the LLM and ground truth can be modeled as computable functions and that learning theory applies directly to this setting.

axioms (2)
  • domain assumption LLMs and ground truth are both computable functions
    Stated in the formalization of the problem in the abstract.
  • standard math Results from learning theory apply to show no LLM can learn all computable functions
    Invoked to conclude inevitability of hallucination.

pith-pipeline@v0.9.0 · 5509 in / 1149 out tokens · 34972 ms · 2026-05-15T20:34:45.866926+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SciIntegrity-Bench: A Benchmark for Evaluating Academic Integrity in AI Scientist Systems

    cs.AI 2026-05 unverdicted novelty 8.0

    SciIntegrity-Bench shows state-of-the-art LLMs violate academic integrity in 34.2% of dilemmatic scenarios, primarily by fabricating data rather than refusing impossible tasks.

  2. EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium

    cs.AI 2026-05 unverdicted novelty 7.0

    EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to ad...

  3. Green Shielding: A User-Centric Approach Towards Trustworthy AI

    cs.CL 2026-04 unverdicted novelty 7.0

    Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like t...

  4. Beyond Explicit Refusals: Soft-Failure Attacks on Retrieval-Augmented Generation

    cs.CR 2026-04 unverdicted novelty 7.0

    DEJA uses evolutionary optimization guided by an LLM-based Answer Utility Score to induce soft-failure responses in RAG systems, achieving over 79% soft attack success rate with under 15% hard failures and high stealt...

  5. Navig-AI-tion: Navigation by Contextual AI and Spatial Audio

    cs.HC 2026-03 unverdicted novelty 7.0

    A system combining VLM landmark instructions with real-time corrective spatial audio reduces route deviations in a small user study compared to VLM-only and Google Maps audio baselines.

  6. Integrating Domain-Specialized Language Models with AI Measurement Tools for Deterministic Atomic-Resolution Experimentation

    physics.app-ph 2026-02 unverdicted novelty 7.0

    Domain-specialized small language models enable deterministic atomic-resolution scanning probe microscopy control with 99.3% command accuracy, lower computational cost, and better domain performance than larger genera...

  7. Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation

    cs.CL 2026-05 unverdicted novelty 6.0

    Dimension-level evaluation reveals that 25-58% of LLM outputs with perfect holistic scores still show measurable intent deficits across languages and domains.

  8. A Unified Benchmark for Evaluating Knowledge Graph Construction Methods and Graph Neural Networks

    cs.LG 2026-05 unverdicted novelty 6.0

    A dual-purpose benchmark supplies two text-derived knowledge graphs and one expert reference graph on the same biomedical corpus to jointly measure construction method quality and GNN robustness via semi-supervised no...

  9. Using Large Language Models as a Co-Author in Undergraduate Quantum Group Research

    math.HO 2026-05 unverdicted novelty 6.0

    An AI model produced a new formula for a central element of U_q(so_12) at the quality level of advanced undergraduate research, along with faster computation via SageMath, prompting changes in mentorship practices.

  10. Hallucinations Undermine Trust; Metacognition is a Way Forward

    cs.CL 2026-05 unverdicted novelty 6.0

    LLMs need metacognition to align expressed uncertainty with their actual knowledge boundaries, moving beyond knowledge expansion to reduce confident errors.

  11. Adaptive Test-Time Compute Allocation with Evolving In-Context Demonstrations

    cs.AI 2026-04 unverdicted novelty 6.0

    An adaptive test-time framework uses a warm-up phase on the test set to build evolving in-context examples, then concentrates compute on unresolved queries to outperform static baselines on math, coding, and reasoning...

  12. FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    FineSteer decomposes inference-time steering into Subspace-guided Conditional Steering and Mixture-of-Steering-Experts to deliver stronger control over LLM behaviors with less utility loss than prior methods.

  13. Scaling Synthetic Data Creation with 1,000,000,000 Personas

    cs.CL 2024-06 unverdicted novelty 6.0

    A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.

  14. AgentReputation: A Decentralized Agentic AI Reputation Framework

    cs.AI 2026-04 unverdicted novelty 5.0

    AgentReputation proposes separating AI agent task execution, reputation management, and secure record-keeping into distinct layers, with context-specific reputation cards and a risk-based policy engine to handle verif...

  15. Beyond Accuracy: LLM Variability in Evidence Screening for Software Engineering SLRs

    cs.SE 2026-04 unverdicted novelty 5.0

    LLMs exhibit substantial heterogeneity and non-determinism in SLR evidence screening, abstracts are decisive for performance, and they show no reliable superiority over classical classifiers on two real SLRs.

  16. A pragmatic approach to regulating AI agents

    cs.CY 2026-04 unverdicted novelty 5.0

    AI agents require distinct regulation as AI systems under the EU AI Act with orchestration-layer oversight and a risk-based traffic light authorization system in contract law to preserve human accountability.

  17. V2E: Validating Smart Contract Vulnerabilities through Profit-driven Exploit Generation and Execution

    cs.SE 2026-04 unverdicted novelty 5.0

    V2E automates PoC generation, triggerability and profitability validation, and iterative refinement using LLMs to confirm exploitable smart contract vulnerabilities, outperforming baselines on 264 labeled contracts.

  18. Learning Project-wise Subsequent Code Edits via Interleaving Neural-based Induction and Tool-based Deduction

    cs.SE 2026-04 unverdicted novelty 5.0

    TRACE improves project-wise subsequent code editing by interleaving neural-based induction for semantic edits and tool-based deduction for syntactic edits.

  19. Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation

    cs.SE 2026-04 unverdicted novelty 4.0

    Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.

  20. Designing for Error Recovery in Human-Robot Interaction

    cs.RO 2026-04 unverdicted novelty 3.0

    Position paper calls for designing robotic AI to detect and recover from its own errors in continuous interactions, using nuclear glovebox operations as an illustrative case.

Reference graph

Works this paper leans on

89 extracted references · 89 canonical work pages · cited by 20 Pith papers · 9 internal anchors

  1. [1]

    Characterizing Attribution and Fluency Tradeoffs for Retrieval-Augmented Large Language Models

    Renat Aksitov, Chung-Ching Chang, David Reitter, Siamak Shakeri, and Yunhsuan Sung. “Characterizing Attribution and Fluency Tradeoffs for Retrieval-Augmented Large Language Models”. In: arXiv preprint 2302.05578 (2023)

  2. [2]

    Computational Complexity - A Modern Approach

    Sanjeev Arora and Boaz Barak. Computational Complexity - A Modern Approach . Cambridge University Press, 2009

  3. [3]

    On the prediction of General Recursive Functions

    J. B¯arzdin, š and R. Freivald. “On the prediction of General Recursive Functions”. In:Soviet Mathematics Doklady, (Dokl. Akad. Nauk SSSR) 13 (1972), pp. 1224–1228

  4. [4]

    Learning families of algebraic structures from informant

    Nikolay Bazhenov, Ekaterina B. Fokina, and Luca San Mauro. “Learning families of algebraic structures from informant”. In: Information and Computation 275 (2020), p. 104590

  5. [5]

    Airline held liable for its chatbot giving passenger bad advice - what this means for travellers

    BBC. Airline held liable for its chatbot giving passenger bad advice - what this means for travellers . Accessed: 2024-02-22. 2024. URL: https://web.archive.org/web/ 20240224015400/https://www.bbc.com/travel/article/20240222-air-canada- chatbot-misinformation-what-travellers-should-know

  6. [6]

    Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks

    Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. “Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks”. In: Advances in Neural Information Processing Systems. V ol. 28. 2015

  7. [7]

    Language Models are Few-Shot Learners

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Lit...

  8. [8]

    Ein beitrag zur mannigfaltigkeitslehre

    Georg Cantor. “Ein beitrag zur mannigfaltigkeitslehre”. In: Journal für die reine und ange- wandte Mathematik (Crelles Journal) 1878.84 (1878), pp. 242–258

  9. [9]

    Ueber eine elementare Frage der Mannigfaltigketislehre

    Georg Cantor. “Ueber eine elementare Frage der Mannigfaltigketislehre.” In:Jahresbericht der Deutschen Mathematiker-V ereinigung1 (1890/91), pp. 72–78

  10. [10]

    Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions

    Haw-Shiuan Chang and Andrew McCallum. “Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). May 2022, pp. 8048–8073

  11. [11]

    Benchmarking Large Language Models in Retrieval-Augmented Generation

    Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. “Benchmarking Large Language Models in Retrieval-Augmented Generation”. In: AAAI. AAAI Press, 2024, pp. 17754–17762

  12. [12]

    Overcoming a Theoretical Limitation of Self-Attention

    David Chiang and Peter Cholak. “Overcoming a Theoretical Limitation of Self-Attention”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). May 2022, pp. 7654–7664

  13. [13]

    The Complexity of Theorem-Proving Procedures

    Stephen A. Cook. “The Complexity of Theorem-Proving Procedures”. In: Proceedings of the Third Annual ACM Symposium on Theory of Computing . 1971, pp. 151–158

  14. [14]

    The Pitfalls of Defining Hallucination

    Kees van Deemter. “The Pitfalls of Defining Hallucination”. In:arXiv preprint 2401.07897 (2024)

  15. [15]

    Chain-of-Verification Reduces Hallucination in Large Language Models

    Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. “Chain-of-Verification Reduces Hallucination in Large Language Models”. In: arXiv preprint 2309.11495 (2023)

  16. [16]

    Neural Path Hunter: Reducing Hallucination in Dialogue Systems via Path Grounding

    Nouha Dziri, Andrea Madotto, Osmar Zaiane, and Avishek Joey Bose. “Neural Path Hunter: Reducing Hallucination in Dialogue Systems via Path Grounding”. In:Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing . Nov. 2021, pp. 2197– 2214. 11

  17. [17]

    From Kant to Hilbert: A Source Book in the F oundations of Mathematics

    William Ewald. From Kant to Hilbert: A Source Book in the F oundations of Mathematics . Oxford University Press, Apr. 2005. ISBN : 9780198505358

  18. [18]

    Super-exponential complexity of Presburger arith- metic

    Michael J Fischer and Michael O Rabin. “Super-exponential complexity of Presburger arith- metic”. In: 1974

  19. [19]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. “The Pile: An 800GB Dataset of Diverse Text for Language Modeling”. In:arXiv preprint 2101.00027 (2020)

  20. [20]

    Language Identification in the Limit

    E. Mark Gold. “Language Identification in the Limit”. In:Information and Control 10.5 (1967), pp. 447–474

  21. [21]

    Assessing The Factual Accu- racy of Generated Text

    Ben Goodrich, Vinay Rao, Peter J. Liu, and Mohammad Saleh. “Assessing The Factual Accu- racy of Generated Text”. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining . 2019, pp. 166–175

  22. [22]

    Looking for a Needle in a Haystack: A Comprehensive Study of Hallucinations in Neural Machine Translation

    Nuno M. Guerreiro, Elena V oita, and André Martins. “Looking for a Needle in a Haystack: A Comprehensive Study of Hallucinations in Neural Machine Translation”. In: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics . May 2023, pp. 1059–1075

  23. [23]

    Textbooks Are All You Need

    Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. “Textbooks Are All You Need”. In:arXiv preprint 2306...

  24. [24]

    Theoretical Limitations of Self-Attention in Neural Sequence Models

    Michael Hahn. “Theoretical Limitations of Self-Attention in Neural Sequence Models”. In: Transactions of the Association for Computational Linguistics 8 (2020), pp. 156–171

  25. [25]

    Stuck in the Quicksand of Numeracy, Far from AGI Summit: Evaluat- ing LLMs’ Mathematical Competency through Ontology-guided Perturbations

    Pengfei Hong, Deepanway Ghosal, Navonil Majumder, Somak Aditya, Rada Mihalcea, and Soujanya Poria. “Stuck in the Quicksand of Numeracy, Far from AGI Summit: Evaluat- ing LLMs’ Mathematical Competency through Ontology-guided Perturbations”. In: arXiv 2401.09395 (2024)

  26. [26]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. “A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions”. In:arXiv preprint 2311.05232 (2023)

  27. [27]

    The Factual Inconsistency Problem in Abstractive Text Summarization: A Survey

    Yichong Huang, Xiachong Feng, Xiaocheng Feng, and Bing Qin. “The Factual Inconsistency Problem in Abstractive Text Summarization: A Survey”. In:arXiv preprint 2104.14839 (2023)

  28. [30]

    Llama-3-70b-Intruct Model on HuggingFace

    HuggingFace. Llama-3-70b-Intruct Model on HuggingFace . Accessed: 2023-12-15. URL: https://web.archive.org/web/20240518083325/https://huggingface.co/ unsloth/llama-3-70b-Instruct-bnb-4bit

  29. [31]

    Osherson, James S

    Sanjay Jain, Daniel N. Osherson, James S. Royer, and Arun Sharma. Systems That Learn: An Introduction to Learning Theory. The MIT Press, Feb. 1999. ISBN : 9780262276252

  30. [32]

    Survey of Hallucination in Natural Language Generation

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. “Survey of Hallucination in Natural Language Generation”. In: ACM Computing Surveys 55.12 (Mar. 2023)

  31. [33]

    Calibrated Language Models Must Halluci- nate

    Adam Tauman Kalai and Santosh S. Vempala. “Calibrated Language Models Must Halluci- nate”. In: Proceedings of the 56th Annual ACM Symposium on Theory of Computing (STOC) . 2024

  32. [34]

    Large Language Models Struggle to Learn Long-Tail Knowledge

    Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. “Large Language Models Struggle to Learn Long-Tail Knowledge”. In: Proceedings of the 40th International Conference on Machine Learning . V ol. 202. July 2023, pp. 15696–15707. 12

  33. [35]

    Deduplicating Training Data Makes Language Models Better

    Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. “Deduplicating Training Data Makes Language Models Better”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). May 2022, pp. 8424–8445

  34. [36]

    Factuality Enhanced Language Models for Open-Ended Text Generation

    Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pascale N Fung, Mohammad Shoeybi, and Bryan Catanzaro. “Factuality Enhanced Language Models for Open-Ended Text Generation”. In: Advances in Neural Information Processing Systems . V ol. 35. 2022, pp. 34586–34599

  35. [37]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”. In: Proceedings of the 34th International Conference on Neural Information Processing Systems . 2020

  36. [38]

    Large Language Models with Controllable Working Memory

    Daliang Li, Ankit Singh Rawat, Manzil Zaheer, Xin Wang, Michal Lukasik, Andreas Veit, Felix Yu, and Sanjiv Kumar. “Large Language Models with Controllable Working Memory”. In: Findings of the Association for Computational Linguistics: ACL 2023 . July 2023, pp. 1774– 1793

  37. [39]

    HaluEval: A Large- Scale Hallucination Evaluation Benchmark for Large Language Models

    Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. “HaluEval: A Large- Scale Hallucination Evaluation Benchmark for Large Language Models”. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing . Association for Computational Linguistics, 2023, pp. 6449–6464

  38. [40]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. “TruthfulQA: Measuring How Models Mimic Human Falsehoods”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). May 2022, pp. 3214–3252

  39. [41]

    Learning Quickly When Irrelevant Attributes Abound: A New Linear- threshold Algorithm

    Nick Littlestone. “Learning Quickly When Irrelevant Attributes Abound: A New Linear- threshold Algorithm”. In: Mach. Learn. 2.4 (1987), pp. 285–318

  40. [42]

    Exposing Attention Glitches with Flip-Flop Language Modeling

    Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. “Exposing Attention Glitches with Flip-Flop Language Modeling”. In: Advances in Neural Information Processing Systems. V ol. 36. 2023

  41. [43]

    Hallucina- tion Detection and Hallucination Mitigation: An Investigation

    Junliang Luo, Tianyu Li, Di Wu, Michael Jenkin, Steve Liu, and Gregory Dudek. “Hallucina- tion Detection and Hallucination Mitigation: An Investigation”. In: arXiv preprint 2401.08358 (2024)

  42. [44]

    When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

    Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. “When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories”. In:Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). July 2023, pp. 9802–9822

  43. [45]

    Knowledge Injection to Counter Large Language Model (LLM) Hallucination

    Ariana Martino, Michael Iannelli, and Coleen Truong. “Knowledge Injection to Counter Large Language Model (LLM) Hallucination”. In: The Semantic Web: ESWC 2023 Satellite Events . 2023, pp. 182–185

  44. [46]

    Introducing Meta Llama 3: The most capable openly available LLM to date

    Meta Platforms, Inc. Introducing Meta Llama 3: The most capable openly available LLM to date. Accessed: 2024-04-30. URL: https://web.archive.org/web/20231207183448/ https://huggingface.co/blog/llama2

  45. [47]

    Looking Beyond Sentence-Level Natural Language Inference for Question Answering and Text Summarization

    Anshuman Mishra, Dhruvesh Patel, Aparna Vijayakumar, Xiang Lorraine Li, Pavan Kapani- pathi, and Kartik Talamadupula. “Looking Beyond Sentence-Level Natural Language Inference for Question Answering and Text Summarization”. In:Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language ...

  46. [48]

    Nationality Bias in Text Generation

    Pranav Narayanan Venkit, Sanjana Gautam, Ruchi Panchanadikar, Ting-Hao Huang, and Shomir Wilson. “Nationality Bias in Text Generation”. In:Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics . May 2023, pp. 116– 122

  47. [49]

    A Simple Recipe towards Reducing Hallucination in Neural Surface Realisation

    Feng Nie, Jin-Ge Yao, Jinpeng Wang, Rong Pan, and Chin-Yew Lin. “A Simple Recipe towards Reducing Hallucination in Neural Surface Realisation”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . July 2019, pp. 2673–2679. 13

  48. [50]

    Entity Cloze By Date: What LMs Know About Unseen Entities

    Yasumasa Onoe, Michael J. Q. Zhang, Eunsol Choi, and Greg Durrett. “Entity Cloze By Date: What LMs Know About Unseen Entities”. In: NAACL-HLT (Findings). Association for Computational Linguistics, 2022, pp. 693–702

  49. [51]

    ChatGPT Release Notes

    OpenAI. “ChatGPT Release Notes”. In: (2023). Accessed: 2023-12-16. URL:https://web. archive.org/web/20231214021113/https://help.openai.com/en/articles/ 6825453-chatgpt-release-notes

  50. [52]

    GPT-4 Technical Report

    OpenAI. “GPT-4 Technical Report”. In: arXiv preprint 2303.08774 (2023)

  51. [53]

    Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies

    Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, and William Yang Wang. “Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies”. In: arXiv preprint 2308.03188 (2023)

  52. [54]

    Data and its (dis)contents: A survey of dataset development and use in machine learning research

    Amandalynne Paullada, Inioluwa Deborah Raji, Emily M. Bender, Emily Denton, and Alex Hanna. “Data and its (dis)contents: A survey of dataset development and use in machine learning research”. In: Patterns 2.11 (2021), p. 100336

  53. [55]

    Check your facts and try again: Improving large language models with external knowledge and automated feedback

    Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, and Jianfeng Gao. “Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback”. In: arXiv preprint 2302.12813 (2023)

  54. [56]

    Uber die vollstandigkeiteines gewissen systems der arithmetik ganzer zahlen, in welchen die addition als einzige operation hervortritt

    Mojzesz Presburger. “Uber die vollstandigkeiteines gewissen systems der arithmetik ganzer zahlen, in welchen die addition als einzige operation hervortritt”. In: Comptes-Rendus du ler Congres des Mathematiciens des Pays Slavs . 1929

  55. [57]

    The Curious Case of Hal- lucinations in Neural Machine Translation

    Vikas Raunak, Arul Menezes, and Marcin Junczys-Dowmunt. “The Curious Case of Hal- lucinations in Neural Machine Translation”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. June 2021, pp. 1172–1183

  56. [58]

    The Troubling Emergence of Hallucination in Large Language Models – An Extensive Definition, Quantification, and Prescriptive Remediations

    Vipula Rawte, Swagata Chakraborty, Agnibh Pathak, Anubhav Sarkar, S. M Towhidul Islam Tonmoy, Aman Chadha, Amit P. Sheth, and Amitava Das. “The Troubling Emergence of Hallucination in Large Language Models – An Extensive Definition, Quantification, and Prescriptive Remediations”. In: arXiv preprint 2310.04988 (2023)

  57. [59]

    NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails

    Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar, Christopher Parisien, and Jonathan Cohen. “NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails”. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Dec. 2023, pp. 431–445

  58. [60]

    Iden- tifying Untrustworthy Samples: Data Filtering for Open-Domain Dialogues with Bayesian Optimization

    Lei Shen, Haolan Zhan, Xin Shen, Hongshen Chen, Xiaofang Zhao, and Xiaodan Zhu. “Iden- tifying Untrustworthy Samples: Data Filtering for Open-Domain Dialogues with Bayesian Optimization”. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2021, pp. 1598–1608

  59. [61]

    Trusting Your Evidence: Hallucinate Less with Context-aware Decoding

    Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia Tsvetkov, Luke Zettlemoyer, and Scott Wen- tau Yih. “Trusting Your Evidence: Hallucinate Less with Context-aware Decoding”. In:arXiv preprint 2305.14739 (2023)

  60. [62]

    In-Context Pretraining: Language Modeling Beyond Document Boundaries

    Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Rich James, Xi Victoria Lin, Noah A. Smith, Luke Zettlemoyer, Scott Yih, and Mike Lewis. “In-Context Pretraining: Language Modeling Beyond Document Boundaries”. In: arXiv preprint 2310.10638 (2023)

  61. [63]

    Retrieval Augmenta- tion Reduces Hallucination in Conversation

    Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. “Retrieval Augmenta- tion Reduces Hallucination in Conversation”. In:Findings of the Association for Computational Linguistics: EMNLP 2021. Nov. 2021, pp. 3784–3803

  62. [64]

    Theories of Meaning

    Jeff Speaks. “Theories of Meaning”. In: The Stanford Encyclopedia of Philosophy . Ed. by Edward N. Zalta. Spring 2021. Metaphysics Research Lab, Stanford University, 2021. URL: https://plato.stanford.edu/archives/spr2021/entries/meaning/

  63. [65]

    Presburger’s article on integer arithmetic: Remarks and translation

    Ryan Stansifer. Presburger’s article on integer arithmetic: Remarks and translation. Tech. rep. Cornell University, 1984. URL:https://dl.acm.org/doi/book/10.5555/867696

  64. [66]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. “LLaMA: Open and Efficient Founda- tion Language Models”. In: arXiv preprint 2302.13971 (2023). 14

  65. [67]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

  66. [68]

    On Computable Numbers, with an Application to the Entscheidungsproblem

    A. M. Turing. “On Computable Numbers, with an Application to the Entscheidungsproblem”. In: Proceedings of the London Mathematical Society s2-42.1 (1937), pp. 230–265

  67. [69]

    Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

    Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. “Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting”. In: NeurIPS. 2023

  68. [70]

    A Theory of the Learnable

    L. G. Valiant. “A Theory of the Learnable”. In: Communications of the ACM 27.11 (Nov. 1984), pp. 1134–1142

  69. [71]

    Attention is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is All You Need”. In: Proceedings of the 31st International Conference on Neural Information Processing Systems . 2017, pp. 6000–6010

  70. [72]

    FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation

    Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, and Thang Luong. “FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation”. In: arXiv preprint 2310.03214 (2023)

  71. [73]

    Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity

    Cunxiang Wang, Xiaoze Liu, Yuanhao Yue, Xiangru Tang, Tianhang Zhang, Cheng Jiayang, Yunzhi Yao, Wenyang Gao, Xuming Hu, Zehan Qi, Yidong Wang, Linyi Yang, Jindong Wang, Xing Xie, Zheng Zhang, and Yue Zhang. “Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity”. In:arXiv preprint 2310.07521 (2023)

  72. [74]

    SCOTT: Self-Consistent Chain-of-Thought Distillation

    Peifeng Wang, Zhengyang Wang, Zheng Li, Yifan Gao, Bing Yin, and Xiang Ren. “SCOTT: Self-Consistent Chain-of-Thought Distillation”. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). July 2023, pp. 5546– 5558

  73. [75]

    Emergent Abilities of Large Language Models

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. “Emergent Abilities of Large Language Models”. In: Transactions on Machine Learning Research (2022)

  74. [76]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter brian, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”. In: Advances in Neural Information Processing Systems . V ol. 35. 2022, pp. 24824–24837

  75. [77]

    Cantor’s diagonal argument — Wikipedia, The Free Encyclopedia

    Wikipedia contributors. “Cantor’s diagonal argument — Wikipedia, The Free Encyclopedia”. In: (2023). [Online; accessed 6-December-2023]. URL: https://en.wikipedia.org/w/ index.php?title=Cantor%27s_diagonal_argument&oldid=1173962712

  76. [78]

    Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

    Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. “Breaking the Softmax Bottleneck: A High-Rank RNN Language Model”. In: International Conference on Learning Representations. 2018

  77. [79]

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models”. In: arXiv preprint 2305.10601 (2023)

  78. [80]

    Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

    Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. “Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models”. In: arXiv preprint 2309.01219 (2023). 15

  79. [81]

    Verify-and- Edit: A Knowledge-Enhanced Chain-of-Thought Framework

    Ruochen Zhao, Xingxuan Li, Shafiq Joty, Chengwei Qin, and Lidong Bing. “Verify-and- Edit: A Knowledge-Enhanced Chain-of-Thought Framework”. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). Association for Computational Linguistics, July 2023, pp. 5823–5840. 16 Appendix A Notation and Ter...

  80. [82]

    a" and "b

    with modifications made to prompt the LLMs to answer directly instead of providing high-level descriptions or code snippets. Base Prompt You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. ...

Showing first 80 references.