arxiv: 2401.11817 · v2 · submitted 2024-01-22 · 💻 cs.CL · cs.AI· cs.LG

Recognition: no theorem link

Hallucination is Inevitable: An Innate Limitation of Large Language Models

Ziwei Xu , Sanjay Jain , Mohan Kankanhalli

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords hallucinationlarge language modelscomputable functionslearning theoryimpossibility resultsAI limitations

0 comments

The pith

LLMs cannot learn all computable functions and will therefore inevitably hallucinate when used as general problem solvers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that hallucinations in large language models are not just a practical issue but an inherent impossibility. It defines a formal computable world where an LLM is a computable function and hallucination occurs when it differs from the true computable ground truth. By applying results from learning theory, it proves that no LLM can learn every possible computable function. This means that when LLMs are applied to arbitrary problems, some inconsistencies with reality are unavoidable. Because this formal setting is a simplified part of the real world, the same limitation applies to actual LLMs.

Core claim

In a formal world, hallucination is defined as inconsistencies between a computable LLM and a computable ground truth function. Employing results from learning theory shows that LLMs cannot learn all the computable functions and will therefore inevitably hallucinate if used as general problem solvers. Since the formal world is a part of the real world, hallucinations are also inevitable for real world LLMs. For LLMs constrained by provable time complexity, hallucination-prone tasks are described and claims validated empirically.

What carries the argument

The formal computable world in which hallucination is an inconsistency between the LLM's computable function and the ground truth function, together with learning theory results that no single computable function can approximate all others.

If this is right

Any LLM used for general problem solving will produce hallucinations on some inputs.
Hallucinations cannot be completely eliminated through any finite training process.
Tasks requiring computation beyond the LLM's time complexity limits are especially prone to hallucination.
Existing mitigation methods can reduce but not remove the possibility of hallucination.
Safe deployment of LLMs requires acknowledging this limitation rather than assuming it can be trained away.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the result holds, LLMs are best suited for tasks within a restricted class of functions rather than open-ended use.
Hybrid systems that pair LLMs with symbolic verifiers or exact algorithms could address the gaps left by this limitation.
Research into the precise boundary of learnable functions for current architectures might identify safer application domains.
Real-world complexity may make the formal bound even stricter, suggesting hallucinations are more frequent than the minimal theoretical rate.

Load-bearing premise

The formal computable world is representative enough of the real world for the impossibility result to apply to practical LLMs.

What would settle it

Finding or constructing an LLM that correctly computes every function in a broad class of computable functions without any hallucinations would disprove the inevitability claim.

read the original abstract

Hallucination has been widely recognized to be a significant drawback for large language models (LLMs). There have been many works that attempt to reduce the extent of hallucination. These efforts have mostly been empirical so far, which cannot answer the fundamental question whether it can be completely eliminated. In this paper, we formalize the problem and show that it is impossible to eliminate hallucination in LLMs. Specifically, we define a formal world where hallucination is defined as inconsistencies between a computable LLM and a computable ground truth function. By employing results from learning theory, we show that LLMs cannot learn all the computable functions and will therefore inevitably hallucinate if used as general problem solvers. Since the formal world is a part of the real world which is much more complicated, hallucinations are also inevitable for real world LLMs. Furthermore, for real world LLMs constrained by provable time complexity, we describe the hallucination-prone tasks and empirically validate our claims. Finally, using the formal world framework, we discuss the possible mechanisms and efficacies of existing hallucination mitigators as well as the practical implications on the safe deployment of LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies a standard no-free-lunch result from learning theory to show that no LLM can match every computable function, but the step from that formal limit to inevitable real-world hallucination is not well supported.

read the letter

The punchline is that this paper frames hallucination as inconsistency between a computable LLM and a computable ground truth, then invokes learning-theory results to argue that no fixed model can cover all such functions. From there it concludes that hallucinations cannot be eliminated for general problem solving and must therefore occur in practice as well. That formal step is straightforward and draws on established theorems, which is the main thing the work contributes. The later sections that use the same framework to classify hallucination-prone tasks under time bounds and to evaluate common mitigators like retrieval or prompting are the parts that feel most concrete and potentially useful for thinking about deployment choices. The soft spot is the bridge to the real world. The argument that the formal computable setting is simply a subset of reality does not address whether practical tasks are drawn from narrow, structured distributions where good approximation is possible without universal coverage. No-free-lunch results have always left room for strong performance on relevant subsets; the paper does not supply a measure over functions or evidence that real workloads require the arbitrary cases. The empirical checks on time-constrained tasks are mentioned but not detailed enough in the abstract to judge their strength. This is the kind of paper that belongs in a reading group focused on theoretical limits of LLMs rather than in a methods section of an applied project. It deserves peer review because the question is live and the formalization is clean, but any referee should press on whether the real-world inevitability claim follows from the math or requires additional distributional assumptions.

Referee Report

2 major / 2 minor

Summary. The paper formalizes hallucination as inconsistency between a computable LLM and a computable ground-truth function in a restricted formal world. It invokes learning-theory results to prove that no fixed computable LLM can match every computable function, hence any LLM used as a general solver must hallucinate on some inputs. The authors then argue that because the formal world is a subset of the real world, the same inevitability holds for practical LLMs; they further characterize hallucination-prone tasks under provable time bounds, supply empirical checks, and evaluate existing mitigation techniques.

Significance. If the formal-to-real bridge can be made rigorous, the result would supply a clean theoretical explanation for the persistence of hallucinations and would constrain expectations for general-purpose LLM deployment. The work correctly applies standard no-free-lunch style arguments to the LLM setting and supplies an empirical component, both of which are strengths.

major comments (2)

[§4] §4 (Extension to real-world LLMs): The sentence 'Since the formal world is a part of the real world which is much more complicated, hallucinations are also inevitable for real world LLMs' is load-bearing for the central claim yet supplies no measure over task distributions. The learning-theoretic impossibility shows existence of failing functions; it does not show that the functions arising in practical NLP tasks lie outside the approximable subset.
[§5] §5 (Time-constrained case and empirical validation): The characterization of hallucination-prone tasks under provable time complexity is stated without an explicit reduction showing that the chosen tasks correspond to functions that are unlearnable by any polynomial-time LLM. The empirical results therefore test a weaker claim than the theoretical one.

minor comments (2)

[Abstract] Abstract: the phrase 'provable time complexity' is introduced without definition or forward reference; a brief parenthetical or citation to the relevant section would improve readability.
[§2] Notation: the paper should clarify whether 'computable LLM' means a Turing machine with finite description or a fixed neural architecture with fixed weights; the distinction affects which learning-theoretic theorems apply directly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§4] §4 (Extension to real-world LLMs): The sentence 'Since the formal world is a part of the real world which is much more complicated, hallucinations are also inevitable for real world LLMs' is load-bearing for the central claim yet supplies no measure over task distributions. The learning-theoretic impossibility shows existence of failing functions; it does not show that the functions arising in practical NLP tasks lie outside the approximable subset.

Authors: We thank the referee for this observation. The core result establishes that no fixed computable LLM can learn every computable function, so any LLM deployed as a general solver must hallucinate on some inputs. Because the formal world is a subset of the real world, this limitation extends to practical LLMs when they are used for broad problem-solving. We agree that the manuscript does not supply a probability measure over real-world task distributions demonstrating that typical NLP tasks fall outside the learnable subset. We will revise §4 to clarify that the inevitability claim applies specifically to general-purpose use rather than asserting that every practical task is unlearnable, and we will add a short discussion of how the existence result constrains expectations for universal solvers. revision: partial
Referee: [§5] §5 (Time-constrained case and empirical validation): The characterization of hallucination-prone tasks under provable time complexity is stated without an explicit reduction showing that the chosen tasks correspond to functions that are unlearnable by any polynomial-time LLM. The empirical results therefore test a weaker claim than the theoretical one.

Authors: We appreciate the referee noting the missing link. In §5 the tasks are selected because they are known to require super-polynomial time under standard complexity assumptions. While we did not include an explicit reduction connecting these tasks to the learning-theoretic unlearnable functions, the section is intended to illustrate the practical consequences of time bounds. The empirical experiments show performance degradation consistent with the theoretical expectations. We will revise the section with a clarifying paragraph that explicitly relates the chosen tasks to the time-bounded unlearnability result and will note that the empirical component is illustrative and complementary to the theory. revision: partial

Circularity Check

0 steps flagged

No circularity; derivation applies external learning-theory results to a self-contained formal model

full rationale

The paper defines hallucination explicitly as inconsistency between a computable LLM and a computable ground-truth function, then invokes standard results from learning theory (no-free-lunch style) to conclude that no fixed LLM can match every computable function. This step is a direct logical consequence of the chosen definition plus external theorems, not a reduction by construction or self-reference. The extension to real-world LLMs is stated as an explicit modeling assumption ('the formal world is a part of the real world'), not smuggled in via self-citation or ansatz. No equations or parameters are fitted and then relabeled as predictions, and no uniqueness theorem is imported from the authors' prior work. The central claim therefore remains independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that both the LLM and ground truth can be modeled as computable functions and that learning theory applies directly to this setting.

axioms (2)

domain assumption LLMs and ground truth are both computable functions
Stated in the formalization of the problem in the abstract.
standard math Results from learning theory apply to show no LLM can learn all computable functions
Invoked to conclude inevitability of hallucination.

pith-pipeline@v0.9.0 · 5509 in / 1149 out tokens · 34972 ms · 2026-05-15T20:34:45.866926+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SciIntegrity-Bench: A Benchmark for Evaluating Academic Integrity in AI Scientist Systems
cs.AI 2026-05 unverdicted novelty 8.0

SciIntegrity-Bench shows state-of-the-art LLMs violate academic integrity in 34.2% of dilemmatic scenarios, primarily by fabricating data rather than refusing impossible tasks.
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium
cs.AI 2026-05 unverdicted novelty 7.0

EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to ad...
Green Shielding: A User-Centric Approach Towards Trustworthy AI
cs.CL 2026-04 unverdicted novelty 7.0

Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like t...
Beyond Explicit Refusals: Soft-Failure Attacks on Retrieval-Augmented Generation
cs.CR 2026-04 unverdicted novelty 7.0

DEJA uses evolutionary optimization guided by an LLM-based Answer Utility Score to induce soft-failure responses in RAG systems, achieving over 79% soft attack success rate with under 15% hard failures and high stealt...
Navig-AI-tion: Navigation by Contextual AI and Spatial Audio
cs.HC 2026-03 unverdicted novelty 7.0

A system combining VLM landmark instructions with real-time corrective spatial audio reduces route deviations in a small user study compared to VLM-only and Google Maps audio baselines.
Integrating Domain-Specialized Language Models with AI Measurement Tools for Deterministic Atomic-Resolution Experimentation
physics.app-ph 2026-02 unverdicted novelty 7.0

Domain-specialized small language models enable deterministic atomic-resolution scanning probe microscopy control with 99.3% command accuracy, lower computational cost, and better domain performance than larger genera...
Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation
cs.CL 2026-05 unverdicted novelty 6.0

Dimension-level evaluation reveals that 25-58% of LLM outputs with perfect holistic scores still show measurable intent deficits across languages and domains.
A Unified Benchmark for Evaluating Knowledge Graph Construction Methods and Graph Neural Networks
cs.LG 2026-05 unverdicted novelty 6.0

A dual-purpose benchmark supplies two text-derived knowledge graphs and one expert reference graph on the same biomedical corpus to jointly measure construction method quality and GNN robustness via semi-supervised no...
Using Large Language Models as a Co-Author in Undergraduate Quantum Group Research
math.HO 2026-05 unverdicted novelty 6.0

An AI model produced a new formula for a central element of U_q(so_12) at the quality level of advanced undergraduate research, along with faster computation via SageMath, prompting changes in mentorship practices.
Hallucinations Undermine Trust; Metacognition is a Way Forward
cs.CL 2026-05 unverdicted novelty 6.0

LLMs need metacognition to align expressed uncertainty with their actual knowledge boundaries, moving beyond knowledge expansion to reduce confident errors.
Adaptive Test-Time Compute Allocation with Evolving In-Context Demonstrations
cs.AI 2026-04 unverdicted novelty 6.0

An adaptive test-time framework uses a warm-up phase on the test set to build evolving in-context examples, then concentrates compute on unresolved queries to outperform static baselines on math, coding, and reasoning...
FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models
cs.LG 2026-04 unverdicted novelty 6.0

FineSteer decomposes inference-time steering into Subspace-guided Conditional Steering and Mixture-of-Steering-Experts to deliver stronger control over LLM behaviors with less utility loss than prior methods.
Scaling Synthetic Data Creation with 1,000,000,000 Personas
cs.CL 2024-06 unverdicted novelty 6.0

A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.
AgentReputation: A Decentralized Agentic AI Reputation Framework
cs.AI 2026-04 unverdicted novelty 5.0

AgentReputation proposes separating AI agent task execution, reputation management, and secure record-keeping into distinct layers, with context-specific reputation cards and a risk-based policy engine to handle verif...
Beyond Accuracy: LLM Variability in Evidence Screening for Software Engineering SLRs
cs.SE 2026-04 unverdicted novelty 5.0

LLMs exhibit substantial heterogeneity and non-determinism in SLR evidence screening, abstracts are decisive for performance, and they show no reliable superiority over classical classifiers on two real SLRs.
A pragmatic approach to regulating AI agents
cs.CY 2026-04 unverdicted novelty 5.0

AI agents require distinct regulation as AI systems under the EU AI Act with orchestration-layer oversight and a risk-based traffic light authorization system in contract law to preserve human accountability.
V2E: Validating Smart Contract Vulnerabilities through Profit-driven Exploit Generation and Execution
cs.SE 2026-04 unverdicted novelty 5.0

V2E automates PoC generation, triggerability and profitability validation, and iterative refinement using LLMs to confirm exploitable smart contract vulnerabilities, outperforming baselines on 264 labeled contracts.
Learning Project-wise Subsequent Code Edits via Interleaving Neural-based Induction and Tool-based Deduction
cs.SE 2026-04 unverdicted novelty 5.0

TRACE improves project-wise subsequent code editing by interleaving neural-based induction for semantic edits and tool-based deduction for syntactic edits.
Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation
cs.SE 2026-04 unverdicted novelty 4.0

Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.
Designing for Error Recovery in Human-Robot Interaction
cs.RO 2026-04 unverdicted novelty 3.0

Position paper calls for designing robotic AI to detect and recover from its own errors in continuous interactions, using nuclear glovebox operations as an illustrative case.

Reference graph

Works this paper leans on

89 extracted references · 89 canonical work pages · cited by 20 Pith papers · 9 internal anchors

[1]

Characterizing Attribution and Fluency Tradeoffs for Retrieval-Augmented Large Language Models

Renat Aksitov, Chung-Ching Chang, David Reitter, Siamak Shakeri, and Yunhsuan Sung. “Characterizing Attribution and Fluency Tradeoffs for Retrieval-Augmented Large Language Models”. In: arXiv preprint 2302.05578 (2023)

work page arXiv 2023
[2]

Computational Complexity - A Modern Approach

Sanjeev Arora and Boaz Barak. Computational Complexity - A Modern Approach . Cambridge University Press, 2009

work page 2009
[3]

On the prediction of General Recursive Functions

J. B¯arzdin, š and R. Freivald. “On the prediction of General Recursive Functions”. In:Soviet Mathematics Doklady, (Dokl. Akad. Nauk SSSR) 13 (1972), pp. 1224–1228

work page 1972
[4]

Learning families of algebraic structures from informant

Nikolay Bazhenov, Ekaterina B. Fokina, and Luca San Mauro. “Learning families of algebraic structures from informant”. In: Information and Computation 275 (2020), p. 104590

work page 2020
[5]

Airline held liable for its chatbot giving passenger bad advice - what this means for travellers

BBC. Airline held liable for its chatbot giving passenger bad advice - what this means for travellers . Accessed: 2024-02-22. 2024. URL: https://web.archive.org/web/ 20240224015400/https://www.bbc.com/travel/article/20240222-air-canada- chatbot-misinformation-what-travellers-should-know

work page arXiv 2024
[6]

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. “Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks”. In: Advances in Neural Information Processing Systems. V ol. 28. 2015

work page 2015
[7]

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Lit...

work page internal anchor Pith review Pith/arXiv arXiv 2005
[8]

Ein beitrag zur mannigfaltigkeitslehre

Georg Cantor. “Ein beitrag zur mannigfaltigkeitslehre”. In: Journal für die reine und ange- wandte Mathematik (Crelles Journal) 1878.84 (1878), pp. 242–258

work page
[9]

Ueber eine elementare Frage der Mannigfaltigketislehre

Georg Cantor. “Ueber eine elementare Frage der Mannigfaltigketislehre.” In:Jahresbericht der Deutschen Mathematiker-V ereinigung1 (1890/91), pp. 72–78

work page
[10]

Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions

Haw-Shiuan Chang and Andrew McCallum. “Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). May 2022, pp. 8048–8073

work page 2022
[11]

Benchmarking Large Language Models in Retrieval-Augmented Generation

Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. “Benchmarking Large Language Models in Retrieval-Augmented Generation”. In: AAAI. AAAI Press, 2024, pp. 17754–17762

work page 2024
[12]

Overcoming a Theoretical Limitation of Self-Attention

David Chiang and Peter Cholak. “Overcoming a Theoretical Limitation of Self-Attention”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). May 2022, pp. 7654–7664

work page 2022
[13]

The Complexity of Theorem-Proving Procedures

Stephen A. Cook. “The Complexity of Theorem-Proving Procedures”. In: Proceedings of the Third Annual ACM Symposium on Theory of Computing . 1971, pp. 151–158

work page 1971
[14]

The Pitfalls of Defining Hallucination

Kees van Deemter. “The Pitfalls of Defining Hallucination”. In:arXiv preprint 2401.07897 (2024)

work page arXiv 2024
[15]

Chain-of-Verification Reduces Hallucination in Large Language Models

Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. “Chain-of-Verification Reduces Hallucination in Large Language Models”. In: arXiv preprint 2309.11495 (2023)

work page arXiv 2023
[16]

Neural Path Hunter: Reducing Hallucination in Dialogue Systems via Path Grounding

Nouha Dziri, Andrea Madotto, Osmar Zaiane, and Avishek Joey Bose. “Neural Path Hunter: Reducing Hallucination in Dialogue Systems via Path Grounding”. In:Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing . Nov. 2021, pp. 2197– 2214. 11

work page 2021
[17]

From Kant to Hilbert: A Source Book in the F oundations of Mathematics

William Ewald. From Kant to Hilbert: A Source Book in the F oundations of Mathematics . Oxford University Press, Apr. 2005. ISBN : 9780198505358

work page 2005
[18]

Super-exponential complexity of Presburger arith- metic

Michael J Fischer and Michael O Rabin. “Super-exponential complexity of Presburger arith- metic”. In: 1974

work page 1974
[19]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. “The Pile: An 800GB Dataset of Diverse Text for Language Modeling”. In:arXiv preprint 2101.00027 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[20]

Language Identification in the Limit

E. Mark Gold. “Language Identification in the Limit”. In:Information and Control 10.5 (1967), pp. 447–474

work page 1967
[21]

Assessing The Factual Accu- racy of Generated Text

Ben Goodrich, Vinay Rao, Peter J. Liu, and Mohammad Saleh. “Assessing The Factual Accu- racy of Generated Text”. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining . 2019, pp. 166–175

work page 2019
[22]

Looking for a Needle in a Haystack: A Comprehensive Study of Hallucinations in Neural Machine Translation

Nuno M. Guerreiro, Elena V oita, and André Martins. “Looking for a Needle in a Haystack: A Comprehensive Study of Hallucinations in Neural Machine Translation”. In: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics . May 2023, pp. 1059–1075

work page 2023
[23]

Textbooks Are All You Need

Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. “Textbooks Are All You Need”. In:arXiv preprint 2306...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Theoretical Limitations of Self-Attention in Neural Sequence Models

Michael Hahn. “Theoretical Limitations of Self-Attention in Neural Sequence Models”. In: Transactions of the Association for Computational Linguistics 8 (2020), pp. 156–171

work page 2020
[25]

Stuck in the Quicksand of Numeracy, Far from AGI Summit: Evaluat- ing LLMs’ Mathematical Competency through Ontology-guided Perturbations

Pengfei Hong, Deepanway Ghosal, Navonil Majumder, Somak Aditya, Rada Mihalcea, and Soujanya Poria. “Stuck in the Quicksand of Numeracy, Far from AGI Summit: Evaluat- ing LLMs’ Mathematical Competency through Ontology-guided Perturbations”. In: arXiv 2401.09395 (2024)

work page arXiv 2024
[26]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. “A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions”. In:arXiv preprint 2311.05232 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

The Factual Inconsistency Problem in Abstractive Text Summarization: A Survey

Yichong Huang, Xiachong Feng, Xiaocheng Feng, and Bing Qin. “The Factual Inconsistency Problem in Abstractive Text Summarization: A Survey”. In:arXiv preprint 2104.14839 (2023)

work page arXiv 2023
[30]

Llama-3-70b-Intruct Model on HuggingFace

HuggingFace. Llama-3-70b-Intruct Model on HuggingFace . Accessed: 2023-12-15. URL: https://web.archive.org/web/20240518083325/https://huggingface.co/ unsloth/llama-3-70b-Instruct-bnb-4bit

work page arXiv 2023
[31]

Osherson, James S

Sanjay Jain, Daniel N. Osherson, James S. Royer, and Arun Sharma. Systems That Learn: An Introduction to Learning Theory. The MIT Press, Feb. 1999. ISBN : 9780262276252

work page 1999
[32]

Survey of Hallucination in Natural Language Generation

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. “Survey of Hallucination in Natural Language Generation”. In: ACM Computing Surveys 55.12 (Mar. 2023)

work page 2023
[33]

Calibrated Language Models Must Halluci- nate

Adam Tauman Kalai and Santosh S. Vempala. “Calibrated Language Models Must Halluci- nate”. In: Proceedings of the 56th Annual ACM Symposium on Theory of Computing (STOC) . 2024

work page 2024
[34]

Large Language Models Struggle to Learn Long-Tail Knowledge

Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. “Large Language Models Struggle to Learn Long-Tail Knowledge”. In: Proceedings of the 40th International Conference on Machine Learning . V ol. 202. July 2023, pp. 15696–15707. 12

work page 2023
[35]

Deduplicating Training Data Makes Language Models Better

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. “Deduplicating Training Data Makes Language Models Better”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). May 2022, pp. 8424–8445

work page 2022
[36]

Factuality Enhanced Language Models for Open-Ended Text Generation

Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pascale N Fung, Mohammad Shoeybi, and Bryan Catanzaro. “Factuality Enhanced Language Models for Open-Ended Text Generation”. In: Advances in Neural Information Processing Systems . V ol. 35. 2022, pp. 34586–34599

work page 2022
[37]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”. In: Proceedings of the 34th International Conference on Neural Information Processing Systems . 2020

work page 2020
[38]

Large Language Models with Controllable Working Memory

Daliang Li, Ankit Singh Rawat, Manzil Zaheer, Xin Wang, Michal Lukasik, Andreas Veit, Felix Yu, and Sanjiv Kumar. “Large Language Models with Controllable Working Memory”. In: Findings of the Association for Computational Linguistics: ACL 2023 . July 2023, pp. 1774– 1793

work page 2023
[39]

HaluEval: A Large- Scale Hallucination Evaluation Benchmark for Large Language Models

Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. “HaluEval: A Large- Scale Hallucination Evaluation Benchmark for Large Language Models”. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing . Association for Computational Linguistics, 2023, pp. 6449–6464

work page 2023
[40]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. “TruthfulQA: Measuring How Models Mimic Human Falsehoods”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). May 2022, pp. 3214–3252

work page 2022
[41]

Learning Quickly When Irrelevant Attributes Abound: A New Linear- threshold Algorithm

Nick Littlestone. “Learning Quickly When Irrelevant Attributes Abound: A New Linear- threshold Algorithm”. In: Mach. Learn. 2.4 (1987), pp. 285–318

work page 1987
[42]

Exposing Attention Glitches with Flip-Flop Language Modeling

Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. “Exposing Attention Glitches with Flip-Flop Language Modeling”. In: Advances in Neural Information Processing Systems. V ol. 36. 2023

work page 2023
[43]

Hallucina- tion Detection and Hallucination Mitigation: An Investigation

Junliang Luo, Tianyu Li, Di Wu, Michael Jenkin, Steve Liu, and Gregory Dudek. “Hallucina- tion Detection and Hallucination Mitigation: An Investigation”. In: arXiv preprint 2401.08358 (2024)

work page arXiv 2024
[44]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. “When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories”. In:Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). July 2023, pp. 9802–9822

work page 2023
[45]

Knowledge Injection to Counter Large Language Model (LLM) Hallucination

Ariana Martino, Michael Iannelli, and Coleen Truong. “Knowledge Injection to Counter Large Language Model (LLM) Hallucination”. In: The Semantic Web: ESWC 2023 Satellite Events . 2023, pp. 182–185

work page 2023
[46]

Introducing Meta Llama 3: The most capable openly available LLM to date

Meta Platforms, Inc. Introducing Meta Llama 3: The most capable openly available LLM to date. Accessed: 2024-04-30. URL: https://web.archive.org/web/20231207183448/ https://huggingface.co/blog/llama2

work page arXiv 2024
[47]

Looking Beyond Sentence-Level Natural Language Inference for Question Answering and Text Summarization

Anshuman Mishra, Dhruvesh Patel, Aparna Vijayakumar, Xiang Lorraine Li, Pavan Kapani- pathi, and Kartik Talamadupula. “Looking Beyond Sentence-Level Natural Language Inference for Question Answering and Text Summarization”. In:Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language ...

work page 2021
[48]

Nationality Bias in Text Generation

Pranav Narayanan Venkit, Sanjana Gautam, Ruchi Panchanadikar, Ting-Hao Huang, and Shomir Wilson. “Nationality Bias in Text Generation”. In:Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics . May 2023, pp. 116– 122

work page 2023
[49]

A Simple Recipe towards Reducing Hallucination in Neural Surface Realisation

Feng Nie, Jin-Ge Yao, Jinpeng Wang, Rong Pan, and Chin-Yew Lin. “A Simple Recipe towards Reducing Hallucination in Neural Surface Realisation”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . July 2019, pp. 2673–2679. 13

work page 2019
[50]

Entity Cloze By Date: What LMs Know About Unseen Entities

Yasumasa Onoe, Michael J. Q. Zhang, Eunsol Choi, and Greg Durrett. “Entity Cloze By Date: What LMs Know About Unseen Entities”. In: NAACL-HLT (Findings). Association for Computational Linguistics, 2022, pp. 693–702

work page 2022
[51]

ChatGPT Release Notes

OpenAI. “ChatGPT Release Notes”. In: (2023). Accessed: 2023-12-16. URL:https://web. archive.org/web/20231214021113/https://help.openai.com/en/articles/ 6825453-chatgpt-release-notes

work page arXiv 2023
[52]

GPT-4 Technical Report

OpenAI. “GPT-4 Technical Report”. In: arXiv preprint 2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies

Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, and William Yang Wang. “Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies”. In: arXiv preprint 2308.03188 (2023)

work page arXiv 2023
[54]

Data and its (dis)contents: A survey of dataset development and use in machine learning research

Amandalynne Paullada, Inioluwa Deborah Raji, Emily M. Bender, Emily Denton, and Alex Hanna. “Data and its (dis)contents: A survey of dataset development and use in machine learning research”. In: Patterns 2.11 (2021), p. 100336

work page 2021
[55]

Check your facts and try again: Improving large language models with external knowledge and automated feedback

Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, and Jianfeng Gao. “Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback”. In: arXiv preprint 2302.12813 (2023)

work page arXiv 2023
[56]

Uber die vollstandigkeiteines gewissen systems der arithmetik ganzer zahlen, in welchen die addition als einzige operation hervortritt

Mojzesz Presburger. “Uber die vollstandigkeiteines gewissen systems der arithmetik ganzer zahlen, in welchen die addition als einzige operation hervortritt”. In: Comptes-Rendus du ler Congres des Mathematiciens des Pays Slavs . 1929

work page 1929
[57]

The Curious Case of Hal- lucinations in Neural Machine Translation

Vikas Raunak, Arul Menezes, and Marcin Junczys-Dowmunt. “The Curious Case of Hal- lucinations in Neural Machine Translation”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. June 2021, pp. 1172–1183

work page 2021
[58]

The Troubling Emergence of Hallucination in Large Language Models – An Extensive Definition, Quantification, and Prescriptive Remediations

Vipula Rawte, Swagata Chakraborty, Agnibh Pathak, Anubhav Sarkar, S. M Towhidul Islam Tonmoy, Aman Chadha, Amit P. Sheth, and Amitava Das. “The Troubling Emergence of Hallucination in Large Language Models – An Extensive Definition, Quantification, and Prescriptive Remediations”. In: arXiv preprint 2310.04988 (2023)

work page arXiv 2023
[59]

NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails

Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar, Christopher Parisien, and Jonathan Cohen. “NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails”. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Dec. 2023, pp. 431–445

work page 2023
[60]

Iden- tifying Untrustworthy Samples: Data Filtering for Open-Domain Dialogues with Bayesian Optimization

Lei Shen, Haolan Zhan, Xin Shen, Hongshen Chen, Xiaofang Zhao, and Xiaodan Zhu. “Iden- tifying Untrustworthy Samples: Data Filtering for Open-Domain Dialogues with Bayesian Optimization”. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2021, pp. 1598–1608

work page 2021
[61]

Trusting Your Evidence: Hallucinate Less with Context-aware Decoding

Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia Tsvetkov, Luke Zettlemoyer, and Scott Wen- tau Yih. “Trusting Your Evidence: Hallucinate Less with Context-aware Decoding”. In:arXiv preprint 2305.14739 (2023)

work page arXiv 2023
[62]

In-Context Pretraining: Language Modeling Beyond Document Boundaries

Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Rich James, Xi Victoria Lin, Noah A. Smith, Luke Zettlemoyer, Scott Yih, and Mike Lewis. “In-Context Pretraining: Language Modeling Beyond Document Boundaries”. In: arXiv preprint 2310.10638 (2023)

work page arXiv 2023
[63]

Retrieval Augmenta- tion Reduces Hallucination in Conversation

Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. “Retrieval Augmenta- tion Reduces Hallucination in Conversation”. In:Findings of the Association for Computational Linguistics: EMNLP 2021. Nov. 2021, pp. 3784–3803

work page 2021
[64]

Theories of Meaning

Jeff Speaks. “Theories of Meaning”. In: The Stanford Encyclopedia of Philosophy . Ed. by Edward N. Zalta. Spring 2021. Metaphysics Research Lab, Stanford University, 2021. URL: https://plato.stanford.edu/archives/spr2021/entries/meaning/

work page 2021
[65]

Presburger’s article on integer arithmetic: Remarks and translation

Ryan Stansifer. Presburger’s article on integer arithmetic: Remarks and translation. Tech. rep. Cornell University, 1984. URL:https://dl.acm.org/doi/book/10.5555/867696

work page doi:10.5555/867696 1984
[66]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. “LLaMA: Open and Efficient Founda- tion Language Models”. In: arXiv preprint 2302.13971 (2023). 14

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

On Computable Numbers, with an Application to the Entscheidungsproblem

A. M. Turing. “On Computable Numbers, with an Application to the Entscheidungsproblem”. In: Proceedings of the London Mathematical Society s2-42.1 (1937), pp. 230–265

work page 1937
[69]

Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. “Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting”. In: NeurIPS. 2023

work page 2023
[70]

A Theory of the Learnable

L. G. Valiant. “A Theory of the Learnable”. In: Communications of the ACM 27.11 (Nov. 1984), pp. 1134–1142

work page 1984
[71]

Attention is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is All You Need”. In: Proceedings of the 31st International Conference on Neural Information Processing Systems . 2017, pp. 6000–6010

work page 2017
[72]

FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation

Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, and Thang Luong. “FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation”. In: arXiv preprint 2310.03214 (2023)

work page arXiv 2023
[73]

Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity

Cunxiang Wang, Xiaoze Liu, Yuanhao Yue, Xiangru Tang, Tianhang Zhang, Cheng Jiayang, Yunzhi Yao, Wenyang Gao, Xuming Hu, Zehan Qi, Yidong Wang, Linyi Yang, Jindong Wang, Xing Xie, Zheng Zhang, and Yue Zhang. “Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity”. In:arXiv preprint 2310.07521 (2023)

work page arXiv 2023
[74]

SCOTT: Self-Consistent Chain-of-Thought Distillation

Peifeng Wang, Zhengyang Wang, Zheng Li, Yifan Gao, Bing Yin, and Xiang Ren. “SCOTT: Self-Consistent Chain-of-Thought Distillation”. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). July 2023, pp. 5546– 5558

work page 2023
[75]

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. “Emergent Abilities of Large Language Models”. In: Transactions on Machine Learning Research (2022)

work page 2022
[76]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter brian, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”. In: Advances in Neural Information Processing Systems . V ol. 35. 2022, pp. 24824–24837

work page 2022
[77]

Cantor’s diagonal argument — Wikipedia, The Free Encyclopedia

Wikipedia contributors. “Cantor’s diagonal argument — Wikipedia, The Free Encyclopedia”. In: (2023). [Online; accessed 6-December-2023]. URL: https://en.wikipedia.org/w/ index.php?title=Cantor%27s_diagonal_argument&oldid=1173962712

work page 2023
[78]

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. “Breaking the Softmax Bottleneck: A High-Rank RNN Language Model”. In: International Conference on Learning Representations. 2018

work page 2018
[79]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models”. In: arXiv preprint 2305.10601 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[80]

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. “Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models”. In: arXiv preprint 2309.01219 (2023). 15

work page internal anchor Pith review Pith/arXiv arXiv 2023
[81]

Verify-and- Edit: A Knowledge-Enhanced Chain-of-Thought Framework

Ruochen Zhao, Xingxuan Li, Shafiq Joty, Chengwei Qin, and Lidong Bing. “Verify-and- Edit: A Knowledge-Enhanced Chain-of-Thought Framework”. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). Association for Computational Linguistics, July 2023, pp. 5823–5840. 16 Appendix A Notation and Ter...

work page 2023
[82]

a" and "b

with modifications made to prompt the LLMs to answer directly instead of providing high-level descriptions or code snippets. Base Prompt You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. ...

work page

Showing first 80 references.