arxiv: 2311.05232 · v2 · submitted 2023-11-09 · 💻 cs.CL

Recognition: no theorem link

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

Bing Qin, Haotian Wang, Lei Huang, Qianglong Chen, Ting Liu, Weihong Zhong, Weihua Peng, Weijiang Yu, Weitao Ma, Xiaocheng Feng, Zhangyin Feng

Pith reviewed 2026-05-13 02:40 UTC · model grok-4.3

classification 💻 cs.CL

keywords hallucinationlarge language modelsLLMtaxonomydetection methodsmitigation strategiessurveyknowledge boundaries

0 comments

The pith

Large language models produce plausible but false content known as hallucinations, and this survey introduces a dedicated taxonomy while reviewing causes, detection methods, mitigation strategies, and open challenges.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes an innovative taxonomy that organizes hallucinations in the specific context of general-purpose large language models, which present distinct issues compared with earlier task-specific systems. It then analyzes the factors that generate these hallucinations and surveys detection approaches together with their benchmarks. The work continues by covering mitigation techniques, limitations of retrieval-augmented models, and promising directions such as hallucinations in vision-language models and knowledge-boundary understanding. A sympathetic reader would care because the overview supplies a shared language and set of tools for making LLMs more trustworthy in real-world information-retrieval settings.

Core claim

This survey begins with an innovative taxonomy of hallucination in the era of LLM, delves into the factors contributing to hallucinations, presents a thorough overview of hallucination detection methods and benchmarks, transfers to representative methodologies for mitigating LLM hallucinations, examines current limitations faced by retrieval-augmented LLMs, and highlights promising research directions including hallucination in large vision-language models and understanding of knowledge boundaries in LLM hallucinations.

What carries the argument

The innovative taxonomy of hallucination tailored to LLMs, which organizes distinct types of non-factual generation and serves as the framework for examining causes, detection, and mitigation.

If this is right

Clarifying contributing factors can guide changes in training procedures that reduce non-factual outputs.
Detection methods and benchmarks allow consistent evaluation of how well different models and techniques control hallucinations.
Mitigation methodologies provide concrete steps that practitioners can apply to improve factual reliability in deployed systems.
Analysis of retrieval-augmented limitations points to needed improvements in hybrid architectures for information retrieval.
Attention to knowledge boundaries suggests models that more accurately signal when they should refrain from answering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The taxonomy could be tested on new multimodal models to check whether hallucination patterns transfer across text and vision.
Ongoing updates to the survey may become necessary as new detection benchmarks and mitigation methods appear.
Work on knowledge boundaries could connect to separate research on uncertainty estimation and model abstention.
Standardized protocols for measuring hallucination rates in interactive settings would strengthen the practical value of the reviewed benchmarks.

Load-bearing premise

The proposed taxonomy and the selected body of literature together give a comprehensive and unbiased account of LLM hallucination research despite the field's rapid evolution.

What would settle it

Publication of a later survey or empirical study that identifies major hallucination categories, detection techniques, or mitigation approaches absent from this taxonomy would indicate the overview is incomplete.

read the original abstract

The emergence of large language models (LLMs) has marked a significant breakthrough in natural language processing (NLP), fueling a paradigm shift in information acquisition. Nevertheless, LLMs are prone to hallucination, generating plausible yet nonfactual content. This phenomenon raises significant concerns over the reliability of LLMs in real-world information retrieval (IR) systems and has attracted intensive research to detect and mitigate such hallucinations. Given the open-ended general-purpose attributes inherent to LLMs, LLM hallucinations present distinct challenges that diverge from prior task-specific models. This divergence highlights the urgency for a nuanced understanding and comprehensive overview of recent advances in LLM hallucinations. In this survey, we begin with an innovative taxonomy of hallucination in the era of LLM and then delve into the factors contributing to hallucinations. Subsequently, we present a thorough overview of hallucination detection methods and benchmarks. Our discussion then transfers to representative methodologies for mitigating LLM hallucinations. Additionally, we delve into the current limitations faced by retrieval-augmented LLMs in combating hallucinations, offering insights for developing more robust IR systems. Finally, we highlight the promising research directions on LLM hallucinations, including hallucination in large vision-language models and understanding of knowledge boundaries in LLM hallucinations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey organizes LLM hallucination work with a new taxonomy and covers causes, detection, mitigation, and open questions, but stays a synthesis rather than a primary advance.

read the letter

The paper's main contribution is its taxonomy for hallucinations in the LLM era, plus a clear breakdown of contributing factors, detection methods with benchmarks, mitigation approaches, and limits of retrieval-augmented setups. It also flags directions like hallucinations in vision-language models and knowledge boundaries. That structure pulls scattered papers into one place and notes how LLM issues differ from older task-specific models, which is useful for anyone trying to build more reliable systems. The abstract shows a logical flow from principles to challenges, and the authors cite a broad range of recent work without obvious internal contradictions. No derivations or data claims to check, so the usual soundness questions do not apply. The main soft spot is the inherent risk in any manual survey: coverage can lag in a field moving this fast, and the claim of an innovative taxonomy rests on how well it captures distinctions that prior work missed. If the full text delivers on the promised thoroughness without major omissions, that risk stays minor. Readers who need a single reference to get up to speed on reliability problems in LLMs or IR applications will find value here. It is not a breakthrough paper, but the organization and forward-looking sections make it worth referee time to verify completeness and balance. I would send it to peer review.

Referee Report

2 major / 3 minor

Summary. The paper is a survey on hallucination in large language models (LLMs). It introduces an innovative taxonomy for LLM-era hallucinations, examines contributing factors, reviews detection methods and benchmarks, discusses representative mitigation methodologies, analyzes limitations of retrieval-augmented LLMs for combating hallucinations, and outlines promising research directions including hallucinations in large vision-language models and understanding knowledge boundaries.

Significance. If the taxonomy provides a clear and useful organizing framework and the overviews accurately capture the state of the field, the survey would serve as a valuable reference for NLP and IR researchers working on LLM reliability. The logical progression from taxonomy through detection, mitigation, and open questions, combined with attention to RAG-specific challenges, could help standardize terminology and prioritize future work on trustworthy information systems.

major comments (2)

The central claim of an 'innovative taxonomy' (abstract and opening section) would be strengthened by an explicit side-by-side comparison table or subsection contrasting the new taxonomy with at least two prior hallucination or factuality taxonomies from the cited literature; without this, it is difficult to assess what specific distinctions are novel versus incremental.
In the detection-methods and benchmarks overview, the absence of a systematic literature-search protocol or inclusion/exclusion criteria (e.g., date range, venues, or keyword strategy) risks selection bias in a fast-moving area; this directly affects the reliability of the 'thorough overview' claim.

minor comments (3)

Figure captions and table headers should explicitly state the source or year of each cited benchmark or method to allow readers to judge currency.
The section on limitations of retrieval-augmented LLMs would benefit from a short summary paragraph at the end that ties the listed limitations back to the taxonomy introduced earlier.
A small number of citations appear to be preprints or workshop papers; the authors should verify that all references have stable DOIs or arXiv identifiers for long-term accessibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation of minor revision. The comments help clarify the presentation of our taxonomy's novelty and improve the transparency of our literature coverage. We address each major comment below and commit to the corresponding revisions.

read point-by-point responses

Referee: The central claim of an 'innovative taxonomy' (abstract and opening section) would be strengthened by an explicit side-by-side comparison table or subsection contrasting the new taxonomy with at least two prior hallucination or factuality taxonomies from the cited literature; without this, it is difficult to assess what specific distinctions are novel versus incremental.

Authors: We agree that an explicit comparison would better substantiate the claim of innovation. In the revised manuscript we will add a new subsection (immediately following the presentation of our taxonomy) that includes a side-by-side comparison table with at least two representative prior taxonomies from the cited literature. The table will enumerate core dimensions (e.g., granularity, scope, and LLM-specific considerations) and explicitly note the distinctions introduced by our framework. revision: yes
Referee: In the detection-methods and benchmarks overview, the absence of a systematic literature-search protocol or inclusion/exclusion criteria (e.g., date range, venues, or keyword strategy) risks selection bias in a fast-moving area; this directly affects the reliability of the 'thorough overview' claim.

Authors: We acknowledge that documenting the selection process would strengthen the survey's reliability. In the revision we will insert a concise paragraph in the introduction (or a new 'Literature Selection' subsection) that describes the search strategy: primary keywords ('LLM hallucination', 'factuality evaluation', 'hallucination detection'), time window (primarily post-2022), venues considered, and inclusion criteria focused on works that address LLM-specific rather than task-specific hallucinations. This addition will mitigate selection-bias concerns without converting the survey into a formal systematic review. revision: yes

Circularity Check

0 steps flagged

No significant circularity in this literature survey

full rationale

This paper is a survey that organizes existing literature into a taxonomy of LLM hallucinations, reviews contributing factors, detection methods, benchmarks, mitigation strategies, limitations of retrieval-augmented models, and open questions. It contains no equations, derivations, fitted parameters, or predictive claims that could reduce to inputs by construction. The central contribution is a structured overview rather than a self-referential argument, so no load-bearing step reduces to a self-definition, self-citation chain, or renamed input. Standard survey self-citations do not create circularity here.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper without new derivations, postulates, or empirical claims that introduce free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5549 in / 1020 out tokens · 38804 ms · 2026-05-13T02:40:11.539828+00:00 · methodology

discussion (0)

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry
cs.CL 2026-05 unverdicted novelty 7.0

Hallucination is detected as a transport-cost excursion in hidden-state trajectories, localized via contrastive PCA in a teacher model and distilled to a BiLSTM student.
GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs
cs.AI 2026-04 unverdicted novelty 7.0

GSAR is a grounding-evaluation framework for multi-agent LLMs that uses a four-way claim typology, evidence-weighted asymmetric scoring, and tiered recovery decisions to detect and mitigate hallucinations.
Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench
cs.AI 2026-04 conditional novelty 7.0

AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cu...
BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation
cs.DL 2026-04 conditional novelty 7.0

Frontier LLMs generate BibTeX entries at 83.6% field accuracy but only 50.9% fully correct; two-stage clibib revision raises accuracy to 91.5% and fully correct entries to 78.3% with 0.8% regression.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 6.0

TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
Template-as-Ontology: Configurable Synthetic Data Infrastructure for Cross-Domain Manufacturing AI Validation
cs.AI 2026-05 unverdicted novelty 6.0

A single configuration file generates causally coherent synthetic MES data across domains and guarantees zero tool-parameter hallucination when AI tools are ontology-constrained.
Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction
cs.LG 2026-05 unverdicted novelty 6.0

A single-pass black-box method models LLM outputs as dynamical systems via Koopman operators to detect hallucinations with claimed state-of-the-art accuracy and lower cost.
CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering
cs.CV 2026-05 unverdicted novelty 6.0

CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with ...
The Surprising Universality of LLM Outputs: A Real-Time Verification Primitive
cs.CR 2026-04 unverdicted novelty 6.0

LLM token rank-frequency distributions converge to a shared Mandelbrot distribution across models and domains, enabling a microsecond-scale statistical primitive for provenance verification and black-box anomaly triage.
The Provenance Gap in Clinical AI: Evidence-Traceable Temporal Knowledge Graphs for Rare Disease Reasoning
cs.CL 2026-04 unverdicted novelty 6.0

HEG-TKG grounds LLM clinical reasoning in hierarchical evidence-based temporal knowledge graphs from 4,512 PubMed records, delivering 100% citation verifiability and error detectability where standard RAG and unprompt...
When Agents Go Quiet: Output Generation Capacity and Format-Cost Separation for LLM Document Synthesis
cs.AI 2026-04 unverdicted novelty 6.0

LLM agents avoid output stalling and reduce generation tokens by 48-72% via deferred template rendering guided by Output Generation Capacity and a Format-Cost Separation Theorem.
EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation
cs.DB 2026-04 unverdicted novelty 6.0

EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.
Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation
cs.LG 2026-04 unverdicted novelty 6.0

Hallucination is an early trajectory commitment in transformers governed by asymmetric attractor dynamics, with prompt encoding selecting the basin and correction needing multi-step intervention.
FocalLens: Visualizing Narratives through Focalization
cs.HC 2026-04 unverdicted novelty 6.0

FocalLens is a new visualization system that captures focalization to display character perceptions, direct/indirect involvement, and narration in narratives, evaluated qualitatively with writers and scholars.
Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space
cs.CL 2026-04 unverdicted novelty 6.0

PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.
Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework
cs.CL 2026-04 unverdicted novelty 6.0

A unified framework for LLM agent memory is benchmarked, with a new hybrid method outperforming state-of-the-art on standard tasks.
The Semantic Training Gap: Ontology-Grounded Tool Architectures for Industrial AI Agent Systems
cs.AI 2026-05 unverdicted novelty 5.0

Ontology-grounded tool architectures eliminate hallucination of domain identifiers in industrial AI agents by enforcing semantic constraints through a typed relational configuration and three-operation interface.
EmoS: A High-Fidelity Multimodal Benchmark for Fine-grained Streaming Emotional Understanding
cs.CL 2026-05 unverdicted novelty 5.0

EmoS is a new high-fidelity benchmark for fine-grained streaming emotional understanding that produces measurable gains when used to fine-tune multimodal large language models.
HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs
cs.CL 2026-05 unverdicted novelty 5.0

HalluScan benchmark tests hallucination detectors on LLMs, identifies NLI Verification as top performer with 0.88 AUROC, and introduces HalluScore (r=0.41 with humans) plus a routing method for 2x cost savings.
Grounding Multi-Hop Reasoning in Structural Causal Models via Group Relative Policy Optimization
cs.AI 2026-05 unverdicted novelty 5.0

SCM-GRPO grounds multi-hop fact verification in structural causal models and applies GRPO reinforcement learning to optimize reasoning chain length, outperforming baselines on HoVer and EX-FEVER.
Budget-Constrained Online Retrieval-Augmented Generation: The Chunk-as-a-Service Model
cs.IR 2026-04 unverdicted novelty 5.0

Chunk-as-a-Service with the UCOSA online algorithm enables budget-constrained selection of prompts for chunk enrichment in RAG, outperforming random selection by 52% on a combined performance metric and delivering hig...
Calibrating Model-Based Evaluation Metrics for Summarization
cs.CL 2026-04 unverdicted novelty 5.0

A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
Beyond Literal Summarization: Redefining Hallucination for Medical SOAP Note Evaluation
cs.AI 2026-04 unverdicted novelty 5.0

Redefining hallucination evaluation for medical SOAP notes to credit clinical reasoning reduces reported hallucination rates from 35% to 9%.
LLM2Manim: Pedagogy-Aware AI Generation of STEM Animations
cs.MM 2026-04 unverdicted novelty 5.0

LLM2Manim pipeline generates pedagogy-aware Manim animations for STEM, producing slightly better student post-test scores (83% vs 78%), learning gains (d=0.67), and engagement than PowerPoint in a controlled study.
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
Reliable AI Needs to Externalize Implicit Knowledge: A Human-AI Collaboration Perspective
cs.AI 2026-05 unverdicted novelty 4.0

Reliable AI needs structured Knowledge Objects to externalize and enable human validation of implicit knowledge that current methods cannot verify.
Grounding Multi-Hop Reasoning in Structural Causal Models via Group Relative Policy Optimization
cs.AI 2026-05 unverdicted novelty 4.0

The SCM-GRPO framework models multi-hop fact verification as causal inference and applies reinforcement learning to optimize reasoning depth, reporting outperformance on HoVer and EX-FEVER.
Multi-Agent Collaboration Mechanisms: A Survey of LLMs
cs.AI 2025-01 unverdicted novelty 4.0

The survey organizes LLM-based multi-agent collaboration mechanisms into a framework with dimensions of actors, types, structures, strategies, and coordination protocols, reviews applications across domains, and ident...
Understanding the planning of LLM agents: A survey
cs.AI 2024-02 accept novelty 4.0

A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.
Large Language Model based Multi-Agents: A Survey of Progress and Challenges
cs.CL 2024-01 unverdicted novelty 4.0

The paper surveys LLM-based multi-agent systems, covering simulated domains, agent profiling and communication, mechanisms for capacity growth, and common benchmarks.
A Survey on Hallucination in Large Vision-Language Models
cs.CV 2024-02 unverdicted novelty 3.0

This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

Reference graph

Works this paper leans on

290 extracted references · 290 canonical work pages · cited by 30 Pith papers · 28 internal anchors

[1]

Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S Morcos. 2023. SemDeDup: Data-efficient learning at web-scale through semantic deduplication. ArXiv preprint abs/2303.09540 (2023). https://arxiv.org/abs/ 2303.09540

work page arXiv 2023
[2]

Vaibhav Adlakha, Parishad BehnamGhader, Xing Han Lu, Nicholas Meade, and Siva Reddy. 2023. Evaluating correctness and faithfulness of instruction-following models for question answering. ArXiv preprint abs/2307.16877 (2023). https://arxiv.org/abs/2307.16877

work page arXiv 2023
[3]

Ayush Agrawal, Lester Mackey, and Adam Tauman Kalai. 2023. Do Language Models Know When They’re Halluci- nating References? ArXiv preprint abs/2305.18248 (2023). https://arxiv.org/abs/2305.18248

work page arXiv 2023
[4]

Perplexity AI. 2023. Perplexity AI. https://www.perplexity.ai/

work page 2023
[5]

Renat Aksitov, Chung-Ching Chang, David Reitter, Siamak Shakeri, and Yun-Hsuan Sung. 2023. Characterizing Attribution and Fluency Tradeoffs for Retrieval-Augmented Large Language Models. ArXiv preprint abs/2302.05578 (2023). https://arxiv.org/abs/2302.05578

work page arXiv 2023
[6]

Diab, and Marjan Ghazvininejad

Badr AlKhamissi, Millicent Li, Asli Celikyilmaz, Mona T. Diab, and Marjan Ghazvininejad. 2022. A Review on Language Models as Knowledge Bases. CoRR abs/2204.06031 (2022). https://doi.org/10.48550/ARXIV.2204.06031 arXiv:2204.06031

work page doi:10.48550/arxiv.2204.06031 2022
[7]

Gemini: A Family of Highly Capable Multimodal Models

Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy P. Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Is...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805 2024
[8]

Anthropic. 2023. Claude. https://claude.ai/

work page 2023
[9]

Antropic. 2024. Claude 3 haiku: our fastest model yet. 2024. https://www.anthropic.com/news/claude-3-haiku

work page 2024
[10]

ArXiv. 2023. arxiv dataset. https://www.kaggle.com/datasets/Cornell-University/arxiv/versions/134

work page 2023
[11]

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. CoRR abs/2310.11511 (2023). https://doi.org/10.48550/ARXIV.2310. 11511 arXiv:2310.11511

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310 2023
[12]

Akari Asai, Zexuan Zhong, Danqi Chen, Pang Wei Koh, Luke Zettlemoyer, Hannaneh Hajishirzi, and Wen-tau Yih. 2024. Reliable, Adaptable, and Attributable Language Models with Retrieval. CoRR abs/2403.03187 (2024). https://doi.org/10.48550/ARXIV.2403.03187 arXiv:2403.03187

work page doi:10.48550/arxiv.2403.03187 2024
[13]

Varun Chandola, Arindam Banerjee, and Vipin Kumar

Amos Azaria and Tom M. Mitchell. 2023. The Internal State of an LLM Knows When its Lying. ArXiv preprint abs/2304.13734 (2023). https://arxiv.org/abs/2304.13734

work page arXiv 2023
[14]

Jinheon Baek, Alham Fikri Aji, and Amir Saffari. 2023. Knowledge-Augmented Language Model Prompting for Zero- Shot Knowledge Graph Question Answering. ArXiv preprint abs/2306.04136 (2023). https://arxiv.org/abs/2306.04136

work page arXiv 2023
[15]

KevinGBecker,KathleenCBarnes,TiffaniJBright, and S Alex Wang

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. 2023. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. ArXiv preprint abs/2302.04023 (2023). https://arxiv.org/abs/ 2302.04023

work page arXiv 2023
[16]

Scott Barnett, Stefanus Kurniawan, Srikanth Thudumu, Zach Brannelly, and Mohamed Abdelrazek. 2024. Seven Failure Points When Engineering a Retrieval Augmented Generation System. CoRR abs/2401.05856 (2024). https: //doi.org/10.48550/ARXIV.2401.05856 arXiv:2401.05856

work page doi:10.48550/arxiv.2401.05856 2024
[17]

Mario Barrantes, Benedikt Herudek, and Richard Wang. 2020. Adversarial nli for factual correctness in text summari- sation models. ArXiv preprint abs/2005.11739 (2020). https://arxiv.org/abs/2005.11739

work page arXiv 2020
[18]

Pierre Basso. 1993. Conditional Causal Logic: A Formal Theory of the Meaning Generating Processes in a Cognitive System. In Proceedings of the 13th International Joint Conference on Artificial Intelligence. Chambéry, France, August 28 - September 3, 1993, Ruzena Bajcsy (Ed.). Morgan Kaufmann, 845–851. http://ijcai.org/Proceedings/93-2/Papers/002.pdf

work page 1993
[19]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. ArXiv preprint abs/2004.05150 (2020). https://arxiv.org/abs/2004.05150

work page internal anchor Pith review Pith/arXiv arXiv 2020
[20]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In FAccT ’21: 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual Event / Toronto, Canada, March 3-10, 2021 , Madeleine Clare Elish, William Isaac, and Richard S. Zemel (Eds....

work page doi:10.1145/3442188.3445922 2021
[21]

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada , Corinna Cortes, Neil D. Lawrence, Daniel D. ...

work page 2015
[22]

Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans

work page
[23]

A is B” fail to learn “B is A

The Reversal Curse: LLMs trained on" A is B" fail to learn" B is A". ArXiv preprint abs/2309.12288 (2023). https://arxiv.org/abs/2309.12288

work page arXiv 2023
[24]

Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. 2022. GPT-NeoX-20B: An Open-Source Autoregressive Language Model. CoRR abs/2204.06745 (2022). https:...

work page doi:10.48550/arxiv.2204.06745 2022
[25]

Rae, Erich Elsen, and Laurent Sifre

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Ori...

work page 2022
[26]

Samuel R Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamile Lukosuite, Amanda Askell, Andy Jones, Anna Chen, et al. 2022. Measuring progress on scalable oversight for large language models. ArXiv preprint abs/2211.03540 (2022). https://arxiv.org/abs/2211.03540 ACM Transactions on Information Systems, Vol. 1, No. 1, Article 1...

work page arXiv 2022
[27]

Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika 39, 3/4 (1952), 324–345. https://www.jstor.org/stable/2334029

work page arXiv 1952
[28]

Ruben Branco, António Branco, João António Rodrigues, and João Ricardo Silva. 2021. Shortcutted Commonsense: Data Spuriousness in Deep Learning of Commonsense Reasoning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 1504–1521....

work page doi:10.18653/v1/2021.emnlp-main.113 2021
[29]

Andrei Z Broder. 1997. On the resemblance and containment of documents. InProceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171) . IEEE, 21–29

work page 1997
[30]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page 2020
[31]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott M. Lundberg, Harsha Nori, Hamid Palangi, Marco Túlio Ribeiro, and Yi Zhang. 2023. Sparks of Artificial General Intelligence: Early experiments with GPT-4. ArXiv preprint abs/2303.12712 (2023). https://arxiv.org/abs/2303.12712

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. 2022. Discovering latent knowledge in language models without supervision. ArXiv preprint abs/2212.03827 (2022). https://arxiv.org/abs/2212.03827

work page arXiv 2022
[34]

A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to chatgpt,

Yihan Cao, Siyu Li, Yixin Liu, Zhiling Yan, Yutong Dai, Philip S. Yu, and Lichao Sun. 2023. A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT. CoRR abs/2303.04226 (2023). https://doi.org/10.48550/ARXIV.2303.04226 arXiv:2303.04226

work page doi:10.48550/arxiv.2303.04226 2023
[35]

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2022. Quantifying memorization across neural language models. ArXiv preprint abs/2202.07646 (2022). https://arxiv.org/ abs/2202.07646

work page internal anchor Pith review arXiv 2022
[36]

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2021. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21) . 2633–2650

work page 2021
[37]

Chung-Ching Chang, David Reitter, Renat Aksitov, and Yun-Hsuan Sung. 2023. KL-Divergence Guided Temperature Sampling. ArXiv preprint abs/2306.01286 (2023). https://arxiv.org/abs/2306.01286

work page arXiv 2023
[38]

Haw-Shiuan Chang, Zonghai Yao, Alolika Gon, Hong Yu, and Andrew McCallum. 2023. Revisiting the Architectures like Pointer Networks to Efficiently Improve the Next Word Distribution, Summarization Factuality, and Beyond. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023 , Anna Rogers, Jordan L. Boyd-Gr...

work page doi:10.18653/v1/2023.findings-acl.805 2023
[39]

Haw-Shiuan Chang and Andrew McCallum. 2022. Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 8048–8073. https: //doi.org/10.18653/v1/2022....

work page doi:10.18653/v1/2022.acl-long.554 2022
[40]

Canyu Chen and Kai Shu. 2023. Combating Misinformation in the Age of LLMs: Opportunities and Challenges. CoRR abs/2311.05656 (2023). https://doi.org/10.48550/ARXIV.2311.05656 arXiv:2311.05656

work page doi:10.48550/arxiv.2311.05656 2023
[41]

Hung-Ting Chen, Michael J. Q. Zhang, and Eunsol Choi. 2022. Rich Knowledge Sources Bring Complex Knowledge Conflicts: Recalibrating Models to Reflect Conflicting Evidence. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022 , Yoav Goldberg, Zornitsa Koz...

work page doi:10.18653/v1/2022.emnlp-main.146 2022
[42]

Hung-Ting Chen, Fangyuan Xu, Shane A Arora, and Eunsol Choi. 2023. Understanding Retrieval Augmentation for Long-Form Question Answering. ArXiv preprint abs/2310.12150 (2023). https://arxiv.org/abs/2310.12150 ACM Transactions on Information Systems, Vol. 1, No. 1, Article 1. Publication date: January 2024. A Survey on Hallucination in Large Language Model...

work page arXiv 2023
[43]

Shiqi Chen, Yiran Zhao, Jinghan Zhang, I-Chun Chern, Siyang Gao, Pengfei Liu, and Junxian He. 2023. FELM: Benchmarking Factuality Evaluation of Large Language Models. ArXiv preprint abs/2310.00741. https://arxiv.org/ abs/2310.00741

work page arXiv 2023
[44]

Tong Chen, Hongwei Wang, Sihao Chen, Wenhao Yu, Kaixin Ma, Xinran Zhao, Hongming Zhang, and Dong Yu

work page
[45]

https://doi.org/10

Dense X Retrieval: What Retrieval Granularity Should We Use? CoRR abs/2312.06648 (2023). https://doi.org/10. 48550/ARXIV.2312.06648 arXiv:2312.06648

work page arXiv 2023
[46]

Xiaoyang Chen, Ben He, Hongyu Lin, Xianpei Han, Tianshu Wang, Boxi Cao, Le Sun, and Yingfei Sun. 2024. Spiral of Silence: How is Large Language Model Killing Information Retrieval? – A Case Study on Open Domain Question Answering. arXiv:2404.10496 [cs.IR] https://arxiv.org/abs/2404.10496

work page arXiv 2024
[47]

Xiuying Chen, Mingzhe Li, Xin Gao, and Xiangliang Zhang. 2022. Towards Improving Faithfulness in Abstractive Summarization. In NeurIPS. http://papers.nips.cc/paper_files/paper/2022/hash/9b6d7202750e8e32cd5270eb7fc131f7- Abstract-Conference.html

work page 2022
[48]

Yijie Chen, Yijin Liu, Fandong Meng, Yufeng Chen, Jinan Xu, and Jie Zhou. 2023. Improving Translation Faithfulness of Large Language Models via Augmenting Instructions. ArXiv preprint abs/2308.12674 (2023). https://arxiv.org/abs/ 2308.12674

work page arXiv 2023
[49]

Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, and Ajay Divakaran. 2023. Measuring and Improving Chain-of- Thought Reasoning in Vision-Language Models.ArXiv preprint abs/2309.04461 (2023). https://arxiv.org/abs/2309.04461

work page arXiv 2023
[50]

Qinyuan Cheng, Xiaonan Li, Shimin Li, Qin Zhu, Zhangyue Yin, Yunfan Shao, Linyang Li, Tianxiang Sun, Hang Yan, and Xipeng Qiu. 2024. Unified Active Retrieval for Retrieval Augmented Generation. CoRR abs/2406.12534 (2024). https://doi.org/10.48550/ARXIV.2406.12534 arXiv:2406.12534

work page doi:10.48550/arxiv.2406.12534 2024
[51]

Qinyuan Cheng, Tianxiang Sun, Wenwei Zhang, Siyin Wang, Xiangyang Liu, Mozhi Zhang, Junliang He, Mianqiu Huang, Zhangyue Yin, Kai Chen, and Xipeng Qiu. 2023. Evaluating Hallucinations in Chinese Large Language Models. arXiv:2310.03368 [cs.CL]

work page arXiv 2023
[52]

I Chern, Steffi Chern, Shiqi Chen, Weizhe Yuan, Kehua Feng, Chunting Zhou, Junxian He, Graham Neubig, Pengfei Liu, et al. 2023. FacTool: Factuality Detection in Generative AI–A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios. ArXiv preprint abs/2307.13528 (2023). https://arxiv.org/abs/2307.13528

work page arXiv 2023
[53]

Cheng-Han Chiang and Hung-yi Lee. 2023. Can Large Language Models Be an Alternative to Human Evaluations? ArXiv preprint abs/2305.01937 (2023). https://arxiv.org/abs/2305.01937

work page arXiv 2023
[54]

David Chiang and Peter Cholak. 2022. Overcoming a Theoretical Limitation of Self-Attention. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Association for Computational Linguistics, Dublin, Ireland, 7654–7664. https://doi.org/10.18653/v1/2022.acl-long.527

work page doi:10.18653/v1/2022.acl-long.527 2022
[55]

Sehyun Choi, Tianqing Fang, Zhaowei Wang, and Yangqiu Song. 2023. KCTS: Knowledge-Constrained Tree Search Decoding with Token-Level Hallucination Detection. ArXiv preprint abs/2310.09044 (2023). https://arxiv.org/abs/ 2310.09044

work page arXiv 2023
[56]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

work page 2023
[57]

Christiano, Jan Leike, Tom B

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep Reinforcement Learning from Human Preferences. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA , Isabelle Guyon, Ulrike von Luxburg, Samy Beng...

work page 2017
[58]

Zheng Chu, Jingchang Chen, Qianglong Chen, Haotian Wang, Kun Zhu, Xiyuan Du, Weijiang Yu, Ming Liu, and Bing Qin. 2024. BeamAggR: Beam Aggregation Reasoning over Multi-source Knowledge for Multi-hop Question Answering. arXiv:2406.19820 [cs.CL] https://arxiv.org/abs/2406.19820

work page arXiv 2024
[59]

Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Tao He, Haotian Wang, Weihua Peng, Ming Liu, Bing Qin, and Ting Liu. 2023. A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future. ArXiv preprint abs/2309.15402 (2023). https://arxiv.org/abs/2309.15402 ACM Transactions on Information Systems, Vol. 1, No. 1, Article 1. Publication date...

work page arXiv 2023
[60]

Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Tao He, Haotian Wang, Weihua Peng, Ming Liu, Bing Qin, and Ting Liu. 2023. A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future. CoRR abs/2309.15402 (2023). https://doi.org/10.48550/ARXIV.2309.15402 arXiv:2309.15402

work page doi:10.48550/arxiv.2309.15402 2023
[61]

Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. 2023. Dola: Decoding by contrasting layers improves factuality in large language models. ArXiv preprint abs/2309.03883 (2023). https: //arxiv.org/abs/2309.03883

work page arXiv 2023
[62]

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa De- hghani, Siddhartha Brahma, et al. 2022. Scaling instruction-finetuned language models. ArXiv preprint abs/2210.11416 (2022). https://arxiv.org/abs/2210.11416

work page internal anchor Pith review Pith/arXiv arXiv 2022
[63]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems. CoRR abs/2110.14168 (2021). arXiv:2110.14168 https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[64]

Roi Cohen, May Hamri, Mor Geva, and Amir Globerson. 2023. LM vs LM: Detecting Factual Errors via Cross Examination. ArXiv preprint abs/2305.13281 (2023). https://arxiv.org/abs/2305.13281

work page arXiv 2023
[65]

Together Computer. 2023. RedPajama: an Open Dataset for Training Large Language Models . https://github.com/ togethercomputer/RedPajama-Data

work page 2023
[66]

Ajeya Cotra. 2021. Why AI alignment could be hard with modern deep learning. https://www.cold-takes.com/why- ai-alignment-could-be-hard-with-modern-deep-learning/ Cold Takes

work page 2021
[67]

Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. 2024. The Power of Noise: Redefining Retrieval for RAG Systems. CoRR abs/2401.14887 (2024). https://doi.org/10.48550/ARXIV.2401.14887 arXiv:2401.14887

work page doi:10.48550/arxiv.2401.14887 2024
[68]

Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. 2022. Knowledge Neurons in Pretrained Transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 8493–8502. https://doi.org/10.18653/v1/2022.acl- long.581

work page doi:10.18653/v1/2022.acl- 2022
[69]

Damai Dai, Wenbin Jiang, Qingxiu Dong, Yajuan Lyu, Qiaoqiao She, and Zhifang Sui. 2022. Neural knowledge bank for pretrained transformers. ArXiv preprint abs/2208.00399 (2022). https://arxiv.org/abs/2208.00399

work page arXiv 2022
[70]

Sunhao Dai, Yuqi Zhou, Liang Pang, Weihao Liu, Xiaolin Hu, Yong Liu, Xiao Zhang, and Jun Xu. 2023. LLMs may Dominate Information Access: Neural Retrievers are Biased Towards LLM-Generated Texts. CoRR abs/2310.20501 (2023). https://doi.org/10.48550/ARXIV.2310.20501 arXiv:2310.20501

work page doi:10.48550/arxiv.2310.20501 2023
[71]

Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2020. Plug and Play Language Models: A Simple Approach to Controlled Text Generation. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net. https: //openreview.net...

work page 2020
[72]

Nicola De Cao, Wilker Aziz, and Ivan Titov. 2021. Editing Factual Knowledge in Language Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing . Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 6491–6506. https://doi.org/10.18653/v1/2021.emnlp-main.522

work page doi:10.18653/v1/2021.emnlp-main.522 2021
[73]

Estevão Filho, Todd Hendry, Daniel Holstein, Jennifer Marsman, Nick Mecklenburg, Sara Malvar, Leonardo O

Maria Angels de Luis Balaguer, Vinamra Benara, Renato Luiz de Freitas Cunha, Roberto de M. Estevão Filho, Todd Hendry, Daniel Holstein, Jennifer Marsman, Nick Mecklenburg, Sara Malvar, Leonardo O. Nunes, Rafael Padilha, Morris Sharp, Bruno Silva, Swati Sharma, Vijay Aski, and Ranveer Chandra. 2024. RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Stud...

work page doi:10.48550/arxiv.2401.08406 2024
[74]

Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, et al. 2023. Language Modeling Is Compression. ArXiv preprint abs/2309.10668 (2023). https://arxiv.org/abs/2309.10668

work page arXiv 2023
[75]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Vo...

work page doi:10.18653/v1/n19-1423 2019
[76]

Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston

work page
[77]

Chain-of-

Chain-of-Verification Reduces Hallucination in Large Language Models. ArXiv preprint abs/2309.11495 (2023). https://arxiv.org/abs/2309.11495

work page arXiv 2023
[78]

Hanxing Ding, Liang Pang, Zihao Wei, Huawei Shen, and Xueqi Cheng. 2024. Retrieve Only When It Needs: Adaptive Retrieval Augmentation for Hallucination Mitigation in Large Language Models. CoRR abs/2402.10612 (2024). https://doi.org/10.48550/ARXIV.2402.10612 arXiv:2402.10612 ACM Transactions on Information Systems, Vol. 1, No. 1, Article 1. Publication da...

work page doi:10.48550/arxiv.2402.10612 2024
[79]

Zican Dong, Tianyi Tang, Junyi Li, Wayne Xin Zhao, and Ji-Rong Wen. 2023. BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities of Large Language Models. ArXiv preprint abs/2309.13345 (2023). https://arxiv.org/abs/2309.13345

work page arXiv 2023
[81]

Nouha Dziri, Andrea Madotto, Osmar Zaïane, and Avishek Joey Bose. 2021. Neural Path Hunter: Reducing Hallu- cination in Dialogue Systems via Path Grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2197–2214. https://doi....

work page doi:10.18653/v1/2021.emnlp-main.168 2021
[82]

Nouha Dziri, Hannah Rashkin, Tal Linzen, and David Reitter. 2021. Evaluating groundedness in dialogue systems: The begin benchmark. ArXiv preprint abs/2105.00071 (2021). https://arxiv.org/abs/2105.00071

work page arXiv 2021

Showing first 80 references.