Across 30 LLMs and 205 TLA+ tasks, syntactic correctness reaches at most 26.6% and semantic correctness 8.6%, with all successes limited to progressive prompting and no advantage from larger models.
Cognitive mirage: A review of hallucinations in large language models
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).
This paper introduces a systems-level conceptual framing and a three-level taxonomy (intra-model, system-level, socio-technical) for uncertainty propagation in compound LLM applications, along with engineering insights and open challenges.
RRCM trains an LLM to dynamically retrieve from collaborative and meta memories using group relative policy optimization driven by final top-k recommendation quality.
Trajel introduces a five-type taxonomy and benchmark for trajectory-level hallucinations in multi-agent LLM workflows, showing existing final-answer benchmarks miss common failures.
A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.
citing papers explorer
-
Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges
LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).
-
Uncertainty Propagation in LLM-Based Systems
This paper introduces a systems-level conceptual framing and a three-level taxonomy (intra-model, system-level, socio-technical) for uncertainty propagation in compound LLM applications, along with engineering insights and open challenges.
-
RRCM: Ranking-Driven Retrieval over Collaborative and Meta Memories for LLM Recommendation
RRCM trains an LLM to dynamically retrieve from collaborative and meta memories using group relative policy optimization driven by final top-k recommendation quality.
-
Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows
Trajel introduces a five-type taxonomy and benchmark for trajectory-level hallucinations in multi-agent LLM workflows, showing existing final-answer benchmarks miss common failures.