PhantomBench is a new benchmark of 60K+ non-existent terms showing language models hallucinate at rates up to 86.7 percent even when inputs assume the concepts exist.
ISBN 979-8-89176-251-0
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 7roles
background 1polarities
background 1representative citing papers
Code-on-Graph lets LLMs turn retrieved KG facts into Python class instances and generate executable code for reasoning, outperforming prior LLM-KG methods by up to 10.5% on WebQSP, CWQ, and GrailQA.
Defines Decision Potential Surface (DPS) whose zero isohypse equals an LLM decision boundary and supplies a K-sample approximation algorithm with derived upper bounds on absolute, expected, and concentration errors.
Dedicated Feature Crosscoders localize RL-induced tool use to a compact feature set in Qwen2.5-3B, yielding +31.1 pp tool correctness gains and +6.8 pp spillover to the base model.
REVERIEMEM is a three-layer perspective-bounded memory system that raises knowledge boundary fidelity by 34.6 points and wins ~79% of narrative comparisons on a new book-based role-playing benchmark.
Anonymization placement in RAG—at the dataset or at the generated answer—creates observable differences in privacy protection versus response utility.
A two-stage hybrid search pipeline paired with a synthetic-data fine-tuned and compressed Ukrainian language model delivers competitive local question answering under strict compute limits.
citing papers explorer
-
PhantomBench: Benchmarking the Non-existential Threat of Language Models
PhantomBench is a new benchmark of 60K+ non-existent terms showing language models hallucinate at rates up to 86.7 percent even when inputs assume the concepts exist.
-
Code-on-Graph: Iterative Programmatic Reasoning via Large Language Models on Knowledge Graphs
Code-on-Graph lets LLMs turn retrieved KG facts into Python class instances and generate executable code for reasoning, outperforming prior LLM-KG methods by up to 10.5% on WebQSP, CWQ, and GrailQA.
-
Decision Potential Surface: A Theoretical and Practical Approximation of Large Language Model Decision Boundary
Defines Decision Potential Surface (DPS) whose zero isohypse equals an LLM decision boundary and supplies a K-sample approximation algorithm with derived upper bounds on absolute, expected, and concentration errors.
-
Localizing RL-Induced Tool Use to a Single Crosscoder Feature
Dedicated Feature Crosscoders localize RL-induced tool use to a compact feature set in Qwen2.5-3B, yielding +31.1 pp tool correctness gains and +6.8 pp spillover to the base model.
-
Staying In Character: Perspective-Bounded Memory For Book-Based Role-Playing Agents
REVERIEMEM is a three-layer perspective-bounded memory system that raises knowledge boundary fidelity by 34.6 points and wins ~79% of narrative comparisons on a new book-based role-playing benchmark.
-
A Case Study on the Impact of Anonymization Along the RAG Pipeline
Anonymization placement in RAG—at the dataset or at the generated answer—creates observable differences in privacy protection versus response utility.
-
An End-to-End Ukrainian RAG for Local Deployment. Optimized Hybrid Search and Lightweight Generation
A two-stage hybrid search pipeline paired with a synthetic-data fine-tuned and compressed Ukrainian language model delivers competitive local question answering under strict compute limits.