Introduces unified span-level hallucination detection benchmark over code, tool output, and documents; fine-tuned Qwen3.5-2B reaches 0.689 span-F1 and outperforms baselines including on code-agent data.
When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with P silo QA
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
FEPoID automatically selects optimal or near-optimal intermediate layers for hallucination detection across LLM architectures and tasks, outperforming prior criteria and baselines, with an added truncation step that further improves performance.
OCC-RAG develops task-specialized SLMs (0.6B and 1.7B) via a new synthetic data pipeline for multi-hop reasoning and context faithfulness, claiming to match or exceed 2-6x larger general models on HotpotQA, MuSiQue, TAT-QA, ConFiQA, and MuSiQue-Un.
citing papers explorer
-
Automatic Layer Selection for Hallucination Detection
FEPoID automatically selects optimal or near-optimal intermediate layers for hallucination detection across LLM architectures and tasks, outperforming prior criteria and baselines, with an added truncation step that further improves performance.