An argument paper reframes LLM explainability as an embodied, situated practice based on Dourish and enactivist cognition, identifying ontological obstacles in internal explanations and advocating affordance-based designs.
Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong
6 Pith papers cite this work. Polarity classification is still indexing.
years
2026 6representative citing papers
HANSEL extracts navigable evidence from agent trajectories with 83.7% precision and 88.8% recall on 45 tasks, reduces volume by 61.6%, and improves verification metrics in a 14-participant study.
Summary reasoning traces from LLMs maintain task performance and increase trust and appeal relative to answer-only or full-trace conditions, but none of the formats improve users' metacognitive calibration on reasoning tasks.
Two linked user studies find that LLM rationale correctness and certainty framing affect trust and decision confidence while presentation format does not, and incorrect rationales increase gaze attention and pupil size.
An experiment finds that overreliance on chatbots persists in hybrid AI-plus-web-search setups and is driven primarily by user characteristics rather than answer properties, with warmth increasing agreement on incorrect answers.
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.
citing papers explorer
-
Embodied Explainability and Ontological Obstacles: Why We Struggle to Explain the Answers of Large Language Models (LLMs)
An argument paper reframes LLM explainability as an embodied, situated practice based on Dourish and enactivist cognition, identifying ontological obstacles in internal explanations and advocating affordance-based designs.
-
HANSEL: Extracting Breadcrumbs from Web Agent Trajectories for Interactive Verification
HANSEL extracts navigable evidence from agent trajectories with 83.7% precision and 88.8% recall on 45 tasks, reduces volume by 61.6%, and improves verification metrics in a 14-participant study.
-
Explaining Too Much? Understanding How Large Language Model Reasoning Traces Influence Performance and Metacognition
Summary reasoning traces from LLMs maintain task performance and increase trust and appeal relative to answer-only or full-trace conditions, but none of the formats improve users' metacognitive calibration on reasoning tasks.
-
When LLM Rationales Become User-Facing: Effects on Trust Perception, Decision-Making, and Gaze Behaviors
Two linked user studies find that LLM rationale correctness and certainty framing affect trust and decision confidence while presentation format does not, and incorrect rationales increase gaze attention and pupil size.
-
The Decision to Verify: How Warmth and User Characteristics Shape Reliance on Conversational Agents for Information Search
An experiment finds that overreliance on chatbots persists in hybrid AI-plus-web-search setups and is driven primarily by user characteristics rather than answer properties, with warmth increasing agreement on incorrect answers.
-
Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.