An argument paper reframes LLM explainability as an embodied, situated practice based on Dourish and enactivist cognition, identifying ontological obstacles in internal explanations and advocating affordance-based designs.
Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
HANSEL extracts navigable evidence from agent trajectories with 83.7% precision and 88.8% recall on 45 tasks, reduces volume by 61.6%, and improves verification metrics in a 14-participant study.
Two linked user studies find that LLM rationale correctness and certainty framing affect trust and decision confidence while presentation format does not, and incorrect rationales increase gaze attention and pupil size.
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.
citing papers explorer
-
Embodied Explainability and Ontological Obstacles: Why We Struggle to Explain the Answers of Large Language Models (LLMs)
An argument paper reframes LLM explainability as an embodied, situated practice based on Dourish and enactivist cognition, identifying ontological obstacles in internal explanations and advocating affordance-based designs.
-
HANSEL: Extracting Breadcrumbs from Web Agent Trajectories for Interactive Verification
HANSEL extracts navigable evidence from agent trajectories with 83.7% precision and 88.8% recall on 45 tasks, reduces volume by 61.6%, and improves verification metrics in a 14-participant study.
-
When LLM Rationales Become User-Facing: Effects on Trust Perception, Decision-Making, and Gaze Behaviors
Two linked user studies find that LLM rationale correctness and certainty framing affect trust and decision confidence while presentation format does not, and incorrect rationales increase gaze attention and pupil size.
-
Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.