An argument paper reframes LLM explainability as an embodied, situated practice based on Dourish and enactivist cognition, identifying ontological obstacles in internal explanations and advocating affordance-based designs.
Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong
6 Pith papers cite this work. Polarity classification is still indexing.
years
2026 6representative citing papers
HANSEL extracts navigable evidence from agent trajectories with 83.7% precision and 88.8% recall on 45 tasks, reduces volume by 61.6%, and improves verification metrics in a 14-participant study.
Summary reasoning traces from LLMs maintain task performance and increase trust and appeal relative to answer-only or full-trace conditions, but none of the formats improve users' metacognitive calibration on reasoning tasks.
Two linked user studies find that LLM rationale correctness and certainty framing affect trust and decision confidence while presentation format does not, and incorrect rationales increase gaze attention and pupil size.
An experiment finds that overreliance on chatbots persists in hybrid AI-plus-web-search setups and is driven primarily by user characteristics rather than answer properties, with warmth increasing agreement on incorrect answers.
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.
citing papers explorer
-
Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.