Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong

Chenglei Si, Navita Goyal, Tongshuang Wu, Chen Zhao, Shi Feng, Hal Daumé Iii, Jordan Boyd-Graber · 2024 · DOI 10.18653/v1/2024.naacl-long.81

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

open at publisher browse 6 citing papers

representative citing papers

Embodied Explainability and Ontological Obstacles: Why We Struggle to Explain the Answers of Large Language Models (LLMs)

cs.HC · 2026-06-22 · unverdicted · novelty 7.0

An argument paper reframes LLM explainability as an embodied, situated practice based on Dourish and enactivist cognition, identifying ontological obstacles in internal explanations and advocating affordance-based designs.

HANSEL: Extracting Breadcrumbs from Web Agent Trajectories for Interactive Verification

cs.HC · 2026-06-17 · unverdicted · novelty 6.0

HANSEL extracts navigable evidence from agent trajectories with 83.7% precision and 88.8% recall on 45 tasks, reduces volume by 61.6%, and improves verification metrics in a 14-participant study.

Explaining Too Much? Understanding How Large Language Model Reasoning Traces Influence Performance and Metacognition

cs.HC · 2026-05-25 · conditional · novelty 6.0

Summary reasoning traces from LLMs maintain task performance and increase trust and appeal relative to answer-only or full-trace conditions, but none of the formats improve users' metacognitive calibration on reasoning tasks.

When LLM Rationales Become User-Facing: Effects on Trust Perception, Decision-Making, and Gaze Behaviors

cs.HC · 2026-06-24 · unverdicted · novelty 5.0

Two linked user studies find that LLM rationale correctness and certainty framing affect trust and decision confidence while presentation format does not, and incorrect rationales increase gaze attention and pupil size.

The Decision to Verify: How Warmth and User Characteristics Shape Reliance on Conversational Agents for Information Search

cs.HC · 2026-05-27 · unverdicted · novelty 4.0

An experiment finds that overreliance on chatbots persists in hybrid AI-plus-web-search setups and is driven primarily by user characteristics rather than answer properties, with warmth increasing agreement on incorrect answers.

Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility

cs.LG · 2026-05-07 · unverdicted · novelty 4.0 · 2 refs

Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility cs.LG · 2026-05-07 · unverdicted · none · ref 189 · 2 links
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.

Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong

fields

years

verdicts

representative citing papers

citing papers explorer