arxiv: 2605.05401 · v1 · submitted 2026-05-06 · 💻 cs.HC

Recognition: unknown

Why Someone Asked "Why": Foil Inference in Human and LLM Question Interpretation

Britt Besch, Tobias Gerstenberg

Authors on Pith no claims yet

Pith reviewed 2026-05-08 15:52 UTC · model grok-4.3

classification 💻 cs.HC

keywords foil inferencewhy-questionshindsight expectationcontrastive explanationsquestion interpretationLLM pragmaticshuman-AI dialogue

0 comments

The pith

People infer the unmentioned contrast in a why-question from what the asker would find surprising in hindsight.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how listeners figure out the intended foil, the alternative outcome that did not happen, when someone asks why about an event. Through vignette experiments, it measures prior expectations about what would happen, closeness to the actual outcome, hindsight expectations about what could have surprised the asker, and direct foil selections. Hindsight expectations turn out to be the strongest predictor of which foil people choose. This matters because accurate foil inference shapes whether explanations address the right contrast in both human conversations and AI interactions. The study also tests large language models on the same task and finds their expectation judgments do not consistently align with their foil choices.

Core claim

Participants selected the intended foil for why-questions in close alignment with their hindsight expectation judgments, which capture what they believed the asker would have expected to happen differently once the actual outcome was known. This alignment was stronger than the link to prior expectation judgments or closeness judgments. The results indicate that foil inference relies on reconstructing the asker's retrospective sense of surprise rather than prospective beliefs or similarity to the observed event.

What carries the argument

Hindsight expectation judgments that measure what the question asker would find surprising given the known outcome.

If this is right

Explanations become more relevant when they target the contrast that hindsight expectations identify as surprising to the asker.
Dialogue systems can generate better answers to why-questions by first estimating what the asker would have found unexpected after the fact.
Large language models require additional alignment between their expectation modeling and foil selection to match human patterns.
Training data for pragmatic reasoning should include explicit hindsight surprise signals when teaching contrastive question interpretation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hindsight mechanism could be tested in live multi-turn dialogues to check whether vignette results generalize beyond written scenarios.
LLM performance might improve if models are fine-tuned to output hindsight expectation scores before selecting a foil.
This inference route may intersect with counterfactual reasoning tasks where agents must explain events by ruling out alternatives that would not have surprised the observer.

Load-bearing premise

That the explicit judgments participants provide in controlled vignette tasks accurately reflect the implicit processes used to infer foils in natural conversation.

What would settle it

A new set of vignettes in which foil selections correlate more strongly with prior expectation or closeness judgments than with hindsight expectation judgments would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.05401 by Britt Besch, Tobias Gerstenberg.

**Figure 1.** Figure 1: Experiment overview and Scenario 1 data. (a) Experiment procedure. All participants read the same vignettes and were assigned to either one of the three predictor conditions (‘prior expectation’, ‘closeness’, or ‘hindsight expectation’) or the ‘foil selection’ condition. (b) Experiment 1 (Humans) results for Scenario 1 with three context variations. Bars show percentage of participants selecting each optio… view at source ↗

**Figure 2.** Figure 2: Experiment 1 (Humans). Scatterplots of the percentage with which each response option of a vignette was selected in the predictor conditions, and the foil selection condition. Error bars are 95% bootstrapped CIs view at source ↗

**Figure 3.** Figure 3: Experiment 1 (Humans). Participant selections (bars) for all vignettes and model predictions of linear models with different predictors (points). Error bars on bars are 95% bootstrapped confidence intervals. constrained generation to a single token so that the output was a single choice label. No conversational memory was used. Results Data analyses for Experiment 2 mirror those of Experiment 1. Linear reg… view at source ↗

**Figure 4.** Figure 4: Experiment 2 (LLMs). LLM and participants foil selections for all vignettes. Error bars are 95% bootstrapped CIs. implies that LLMs struggle with inferring the implicit meaning of why-questions when the foil is unstated. General discussion How do we figure out what implicit contrast someone had in mind when they asked a why-question? The experiments reported here support a view of foil inference as a soci… view at source ↗

read the original abstract

Explanations are inherently contrastive: E happened rather than E' because of C rather than C'. However, these contrasts, or "foils", are rarely mentioned explicitly but have to be inferred in context. Here, we investigate how people select the intended foil E' of a why-question. Participants read vignettes and judged, for each foil, their prior expectation (what will happen next), closeness (what is most similar to what happened), and hindsight expectation (what could have happened instead), as well as which foil they thought the question asker had in mind when they asked the why-question. We found that foil selections were best predicted by hindsight expectation judgments. This suggests that people infer the foil by considering what a question asker finds surprising after the outcome occurred. Since correct foil selection is relevant not only in human-human interaction but also increasingly in dialogues with large language models, we investigated their performance on the same task. The coupling between LLMs' explicit expectation judgments and their foil selections is inconsistent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hindsight expectation predicts foil selection better than prior or closeness in the experiments, but the design leaves open whether those ratings capture the actual inference process or just task artifacts.

read the letter

The main takeaway is that participants' hindsight expectation ratings (what could have happened instead, after seeing the outcome) lined up best with which foil they attributed to the question asker, outperforming prior expectation and closeness. The LLM part shows messier coupling between their judgments and foil choices. That's the concrete empirical result the paper delivers on why-questions and contrastive explanation.

Referee Report

3 major / 2 minor

Summary. The manuscript reports an empirical study on how humans and LLMs infer the intended foil (alternative contrast) in why-questions. Using vignettes, human participants rate foils on prior expectation (pre-outcome), closeness (similarity to outcome), hindsight expectation (post-outcome alternatives), and select the foil they believe the asker had in mind. The central claim is that hindsight expectation judgments best predict foil selections, suggesting inference via post-outcome surprise. The study also tests LLMs on the same task and reports inconsistent coupling between their explicit judgments and foil selections.

Significance. If the central result holds, the work advances understanding of contrastive reasoning and pragmatic inference in explanations, with direct relevance to improving LLM handling of why-questions in human-AI dialogue. A clear strength is the multi-predictor design that pits prior expectation, closeness, and hindsight against each other on the same items, enabling a comparative test rather than isolated measures. The LLM extension is also valuable for applied HCI. Significance is limited by the absence of validation that explicit ratings map to implicit conversational processes.

major comments (3)

[Results] Results section: The claim that hindsight expectation judgments best predict foil selections is presented without sample sizes, statistical tests (e.g., regression or correlation coefficients, p-values), effect sizes, or controls for order/vignette effects. This makes it impossible to evaluate whether hindsight truly outperforms the other predictors or whether the result is robust.
[Methods] Methods: All measures (prior expectation, closeness, hindsight expectation, and foil selection) are elicited sequentially from the same participants on identical vignettes. This design cannot distinguish whether hindsight judgments causally drive foil inference or whether both are parallel downstream products of reading the vignette and performing the task, weakening the interpretation that people infer foils via post-outcome surprise.
[LLM Experiments] LLM experiments: The finding of inconsistent coupling between LLMs' expectation judgments and foil selections lacks quantitative metrics (e.g., per-model correlations or accuracy differences), model specifications, and prompt details. Without these, the claim cannot be assessed for reliability or compared to the human data.

minor comments (2)

[Abstract] Abstract: Omits participant count, vignette count, and analysis methods, which would allow readers to gauge the empirical scope immediately.
[Methods] The operational definitions of 'closeness' and 'hindsight expectation' would benefit from a concrete vignette example in the main text to clarify distinctions for readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each of the major comments below and indicate the revisions made to strengthen the paper.

read point-by-point responses

Referee: [Results] Results section: The claim that hindsight expectation judgments best predict foil selections is presented without sample sizes, statistical tests (e.g., regression or correlation coefficients, p-values), effect sizes, or controls for order/vignette effects. This makes it impossible to evaluate whether hindsight truly outperforms the other predictors or whether the result is robust.

Authors: We agree that the original manuscript did not provide sufficient statistical details in the Results section. We have revised the manuscript to include the participant sample size, Pearson correlation coefficients between each predictor and foil selection (hindsight expectation showing the strongest association), multiple regression analyses with coefficients, p-values, and effect sizes, as well as mixed-effects models controlling for vignette and order effects. These additions demonstrate the robustness of the finding that hindsight expectation best predicts foil selections. revision: yes
Referee: [Methods] Methods: All measures (prior expectation, closeness, hindsight expectation, and foil selection) are elicited sequentially from the same participants on identical vignettes. This design cannot distinguish whether hindsight judgments causally drive foil inference or whether both are parallel downstream products of reading the vignette and performing the task, weakening the interpretation that people infer foils via post-outcome surprise.

Authors: We acknowledge that the sequential within-subjects design does not permit strong causal inferences regarding the role of hindsight expectations in driving foil selection. This was a deliberate choice to enable direct comparison of the predictors using the same vignettes and participants. In the revised manuscript, we have expanded the Discussion to explicitly address this limitation, clarifying that our results show a strong predictive relationship rather than a causal mechanism. We also suggest directions for future research involving experimental manipulations to test causality. revision: yes
Referee: [LLM Experiments] LLM experiments: The finding of inconsistent coupling between LLMs' expectation judgments and foil selections lacks quantitative metrics (e.g., per-model correlations or accuracy differences), model specifications, and prompt details. Without these, the claim cannot be assessed for reliability or compared to the human data.

Authors: We agree that the LLM section required more detail for proper evaluation. The revised manuscript now includes the specific LLMs tested with their versions, the full prompt templates used for eliciting judgments and selections, and quantitative metrics such as per-model correlation coefficients between explicit judgments and foil selections, as well as comparisons of selection accuracy to human performance. These revisions enable direct assessment and comparison with the human results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical study with independent data collection and statistical findings

full rationale

This paper reports an empirical psychology experiment in which participants provide multiple explicit judgments (prior expectation, closeness, hindsight expectation, foil selection) on the same set of vignettes. The central result—that hindsight expectation judgments best predict foil selections—is a statistical outcome from participant data rather than a mathematical derivation, fitted parameter, or self-referential definition. No equations, ansatzes, uniqueness theorems, or load-bearing self-citations appear in the abstract or described methods. The design collects all measures from the same individuals, which raises questions of validity and task demand, but does not constitute circularity under the specified patterns because the finding is not forced by construction or by renaming inputs as outputs. The study is self-contained against its own external participant data.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work relies on standard assumptions of vignette-based psychology experiments rather than new axioms or entities.

axioms (2)

domain assumption Participants' explicit ratings of expectation and foil selection reflect their implicit inference processes in natural language understanding.
Invoked when treating the judgment tasks as direct measures of foil inference.
domain assumption The selected vignettes create realistic why-questions whose intended foils can be reliably judged by readers.
Required for the experimental paradigm to generalize beyond the lab materials.

pith-pipeline@v0.9.0 · 5469 in / 1396 out tokens · 42075 ms · 2026-05-08T15:52:30.601745+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 9 canonical work pages · 1 internal anchor

[1]

jsPsych: Enabling an open-source collaborative ecosystem of behavioral experiments , volume =

De Leeuw, Joshua R and Gilbert, Rebecca A and Luchterhandt, Bj. jsPsych: Enabling an open-source collaborative ecosystem of behavioral experiments , volume =. Journal of Open Source Software , number =
[2]

Royal Institute of Philosophy Supplement , volume=

Contrastive Explanation , author=. Royal Institute of Philosophy Supplement , volume=. 1990 , doi=

1990
[3]

Cognition , volume=

Perceived similarity of imagined possible worlds affects judgments of counterfactual plausibility , author=. Cognition , volume=. 2021 , doi=

2021
[4]

1995 , month = feb, day =

Congressional Record: Senate, page S2457 , howpublished =. 1995 , month = feb, day =

1995
[5]

and Saxe, Rebecca and Tenenbaum, Joshua B

Baker, Chris L. and Saxe, Rebecca and Tenenbaum, Joshua B. , title =. Cognition , year =
[6]

The Philosophical Quarterly , year =

Barnes, Annette , title =. The Philosophical Quarterly , year =
[7]

Using contrastive inferences to learn about new words and categories , volume =

Bergey, Claire and Yurovsky, Daniel , year =. Using contrastive inferences to learn about new words and categories , volume =. Cognition , doi =
[8]

arXiv , year =

Bigelow, John and Morris, Adam and Gerstenberg, Tobias , title =. arXiv , year =
[9]

Cognition , year =

Brigard, Felipe , title =. Cognition , year =
[10]

and Gerstenberg, Tobias , title =

Brockbank, Erik and Yang, Justin and Govil, Mishika and Fan, Judith E. and Gerstenberg, Tobias , title =. Proceedings of the 46th Annual Conference of the Cognitive Science Society , year =
[11]

Contrastive Explanations That Anticipate Human Misconceptions Can Improve Human Decision-Making Skills , year =

Bu. Contrastive Explanations That Anticipate Human Misconceptions Can Improve Human Decision-Making Skills , year =. doi:10.1145/3706598.3713229 , booktitle =

work page doi:10.1145/3706598.3713229
[12]

Cognitive Computation , year =

Castelnovo, Alessandro and Crupi, Riccardo and Mombelli, Nicolò and Nanino, Gabriele and Regoli, Daniele , title =. Cognitive Computation , year =
[13]

Proceedings of the CHI Conference on Human Factors in Computing Systems , year =

Chandra, Rohan and others , title =. Proceedings of the CHI Conference on Human Factors in Computing Systems , year =
[14]

Cognitive Psychology , year =

Chin-Parker, Stacey and Cantelon, Jessica , title =. Cognitive Psychology , year =
[15]

Psychonomic Bulletin & Review , year =

Coenen, Anna and Rehder, Bob and Ware, Emily , title =. Psychonomic Bulletin & Review , year =
[16]

Annual Review of Linguistics , year =

Degen, Judith , title =. Annual Review of Linguistics , year =
[17]

Social Contract

Fr. Social Contract. arXiv preprint arXiv:2310.17769 , year =

work page arXiv
[18]

Human-like Affective Cognition in Foundation Models , journal =

Gandhi, Kanishk and Lynch, Zoe and Fr. Human-like Affective Cognition in Foundation Models , journal =
[19]

Journal of Experimental Psychology: General , year =

Gerstenberg, Tobias and Icard, Thomas , title =. Journal of Experimental Psychology: General , year =
[20]

and Chater, Nick and Tenenbaum, Joshua B

Griffiths, Thomas L. and Chater, Nick and Tenenbaum, Joshua B. , title =
[21]

Contemporary Science and Natural Explanation , editor =

Hesslow, Germund , title =. Contemporary Science and Natural Explanation , editor =. 1988 , pages =

1988
[22]

, title =

Hilton, Denis J. , title =. Psychological Bulletin , year =
[23]

Philosophy of Science , year =

Hitchcock, Christopher , title =. Philosophy of Science , year =
[24]

and Abel, David and Correa, Carlos G

Ho, Mark K. and Abel, David and Correa, Carlos G. and Littman, Michael L. and Cohen, Jonathan D. and Griffiths, Thomas L. , title =. Nature , year =
[25]

, title =

Jin, X. , title =. arXiv preprint , year =
[26]

, title =

Kahneman, Daniel and Varey, Carol A. , title =. Journal of Personality and Social Psychology , year =
[27]

A Survey of Algorithmic Recourse: Contrastive Explanations and Consequential Recommendations , journal =

Karimi, Amir-Hossein and Barthe, Gilles and Sch. A Survey of Algorithmic Recourse: Contrastive Explanations and Consequential Recommendations , journal =. 2022 , volume =

2022
[28]

Cognitive Science , year =

Keshmirian, Anita and Willig, Moritz and Hemmatian, Babak and Hahn, Ulrike and Kersting, Kristian and Gerstenberg, Tobias , title =. Cognitive Science , year =
[29]

Cognitive Science , year =

Kirfel, Lara and Harding, Jacqueline and Shin, Jeong and Xin, Cindy and Icard, Thomas and Gerstenberg, Tobias , title =. Cognitive Science , year =
[30]

Journal of Experimental Psychology: General , year =

Kirfel, Lara and Icard, Thomas and Gerstenberg, Tobias , title =. Journal of Experimental Psychology: General , year =
[31]

Cognitive Psychology , year =

Lombrozo, Tania , title =. Cognitive Psychology , year =
[32]

, title =

Poesia Reis e Silva, Gabriel and Goodman, Noah D. , title =. Proceedings of the 44th Annual Conference of the Cognitive Science Society , year =
[33]

and Josephson, Susan G

Josephson, John R. and Josephson, Susan G. , title =
[34]

and Erb, Hans-Peter , title =

Hilton, Denis J. and Erb, Hans-Peter , title =. Thinking & Reasoning , year =
[35]

Proceedings of the 30th ACM Conference on User Modeling, Adaptation and Personalization , year =

Riveiro, Maria and Thill, Serge , title =. Proceedings of the 30th ACM Conference on User Modeling, Adaptation and Personalization , year =
[36]

and Phillips, Jonathan , title =

Kominsky, Jonathan F. and Phillips, Jonathan , title =. Cognitive Science , year =
[37]

, title =

Krarup, Benjamin and Krivic, Senka and Magazzeni, Daniele and Long, Derek and Cashmore, Michael and Smith, David E. , title =. Journal of Artificial Intelligence Research , year =
[38]

Revisiting the Evaluation of Theory of Mind through Question Answering , booktitle =

Le, Matthew and Boureau, Y. Revisiting the Evaluation of Theory of Mind through Question Answering , booktitle =. 2019 , pages =

2019
[39]

IEEE Transactions on Knowledge and Data Engineering , year =

Malandri, Luca and Guidotti, Riccardo and Monreale, Anna and Turini, Franco and Pedreschi, Dino , title =. IEEE Transactions on Knowledge and Data Engineering , year =
[40]

and Klein, Jill Gabrielle , title =

McGill, Ann L. and Klein, Jill Gabrielle , title =. Journal of Personality and Social Psychology , year =
[41]

and Tenbrunsel, Ann E

McGill, Ann L. and Tenbrunsel, Ann E. , title =. Journal of Personality and Social Psychology , year =
[42]

Artificial Intelligence , year =

Miller, Tim , title =. Artificial Intelligence , year =
[43]

Journal of Artificial Intelligence Research , year =

Miller, Tim , title =. Journal of Artificial Intelligence Research , year =
[44]

Advances in Neural Information Processing Systems , year =

Nie, Allen and Zhang, Yuhui and Amdekar, Atharva and Piech, Chris and Hashimoto, Tatsunori and Gerstenberg, Tobias , title =. Advances in Neural Information Processing Systems , year =
[45]

Proceedings of the National Academy of Sciences , year =

Phillips, Jonathan and Cushman, Fiery , title =. Proceedings of the National Academy of Sciences , year =
[46]

Mind & Language , year =

Phillips, Jonathan and Knobe, Joshua , title =. Mind & Language , year =
[47]

The Philosophical Review , year =

Schaffer, Jonathan , title =. The Philosophical Review , year =
[48]

Legal Theory , year =

Schaffer, Jonathan , title =. Legal Theory , year =
[49]

Journal for the Theory of Social Behaviour , volume=

Remote causes, bad explanations? , author=. Journal for the Theory of Social Behaviour , volume=. 2002 , url =

2002
[50]

1988 , pages=

The Pragmatic Theory of Explanation , author =. 1988 , pages=

1988
[51]

Trends in Cognitive Sciences , volume =

Pragmatic Language Interpretation as Probabilistic Inference , author =. Trends in Cognitive Sciences , volume =
[52]

Proceedings of the Annual Meeting of the Cognitive Science Society , year =

Cooperative Explanation as Rational Communication , author =. Proceedings of the Annual Meeting of the Cognitive Science Society , year =
[53]

Trends in Cognitive Sciences , volume =

The Structure and Function of Explanations , author =. Trends in Cognitive Sciences , volume =
[54]

The Oxford Handbook of Thinking and Reasoning , editor =

Explanation and Abductive Inference , author =. The Oxford Handbook of Thinking and Reasoning , editor =
[55]

Prompting Contrastive Explanations for Commonsense Reasoning Tasks

Paranjape, Bhargavi and Michael, Julian and Ghazvininejad, Marjan and Hajishirzi, Hannaneh and Zettlemoyer, Luke. Prompting Contrastive Explanations for Commonsense Reasoning Tasks. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021. doi:10.18653/v1/2021.findings-acl.366

work page doi:10.18653/v1/2021.findings-acl.366 2021
[56]

2026 , issn =

Large language models are contrastive reasoners , journal =. 2026 , issn =. doi:https://doi.org/10.1016/j.eswa.2025.130407 , url =

work page doi:10.1016/j.eswa.2025.130407 2026
[57]

Toward Structured Knowledge Reasoning: Contrastive Retrieval-Augmented Generation on Experience

Gu, Jiawei and Xian, Ziting and Xie, Yuanzhen and Liu, Ye and Liu, Enjie and Zhong, Ruichao and Gao, Mochi and Tan, Yunzhi and Hu, Bo and Li, Zang. Toward Structured Knowledge Reasoning: Contrastive Retrieval-Augmented Generation on Experience. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1224

work page doi:10.18653/v1/2025.findings-acl.1224 2025
[58]

Eliciting Critical Reasoning in Retrieval-Augmented Generation via Contrastive Explanations

Ranaldi, Leonardo and Valentino, Marco and Freitas, Andre. Eliciting Critical Reasoning in Retrieval-Augmented Generation via Contrastive Explanations. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.n...

work page doi:10.18653/v1/2025.naacl-long.557 2025
[59]

Proceedings of the Annual Meeting of the Cognitive Science Society , volume=

The shape of option generation in open-ended decision problems , author=. Proceedings of the Annual Meeting of the Cognitive Science Society , volume=
[60]

, author=

Reconciling truthfulness and relevance as epistemic and decision-theoretic utility. , author=. Psychological review , volume=. 2024 , publisher=

2024
[61]

arXiv preprint arXiv:2505.03732 , year=

A communication-first account of explanation , author=. arXiv preprint arXiv:2505.03732 , year=

work page arXiv
[62]

Cognitive Science , volume =

Colombo, Matteo , title =. Cognitive Science , volume =. doi:https://doi.org/10.1111/cogs.12340 , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1111/cogs.12340 , year =

work page doi:10.1111/cogs.12340
[63]

Psychonomic Bulletin & Review , volume=

A contrastive account of explanation generation , author=. Psychonomic Bulletin & Review , volume=. 2017 , publisher=

2017
[64]

Cognitive processing , volume=

Background shifts affect explanatory style: How a pragmatic theory of explanation accounts for background effects in the generation of explanations , author=. Cognitive processing , volume=. 2010 , publisher=

2010
[65]

The Llama 3 Herd of Models

The Llama 3 Herd of Models , author=. arXiv preprint arXiv:2407.21783 , url=. 2024 , eprint=

work page internal anchor Pith review arXiv 2024
[66]

2023 , eprint=

Mistral 7B , author=. 2023 , eprint=

2023
[67]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025