pith. sign in

arxiv: 2605.18352 · v1 · pith:UWBBJBPYnew · submitted 2026-05-18 · 💻 cs.CL

Presupposition and Reasoning in Conditionals: A Theory-Based Study of Humans and LLMs

Pith reviewed 2026-05-20 11:42 UTC · model grok-4.3

classification 💻 cs.CL
keywords presupposition projectionconditionalspragmatic reasoninglarge language modelshuman judgmentssurface pattern matchinglinguistic benchmarks
0
0 comments X

The pith

LLMs that best match human ratings on conditional presuppositions often lack coherent pragmatic reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares human and LLM judgments on presupposition projection through conditional sentences using a normed dataset that varies the link between the antecedent and the projected content. Humans appear to combine probabilistic information with pragmatic considerations when giving likelihood ratings. LLMs produce responses that align with humans to varying degrees. Models whose outputs most closely resemble human ratings typically show little evidence of using pragmatic principles when their reasoning is examined. Models that demonstrate stronger pragmatic reasoning diverge more from human patterns, which suggests their performance comes from statistical patterns in training data rather than genuine competence in pragmatics.

Core claim

Humans integrate probabilistic and pragmatic cues in their judgments of presupposition projection in conditionals, whereas LLMs show variable alignment with these patterns. Models that best match human ratings often lack coherent pragmatic reasoning when evaluated with a linguistically motivated checklist, while models with stronger reasoning produce less human-like judgments. These findings indicate that LLM performance on such tasks may result from surface pattern matching rather than pragmatic competence.

What carries the argument

A linguistically motivated checklist applied within an LLM-as-a-Judge framework to evaluate the presence and coherence of pragmatic reasoning in model responses to conditional sentences.

If this is right

  • Theory-grounded benchmarks are required to distinguish surface-level matching from actual pragmatic competence in language models.
  • Human-like output on presupposition tasks does not imply that models employ the same underlying reasoning processes as people.
  • LLM performance on linguistic judgment tasks may fail to generalize to novel cases outside patterns seen during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The observed mismatch between rating alignment and reasoning quality may extend to other pragmatic phenomena such as implicature calculation.
  • Model training that rewards explicit reasoning steps could shift the trade-off between human-like outputs and genuine pragmatic competence.
  • Separate evaluation tracks for output similarity and for traceable reasoning would give clearer signals about where current models fall short.

Load-bearing premise

The normed dataset successfully isolates the relation between the antecedent and the projected presupposition without unmeasured confounds that could drive both human and model responses.

What would settle it

Testing whether models given explicit pragmatic reasoning instructions or modules produce more coherent reasoning traces while simultaneously reducing their alignment with human likelihood ratings on the same conditional items.

Figures

Figures reproduced from arXiv: 2605.18352 by Olessia Jouravlev, Raj Singh, Tara Azin, Yongan Yu.

Figure 1
Figure 1. Figure 1: Human mean Likert scores in the main pre [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Mean Likert scores for human participants [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Human-LLM alignment measured using Spearman’s rank correlation (ρ) and mean absolute error (MAE) under with-context and without-context condi￾tions. Higher Spearman values and lower MAE indicate closer alignment between model predictions and human mean Likert scores. Stronger Theoretical Alignment in Larger Instruction-Tuned Models Across evaluation dimensions, larger and more heavily instruction￾tuned mod… view at source ↗
Figure 4
Figure 4. Figure 4: System prompts used for likelihood judgment in the without-context and with-context conditions. The [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: System prompts used by the LLM-as-a-judge model for checklist-based evaluation in the without-context [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
read the original abstract

Presupposition projection in conditionals is central to theories of meaning and pragmatics, yet it remains largely unevaluated in large language models. We address this gap through a parallel behavioral study comparing human judgments and LLM predictions on a normed dataset of conditional sentences that controls the relation between the antecedent and the projected presupposition. We collect likelihood ratings from 120 participants and four LLMs under matched contextual conditions. Results show that humans integrate probabilistic and pragmatic cues in their judgment, whereas LLMs show variable alignment with human patterns. Using a linguistically motivated checklist within an LLM-as-a-Judge framework, we further evaluate model reasoning. We observe models that best match human ratings often lack coherent pragmatic reasoning, while models with stronger reasoning produce less human-like judgments. These findings suggest that LLMs' performance on such tasks may result from surface pattern matching rather than pragmatic competence. Our findings highlight the importance of benchmarks grounded in linguistic theory for comparing humans and models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts a parallel behavioral study on presupposition projection in conditionals, using a normed dataset that controls the relation between antecedent and projected presupposition. It collects likelihood ratings from 120 human participants and four LLMs under matched conditions, then applies a linguistically motivated checklist inside an LLM-as-Judge framework to evaluate reasoning. The central claim is that models best matching human ratings often lack coherent pragmatic reasoning while stronger reasoners produce less human-like judgments, suggesting LLM performance arises from surface pattern matching rather than pragmatic competence.

Significance. If the dissociation holds after verification, the work is significant for providing a theory-grounded benchmark that distinguishes pattern matching from pragmatic competence in LLMs. The parallel human-LLM design with controlled conditions and the explicit use of linguistic theory for the checklist are strengths that could guide future evaluations of pragmatic abilities.

major comments (2)
  1. [§4.2] §4.2 (LLM-as-Judge checklist): The manuscript does not specify whether the judge model is held out from the set of rated models or detail the exact scoring procedure for checklist items. This is load-bearing for the central claim because if checklist items can be satisfied by the model echoing its own likelihood rating or by surface pattern completion, the observed mismatch no longer isolates pragmatic competence from the cues driving the ratings.
  2. [§3.1 and §4.1] §3.1 and §4.1: No statistical tests, effect sizes, confidence intervals, or participant exclusion criteria are reported for the human likelihood ratings or the LLM-human alignment comparisons. This undermines verification of the claim that humans integrate probabilistic and pragmatic cues while LLMs show variable alignment.
minor comments (2)
  1. [Figure 2] Figure 2 or 3 (whichever shows the rating distributions): axis labels and error bars should be clarified to allow direct visual comparison of human vs. model variance.
  2. [Introduction] The abstract and introduction could add one sentence explicitly stating the number of checklist items and their linguistic motivation to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify two areas where additional clarity and reporting will strengthen the manuscript's claims about the dissociation between human-like ratings and pragmatic reasoning in LLMs. We address each point below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (LLM-as-Judge checklist): The manuscript does not specify whether the judge model is held out from the set of rated models or detail the exact scoring procedure for checklist items. This is load-bearing for the central claim because if checklist items can be satisfied by the model echoing its own likelihood rating or by surface pattern completion, the observed mismatch no longer isolates pragmatic competence from the cues driving the ratings.

    Authors: We agree that the current description in §4.2 is insufficient to fully isolate pragmatic competence. The judge model (GPT-4) was in fact held out from the four evaluated models, and checklist items were scored on independent binary judgments of the reasoning trace rather than the likelihood rating itself. However, these details were not stated explicitly. In the revision we will add: (1) explicit confirmation that the judge is held out, (2) the full prompt template used for each checklist item, and (3) a description of the scoring procedure showing that items target coherence of pragmatic inference (e.g., recognition of accommodation vs. projection) rather than restatement of the rating. We will also include example traces in the appendix to demonstrate the distinction from surface pattern matching. revision: yes

  2. Referee: [§3.1 and §4.1] §3.1 and §4.1: No statistical tests, effect sizes, confidence intervals, or participant exclusion criteria are reported for the human likelihood ratings or the LLM-human alignment comparisons. This undermines verification of the claim that humans integrate probabilistic and pragmatic cues while LLMs show variable alignment.

    Authors: We accept that the lack of inferential statistics weakens the evidential support for the reported patterns. In the revised manuscript we will add, in §3.1, the participant exclusion criteria (failed attention checks or completion time <5 min; final N=120), and report mixed-effects models or appropriate non-parametric tests comparing likelihood ratings across the controlled antecedent-presupposition relations, together with effect sizes and 95% confidence intervals. In §4.1 we will report Pearson or Spearman correlations between human and LLM ratings per model, with bootstrap confidence intervals and statistical tests for differences in alignment across models. These additions will allow readers to verify the integration of probabilistic and pragmatic cues in humans versus the variable alignment in LLMs. revision: yes

Circularity Check

0 steps flagged

Empirical behavioral comparison with independent human benchmark shows no circular derivation

full rationale

The paper reports a parallel study collecting human likelihood ratings from 120 participants on a normed dataset of conditionals and comparing them to outputs from four LLMs under matched conditions. It then applies a linguistically motivated checklist inside an LLM-as-Judge framework to assess reasoning. Human ratings function as an external benchmark independent of model outputs; the observed dissociation between rating alignment and checklist-based reasoning coherence is presented as an empirical result rather than a quantity derived from fitted parameters, self-definitions, or self-citation chains. No equations, uniqueness theorems, or ansatzes appear in the provided text that reduce the central claims to the inputs by construction. The derivation chain is therefore self-contained against the external human data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on standard linguistic assumptions about presupposition projection and on the validity of likelihood ratings as a measure of human pragmatic inference; no free parameters or new entities are introduced.

axioms (1)
  • domain assumption A normed dataset can control the relation between antecedent and projected presupposition so that human judgments reflect probabilistic and pragmatic cues.
    Abstract states the dataset 'controls the relation between the antecedent and the projected presupposition' and that humans integrate both cue types.

pith-pipeline@v0.9.0 · 5703 in / 1341 out tokens · 73036 ms · 2026-05-20T11:42:02.479287+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · 1 internal anchor

  1. [1]

    InFind- 9 ings of the Association for Computational Linguis- tics: ACL 2025, pages 2096–2107, Vienna, Austria

    Measuring bias and agreement in large lan- guage model presupposition judgments. InFind- 9 ings of the Association for Computational Linguis- tics: ACL 2025, pages 2096–2107, Vienna, Austria. Association for Computational Linguistics. Tara Azin, Daniel Dumitrescu, Diana Inkpen, and Raj Singh. 2025. Let’s confer: A dataset for evalu- ating natural language...

  2. [2]

    ClaimDB: A Fact Verification Benchmark over Large Structured Data

    PUB: A pragmatics understanding benchmark for assessing LLMs’ pragmatics capabilities. InFind- ings of the Association for Computational Linguistics: ACL 2024, pages 12075–12097, Bangkok, Thailand. Association for Computational Linguistics. Robert Stalnaker. 1973. Presuppositions.Journal of Philosophical Logic, 2(4):447–457. Robert Stalnaker. 1998. On the...

  3. [3]

    Educational backgrounds ranged from high school to graduate degrees. Participants completed 90 items, each consist- ing of a conditional statement and a corresponding target statement, and rated the likelihood of the target statement on a 0–7 Likert scale. In the with- context condition, participants were additionally provided with brief identifying backg...

  4. [4]

    The cor- responding API documentation is avail- able at https://platform.openai.com/ docs/models

    GPT-5is provided by OpenAI. The cor- responding API documentation is avail- able at https://platform.openai.com/ docs/models

  5. [5]

    Rel:If Alex uses social media, he will post a photo to his Instagram account

    Gemini-2.5-flashis provided by Google 13 Background Statement 1 (Conditional) Target having an Instagram account (High-Probability Example) Alex works at the airport in your town. Rel:If Alex uses social media, he will post a photo to his Instagram account. Alex has an Instagram account. Maya works at the airport in your town. S-Rel:If Maya is over 50, sh...

  6. [6]

    The corresponding API documentation is available at https: //platform.claude.com/docs/en/intro

    Claude-haiku-4is provided by An- thropic. The corresponding API documentation is available at https: //platform.claude.com/docs/en/intro

  7. [7]

    For large proprietary models (e.g., GPT-5 and Gemini-2.5-Flash), a single run on 360 samples costs approximately $14 CAD for explanation gen- eration

    Qwen2.5-7B-Instruct6 andLlama3.1-8B- Instruct7 are open-source base model weights obtained from Hugging Face ( https:// huggingface.co/). For large proprietary models (e.g., GPT-5 and Gemini-2.5-Flash), a single run on 360 samples costs approximately $14 CAD for explanation gen- eration. For the judge model, Claude-Haiku-4, eval- uating approximately 40,0...

  8. [10]

    Final Rating: [0--7]

    How does the conditional relationship affect the likelihood of the target statement? After your reasoning, provide your final answer in the format: "Final Rating: [0--7]" --- Speaker's statement: {statement_1} Question: How likely is it that the following statement is true? Statement: {statement_2} With-Context Condition Prompt In this task, imagine you a...

  9. [11]

    What does the speaker's conditional statement imply about the subject?

  10. [12]

    What background assumptions or world knowledge are relevant?

  11. [13]

    Final Rating: [0--7]

    How does the conditional relationship affect the likelihood of the target statement? After your reasoning, provide your final answer in the format: "Final Rating: [0--7]" --- background: {background} Speaker's statement: {statement_1} Question: How likely is it that the following statement is true? Statement: {statement_2} Figure 4: System prompts used fo...