pith. machine review for the scientific record. sign in

arxiv: 2603.20133 · v2 · submitted 2026-03-20 · 💻 cs.CL

Recognition: no theorem link

Reasoning Gets Harder for LLMs Inside A Dialogue

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:07 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM reasoningtask-oriented dialoguemulti-turn evaluationbenchmark constructionperformance gaparithmetic reasoningspatial reasoningtemporal reasoning
0
0 comments X

The pith

LLMs show a substantial and consistent drop in reasoning performance when tasks are embedded inside multi-turn dialogues rather than presented in isolation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current reasoning benchmarks, which test models on isolated problems, overestimate LLM capabilities in realistic task-oriented dialogue settings where reasoning must occur alongside text generation and strict adherence to role, format, and style instructions. To demonstrate this, the authors introduce the BOULDER benchmark, which pairs eight travel-related tasks involving arithmetic, spatial, and temporal reasoning in both isolated and dialogue-based formats. Experiments across eight different LLMs reveal a clear performance gap that ablations attribute primarily to the multi-turn structure of dialogue, along with secondary influences from role conditioning and tool-use demands. This matters for anyone relying on benchmarks to predict how models will behave in interactive applications such as travel assistants or customer service agents.

Core claim

When the same reasoning problems are placed inside task-oriented dialogues instead of given as standalone questions, LLMs exhibit a substantial and consistent performance decline; controlled ablations indicate the multi-turn nature of dialogue is the dominant factor, with additional contributions from role instructions and tool-use requirements.

What carries the argument

BOULDER, a dynamic benchmark that supplies matched isolated and dialogue variants for eight travel tasks requiring arithmetic, spatial, temporal, commonsense, and formal reasoning.

If this is right

  • Isolated-benchmark scores cannot be treated as reliable proxies for reasoning ability in live interactive systems.
  • Model evaluation protocols should routinely include multi-turn dialogue variants to surface hidden weaknesses.
  • Training objectives may need explicit exposure to interleaved reasoning and instruction-following within conversations.
  • Tool-augmented dialogue agents will likely underperform on reasoning steps unless the dialogue context itself is part of the training signal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers of dialogue systems may need new fine-tuning regimes that simulate full conversation histories rather than single prompts.
  • The gap could widen further in longer sessions or with more complex role constraints, suggesting a scaling law for dialogue depth.
  • Existing safety and alignment techniques tuned on isolated prompts may transfer poorly once models must maintain persona across turns.

Load-bearing premise

The observed performance gap stems mainly from the multi-turn interactive format rather than from benchmark artifacts, differing levels of data contamination, or variations in prompt length and structure.

What would settle it

Re-running the eight tasks with dialogue variants that are forced to single-turn responses while exactly matching token length, role instructions, and output format constraints, then finding no remaining performance gap.

Figures

Figures reproduced from arXiv: 2603.20133 by Ivan Kart\'a\v{c}, Mateusz Lango, Ond\v{r}ej Du\v{s}ek.

Figure 1
Figure 1. Figure 1: An example from our BOULDER benchmark, showing the same problem instance in two variants: as an isolated task and within a task-oriented dialogue. as part of a broader task has not yet been suffi￾ciently explored (Cui et al., 2020; Li et al., 2023). Within language processing tasks, LLMs are typ￾ically expected to adhere to various instructions regarding their role, output format, style, or re￾sponse lengt… view at source ↗
Figure 2
Figure 2. Figure 2: Evaluation results of 8 LLMs in the three [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 5
Figure 5. Figure 5: Results for the Baseline with dialogue role and Dialogue with reasoning instruction ablations, micro￾averaged over all tasks. Significance indication follows [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Detailed per-task results for the Baseline (b), Dialogue (d), and Dialogue-concise (c) setups. Asterisks indicate significant differences between setups in neighboring columns (t-test, ∗: p < 0.05, ∗∗: p < 0.01, ∗∗∗: p < 0.001). shorter responses on average in the dialogue setup. A hypothesis is that this is caused by a bias induced by the role of a TOD assistant they are instructed to follow (Gupta et al.… view at source ↗
Figure 7
Figure 7. Figure 7: Example conversation template in JSON format for the [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt template used for generating paraphrases of conversation templates. Paraphrases generated in [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt template for the task-oriented dialogue system used in the [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt template for the task-oriented dialogue system used in the [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Changes made to the prompt for the Dialogue with reduced domains ablation. Text highlighted in red indicates removed content. This example illustrates the changes made to the distance and opening hours tasks, where we remove the attractions and trains domains. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Changes made to the prompt and the message history for the [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Changes made to the prompt and the message history for the [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Changes made to the prompt and the message history for the [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Changes made to the prompt and the message history for the [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Changes made to the prompt and the message history for the [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Prompt template for the amounts parser used to extract values in the [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Prompt template for the time parser used to extract values in the [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Prompt template for the time parser used to extract values in the [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Prompt template for the restaurant names parser used to extract values in the [PITH_FULL_IMAGE:figures/full_fig_p030_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Prompt template for the time parser used to extract values in the [PITH_FULL_IMAGE:figures/full_fig_p031_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Prompt template for the time parser used to extract values in the [PITH_FULL_IMAGE:figures/full_fig_p031_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Prompt template for the time parser used to extract values in the [PITH_FULL_IMAGE:figures/full_fig_p031_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Detailed results for the Dialogue with reduced domains (d(d)) ablation, compared with the Baseline (b) and Dialogue (d) setups. Asterisks indicate significant differences between setups in neighboring columns (t-test, ∗: p < 0.05, ∗∗: p < 0.01, ∗∗∗: p < 0.001). Qwen3 4B Mistral Small 24B Qwen3 30B A3B Command A 111B Qwen3 235B A22B DeepSeek V3.2 Gemini 2.5 Flash Claude Sonnet 4.5 Ticket Price Accuracy Boo… view at source ↗
Figure 25
Figure 25. Figure 25: Detailed results for the Dialogue without tools (d(t)) ablation, compared with the Baseline (b) and Dialogue (d) setups. Asterisks indicate significant differences between setups in neighboring columns (t-test, ∗: p < 0.05, ∗∗: p < 0.01, ∗∗∗: p < 0.001). Qwen3 4B Mistral Small 24B Qwen3 30B A3B Command A 111B Qwen3 235B A22B DeepSeek V3.2 Gemini 2.5 Flash Claude Sonnet 4.5 Ticket Price Accuracy Booking Pr… view at source ↗
Figure 26
Figure 26. Figure 26: Detailed results for the Multi-turn baseline (b(m)) ablation, compared with the Baseline (b) and Dialogue (d) setups. Asterisks indicate significant differences between setups in neighboring columns (t-test, ∗: p < 0.05, ∗∗: p < 0.01, ∗∗∗: p < 0.001). 32 [PITH_FULL_IMAGE:figures/full_fig_p032_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Detailed results for the Single-turn dialogue (d(s)) ablation, compared with the Baseline (b) and Dialogue (d) setups. Asterisks indicate significant differences between setups in neighboring columns (t-test, ∗: p < 0.05, ∗∗: p < 0.01, ∗∗∗: p < 0.001). Qwen3 4B Mistral Small 24B Qwen3 30B A3B Command A 111B Qwen3 235B A22B DeepSeek V3.2 Gemini 2.5 Flash Claude Sonnet 4.5 Ticket Price Accuracy Booking Pric… view at source ↗
Figure 28
Figure 28. Figure 28: Detailed results for the Baseline with dialogue role (b(r)) ablation, compared with the Baseline (b) and Dialogue (d) setups. Asterisks indicate significant differences between setups in neighboring columns (t-test, ∗: p < 0.05, ∗∗: p < 0.01, ∗∗∗: p < 0.001). Qwen3 4B Mistral Small 24B Qwen3 30B A3B Command A 111B Qwen3 235B A22B DeepSeek V3.2 Gemini 2.5 Flash Claude Sonnet 4.5 Ticket Price Accuracy Booki… view at source ↗
Figure 29
Figure 29. Figure 29: Detailed results for the Dialogue with reasoning instructions (d(r)) ablation, compared with the Baseline (b) and Dialogue (d) setups. Asterisks indicate significant differences between setups in neighboring columns (t-test, ∗: p < 0.05, ∗∗: p < 0.01, ∗∗∗: p < 0.001). 33 [PITH_FULL_IMAGE:figures/full_fig_p033_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Detailed results for the parser evaluation by task, model, and evaluation setting. The error bars show [PITH_FULL_IMAGE:figures/full_fig_p034_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Scores and average response lengths in characters by LLM averaged over all tasks. [PITH_FULL_IMAGE:figures/full_fig_p035_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Comparison of scores and average response lengths in characters by LLM for the [PITH_FULL_IMAGE:figures/full_fig_p036_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Comparison of scores and average response lengths in characters by LLM for the [PITH_FULL_IMAGE:figures/full_fig_p037_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Comparison of responses from Qwen3 30B A3B for the [PITH_FULL_IMAGE:figures/full_fig_p038_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Comparison of responses from Command A 111B for the [PITH_FULL_IMAGE:figures/full_fig_p039_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Comparison of responses from Qwen3 235B A22B for the [PITH_FULL_IMAGE:figures/full_fig_p040_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Comparison of responses from Claude 4.5 Sonnet for the [PITH_FULL_IMAGE:figures/full_fig_p041_37.png] view at source ↗
read the original abstract

Large Language Models (LLMs) achieve strong performance on many reasoning benchmarks, yet these evaluations typically focus on isolated tasks that differ from real-world usage in task-oriented dialogue (TOD). In this setting, LLMs must perform reasoning inherently while generating text and adhering to instructions on role, format, and style. This mismatch raises concerns about whether benchmark performance accurately reflects models' reasoning robustness in TOD setting. We investigate how framing reasoning tasks within TOD affects LLM performance by introducing BOULDER, a new dynamic benchmark covering eight travel-related tasks that require arithmetic, spatial, and temporal reasoning with both commonsense and formal aspects. Each problem is presented in both isolated and dialogue-based variants, enabling controlled comparison while mitigating data contamination. Experiments on eight LLMs reveal a substantial and consistent performance gap between isolated and dialogue settings. Through ablations and qualitative analysis, we show that this gap is largely driven by the multi-turn nature of dialogue, with additional effects from role conditioning and tool-use requirements. Our results highlight the need to evaluate LLM reasoning in realistic interactive scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BOULDER, a dynamic benchmark of eight travel-related tasks requiring arithmetic, spatial, and temporal reasoning. Each task is instantiated in matched isolated and dialogue-based variants; experiments on eight LLMs report a consistent performance drop in the dialogue setting, which ablations and qualitative analysis attribute primarily to multi-turn interaction, with secondary contributions from role conditioning and tool-use requirements.

Significance. If the controlled comparison holds, the work demonstrates that standard isolated reasoning benchmarks can overestimate LLM capabilities relative to realistic task-oriented dialogue, providing a concrete motivation for interactive evaluation protocols. The benchmark design that re-uses the same underlying problems across framings is a methodological strength that reduces contamination concerns.

major comments (2)
  1. [§4 and §5] §4 (Results) and §5 (Ablations): the central claim that the gap is 'largely driven by the multi-turn nature' requires explicit confirmation that isolated baselines receive identical role, format, and style instructions on every turn; without a side-by-side prompt template comparison, the ablation results risk conflating turn count with differences in instruction complexity and output constraints.
  2. [Abstract and §4] Abstract and §4: the reported 'substantial and consistent performance gap' is stated without numerical values, per-task accuracies, error bars, or exact problem counts; the results tables must include these quantities together with statistical tests to allow readers to assess effect size and variability across the eight models.
minor comments (2)
  1. [§3] §3 (Benchmark construction): specify the exact number of problems generated per task and the precise mechanism used to ensure the isolated and dialogue variants remain semantically identical while varying only the interaction framing.
  2. [Figure 2 and Table 1] Figure 2 and Table 1: axis labels and legend entries should explicitly distinguish 'isolated' from 'dialogue' conditions and indicate whether error bars represent standard deviation across models or across problem instances.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity in our experimental design and results presentation. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§4 and §5] §4 (Results) and §5 (Ablations): the central claim that the gap is 'largely driven by the multi-turn nature' requires explicit confirmation that isolated baselines receive identical role, format, and style instructions on every turn; without a side-by-side prompt template comparison, the ablation results risk conflating turn count with differences in instruction complexity and output constraints.

    Authors: We agree that explicit confirmation is necessary to strengthen the claim. In the isolated setting, each problem receives the identical role, format, and style instructions as the first turn of the corresponding dialogue variant (with no subsequent turns). To eliminate any ambiguity and allow readers to verify that instruction complexity does not confound the multi-turn factor, we will add side-by-side prompt templates for both conditions in the revised appendix. This addition will directly support the ablation results attributing the gap primarily to multi-turn interaction. revision: yes

  2. Referee: [Abstract and §4] Abstract and §4: the reported 'substantial and consistent performance gap' is stated without numerical values, per-task accuracies, error bars, or exact problem counts; the results tables must include these quantities together with statistical tests to allow readers to assess effect size and variability across the eight models.

    Authors: We will revise §4 to expand the results tables with per-task accuracies, exact problem counts per task and model, error bars (standard deviation across multiple runs), and statistical significance tests (paired t-tests with p-values) for the isolated vs. dialogue gaps. These details will enable assessment of effect sizes and variability. The abstract will remain a high-level summary, but the main text will now contain all requested quantitative information. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparison on newly introduced paired benchmark variants

full rationale

The paper introduces the BOULDER benchmark with explicitly paired isolated and dialogue-based variants of the same travel-related reasoning problems. Performance gaps are measured through controlled experiments on eight LLMs, supported by ablations that vary role conditioning, tool-use, and turn count. No equations, predictions, or first-principles derivations are present that reduce to their own inputs by construction. The central claim rests on observable experimental differences rather than self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations. The derivation chain is self-contained and externally falsifiable via replication on the released benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the construction of the BOULDER benchmark and the empirical comparison; no free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5487 in / 1025 out tokens · 40203 ms · 2026-05-15T08:07:36.638798+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

  1. [1]

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- garet Mitchell, Dhruv Batra, C

    Prediction-powered Inference.CoRR, abs/2301.09633. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- garet Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question An- swering. In2015 IEEE International Conference on Computer Vision, ICCV 2015, pages 2425–2433, Santiago, Chile. Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Ji...

  2. [2]

    InThe Thirteenth In- ternational Conference on Learning Representations, ICLR 2025, Singapore

    Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning. InThe Thirteenth In- ternational Conference on Learning Representations, ICLR 2025, Singapore. Shashank Gupta, Vaishnavi Shrivastava, Ameet Desh- pande, Ashwin Kalyan, Peter Clark, Ashish Sabhar- wal, and Tushar Khot. 2024. Bias Runs Deep: Im- plicit Reasoning Biases in Persona-assigned ...

  3. [3]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    CoQA: A Conversational Question Answer- ing Challenge.Transactions of the Association for Computational Linguistics, 7:249–266. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Di- rani, Julian Michael, and Samuel R. Bowman. 2023. GPQA: A Graduate-level Google-proof q&a Bench- mark.CoRR, abs/2311.12022. Oscar Sai...

  4. [4]

    Each task is then implemented as a rule-based procedure

    We convert the original GPS coordinates for hotels, restaurants, and attractions to meters, since we work with Euclidean dis- tance in the spatial tasks. Each task is then implemented as a rule-based procedure. Input values and targets for individual examples are selected randomly based on specific constraints, such as the minimum and maximum number of gu...

  5. [5]

    Automatically generate new test examples, possibly also modifying task parameters, e.g., ranges or thresholds

  6. [6]

    Extend the database with new synthetic data or apply perturbations to existing items

  7. [7]

    yes”, “no

    Implement generation procedures and dia- logue templates for new tasks This allows generating a large number of new di- verse benchmark examples without any substantial manual effort. A.2 Tasks In this section, we provide details on each task in our benchmark. In all spatial tasks, we represent coordinates in meters from the origin, a (0, 0) point in the ...

  8. [8]

    The Gardenia - Open 09:00-22:00 on Thursday✓

  9. [9]

    La Mimosa -Closed on Thursdays✗

  10. [10]

    every 30 minutes

    Shiraz Restaurant - Open 11:00-23:00 on Thursday✓ So both The Gardenia and Shiraz Restaurant are open for the entire time between 15:30 and 19:30 on Thursday. As a special case of post-hoc rationalization dis- cussed in Section E.1, LLMs sometimes fabricate values not present in the data to better match their initial answer. In the following example, the ...

  11. [11]

    Single Room (1 person): - Price per night: £50 - Duration: 4 nights - Total: £50 × 4 = £200

  12. [12]

    • The Hotpot: Open from 12:00 to 22:30

    Family Room (4 people): - Price per night: £130 (for 4 people) - Duration: 4 nights (but one person checks in one day earlier, so only 3 nights) [...] In the example below, Gemini 2.5 Flash inverts the set containment relation between two time in- tervals, rejecting a valid option: Gemini 2.5 Flash–Restaurant opening hours The following Chinese restaurant...

  13. [13]

    Tool calls are removed from conversation his- tories and structured data retrieved by these tools are moved to the system prompt

  14. [14]

    Instructions about tool use are replaced with instructions on how to use the provided con- text data

  15. [15]

    Tool schemas arenotpassed to LLMs during inference 21 Single-turn dialogueTo transform the original dialogue setup to a single-turn variant, we merge all user’s queries in a conversation history into a single query. For example, if the user first asks about trains with specific criteria and then asks about departure frequency in another turn, these questi...

  16. [16]

    Message history with a single user prompt is extended with conversation turns from the dialogue setup (excluding tool calls)

  17. [17]

    Role of the original user prompt is changed tosystem

  18. [18]

    role": "assistant

    Query is moved to the lastuser message and replaced with a generic instruction:Given the {domain} data in JSON format below, answer the user’s questions. Baseline with dialogue roleAs illustrated in Fig- ure 15, we add two instructions related to the travel assistant role from the dialogue setup to baseline prompts: • You are a helpful assistant specializ...

  19. [19]

    The time interval between 20:00 and midnight is 4 hours, which is equivalent to 240 minutes

    TR13067: 23:37 There are 12 trains departing between 20:00 and midnight. The time interval between 20:00 and midnight is 4 hours, which is equivalent to 240 minutes. To calculate the average frequency, we’ll divide the total time interval (240 minutes) by the number of trains (12): Average frequency = Total time interval / Number of trains = 240 minutes /...